python超简单实用爬虫操作---2、爬虫基本操作

一、总结

一句话总结：

requests库爬取网页非常简单，例如 response = requests.get("https://www.cnblogs.com/Renyi-Fan/p/13264726.html") 可以直接获取请求对应的响应对象

import requests
# 爬取博客园博客数据
response = requests.get("https://www.cnblogs.com/Renyi-Fan/p/13264726.html")
print(response.status_code)
# print(response.text)

1、爬取一些有程序爬虫限制的网站，比如知乎，我们应该怎么办？

模拟浏览器的user-agent即可，即在headers中指定user-agent为浏览器的即可

import requests
# 爬取知乎数据
headers = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
response = requests.get("https://www.zhihu.com/",headers=headers)
print(response.status_code)
#print(response.text)
# "User-Agent": "python-requests/2.22.0",

二、爬虫基本操作

博客对应课程的视频位置：2、爬虫基本操作-范仁义-读书编程笔记
https://www.fanrenyi.com/video/35/318

一、爬虫介绍

爬虫介绍

爬虫就是自动获取网页内容的程序

google、百度等搜索引擎本质就是爬虫

虽然大部分编程语言几乎都能做爬虫，但是做爬虫最火的编程语言还是python

python中爬取网页的库有python自带的urllib，还有第三方库requests等，

requests这个库是基于urllib，功能都差不多，但是操作啥的要简单方便特别多

Requests is an elegant and simple HTTP library for Python, built for human beings.

爬虫实例

比如我的网站 https://fanrenyi.com

比如电影天堂等等

课程介绍

讲解python爬虫操作中特别常用的操作，比如爬取网页、post方式爬取网页、模拟登录爬取网页等等

二、爬虫基本操作

# 安装requests库
# pip3 install requests
# 引入requests库
# import requests

In [3]:

import requests
# 爬取博客园博客数据
response = requests.get("https://www.cnblogs.com/Renyi-Fan/p/13264726.html")
print(response.status_code)
# print(response.text)

In [7]:

import requests
# 爬取知乎数据
headers = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
}
response = requests.get("https://www.zhihu.com/",headers=headers)
print(response.status_code)
#print(response.text)

In [8]:

import requests
response = requests.get("http://httpbin.org/get")
print(response.text)
# "User-Agent": "python-requests/2.22.0",

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.22.0", 
    "X-Amzn-Trace-Id": "Root=1-5f0519fc-c5562180e38514b668891480"
  }, 
  "origin": "119.86.158.170", 
  "url": "http://httpbin.org/get"
}

posted @ 2020-07-08 07:06 范仁义阅读(263) 评论(0) 收藏举报

刷新页面返回顶部

范仁义

在校每年国奖、每年专业第一，加拿大留学，先后工作于华东师范大学和香港教育大学

python超简单实用爬虫操作---2、爬虫基本操作

python超简单实用爬虫操作---2、爬虫基本操作

一、总结

一句话总结：

requests库爬取网页非常简单，例如 response = requests.get("https://www.cnblogs.com/Renyi-Fan/p/13264726.html") 可以直接获取请求对应的响应对象

1、爬取一些有程序爬虫限制的网站，比如知乎，我们应该怎么办？

模拟浏览器的user-agent即可，即在headers中指定user-agent为浏览器的即可

二、爬虫基本操作

一、爬虫介绍

二、爬虫基本操作

作者相关推荐

公告