Resquests库入门

Resquests库的7个主要方法

方法	作用
requests.request()	构造一个请求，支撑以下方法的基础方法
requests.get()	获取HTML网页，对应http的get
requests.head()	获取HTML网页头信息，对应http的head
requests.post()	向HTML网页提交POST请求，对应http的POST
requests.put()	向HTML网页提交PUT请求，对应http的put
requests.patch()	向HTML网页提交局部修改请求，对应http的patch
requests.delete()	向HTML网页提交删除请求，对应http的delete

requests.get()方法

r = requrests.get(url,params=None,**kwargs)

url：要获取的网页url链接地址

Params: url中的额外参数，字典或者字节流格式

**kwargs：12个控制访问的参数

r = requests.get(url)

第一步，构造一个向服务器请求资源的Request对象

第二步，返回一个包含服务器资源的Response对象

type(r) --> r是Response格式

Response对象包含服务器返回的所有信息，也包含请求的Request信息

属性	说明
r.status_code	HTTP请求的返回状态，200表示连接成功，404表示失败
r.text	HTTP响应内容的字符串形式，即url对应的页面内容
r.encoding	从HTTP header中猜测的响应内容编码方式,这个编码不一定准确，若header中没有charset字段则默认为ISO-8859-1编码模式，则无法解析中文
r.apparent_encoding	从内容中分析出的响应内容编码方式(备选编码方式)
r.content	HTTP响应内容的二进制形式

Requests库的异常

异常	说明
requests.ConnectionError	网络链接错误
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	重定向次数超过最大次数，产生重定向异常
requests.ConnectTimeout	与远程服务器链接超时异常
requests.Timeout	请求URL超时，也就是URL从请求到获取请求内容这个过程超过预定时间，超时异常

Response异常

r.raise_for_status()	判断r.status_code是否等于200，如果不是200，产生HTTPError异常

爬取网页的通用代码框架

import requests
import ssl


ssl._create_default_https_context = ssl._create_unverified_context  # 全局取消证书验证
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding  # 确保编码正确
        return r.text
    except:
        return '产生了异常！'


if __name__ == '__main__':
    url = 'http://www.baidu.com'
    print(getHTMLText(url))

Requests库方法和http协议

http协议：Hypertext Transfer Protocol，超文本传输协议
HTTP是一个基于“请求与响应”模式的、无状态的应用层协议
HTTP协议采用URL作为定位网络资源的标识
URL格式如下：http://host[:port][path]

Host:合法的Internet主机域名或者IP地址

port：端口号，默认端口号为80

path：请求资源在目标服务器或者主机上的路径
URL是通过http协议存取资源的Internet路径，一个URL对应一个数据资源

http协议对资源的操作

方法	说明
get	请求获取URL位置的资源
head	请求获取URL位置资源的头部信息，如果要获取的资源过大或者获取资源的代价很大，就先获取头部信息来判断这个资源是否是我们所需要获取的
post	请求向URL位置资源后面附加用户提交的资源，不会改变之前已提交的资源
put	请求向URL位置存储一个资源，覆盖原来URL位置的资源
patch	请求局部更新URL位置的资源，即改变资源的部分内容
Delete	请求删除URL位置的资源

patch和put的区别

假如URL位置有一组数据UserInfo，包括UserID、UserName等20个字段

需求：用户向需要修改UserName，其他不变

采用patch的话，仅向URL提交UserName的局部更新请求
采用put，必须将所有的20个字段一并提交到URL，没有提交的字段将会被删除

patch的最主要的好吃就是节省网络带宽

http协议和requests库

http协议get、head、post、put、patch、delete与requests库中的方法：requests.get()、requests.head()、requests.post()、requests.put()、requests.patch()、requests.delete()的功能是一致的

requests库head()用法

import requests


r = requests.head('http://httpbin.org/get')
print(r.headers)

-> {'Date': 'Fri, 10 Dec 2021 03:12:50 GMT', 'Content-Type': 'application/json', 'Content-Length': '308', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

requests库post()用法

import requests
"""post用法"""

payload = {
    'key1': 'value1',
    'key2': 'value2'
}
r = requests.post('http://httpbin.org/post', data=payload)
print(r.text)

->{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.26.0", 
    "X-Amzn-Trace-Id": "Root=1-61b2c6ce-73fa4750611789b955b64336"
  }, 
  "json": null, 
  "origin": "223.104.151.124", 
  "url": "http://httpbin.org/post"
}

发现通过post提交的字典都自动编码为form

import requests
"""post用法"""

r = requests.post('http://httpbin.org/post', data='ABC')
print(r.text)


->{
  "args": {}, 
  "data": "ABC", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "3", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.26.0", 
    "X-Amzn-Trace-Id": "Root=1-61b2c8dc-1d8157a4005e944f1209fa6b"
  }, 
  "json": null, 
  "origin": "223.104.151.100", 
  "url": "http://httpbin.org/post"
}

通过POST向URL传送一个字符串被自动编码为data

requests库put()用法

import requests
"""put用法"""

payload = {
    'key3': 'value11',
    'key4': 'value22'
}
r = requests.put('http://httpbin.org/put', data=payload)
print(r.text)

->{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key3": "value11", 
    "key4": "value22"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "25", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.26.0", 
    "X-Amzn-Trace-Id": "Root=1-61b2c9db-51a867c810e2faa17dc237c9"
  }, 
  "json": null, 
  "origin": "223.104.151.124", 
  "url": "http://httpbin.org/put"
}

发现通过put向URL传递一个字典后，会覆盖原有在form下的数据

Requests库主要方法解析

requests.request()

requests.request(method, url, **kwargs)

Method :请求方式，应对get/put/post等七种
url:要获取页面的url链接

**kwargs：控制访问的参数，总共13个。

params:字典或者字节序列，作为参数增加到url中

import requests


dic = {
    'key1': 'value1',
    'key2': 'value2'
}
r = requests.request('Get', 'http://httpbin.org/get', params=dic)
print(r.url)

->http://httpbin.org/get?key1=value1&key2=value2
# 字典当中的内容附加到url后面了

data：字典、字节序列或者文件对象，作为Request内容

import requests


dic = {
    'key1': 'value1',
    'key2': 'value2'
}
r = requests.request('POST', 'http://httpbin.org/post', data=dic)
print(r.text)
-------------------------------------
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.26.0", 
    "X-Amzn-Trace-Id": "Root=1-61b2fbe5-59a8d6916ef7981471d4afde"
  }, 
  "json": null, 
  "origin": "223.104.151.50", 
  "url": "http://httpbin.org/post"
}

import requests


d = 'body'
r = requests.request('POST', 'http://httpbin.org/post', data=d)
print(r.text)
---------------------------------
{
  "args": {}, 
  "data": "body", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "4", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.26.0", 
    "X-Amzn-Trace-Id": "Root=1-61b2fc2b-7f33c33014341c80049d902f"
  }, 
  "json": null, 
  "origin": "223.104.151.50", 
  "url": "http://httpbin.org/post"
}

json：json格式数据，作为Request内容

import requests

dic = {
    'key1': 'value1',
    'key2': 'value2'
}
r = requests.request('POST', 'http://httpbin.org/post', json=dic)

headers:字典，http定制头

import requests

hd = {'user-agent':'Chorme/10'}
r = requests.request('POST', 'http://httpbin.org/post', headers=hd)

auth:元组，支持HTTP认证功能
files：字典类型，传输文件

import requests

fs = {'file':open('data.xlsx', 'rb')}
r = requests.request('POST', 'http://httpbin.org/post', files=fs)

timeout：设定超时时间，以秒为单位

import requests

r = requests.request('GET', 'http://httpbin.org/get', timeout=10)

proxies:字典类型，设定代理服务器，可以增加登录认证

>>>pxs = { 'http': 'http://user:pass@10.10.10.1:1234' 
          'https': 'https://10.10.10.1:4321' }
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)

allow_redirects:True/False，默认为True，设定重定向的开关
stream：True/False，默认为True，获取内容立即下载开关
verify：True/False，默认为True，认证SSL证书开关
cert：本地SSL证书路径

劫持session。如果使用requests的post方法登录了某个网站，再想获取登陆后自己的信息，就需要再使用requests的get方式去请求个人信息页面，这相当于打开了两个浏览器，是两个独立的操作，对应两个完全不想关的session，想要获取到自己登录的信息是不可能的。但是给两次请求加上同一个cookie的方法太繁琐，劫持session的方法会更加简便
```
"""劫持session"""
import requests


s = requests.Session()
s.get('https://httpbin.org/cookies/set/number/123456789')
r = requests.get('https://httpbin.org/cookies')
print(r.text)

----------------------------
{
  "cookies": {"number": "123456789"}
}
```

这个程序的运行结果中含有cookie