python requests 超时与重试

一源起：

requests模块作为python爬虫方向的基础模块实际上在日常实际工作中也会涉及到，比如用requests向对方接口url发送POST请求进行推送数据，使用GET请求拉取数据。

但是这里有一个状况需要我们考虑到：那就是超时的情况如何处理，超时后重试的机制。

二连接超时与读取超时：

超时：可分为连接超时和读取超时。

连接超时

连接超时，连接时request等待的时间(s)

import requests
import datetime

url = 'http://www.google.com.hk'
start = datetime.datetime.now()
print('start', start)
try:
    html = requests.get(url, timeout=5).text
    print('success')
except requests.exceptions.RequestException as e:
    print(e)
end = datetime.datetime.now()
print('end', end)
print('耗时： {time}'.format(time=(end - start)))

# 结果：
# start 2019-11-28 14:19:24.249588
# HTTPConnectionPool(host='www.google.com.hk', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x000001D8ECB1CCC0>, 'Connection to www.google.com.hk timed out. (connect timeout=5)'))
# end 2019-11-28 14:19:29.262519
# 耗时： 0:00:05.012931

因为 google 被墙了，所以无法连接，错误信息显示 connect timeout（连接超时）。

就算不设置timeout=5，也会有一个默认的连接超时时间(大约21秒左右)。

start 2019-11-28 15:00:36.441117
HTTPConnectionPool(host='www.google.com.hk', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000023130B9CCC0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。',))
end 2019-11-28 15:00:57.459768
耗时： 0:00:21.018651

读取超时

读取超时，客户端等待服务器发送请求的事件，特定地指要等待服务器发送字节之间的时间，在大部分情况下，是指服务器发送第一个字节之前的时间。

总而言之：

　　连接超时 ==> 发起请求连接到建立连接之间的最大时长

　　读取超时 ==> 连接成功开始到服务器返回响应之间等待的最大时长

故，如果设置超时时间/timeout,这个timeout值将会作为connect和read二者的timeout。如果分别设置，就需要传入一个元组：

r = requests.get('https://github.com', timeout=5)
r = requests.get('https://github.com', timeout=(0.5, 27))

例: 设置一个15秒的响应等待时间的请求：

import datetime
import requests

url_login = 'http://www.heibanke.com/accounts/login/?next=/lesson/crawler_ex03/'

session = requests.Session()
session.get(url_login)

token = session.cookies['csrftoken']
session.post(url_login, data={'csrfmiddlewaretoken': token, 'username': 'xx', 'password': 'xx'})

start = datetime.datetime.now()
print('start', start)

url_pw = 'http://www.heibanke.com/lesson/crawler_ex03/pw_list/'
try:
    html = session.get(url_pw, timeout=(5, 10)).text
    print('success')
except requests.exceptions.RequestException as e:
    print(e)

end = datetime.datetime.now()
print('end', end)
print('耗时： {time}'.format(time=(end - start)))

# start 2019-11-28 19:32:20.589827
# # success
# # end 2019-11-28 19:32:22.590872
# # 耗时： 0:00:02.001045

如果设置为：timeout=(1, 0.5)，错误信息中显示的是 read timeout（读取超时）

start 2019-11-28 19:36:38.503593
HTTPConnectionPool(host='www.heibanke.com', port=80): Read timed out. (read timeout=0.5)
end 2019-11-28 19:36:39.005271
耗时： 0:00:00.501678

读取超时是没有默认值的，如果不设置，请求将一直处于等待状态，爬虫经常卡住又没有任何信息错误，原因就是因为读取超时了。

超时重试

一般超时不会立即返回，而是设置一个多次重连的机制

import requests
import datetime

url = 'http://www.google.com.hk'

def gethtml(url):
    i = 0
    while i < 3:
        start = datetime.datetime.now()
        print('start', start)
        try:
            html = requests.get(url, timeout=5).text
            return html
        except requests.exceptions.RequestException:
            i += 1
        end = datetime.datetime.now()
        print('end', end)
        print('耗时： {time}'.format(time=(end - start)))

if __name__ == '__main__':
    gethtml(url)

其实 requests 已经有封装好的方法：

import time
import requests
from requests.adapters import HTTPAdapter

s = requests.Session()
s.mount('http://', HTTPAdapter(max_retries=3))
s.mount('https://', HTTPAdapter(max_retries=3))

print(time.strftime('%Y-%m-%d %H:%M:%S'))
try:
    r = s.get('http://www.google.com.hk', timeout=5)
    print(r.text)
except requests.exceptions.RequestException as e:
    print(e)
print(time.strftime('%Y-%m-%d %H:%M:%S'))

# 2019-11-28 19:48:05
# HTTPConnectionPool(host='www.google.com.hk', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x0000019D8D88D208>, 'Connection to www.google.com.hk timed out. (connect timeout=5)'))
# 2019-11-28 19:48:25

max_retries为最大重试次数，重试3次，加上最初的一次请求，共4次，所以上述代码运行耗时20秒而不是15秒

posted @ 2019-11-28 19:50 MrSu 阅读(10930) 评论(1) 收藏举报

刷新页面返回顶部

Loading

把'坚持'坚持成你的天赋！

python requests 超时与重试

一源起：

二连接超时与读取超时：

连接超时

读取超时

超时重试

公告

Loading

把'坚持'坚持成你的天赋！

python requests 超时与重试

一 源起：

二 连接超时与读取超时：

连接超时

读取超时

超时重试

公告

一源起：

二连接超时与读取超时：