day10-协程gevent并发爬网页
概述
前面我们介绍了gevent遇到I/O操作就会自动切换,现在我们使用gevent爬一个实际的网页下来
串行爬网页
from urllib import request
import time
def f(url):
print('GET: %s' % url)
resp = request.urlopen(url) #生成一个请求
data = resp.read() #读取爬取到的数据
f = open("url.html","wb")
f.write(data)
f.close()
print('%d bytes received from %s.' % (len(data), url))
urls = ['https://www.python.org',
'https://www.yahoo.com',
'https://github.com'
]
time_start = time.time()
for i in urls:
f(i)
print("同步cost:",time.time() - time_start)
#运行输出
GET: https://www.python.org
48893 bytes received from https://www.python.org.
GET: https://www.yahoo.com
505354 bytes received from https://www.yahoo.com.
GET: https://github.com
51489 bytes received from https://github.com.
同步cost: 3.7278189659118652
Process finished with exit code 0
gevent协程爬网页
from urllib import request
import gevent,time
def f(url):
print('GET: %s' % url)
resp = request.urlopen(url)
data = resp.read()
f = open("url.html","wb")
f.write(data)
f.close()
print('%d bytes received from %s.' % (len(data), url))
async_time_start = time.time()
gevent.joinall([
gevent.spawn(f, 'https://www.python.org/'),
gevent.spawn(f, 'https://www.yahoo.com/'),
gevent.spawn(f, 'https://github.com/'),
])
print("异步cost:",time.time() - async_time_start) #计算消耗的时间
#运行输出
GET: https://www.python.org/
48893 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
504992 bytes received from https://www.yahoo.com/.
GET: https://github.com/
51489 bytes received from https://github.com/.
异步cost: 4.873842000961304
Process finished with exit code 0
通过以上同步和异步爬网页所花的时间,我们并不能看见并发比串行速度上快多少?为什么?
其实urllib默认和gevent是没有关系的。urllib现在默认情况下如果你要通过gevent来去调用,它就是阻塞的,gevent现在检测不到urllib的I/O操作。它都不知道urllib进行了IO操作,所以它都不会进行切换,所以它就串行了。所以这个urllib和我们之前学的socket交给gevent不好使,因为gevent它不知道你进行了IO操作,所以就会卡住。所以他们都还是串行的,爬取网页的时间都差不多。
那么,怎样才能让gevent知道urllib正在进程I/O操作呢?打monkey.patch()的补丁即可
from urllib import request
import gevent,time
from gevent import monkey
monkey.patch_all() #把当前程序的所有的I/O操作给我单独的做上标记
def f(url):
print('GET: %s' % url)
resp = request.urlopen(url)
data = resp.read()
f = open("url.html","wb")
f.write(data)
f.close()
print('%d bytes received from %s.' % (len(data), url))
async_time_start = time.time()
gevent.joinall([
gevent.spawn(f, 'https://www.python.org/'),
gevent.spawn(f, 'https://www.yahoo.com/'),
gevent.spawn(f, 'https://github.com/'),
])
print("异步cost:",time.time() - async_time_start) #计算消耗的时间
#运行输出
GET: https://www.python.org/
GET: https://www.yahoo.com/
GET: https://github.com/
48893 bytes received from https://www.python.org/.
51489 bytes received from https://github.com/.
504160 bytes received from https://www.yahoo.com/.
异步cost: 1.559157133102417
Process finished with exit code 0
解析:通过以上可以看出其实就是通过打补丁来检测到它有urllib,它就把urllib里面所有涉及到的有可能进行I/O操作的地方直接在前面加一个标记,这个标记就相当于gevent.sleep(),所以把urllib变成一个有阻塞,那么协程一遇到阻塞,它就切换了。
通过gevent实现单线程下的多socket并发
服务器端
import sys
import socket
import time
import gevent
from gevent import socket, monkey
monkey.patch_all()
def server(port):
s = socket.socket()
s.bind(('0.0.0.0', port))
s.listen(500) #监听TCP传入连接
while True:
cli, addr = s.accept() #cli就是客户端连接过来在服务器端为其生成的一个连接实例(对象)
gevent.spawn(handle_request, cli) #创建协程
def handle_request(conn):
try:
while True:
data = conn.recv(1024) #接收客户端发送来的数据
print("recv:", data)
conn.send(data)
if not data:
break
except Exception as ex:
print(ex)
finally:
conn.close()
if __name__ == '__main__':
server(8001)
#运行输出
recv: b'huwei'
recv: b'123'
客户端
import socket
HOST = 'localhost' # The remote host
PORT = 8001 # The same port as used by the server
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) #创建客户端实例(对象)
s.connect((HOST, PORT)) #连接远程机器
while True:
msg = bytes(input(">>:"), encoding="utf8")
s.sendall(msg) #发送数据到远端
data = s.recv(1024) #从远端接收数据
# print(data)
print('Received',data)
s.close()
#运行输出
>>:huwei
Received b'huwei'
>>:123
Received b'123'
>>:

浙公网安备 33010602011771号