day10-协程gevent并发爬网页
一、前言
刚刚只是在理论上讲述了gevent遇到io自动切换,下面我们就来实际操作一下,在实战过程中我们用协程大面积的爬虫,看看如何用gevent去实现并发的效果的。
二、串行爬网页
2.1、串行爬网页
说明:我们先来看看串行效果的爬网页的代码,看看消耗多长时间
from urllib import request #简单的爬虫模块,复杂的不用这个
import time
def f(url):
print("GET:{0}".format(url))
resp = request.urlopen(url) #request.urlopen()函数
data = resp.read() #读取爬到的数据
with open("url.html","wb") as f:
f.write(data)
print('{0} bytes received from {1}'.format(len(data), url))
urls = [
'http://www.163.com/',
'https://www.yahoo.com/',
'https://github.com/'
]
time_start = time.time() #开始时间
for url in urls:
f(url)
print("同步cost",time.time()-time_start) #程序执行消耗的时间
执行结果如下:
GET:http://www.163.com/ 658380 bytes received from http://www.163.com/ GET:https://www.yahoo.com/ 468153 bytes received from https://www.yahoo.com/ GET:https://github.com/ 55467 bytes received from https://github.com/ 同步cost 5.505090951919556 #程序消耗的时间
2.2、gevent协程爬虫
说明:刚刚是串行的执行的,我们现在用gevent并发执行一下,看看效果。
from urllib import request
import gevent,time
def f(url):
print("GET:{0}".format(url))
resp = request.urlopen(url)
data = resp.read()
with open("url.html","wb") as f:
f.write(data)
print('{0} bytes received from {1}'.format(len(data), url))
async_time_start = time.time()
gevent.joinall([ #用gevent启动协程
gevent.spawn(f,'http://www.163.com/'), #第二个值是传入参数,之前我们没有讲,因为前面没有传参
gevent.spawn(f,'https://www.yahoo.com/'),
gevent.spawn(f,'https://github.com/'),
])
print("异步cost",time.time()-async_time_start) #计算时间
执行结果如下:
GET:http://www.163.com/ 658380 bytes received from http://www.163.com/ GET:https://www.yahoo.com/ 466264 bytes received from https://www.yahoo.com/ GET:https://github.com/ 55459 bytes received from https://github.com/ 异步cost 6.204461574554443 #执行时间
问题:为啥我用了并发,执行的时间没有缩短,反而变的更长了呢?
其实urllib默认跟gevent是没有关系的。urllib现在默认,如果你要通过gevent来去调用,它就是阻塞,gevent现在检测不到urllib的IO操作。它都不知道urllib进行了IO操作,所以它都不会进行切换,所以它就串行了。所以这个urllib和我们之前学的socket交给gevent不好使,因为gevent它不知道你进行了IO操作,所以就会卡住。
三、并发爬网页
既然上面那种情况都不行,那怎么让gevent知道urllib正在进行IO操作呢?
答:打补丁,通过导入monkey,来打这个补丁,在程序中什么都不写,就添加一行monkey.patch()即可。
3.1、代码
from urllib import request
import gevent,time
from gevent import monkey #导入monkey
monkey.patch_all() #把当前程序的所有的io操作给我单独的作上标记,且就执行这一句即可
def f(url):
print("GET:{0}".format(url))
resp = request.urlopen(url)
data = resp.read()
with open("url.html","wb") as f:
f.write(data)
print('{0} bytes received from {1}'.format(len(data), url))
urls = [
'http://www.163.com/',
'https://www.yahoo.com/',
'https://github.com/'
]
time_start = time.time()
for url in urls:
f(url)
print("同步cost",time.time()-time_start) #串行时间计算
async_time_start = time.time()
gevent.joinall([
gevent.spawn(f,'http://www.163.com/'),
gevent.spawn(f,'https://www.yahoo.com/'),
gevent.spawn(f,'https://github.com/'),
])
print("异步cost",time.time()-async_time_start) #并发的时间的时间计算
执行结果:
GET:http://www.163.com/ 658315 bytes received from http://www.163.com/ GET:https://www.yahoo.com/ 467577 bytes received from https://www.yahoo.com/ GET:https://github.com/ 55467 bytes received from https://github.com/ 同步cost 4.895136833190918 #同步执行的结果 GET:http://www.163.com/ GET:https://www.yahoo.com/ GET:https://github.com/ 658315 bytes received from http://www.163.com/ 471042 bytes received from https://www.yahoo.com/ 55467 bytes received from https://github.com/ 异步cost 3.0067789554595947 #异步执行的结果
哈哈,看到效果了吧,其实差距不大,还有一个原因就是网络的原因也有。总之这个是需要通过打补丁的。其实就是说通过打补丁来检测到它有urllib,它就把urllib里面所有涉及到的有可能进行IO操作的地方直接花在前面加一个标记,这个标记就相当于gevent.sleep(),所以把urllib变成一个一有阻塞,它就切换了。
注意了,gevent.sleep()是模拟IO操作的,标记的意思是,这边是IO操作,遇到阻塞就切换。
四、gevent实现单线程下的多socket并发
4.1、server端
import sys,gevent,socket,time
from gevent import socket,monkey
monkey.patch_all()
def server(port):
s = socket.socket()
s.bind(('0.0.0.0', port))
s.listen(500)
while True:
cli, addr = s.accept()
gevent.spawn(handle_request, cli) #协程
def handle_request(conn):
try:
while True:
data = conn.recv(1024)
print("recv:", data)
conn.send(data)
if not data:
conn.shutdown(socket.SHUT_WR)
except Exception as ex:
print(ex)
finally:
conn.close()
if __name__ == '__main__':
server(8001)
4.2、client端
import socket
HOST = 'localhost' # The remote host
PORT = 8001 # The same port as used by the server
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT))
while True:
msg = bytes(input(">>:"),encoding="utf8")
s.sendall(msg)
data = s.recv(1024)
print('Received', repr(data))
s.close()

浙公网安备 33010602011771号