day10-协程爬虫案例
一、前言
刚刚只是在理论上讲述了gevent遇到io自动切换,下面我们就来实际操作一下,在实战过程中我们用协程大面积的爬虫,看看如何用gevent去实现并发的效果的。
二、串行爬网页
2.1、串行爬网页
说明:我们先来看看串行效果的爬网页的代码,看看消耗多长时间
from urllib import request import time #from gevent import monkey import gevent #monkey.patch_all() def f(url): print("GET:{0}".format(url)) resp = request.urlopen(url) data = resp.read() with open("url.html","wb") as f: f.write(data) print("{0} bytes received from {1}".format(len(data),url)) urls= [ "http://www.163.com", "http://www.sina.com.cn", ] time_start = time.time() for url in urls: f(url) print("同步cost",time.time()-time_start)
输出。。。
GET:http://www.163.com
690840 bytes received from http://www.163.com
GET:http://www.sina.com.cn
604430 bytes received from http://www.sina.com.cn
同步cost 1.0000014305114746
2.2、gevent协程爬虫
说明:刚刚是串行的执行的,我们现在用gevent并发执行一下,看看效果。
from urllib import request import time #from gevent import monkey import gevent #monkey.patch_all() def f(url): print("GET:{0}".format(url)) resp = request.urlopen(url) data = resp.read() with open("url.html","wb") as f: f.write(data) print("{0} bytes received from {1}".format(len(data),url)) urls= [ "http://www.163.com", "http://www.sina.com.cn", ] async_time_start =time.time() gevent.joinall([ gevent.spawn(f,"http://www.163.com"), gevent.spawn(f,"http://www.sina.com.cn") ]) print("异步cost",time.time()-async_time_start)
#输出。。。。
GET:http://www.163.com
690840 bytes received from http://www.163.com
GET:http://www.sina.com.cn
604441 bytes received from http://www.sina.com.cn
异步cost 0.9600014686584473
问题:为啥我用了并发,执行的时间没有缩短多少呢? 感觉不到效率
其实urllib默认跟gevent是没有关系的。urllib现在默认,如果你要通过gevent来去调用,它就是阻塞,gevent现在检测不到urllib的IO操作。它都不知道urllib进行了IO操作,所以它都不会进行切换,所以它就串行了。所以这个urllib和我们之前学的socket交给gevent不好使,因为gevent它不知道你进行了IO操作,所以就会卡住。
三、并发爬网页
既然上面那种情况都不行,那怎么让gevent知道urllib正在进行IO操作呢?
答:打补丁,通过导入monkey,来打这个补丁,在程序中什么都不写,就添加一行monkey.patch()即可。
3.1、代码
1 from urllib import request 2 import time 3 from gevent import monkey 4 import gevent 5 6 monkey.patch_all() ##把当前程序的所有的io操作给我单独的作上标记,且就执行这一句即可7 8 def f(url): 9 print("GET:{0}".format(url)) 10 resp = request.urlopen(url) 11 data = resp.read() 12 with open("url.html","wb") as f: 13 f.write(data) 14 print("{0} bytes received from {1}".format(len(data),url)) 15 16 urls= [ 17 "http://www.163.com", 18 "http://www.sina.com.cn", 19 20 ] 21 time_start = time.time() 22 23 for url in urls: 24 f(url) 25 26 print("同步cost",time.time()-time_start) 27 28 async_time_start =time.time() 29 gevent.joinall([ 30 gevent.spawn(f,"http://www.163.com"), 31 gevent.spawn(f,"http://www.sina.com.cn") 32 ]) 33 print("异步cost",time.time()-async_time_start)
哈哈,看到效果了吧,其实差距不大,还有一个原因就是网络的原因也有。总之这个是需要通过打补丁的。其实就是说通过打补丁来检测到它有urllib,它就把urllib里面所有涉及到的有可能进行IO操作的地方直接花在前面加一个标记,这个标记就相当于gevent.sleep(),所以把urllib变成一个一有阻塞,它就切换了。 注意了,gevent.sleep()是模拟IO操作的,标记的意思是,这边是IO操作,遇到阻塞就切换
四、gevent 实现单线程下的多socket 并发
4.1、server端
import sys,gevent,socket,time from gevent import socket,monkey monkey.patch_all() def server(port): s = socket.socket() s.bind(('0.0.0.0', port)) s.listen(500) while True: cli, addr = s.accept() gevent.spawn(handle_request, cli) #协程 def handle_request(conn): try: while True: data = conn.recv(1024) print("recv:", data) conn.send(data) if not data: conn.shutdown(socket.SHUT_WR) except Exception as ex: print(ex) finally: conn.close() if __name__ == '__main__': server(8001)
4.2、client端
1 import socket 2 3 HOST = 'localhost' # The remote host 4 PORT = 8001 # The same port as used by the server 5 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 6 s.connect((HOST, PORT)) 7 while True: 8 msg = bytes(input(">>:"),encoding="utf8") 9 s.sendall(msg) 10 data = s.recv(1024) 11 print('Received', repr(data)) 12 s.close()

浙公网安备 33010602011771号