day10-协程爬虫案例

一、前言

刚刚只是在理论上讲述了gevent遇到io自动切换,下面我们就来实际操作一下,在实战过程中我们用协程大面积的爬虫,看看如何用gevent去实现并发的效果的。

二、串行爬网页

2.1、串行爬网页

说明:我们先来看看串行效果的爬网页的代码,看看消耗多长时间

from urllib import request
import time
#from gevent import monkey
import gevent

#monkey.patch_all()

def f(url):
    print("GET:{0}".format(url))
    resp = request.urlopen(url)
    data = resp.read()
    with open("url.html","wb") as f:
        f.write(data)
    print("{0} bytes received from {1}".format(len(data),url))

urls= [
    "http://www.163.com",
    "http://www.sina.com.cn",

]
time_start = time.time()

for url in urls:
    f(url)

print("同步cost",time.time()-time_start)

输出。。。
GET:http://www.163.com
690840 bytes received from http://www.163.com
GET:http://www.sina.com.cn
604430 bytes received from http://www.sina.com.cn
同步cost 1.0000014305114746

2.2、gevent协程爬虫

说明:刚刚是串行的执行的,我们现在用gevent并发执行一下,看看效果。

from urllib import request
import time
#from gevent import monkey
import gevent

#monkey.patch_all()

def f(url):
    print("GET:{0}".format(url))
    resp = request.urlopen(url)
    data = resp.read()
    with open("url.html","wb") as f:
        f.write(data)
    print("{0} bytes received from {1}".format(len(data),url))

urls= [
    "http://www.163.com",
    "http://www.sina.com.cn",

]

async_time_start =time.time()
gevent.joinall([
    gevent.spawn(f,"http://www.163.com"),
gevent.spawn(f,"http://www.sina.com.cn")
])
print("异步cost",time.time()-async_time_start)
#输出。。。。
GET:http://www.163.com
690840 bytes received from http://www.163.com
GET:http://www.sina.com.cn
604441 bytes received from http://www.sina.com.cn
异步cost 0.9600014686584473


问题:为啥我用了并发,执行的时间没有缩短多少呢? 感觉不到效率

  其实urllib默认跟gevent是没有关系的。urllib现在默认,如果你要通过gevent来去调用,它就是阻塞,gevent现在检测不到urllib的IO操作。它都不知道urllib进行了IO操作,所以它都不会进行切换,所以它就串行了。所以这个urllib和我们之前学的socket交给gevent不好使,因为gevent它不知道你进行了IO操作,所以就会卡住。

三、并发爬网页

  既然上面那种情况都不行,那怎么让gevent知道urllib正在进行IO操作呢?

  答:打补丁,通过导入monkey,来打这个补丁,在程序中什么都不写,就添加一行monkey.patch()即可。

3.1、代码

 1 from urllib import request
 2 import time
 3 from gevent import monkey
 4 import gevent
 5 
 6 monkey.patch_all() ##把当前程序的所有的io操作给我单独的作上标记,且就执行这一句即可
 7 
 8 def f(url):
 9     print("GET:{0}".format(url))
10     resp = request.urlopen(url)
11     data = resp.read()
12     with open("url.html","wb") as f:
13         f.write(data)
14     print("{0} bytes received from {1}".format(len(data),url))
15 
16 urls= [
17     "http://www.163.com",
18     "http://www.sina.com.cn",
19 
20 ]
21 time_start = time.time()
22 
23 for url in urls:
24     f(url)
25 
26 print("同步cost",time.time()-time_start)
27 
28 async_time_start =time.time()
29 gevent.joinall([
30     gevent.spawn(f,"http://www.163.com"),
31 gevent.spawn(f,"http://www.sina.com.cn")
32 ])
33 print("异步cost",time.time()-async_time_start)

 

哈哈,看到效果了吧,其实差距不大,还有一个原因就是网络的原因也有。总之这个是需要通过打补丁的。其实就是说通过打补丁来检测到它有urllib,它就把urllib里面所有涉及到的有可能进行IO操作的地方直接花在前面加一个标记,这个标记就相当于gevent.sleep(),所以把urllib变成一个一有阻塞,它就切换了。   注意了,gevent.sleep()是模拟IO操作的,标记的意思是,这边是IO操作,遇到阻塞就切换

四、gevent 实现单线程下的多socket 并发

4.1、server端

import sys,gevent,socket,time
from gevent import socket,monkey
monkey.patch_all()
  
def server(port):
    s = socket.socket()
    s.bind(('0.0.0.0', port))
    s.listen(500)
    while True:
        cli, addr = s.accept()
        gevent.spawn(handle_request, cli)   #协程
 
def handle_request(conn):
    try:
        while True:
            data = conn.recv(1024)
            print("recv:", data)
            conn.send(data)
            if not data:
                conn.shutdown(socket.SHUT_WR)
    except Exception as  ex:
        print(ex)
    finally:
        conn.close()
if __name__ == '__main__':
    server(8001)

4.2、client端

 

 1 import socket
 2   
 3 HOST = 'localhost'    # The remote host
 4 PORT = 8001           # The same port as used by the server
 5 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
 6 s.connect((HOST, PORT))
 7 while True:
 8     msg = bytes(input(">>:"),encoding="utf8")
 9     s.sendall(msg)
10     data = s.recv(1024)
11     print('Received', repr(data))
12 s.close()

 

posted @ 2018-03-23 11:57  东郭仔  阅读(128)  评论(0)    收藏  举报