day10-协程gevent并发爬网页

概述

前面我们介绍了gevent遇到I/O操作就会自动切换，现在我们使用gevent爬一个实际的网页下来

串行爬网页

from urllib import request
import time

def f(url):
    print('GET: %s' % url)
    resp = request.urlopen(url) #生成一个请求
    data = resp.read() #读取爬取到的数据
    f = open("url.html","wb")
    f.write(data)
    f.close()
    print('%d bytes received from %s.' % (len(data), url))

urls = ['https://www.python.org',
        'https://www.yahoo.com',
        'https://github.com'
        ]
time_start = time.time()
for i in urls:
    f(i)
print("同步cost：",time.time() - time_start)

#运行输出

GET: https://www.python.org
48893 bytes received from https://www.python.org.
GET: https://www.yahoo.com
505354 bytes received from https://www.yahoo.com.
GET: https://github.com
51489 bytes received from https://github.com.
同步cost： 3.7278189659118652

Process finished with exit code 0

gevent协程爬网页

from urllib import request
import gevent,time

def f(url):
    print('GET: %s' % url)
    resp = request.urlopen(url)
    data = resp.read()
    f = open("url.html","wb")
    f.write(data)
    f.close()
    print('%d bytes received from %s.' % (len(data), url))

async_time_start = time.time()
gevent.joinall([
        gevent.spawn(f, 'https://www.python.org/'),
        gevent.spawn(f, 'https://www.yahoo.com/'),
        gevent.spawn(f, 'https://github.com/'),
])
print("异步cost：",time.time() - async_time_start) #计算消耗的时间

#运行输出

GET: https://www.python.org/
48893 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
504992 bytes received from https://www.yahoo.com/.
GET: https://github.com/
51489 bytes received from https://github.com/.
异步cost： 4.873842000961304

Process finished with exit code 0

通过以上同步和异步爬网页所花的时间，我们并不能看见并发比串行速度上快多少？为什么？

其实urllib默认和gevent是没有关系的。urllib现在默认情况下如果你要通过gevent来去调用，它就是阻塞的，gevent现在检测不到urllib的I/O操作。它都不知道urllib进行了IO操作，所以它都不会进行切换，所以它就串行了。所以这个urllib和我们之前学的socket交给gevent不好使，因为gevent它不知道你进行了IO操作，所以就会卡住。所以他们都还是串行的，爬取网页的时间都差不多。

那么，怎样才能让gevent知道urllib正在进程I/O操作呢？打monkey.patch()的补丁即可

from urllib import request
import gevent,time
from gevent import monkey

monkey.patch_all() #把当前程序的所有的I/O操作给我单独的做上标记

def f(url):
    print('GET: %s' % url)
    resp = request.urlopen(url)
    data = resp.read()
    f = open("url.html","wb")
    f.write(data)
    f.close()
    print('%d bytes received from %s.' % (len(data), url))

async_time_start = time.time()
gevent.joinall([
        gevent.spawn(f, 'https://www.python.org/'),
        gevent.spawn(f, 'https://www.yahoo.com/'),
        gevent.spawn(f, 'https://github.com/'),
])
print("异步cost：",time.time() - async_time_start) #计算消耗的时间

#运行输出

GET: https://www.python.org/
GET: https://www.yahoo.com/
GET: https://github.com/
48893 bytes received from https://www.python.org/.
51489 bytes received from https://github.com/.
504160 bytes received from https://www.yahoo.com/.
异步cost： 1.559157133102417

Process finished with exit code 0

解析：通过以上可以看出其实就是通过打补丁来检测到它有urllib，它就把urllib里面所有涉及到的有可能进行I/O操作的地方直接在前面加一个标记，这个标记就相当于gevent.sleep()，所以把urllib变成一个有阻塞，那么协程一遇到阻塞，它就切换了。

通过gevent实现单线程下的多socket并发

服务器端

import sys
import socket
import time
import gevent

from gevent import socket, monkey

monkey.patch_all()

def server(port):
    s = socket.socket()
    s.bind(('0.0.0.0', port))
    s.listen(500) #监听TCP传入连接
    while True:
        cli, addr = s.accept() #cli就是客户端连接过来在服务器端为其生成的一个连接实例(对象)
        gevent.spawn(handle_request, cli) #创建协程


def handle_request(conn):
    try:
        while True:
            data = conn.recv(1024) #接收客户端发送来的数据
            print("recv:", data)
            conn.send(data)
            if not data:
                break

    except Exception as  ex:
        print(ex)
    finally:
        conn.close()


if __name__ == '__main__':
    server(8001)

#运行输出

recv: b'huwei'
recv: b'123'

客户端

import socket

HOST = 'localhost'  # The remote host
PORT = 8001  # The same port as used by the server
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) #创建客户端实例(对象)
s.connect((HOST, PORT)) #连接远程机器
while True:
    msg = bytes(input(">>:"), encoding="utf8")
    s.sendall(msg)  #发送数据到远端
    data = s.recv(1024) #从远端接收数据
    # print(data)

    print('Received',data)
s.close()

#运行输出

>>:huwei
Received b'huwei'
>>:123
Received b'123'
>>:

posted @ 2017-12-14 23:46 Mr.hu 阅读(246) 评论(0) 收藏举报

刷新页面返回顶部

Mr.hu