[Python] 10 - Concurrent: asyncio
Ref: HOWTO Fetch Internet Resources Using The urllib Package
Ref: Python High Performance - Second Edition【基于python3】
Ref: http://online.fliphtml5.com/odjuw/kcqs/#p=8【在线电子书】
Ref: 廖雪峰的异步IO【还是这个比较好一点】
Ref: Efficient web-scraping with Python’s asynchronous programming【参考】
Ref: A Web Crawler With asyncio Coroutines【参考】
一些概念
并行:parallel
并发:concurrent
协程:Coroutines
一种比线程更加轻量级的存在。正如一个进程可以拥有多个线程一样,一个线程也可以拥有多个协程。
协程不是被操作系统内核所管理,而完全是由程序所控制(也就是在用户态执行)。
这样带来的好处就是性能得到了很大的提升,不会像线程切换那样消耗资源。
Linux异步原理
参考一:boost coroutine with multi core
参考二:poll 和 select
poll 和 select 的实现基本上是一致的,只是传递参数有所不同,他们的基本流程如下:
1. 复制用户数据到内核空间
2. 估计超时时间
3. 遍历每个文件并调用f_op->poll 取得文件当前就绪状态, 如果前面遍历的文件都没有就绪,向文件插入wait_queue节点
4. 遍历完成后检查状态:
a). 如果已经有就绪的文件转到5;
b). 如果有信号产生,重启poll或select(转到 1或3);
c). 否则挂起进程等待超时或唤醒,超时或被唤醒后再次遍历所有文件取得每个文件的就绪状态
5. 将所有文件的就绪状态复制到用户空间
6. 清理申请的资源
写在开始
requests.get 串行策略
import requests import string import random # 生成url def generate_urls(base_url, num_urls): """ We add random characters to the end of the URL to break any caching mechanisms in the requests library or the server """ for i in range(num_urls): yield base_url + "".join(random.sample(string.ascii_lowercase, 10)) # 执行url def run_experiment(base_url, num_iter=500): response_size = 0 for url in generate_urls(base_url, num_iter): print(url) response = requests.get(url) response_size += len(response.text) return response_size
if __name__ == "__main__": import time delay = 100 num_iter = 50 base_url = "http://www.baidu.com/add?name=serial&delay={}&".format(delay) start = time.time() result = run_experiment(base_url, num_iter) end = time.time() print("Result: {}, Time: {}".format(result, end - start))
Gevent 方案
【暂时放弃该方案,太复杂且代码不可用】
以下是有变化部分的代码:
from gevent import monkey monkey.patch_socket() #---------------------------------- import gevent from gevent.coros import Semaphore import urllib2 from contextlib import closing
import string import random
def download(url, semaphore): with semaphore, closing(urllib2.urlopen(url)) as data: return data.read() def chunked_requests(urls, chunk_size=100): semaphore = Semaphore(chunk_size)
requests = [gevent.spawn(download, u, semaphore) for u in urls]
for response in gevent.iwait(requests): yield response def run_experiment(base_url, num_iter=500): urls = generate_urls(base_url, num_iter)
response_futures = chunked_requests(urls, 100) response_size = sum(len(r.value) for r in response_futures)
return response_size
gevent.spawn()
Create a new Greenlet
object and schedule it to run function(*args, **kwargs)
.
greenlet的源代码,代码不多,就2000行C语言的代码,其中有一部分栈寄存器的修改的代码是由汇编实现的。
一句话来说明greenlet的实现原理:通过栈的复制切换来实现不同协程之间的切换。
contextlib 的 closing
对于不支持使用 "with"语句 的 "类似文件” 的对象,使用 contextlib.closing():
import contextlib.closing
with closing(urllib.urlopen("http://www.python.org/")) as front_page: for line in front_page: print line
异步IO
一、简单的模型
yield是有返回值的。
def consumer(): r = '' while True: n = yield r if not n: return print('[CONSUMER] Consuming %s...' % n) r = '200 OK' def produce(c): c.send(None) # <-- 启动生成器 n = 0 while n < 5: n = n + 1 print('[PRODUCER] Producing %s...' % n) r = c.send(n) print('[PRODUCER] Consumer return: %s' % r) c.close()
#-------------------------------------------------- c = consumer() produce(c) # 给消费者c喂消息
二、asyncio 的由来
传统方式
Ref: https://www.liaoxuefeng.com/wiki/1016959663602400/1017970488768640
(1) 从asyncio
模块中直接获取一个EventLoop
的引用,
(2) 然后把需要执行的协程扔到EventLoop
中执行,就实现了异步IO。
import threading import asyncio @asyncio.coroutine def hello(): print('Hello world! (%s)' % threading.currentThread()) yield from asyncio.sleep(1) # 看成是一个耗时的io操作 print('Hello again! (%s)' % threading.currentThread()) loop = asyncio.get_event_loop() # (1) 获取一个EventLoop引用 tasks = [hello(), hello()] loop.run_until_complete(asyncio.wait(tasks)) # (2) 将携程扔到EventLoop中去执行 loop.close()
异步wget网页
writer.drain():这是一个与底层IO输入缓冲区交互的流量控制方法。当缓冲区达到上限时,drain()
阻塞,待到缓冲区回落到下限时,写操作可以被恢复。当不需要等待时,drain()
会立即返回。
#%% import asyncio @asyncio.coroutine def wget(host): print('wget %s...' % host)
# (1) 首先,获得socket双向管道 connect = asyncio.open_connection(host, 80) reader, writer = yield from connect
# (2) 发送request要网页内容 header = 'GET / HTTP/1.0\r\nHost: %s\r\n\r\n' % host writer.write(header.encode('utf-8')) yield from writer.drain()
# (3) 获得网页内容 while True: line = yield from reader.readline() if line == b'\r\n': break print('%s header > %s' % (host, line.decode('utf-8').rstrip())) # Ignore the body, close the socket writer.close()
loop = asyncio.get_event_loop() tasks = [wget(host) for host in ['www.sina.com.cn', 'www.sohu.com', 'www.163.com']] loop.run_until_complete(asyncio.wait(tasks)) loop.close()
总结下来就是主要做了两件事:
(1) @asyncio.coroutine
(2) yield from:不希望堵塞的地方
换为 async, await
换个写法,看上去干净一些。
import threading import asyncio async def hello(): print('Hello world! (%s)' % threading.currentThread()) await asyncio.sleep(1) # 看成是一个耗时的io操作 print('Hello again! (%s)' % threading.currentThread()) loop = asyncio.get_event_loop() # (1) 获取一个EventLoop引用 tasks = [hello(), hello()] loop.run_until_complete(asyncio.wait(tasks)) # (2) 将协程扔到EventLoop中去执行 loop.close()
三、aiohttp 助力
现在是把asyncio放在了服务器端!
asyncio
可以实现单线程并发IO操作。如果仅用在客户端,发挥的威力不大。如果把asyncio
用在服务器端,例如Web服务器,由于HTTP连接就是IO操作,因此可以用单线程+coroutine
实现多用户的高并发支持。
# server code
import asyncio from aiohttp import web async def index(request): await asyncio.sleep(0.5) return web.Response(body=b'<h1>Index</h1>') async def hello(request): await asyncio.sleep(0.5) text = '<h1>hello, %s!</h1>' % request.match_info['name'] return web.Response(body=text.encode('utf-8')) async def init(loop): app = web.Application(loop=loop) app.router.add_route('GET', '/', index) app.router.add_route('GET', '/hello/{name}', hello)
srv = await loop.create_server(app.make_handler(), '127.0.0.1', 8000) print('Server started at http://127.0.0.1:8000...') return srv
loop = asyncio.get_event_loop() loop.run_until_complete(init(loop)) loop.run_forever()
异步百万并发
文章不错,详见链接。
值得注意的一点是:最大并发限制的设置。
semaphore = asyncio.Semaphore(500) # 限制并发量为500
一段可运行的代码样例:
Goto: 对python async与await的理解
import time import asyncio import requests async def test2(i): r = await other_test(i) print(i,r) async def other_test(i): r = requests.get(i) print(i) await asyncio.sleep(4) print(time.time()-start) return r url = ["https://segmentfault.com/p/1210000013564725", "https://www.jianshu.com/p/83badc8028bd", "https://www.baidu.com/"] loop = asyncio.get_event_loop() task = [asyncio.ensure_future(test2(i)) for i in url] start = time.time() loop.run_until_complete(asyncio.wait(task)) endtime = time.time()-start print(endtime) loop.close()
End.