Python对并发编程的支持
- 多线程:Threading,利用CPU和IO同时执行的原理,让CPU不会干巴巴的等待
- 多进程:Multiprocessing,利用多核CPU,每个CPU都进行计算和IO
- 异步IO:asyncio;单线程,实现函数异步执行
- 使用Lock对资源加锁
- 使用Queue实现不同线程/进程之间的通信,实现生产者和消费者模式
- 使用线程池和进程池,简化任务提交、等待和获取结果
CPU密集型计算和IO密集型计算
- CPU密集型计算:压缩解压缩、加密解密、正则表达式搜索
- IO密集型计算:读写操作(硬盘和内存);文件处理程序、网络爬虫、读写数据库
多线程、多进程和多协程
- 多进程process(multiprocessing),利用多核CPU并行运算,占用较多资源,可启动数目较少,CPU密集型计算
- 多线程Thread(threading)相比进程,更轻量,占用资源少;但是多线程只能并发执行,不能利用多CPU(GIL),同时启动数目有限,占用内存资源,有线程切换开销;因此适用于IO密集型计算,同时运行的任务数目要求不多
- 多协程 Coroutine(asyncio),内存开销最少,启动数量最多,支持的库有限制,例如requests库不支持协程,需要使用aiohttp,实现的代码也较为复杂,适用于IO密集型计算,超多任务的执行
- 一个进程中可以启动多个线程,一个线程可以启动多个协程
GIL
Python创建多线程的方法
import threading
def my_func(a,b)
do_craw(a,b)
t = threading.Thread(target=my_func,args=(100,200))
t.start()
t.join() # 等待结束
import requests
import threading
import time
urls = [f"https://www.cnblogs.com/#{page}" for page in range(50)]
def crawl(url):
r = requests.get(url)
print(url, len(r.text))
def single_url():
print("single thread start")
for url in urls:
crawl(url)
print("single threads end")
def multi_thread():
print("multi threads start")
threads = []
for url in urls:
threads.append(threading.Thread(target=crawl, args=(url,))) # args接受的参数是元组,因此需要(url,)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print("multi threads end")
if __name__ == '__main__':
start = time.time()
multi_thread()
end = time.time()
print("多线程用时:", end- start) # 多线程用时: 3.138007879257202
start2 = time.time()
single_url()
end2 = time.time()
print("单线程用时:", end2 - start2) # 单线程用时: 11.541695833206177
生产者、消费者爬虫
import requests
import threading
import time
from bs4 import BeautifulSoup
import queue
import random
import threading
urls = [f"https://www.cnblogs.com/#{page}" for page in range(1, 51)]
def crawl(url):
r = requests.get(url)
return r.text
def parse(html):
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all("a", class_="post-item-title")
return [(link["href"], link.get_text()) for link in links] # 返回一个列表,每个元素是一个元组,包含了每个页面上所有文章的链接与标题
def do_crawl(url_queue, html_queue):
while True:
url = url_queue.get()
html = crawl(url)
html_queue.put(html)
print(threading.current_thread().name, f"crawl {url}",
"url_queue.size=", url_queue.qsize())
time.sleep(random.randint(1, 2))
def do_parse(html_queue, fout):
while True:
html = html_queue.get()
results = parse(html)
for result in results:
fout.write(str(result) + "\n")
print(threading.current_thread().name, f"results.size", len(results),
"html_queue.size=", html_queue.qsize())
time.sleep(random.randint(1, 2))
if __name__ == '__main__':
url_queue = queue.Queue()
html_queue = queue.Queue()
for url in urls:
url_queue.put(url)
for idx in range(3):
t = threading.Thread(target=do_crawl, args=(url_queue, html_queue),
name=f"crawl{idx}")
t.start()
fout = open("data.txt", "w", encoding="utf-8")
for idx in range(2):
t = threading.Thread(target=do_parse, args=(html_queue, fout),
name=f"parse{idx}")
t.start()
线程安全问题及解决方案
- 线程安全指某个函数、函数库在多线程环境中被调用时,能够正确处理多个线程之间的共享变量,使程序功能正确完成。
- 由于线程的执行会随时发生切换,就造成不可预料的结果,出现线程不安全。
- 使用lock解决线程安全:
# 用法1 try-finally 模式:
import threading
lock = threading.lock()
lock.acquire()
try:
# do something
finally:
lock.release()
# 用法2:
import threading
lock = threading.lock()
with lock:
# do something 在单线程函数外使用with,然后正常启动线程即可
线程池ThreadPoolExecutor
- 线程的新建和终止需要消耗系统资源,如果频繁的新建和终止线程,会造成不必要的浪费,减少新建和终止的开销
- 线程池适用场景:适合处理突发性大量请求,或需要大量线程完成任务,但实际任务处理时间较短
# ThreadPoolExecutor的使用方法:
from concurrent.futures import ThreadPoolExecutor, as_completed
with ThreadPoolExecutor() as pool:
results = pool.map(crawl, urls) # 注意此处的urls代表很多参数的列表,而不是单个参数
for result in results:
print(result)
with ThreadPoolExecutor() as pool:
futures = [pool.submit(crawl, url) for url in urls] # 注意这里submit里的是单个url
# 对于futures里的任务,有两种便利方式:
# 方式一:按url顺序返回
for future in futures:
print(future.result())
# 方式二:谁先执行完,返回谁
for future in as_completed(futures):
print(future.result())
协程的一些概念
Coroutines
- 协程中通过
async def的方式可以声明一个coroutine function
- coroutine function的返回值是一个coroutine object
- 直接调用coroutine时并不会执行函数中的代码
import asyncio
async def main():
print('hello')
await asyncio.sleep(1)
print('world')
main() # <coroutine object main at 0x1053bb7c8>
执行coroutine的方法:
asyncio.run(), 注意下述方法只是顺序的执行了main函数
import asyncio
async def main():
print('hello')
await asyncio.sleep(1)
print('world')
asyncio.run(main())
# hello
# world
- Awaiting on a coroutine
- 下述例子使用await关键字把say_after这个coroutine function挂起,然后使用
asyncio.run()调用
- 注意这样依然是在顺序地执行main函数
import asyncio
import time
async def say_after(delay, what):
await asyncio.sleep(delay)
print(what)
async def main():
print(f"started at {time.strftime('%X')}")
await say_after(1, 'hello')
await say_after(2, 'world')
print(f"finished at {time.strftime('%X')}")
asyncio.run(main())
# started at 17:13:52
# hello
# world
# finished at 17:13:55
asyncio.create_task()
- 该函数可以并发地执行coroutine function
async def main():
task1 = asyncio.create_task(
say_after(1, 'hello'))
task2 = asyncio.create_task(
say_after(2, 'world'))
print(f"started at {time.strftime('%X')}")
# Wait until both tasks are completed (should take
# around 2 seconds.)
await task1
await task2
print(f"finished at {time.strftime('%X')}")
asyncio.run(main())
# started at 17:14:32
# hello
# world
# finished at 17:14:34
Awaitables
- We say that an object is an awaitable object if it can be used in an
await expression.
- 三种awaitable objects:coroutines, Tasks, and Futures.
Tasks
- 可以用Tasks并发执行coroutines
- When a coroutine is wrapped into a Task with functions like
asyncio.create_task() the coroutine is automatically scheduled to run soon:
asyncio.create_task()会返回一个Task object
import asyncio
async def nested():
return 42
async def main():
# Schedule nested() to run soon concurrently
# with "main()".
task = asyncio.create_task(nested())
# "task" can now be used to cancel "nested()", or
# can simply be awaited to wait until it is complete:
await task
asyncio.run(main())
Futures
- A Future is a special low-level awaitable object that represents an eventual result of an asynchronous operation.
- Normally there is no need to create Future objects at the application level code.
协程模板
import asyncio
async def func1():
pass
await asyncio.sleep()
pass
async def func2():
pass
await asyncio.sleep()
pass
async def func3():
pass
await asyncio.sleep()
pass
# 方法一:
async def main():
f1 = func1()
await f1 # 一般await挂起操作放在协程对象前面,await必须卸载async函数里
f2 = func2()
await f2
f3 = func3()
await f3
# 方法二(推荐):
async def main():
tasks = [
asyncio.create_tasks(func1()),
asyncio.create_tasks(func2()),
asyncio.create_tasks(func3())]
await asyncio.wait(tasks)
if __name__ == "__main__":
asyncip.run(main())
import asyncio
async def download(url):
await asyncio.sleep() # 发起网络请求
async def main():
urls= []
tasks = []
for url in urls:
d = asyncio.create_tasks(download(url))
tasks.append(d)
await asyncio.wait(tasks)
if __name__ == "__main__":
asyncio.run(main())
import asyncio
import aihttp
async def aiodownload(url):
name = something
async with aiohttp.ClinetSession() as session:
async with session.get() as response:
response.content.read()
response.text() # 必须加()
repsonse.json()
# 可以学习aiofiles模块,改善写入文件效率
with open(name, mode="wb") as f:
f.write(awit response.content.read()) # 读取内容是异步的,需要挂起