Python多线程、多进程与协程

Python对并发编程的支持

多线程：Threading，利用CPU和IO同时执行的原理，让CPU不会干巴巴的等待
多进程：Multiprocessing，利用多核CPU，每个CPU都进行计算和IO
异步IO：asyncio；单线程，实现函数异步执行
使用Lock对资源加锁
使用Queue实现不同线程/进程之间的通信，实现生产者和消费者模式
使用线程池和进程池，简化任务提交、等待和获取结果

CPU密集型计算和IO密集型计算

CPU密集型计算：压缩解压缩、加密解密、正则表达式搜索
IO密集型计算：读写操作（硬盘和内存）；文件处理程序、网络爬虫、读写数据库

多线程、多进程和多协程

多进程process（multiprocessing），利用多核CPU并行运算，占用较多资源，可启动数目较少，CPU密集型计算
多线程Thread（threading）相比进程，更轻量，占用资源少；但是多线程只能并发执行，不能利用多CPU（GIL），同时启动数目有限，占用内存资源，有线程切换开销；因此适用于IO密集型计算，同时运行的任务数目要求不多
多协程 Coroutine（asyncio），内存开销最少，启动数量最多，支持的库有限制，例如requests库不支持协程，需要使用aiohttp，实现的代码也较为复杂，适用于IO密集型计算，超多任务的执行
一个进程中可以启动多个线程，一个线程可以启动多个协程

GIL

全局解释器锁，它使得任何时刻仅有一个线程在执行

Python创建多线程的方法

import threading

def my_func(a,b)
	do_craw(a,b)
	

t = threading.Thread(target=my_func,args=(100,200))
t.start()
t.join()  # 等待结束

单线程与多线程实例

import requests
import threading
import time

urls = [f"https://www.cnblogs.com/#{page}" for page in range(50)]


def crawl(url):
    r = requests.get(url)
    print(url, len(r.text))


def single_url():
    print("single thread start")
    for url in urls:
        crawl(url)
    print("single threads end")

def multi_thread():
    print("multi threads start")
    threads = []
    for url in urls:
        threads.append(threading.Thread(target=crawl, args=(url,)))  # args接受的参数是元组，因此需要（url，）
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    print("multi threads end")


if __name__ == '__main__':
    start = time.time()
    multi_thread()
    end = time.time()
    print("多线程用时：", end- start)  # 多线程用时： 3.138007879257202

    start2 = time.time()
    single_url()
    end2 = time.time()
    print("单线程用时：", end2 - start2)  # 单线程用时： 11.541695833206177

多线程是无序执行的，单线程是顺序执行

生产者、消费者爬虫

import requests
import threading
import time
from bs4 import BeautifulSoup
import queue
import random
import threading


urls = [f"https://www.cnblogs.com/#{page}" for page in range(1, 51)]


def crawl(url):
    r = requests.get(url)
    return r.text


def parse(html):
    soup = BeautifulSoup(html, "html.parser")
    links = soup.find_all("a", class_="post-item-title")
    return [(link["href"], link.get_text()) for link in links]  # 返回一个列表，每个元素是一个元组，包含了每个页面上所有文章的链接与标题


def do_crawl(url_queue, html_queue):
    while True:
        url = url_queue.get()
        html = crawl(url)
        html_queue.put(html)
        print(threading.current_thread().name, f"crawl {url}",
              "url_queue.size=", url_queue.qsize())
        time.sleep(random.randint(1, 2))


def do_parse(html_queue, fout):
    while True:
        html = html_queue.get()
        results = parse(html)
        for result in results:
            fout.write(str(result) + "\n")
        print(threading.current_thread().name, f"results.size", len(results),
              "html_queue.size=", html_queue.qsize())
        time.sleep(random.randint(1, 2))


if __name__ == '__main__':
    url_queue = queue.Queue()
    html_queue = queue.Queue()
    for url in urls:
        url_queue.put(url)
    for idx in range(3):
        t = threading.Thread(target=do_crawl, args=(url_queue, html_queue),
                             name=f"crawl{idx}")
        t.start()
    fout = open("data.txt", "w", encoding="utf-8")
    for idx in range(2):
        t = threading.Thread(target=do_parse, args=(html_queue, fout),
                             name=f"parse{idx}")
        t.start()

线程安全问题及解决方案

线程安全指某个函数、函数库在多线程环境中被调用时，能够正确处理多个线程之间的共享变量，使程序功能正确完成。
由于线程的执行会随时发生切换，就造成不可预料的结果，出现线程不安全。
使用lock解决线程安全：

# 用法1 try-finally 模式：
import threading


lock = threading.lock()
lock.acquire()
try:
	# do something
finally:
	lock.release()


# 用法2：
import threading

lock = threading.lock()
with lock:
	# do something  在单线程函数外使用with，然后正常启动线程即可

线程池ThreadPoolExecutor

线程的新建和终止需要消耗系统资源，如果频繁的新建和终止线程，会造成不必要的浪费，减少新建和终止的开销
线程池适用场景：适合处理突发性大量请求，或需要大量线程完成任务，但实际任务处理时间较短

# ThreadPoolExecutor的使用方法：
from concurrent.futures import ThreadPoolExecutor, as_completed
with ThreadPoolExecutor() as pool:
	results = pool.map(crawl, urls)   # 注意此处的urls代表很多参数的列表，而不是单个参数
	for result in results:
		print(result)
		


with ThreadPoolExecutor() as pool:
	futures = [pool.submit(crawl, url) for url in urls]  # 注意这里submit里的是单个url
	
	# 对于futures里的任务，有两种便利方式：
	# 方式一：按url顺序返回
	for future in futures:
		print(future.result())
	# 方式二：谁先执行完，返回谁
	for future in as_completed(futures):
		print(future.result())

协程的一些概念

Coroutines

协程中通过async def的方式可以声明一个coroutine function
coroutine function的返回值是一个coroutine object
直接调用coroutine时并不会执行函数中的代码

import asyncio

async def main():
    print('hello')
    await asyncio.sleep(1)
    print('world')

main()  # <coroutine object main at 0x1053bb7c8>

执行coroutine的方法:

asyncio.run(), 注意下述方法只是顺序的执行了main函数

import asyncio

async def main():
    print('hello')
    await asyncio.sleep(1)
    print('world')

asyncio.run(main())
# hello
# world

Awaiting on a coroutine
下述例子使用await关键字把say_after这个coroutine function挂起,然后使用asyncio.run()调用
注意这样依然是在顺序地执行main函数

import asyncio
import time

async def say_after(delay, what):
    await asyncio.sleep(delay)
    print(what)

async def main():
    print(f"started at {time.strftime('%X')}")

    await say_after(1, 'hello')
    await say_after(2, 'world')

    print(f"finished at {time.strftime('%X')}")

asyncio.run(main())
# started at 17:13:52
# hello
# world
# finished at 17:13:55

asyncio.create_task()
该函数可以并发地执行coroutine function

async def main():
    task1 = asyncio.create_task(
        say_after(1, 'hello'))

    task2 = asyncio.create_task(
        say_after(2, 'world'))

    print(f"started at {time.strftime('%X')}")

    # Wait until both tasks are completed (should take
    # around 2 seconds.)
    await task1
    await task2

    print(f"finished at {time.strftime('%X')}")

asyncio.run(main())
# started at 17:14:32
# hello
# world
# finished at 17:14:34

Awaitables

We say that an object is an awaitable object if it can be used in an await expression.
三种awaitable objects:coroutines, Tasks, and Futures.

Tasks

可以用Tasks并发执行coroutines
When a coroutine is wrapped into a Task with functions like asyncio.create_task() the coroutine is automatically scheduled to run soon:
asyncio.create_task()会返回一个Task object

import asyncio

async def nested():
    return 42

async def main():
    # Schedule nested() to run soon concurrently
    # with "main()".
    task = asyncio.create_task(nested())

    # "task" can now be used to cancel "nested()", or
    # can simply be awaited to wait until it is complete:
    await task

asyncio.run(main())

Futures

A Future is a special low-level awaitable object that represents an eventual result of an asynchronous operation.
Normally there is no need to create Future objects at the application level code.

协程模板

利用协程执行一般任务

import asyncio

async def func1():
	pass
	await asyncio.sleep()
	pass

async def func2():
	pass
	await asyncio.sleep()
	pass

async def func3():
	pass
	await asyncio.sleep()
	pass


# 方法一：
async def main():
	f1 = func1()
	await f1  # 一般await挂起操作放在协程对象前面，await必须卸载async函数里
	f2 = func2()
	await f2
	f3 = func3()
	await f3
	
# 方法二(推荐)：
async def main():
	tasks = [
	asyncio.create_tasks(func1()), 
	asyncio.create_tasks(func2()), 
	asyncio.create_tasks(func3())]
	
	await asyncio.wait(tasks)


if __name__ == "__main__":
	asyncip.run(main())

协程在爬虫里的应用

import asyncio

async def download(url):
	await asyncio.sleep()  # 发起网络请求
	
async def main():
	urls= []

	tasks = []
	for url in urls:
		d = asyncio.create_tasks(download(url))
		tasks.append(d)


	await asyncio.wait(tasks)

if __name__ == "__main__":
	asyncio.run(main())

进一步完善

import asyncio
import aihttp


async def aiodownload(url):
	name = something
	async with aiohttp.ClinetSession() as session:
		async with session.get() as response:
			response.content.read()
			response.text() # 必须加（）
			repsonse.json()
			# 可以学习aiofiles模块，改善写入文件效率
			with open(name, mode="wb") as f:
				f.write(awit response.content.read()) # 读取内容是异步的，需要挂起

posted @ 2023-06-02 16:03 梁书源阅读(26) 评论(0) 收藏举报

刷新页面返回顶部

我的备忘录

记录学习中容易遗忘的零碎知识点

Python多线程、多进程与协程

Python对并发编程的支持

CPU密集型计算和IO密集型计算

多线程、多进程和多协程

GIL

Python创建多线程的方法

生产者、消费者爬虫

线程安全问题及解决方案

线程池ThreadPoolExecutor

协程的一些概念

Coroutines

执行coroutine的方法:

Awaitables

Tasks

Futures

协程模板

公告