Python多线程、多进程与协程

Python对并发编程的支持

  • 多线程:Threading,利用CPU和IO同时执行的原理,让CPU不会干巴巴的等待
  • 多进程:Multiprocessing,利用多核CPU,每个CPU都进行计算和IO
  • 异步IO:asyncio;单线程,实现函数异步执行
  • 使用Lock对资源加锁
  • 使用Queue实现不同线程/进程之间的通信,实现生产者和消费者模式
  • 使用线程池和进程池,简化任务提交、等待和获取结果

CPU密集型计算和IO密集型计算

  • CPU密集型计算:压缩解压缩、加密解密、正则表达式搜索
  • IO密集型计算:读写操作(硬盘和内存);文件处理程序、网络爬虫、读写数据库

多线程、多进程和多协程

  • 多进程process(multiprocessing),利用多核CPU并行运算,占用较多资源,可启动数目较少,CPU密集型计算
  • 多线程Thread(threading)相比进程,更轻量,占用资源少;但是多线程只能并发执行,不能利用多CPU(GIL),同时启动数目有限,占用内存资源,有线程切换开销;因此适用于IO密集型计算,同时运行的任务数目要求不多
  • 多协程 Coroutine(asyncio),内存开销最少,启动数量最多,支持的库有限制,例如requests库不支持协程,需要使用aiohttp,实现的代码也较为复杂,适用于IO密集型计算,超多任务的执行
  • 一个进程中可以启动多个线程,一个线程可以启动多个协程

GIL

  • 全局解释器锁,它使得任何时刻仅有一个线程在执行

Python创建多线程的方法

import threading

def my_func(a,b)
	do_craw(a,b)
	

t = threading.Thread(target=my_func,args=(100,200))
t.start()
t.join()  # 等待结束

  • 单线程与多线程实例
import requests
import threading
import time

urls = [f"https://www.cnblogs.com/#{page}" for page in range(50)]


def crawl(url):
    r = requests.get(url)
    print(url, len(r.text))


def single_url():
    print("single thread start")
    for url in urls:
        crawl(url)
    print("single threads end")

def multi_thread():
    print("multi threads start")
    threads = []
    for url in urls:
        threads.append(threading.Thread(target=crawl, args=(url,)))  # args接受的参数是元组,因此需要(url,)
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    print("multi threads end")


if __name__ == '__main__':
    start = time.time()
    multi_thread()
    end = time.time()
    print("多线程用时:", end- start)  # 多线程用时: 3.138007879257202

    start2 = time.time()
    single_url()
    end2 = time.time()
    print("单线程用时:", end2 - start2)  # 单线程用时: 11.541695833206177
  • 多线程是无序执行的,单线程是顺序执行

生产者、消费者爬虫

import requests
import threading
import time
from bs4 import BeautifulSoup
import queue
import random
import threading


urls = [f"https://www.cnblogs.com/#{page}" for page in range(1, 51)]


def crawl(url):
    r = requests.get(url)
    return r.text


def parse(html):
    soup = BeautifulSoup(html, "html.parser")
    links = soup.find_all("a", class_="post-item-title")
    return [(link["href"], link.get_text()) for link in links]  # 返回一个列表,每个元素是一个元组,包含了每个页面上所有文章的链接与标题


def do_crawl(url_queue, html_queue):
    while True:
        url = url_queue.get()
        html = crawl(url)
        html_queue.put(html)
        print(threading.current_thread().name, f"crawl {url}",
              "url_queue.size=", url_queue.qsize())
        time.sleep(random.randint(1, 2))


def do_parse(html_queue, fout):
    while True:
        html = html_queue.get()
        results = parse(html)
        for result in results:
            fout.write(str(result) + "\n")
        print(threading.current_thread().name, f"results.size", len(results),
              "html_queue.size=", html_queue.qsize())
        time.sleep(random.randint(1, 2))


if __name__ == '__main__':
    url_queue = queue.Queue()
    html_queue = queue.Queue()
    for url in urls:
        url_queue.put(url)
    for idx in range(3):
        t = threading.Thread(target=do_crawl, args=(url_queue, html_queue),
                             name=f"crawl{idx}")
        t.start()
    fout = open("data.txt", "w", encoding="utf-8")
    for idx in range(2):
        t = threading.Thread(target=do_parse, args=(html_queue, fout),
                             name=f"parse{idx}")
        t.start()

线程安全问题及解决方案

  • 线程安全指某个函数、函数库在多线程环境中被调用时,能够正确处理多个线程之间的共享变量,使程序功能正确完成。
  • 由于线程的执行会随时发生切换,就造成不可预料的结果,出现线程不安全。
  • 使用lock解决线程安全:
# 用法1 try-finally 模式:
import threading


lock = threading.lock()
lock.acquire()
try:
	# do something
finally:
	lock.release()


# 用法2:
import threading

lock = threading.lock()
with lock:
	# do something  在单线程函数外使用with,然后正常启动线程即可

线程池ThreadPoolExecutor

  • 线程的新建和终止需要消耗系统资源,如果频繁的新建和终止线程,会造成不必要的浪费,减少新建和终止的开销
  • 线程池适用场景:适合处理突发性大量请求,或需要大量线程完成任务,但实际任务处理时间较短
# ThreadPoolExecutor的使用方法:
from concurrent.futures import ThreadPoolExecutor, as_completed
with ThreadPoolExecutor() as pool:
	results = pool.map(crawl, urls)   # 注意此处的urls代表很多参数的列表,而不是单个参数
	for result in results:
		print(result)
		


with ThreadPoolExecutor() as pool:
	futures = [pool.submit(crawl, url) for url in urls]  # 注意这里submit里的是单个url
	
	# 对于futures里的任务,有两种便利方式:
	# 方式一:按url顺序返回
	for future in futures:
		print(future.result())
	# 方式二:谁先执行完,返回谁
	for future in as_completed(futures):
		print(future.result())

协程的一些概念

Coroutines

  • 协程中通过async def的方式可以声明一个coroutine function
  • coroutine function的返回值是一个coroutine object
  • 直接调用coroutine时并不会执行函数中的代码
import asyncio

async def main():
    print('hello')
    await asyncio.sleep(1)
    print('world')

main()  # <coroutine object main at 0x1053bb7c8>

执行coroutine的方法:

  • asyncio.run(), 注意下述方法只是顺序的执行了main函数
import asyncio

async def main():
    print('hello')
    await asyncio.sleep(1)
    print('world')

asyncio.run(main())
# hello
# world
  • Awaiting on a coroutine
  • 下述例子使用await关键字把say_after这个coroutine function挂起,然后使用asyncio.run()调用
  • 注意这样依然是在顺序地执行main函数
import asyncio
import time

async def say_after(delay, what):
    await asyncio.sleep(delay)
    print(what)

async def main():
    print(f"started at {time.strftime('%X')}")

    await say_after(1, 'hello')
    await say_after(2, 'world')

    print(f"finished at {time.strftime('%X')}")

asyncio.run(main())
# started at 17:13:52
# hello
# world
# finished at 17:13:55
  • asyncio.create_task()
  • 该函数可以并发地执行coroutine function
async def main():
    task1 = asyncio.create_task(
        say_after(1, 'hello'))

    task2 = asyncio.create_task(
        say_after(2, 'world'))

    print(f"started at {time.strftime('%X')}")

    # Wait until both tasks are completed (should take
    # around 2 seconds.)
    await task1
    await task2

    print(f"finished at {time.strftime('%X')}")

asyncio.run(main())
# started at 17:14:32
# hello
# world
# finished at 17:14:34

Awaitables

  • We say that an object is an awaitable object if it can be used in an await expression.
  • 三种awaitable objects:coroutines, Tasks, and Futures.

Tasks

  • 可以用Tasks并发执行coroutines
  • When a coroutine is wrapped into a Task with functions like asyncio.create_task() the coroutine is automatically scheduled to run soon:
  • asyncio.create_task()会返回一个Task object
import asyncio

async def nested():
    return 42

async def main():
    # Schedule nested() to run soon concurrently
    # with "main()".
    task = asyncio.create_task(nested())

    # "task" can now be used to cancel "nested()", or
    # can simply be awaited to wait until it is complete:
    await task

asyncio.run(main())

Futures

  • A Future is a special low-level awaitable object that represents an eventual result of an asynchronous operation.
  • Normally there is no need to create Future objects at the application level code.

协程模板

  • 利用协程执行一般任务
import asyncio

async def func1():
	pass
	await asyncio.sleep()
	pass

async def func2():
	pass
	await asyncio.sleep()
	pass

async def func3():
	pass
	await asyncio.sleep()
	pass


# 方法一:
async def main():
	f1 = func1()
	await f1  # 一般await挂起操作放在协程对象前面,await必须卸载async函数里
	f2 = func2()
	await f2
	f3 = func3()
	await f3
	
# 方法二(推荐):
async def main():
	tasks = [
	asyncio.create_tasks(func1()), 
	asyncio.create_tasks(func2()), 
	asyncio.create_tasks(func3())]
	
	await asyncio.wait(tasks)


if __name__ == "__main__":
	asyncip.run(main())
  • 协程在爬虫里的应用
import asyncio

async def download(url):
	await asyncio.sleep()  # 发起网络请求
	
async def main():
	urls= []

	tasks = []
	for url in urls:
		d = asyncio.create_tasks(download(url))
		tasks.append(d)


	await asyncio.wait(tasks)

if __name__ == "__main__":
	asyncio.run(main())
	
  • 进一步完善
import asyncio
import aihttp


async def aiodownload(url):
	name = something
	async with aiohttp.ClinetSession() as session:
		async with session.get() as response:
			response.content.read()
			response.text() # 必须加()
			repsonse.json()
			# 可以学习aiofiles模块,改善写入文件效率
			with open(name, mode="wb") as f:
				f.write(awit response.content.read()) # 读取内容是异步的,需要挂起


posted @ 2023-06-02 16:03  梁书源  阅读(26)  评论(0)    收藏  举报