玩转Python并发编程(1)——进程
一、前言
开始讲并发前,先科普一下什么是CPU密集型和IO密集型。
CPU密集型:
也叫计算密集型。计算密集型任务的特点是要进行大量的计算,消耗CPU资源,比如计算圆周率、对视频进行高清解码等等,全靠CPU的运算能力。这种计算密集型任务虽然也可以用多任务完成,但是任务越多,花在任务切换的时间就越多,CPU执行任务的效率就越低,所以,要最高效地利用CPU,计算密集型任务同时进行的数量应当等于CPU的核心数。计算密集型任务由于主要消耗CPU资源,因此,代码运行效率至关重要。Python这样的脚本语言运行效率很低,完全不适合计算密集型任务。对于大型的计算密集型任务,最好用C语言编写。
IO密集型:
涉及到网络、磁盘IO的任务都是IO密集型任务,这类任务的特点是CPU消耗很少,任务的大部分时间都在等待IO操作完成(因为IO的速度远远低于CPU和内存的速度)。对于IO密集型任务,任务越多,CPU效率越高,但也有一个限度。常见的大部分任务都是IO密集型任务,比如Web应用。IO密集型任务执行期间,99%的时间都花在IO上,花在CPU上的时间很少,因此,用运行速度极快的C语言替换用Python这样运行速度极低的脚本语言,完全无法提升运行效率。对于IO密集型任务,最合适的语言就是开发效率最高(代码量最少)的语言,脚本语言是首选,C语言最差。
总之,计算密集型程序适合C语言多线程,I/O密集型程序适合脚本语言(例如Python)开发的多线程。
并发编程可以说是我们脱离初级程序员的必经之路。很多时候,当我们编写一些脚本或程序之后,比如爬虫、文件批处理程序等等。但是,这些程序往往都是顺序运行的,且同一时间只能做一件事情,也就是我们所说的“单线程程序”,也称“同步程序(Sync)”或者阻塞程序(Blocking)。
然而我们的电脑CPU包含多个核心,可以同时进行多任务处理。还有一些情况,我们的程序卡在了某一个IO过程中,CPU并没有进行任何计算,而是等待IO。为了在这种情况下充分利用CPU资源,提高程序的运行效率,我们需要一种新的编程模式:并发编程。
Python的并发实现主要有三种模型处理方案:进程(Process)、线程(Thread)、协程(Async);而所有的并发基本上都会涉及三个问题:通讯、同步、共享内存。
上述三种并发模型对这三个问题的处理方案和利弊各有不同,从实现原理上,总体可以分为抢占式(进程和线程)和合作式(协程);并行式(进程)和非并行式(线程和协程)。
另外,关于同步/异步、并发/并行、线程/进程的详细讲解可移步:https://www.cnblogs.com/minseo/p/15402374.html 以及 https://blog.csdn.net/just_learing/article/details/123785824 讲的很不错。
二、进程(Process)
对于CPU密集型任务来说,多进程是很适合的。
1、我们先来创建一个任务函数。
import random
import time
import os
def expensive_function(n: int):
# 获取当前进程的进程ID和父进程ID
print(f'[PID = {os.getpid()}] (Parent process is {os.getppid()}) Executing with n = {n}...')
sleep_time = random.randint(1, 5)
time.sleep(sleep_time)
print(f'[PID = {os.getpid()}] n = {n} Done, 休眠了{sleep_time}秒')
if __name__ == '__main__':
expensive_function(3)
输出:
[PID = 11400] (Parent process is 6840) Executing with n = 3...
[PID = 11400] n = 3 Done, 休眠了5秒
可以看出,当前进程号为11400,父进程号为6840。
2、创建一个进程池。
import random
import time
import os
from concurrent import futures # futures是一个Python并发的公共接口,可以是线程也可以是进程
def expensive_function(n: int):
# 获取当前进程的进程ID和父进程ID
print(f'[PID = {os.getpid()}] (Parent process is {os.getppid()}) Executing with n = {n}...')
sleep_time = random.randint(1, 5)
time.sleep(sleep_time)
print(f'[PID = {os.getpid()}] n = {n} Done, 休眠了{sleep_time}秒')
def execute_with_pool():
tasks = [i for i in range(10)]
print('启动进程池')
with futures.ProcessPoolExecutor(max_workers=5) as pool:
# 将迭代器tasks中的每个元素提取出来当作任务函数expensive_function的参数,创建一个个进程,放进进程池中
pool.map(expensive_function, tasks)
if __name__ == '__main__':
execute_with_pool()
输出:
启动进程池
[PID = 22464] (Parent process is 22888) Executing with n = 0...
[PID = 23132] (Parent process is 22888) Executing with n = 1...
[PID = 23004] (Parent process is 22888) Executing with n = 2...
[PID = 11032] (Parent process is 22888) Executing with n = 3...
[PID = 5392] (Parent process is 22888) Executing with n = 4...
[PID = 5392] n = 4 Done, 休眠了1秒
[PID = 5392] (Parent process is 22888) Executing with n = 5...
[PID = 5392] n = 5 Done, 休眠了1秒
[PID = 5392] (Parent process is 22888) Executing with n = 6...
[PID = 23132] n = 1 Done, 休眠了3秒
[PID = 23004] n = 2 Done, 休眠了3秒
[PID = 23132] (Parent process is 22888) Executing with n = 7...
[PID = 23004] (Parent process is 22888) Executing with n = 8...
[PID = 23132] n = 7 Done, 休眠了1秒
[PID = 23132] (Parent process is 22888) Executing with n = 9...
[PID = 22464] n = 0 Done, 休眠了5秒
[PID = 11032] n = 3 Done, 休眠了5秒
[PID = 23004] n = 8 Done, 休眠了3秒
[PID = 5392] n = 6 Done, 休眠了4秒
[PID = 23132] n = 9 Done, 休眠了3秒
可以看出,程序一开始直接开启5个进程,以PID=5392为例,当n=4的task(参数)被Done之后,PID=5392的进程立刻接手n=5的task(参数),以此类推,直到最后n=9的task(参数)被Done之后,程序停止。
这里我们使用了futures.ProcessPoolExecutor这个暴露的API直接创建进程池,那么我们是否可以自己创建一个进程池呢?
import random
import time
import os
import multiprocessing
from concurrent import futures # futures是一个Python并发的公共接口,可以是线程也可以是进程
def expensive_function(n: int):
# 获取当前进程的进程ID和父进程ID
print(f'[PID = {os.getpid()}] (Parent process is {os.getppid()}) Executing with n = {n}...')
sleep_time = random.randint(1, 5)
time.sleep(sleep_time)
print(f'[PID = {os.getpid()}] n = {n} Done, 休眠了{sleep_time}秒')
def execute_with_pool():
tasks = [i for i in range(10)]
print('启动进程池')
with futures.ProcessPoolExecutor(max_workers=5) as pool:
# 将迭代器tasks中的每个元素提取出来当作任务函数expensive_function的参数,创建一个个进程,放进进程池中
pool.map(expensive_function, tasks)
def execute_with_raw_pool():
tasks = [i for i in range(10)]
raw_process_pool = [multiprocessing.Process(target=expensive_function, args=(tasks[i],)) for i in range(10)]
for p in raw_process_pool:
p.start() # 开启进程
# join方法:阻塞当前进程,等待直到调用join方法的进程结束,才会继续执行当前的进程;在这里,当前进程就是主进程
# 这里的for循环的作用是:等待所有子进程结束
for p in raw_process_pool:
p.join()
if __name__ == '__main__':
execute_with_raw_pool()
可以看到,我们使用了multiprocessing.Process来手动创建一个进程池,然后通过start和join来控制进程池中进程的开启与关闭。
输出:
[PID = 15652] (Parent process is 2708) Executing with n = 0...
[PID = 3648] (Parent process is 2708) Executing with n = 1...
[PID = 5696] (Parent process is 2708) Executing with n = 2...
[PID = 6604] (Parent process is 2708) Executing with n = 3...
[PID = 2244] (Parent process is 2708) Executing with n = 4...
[PID = 6116] (Parent process is 2708) Executing with n = 5...
[PID = 7024] (Parent process is 2708) Executing with n = 6...
[PID = 12332] (Parent process is 2708) Executing with n = 7...
[PID = 6620] (Parent process is 2708) Executing with n = 8...
[PID = 13856] (Parent process is 2708) Executing with n = 9...
[PID = 5696] n = 2 Done, 休眠了1秒
[PID = 6604] n = 3 Done, 休眠了1秒
[PID = 7024] n = 6 Done, 休眠了1秒
[PID = 13856] n = 9 Done, 休眠了1秒
[PID = 3648] n = 1 Done, 休眠了3秒
[PID = 6116] n = 5 Done, 休眠了3秒
[PID = 12332] n = 7 Done, 休眠了3秒
[PID = 15652] n = 0 Done, 休眠了4秒
[PID = 2244] n = 4 Done, 休眠了4秒
[PID = 6620] n = 8 Done, 休眠了5秒
3、进程间的通讯
进程间的通讯我们通常使用Queue队列来实现。
import random
import time
import os
import multiprocessing
from concurrent import futures # futures是一个Python并发的公共接口,可以是线程也可以是进程
def communicate_between_process():
q = multiprocessing.Queue() # 实例化一个队列
p_put = multiprocessing.Process(target=put, args=(q,)) # 创建一个进程,用来发消息
p_get = multiprocessing.Process(target=get, args=(q,)) # 创建一个进程,用来收消息
p_put.start()
p_get.start()
p_put.join()
p_get.join()
def put(q: multiprocessing.Queue):
msg = ['an', object, 3]
print(f'[PID = {os.getpid()}] Sleep 2 Seconds')
time.sleep(2)
q.put(msg)
print(f'[PID = {os.getpid()}] Put msg={msg}')
def get(q: multiprocessing.Queue):
print(f'[PID = {os.getpid()}] Pulling {q}')
res = q.get()
print(f'[PID = {os.getpid()}] Get res={res}')
if __name__ == '__main__':
communicate_between_process()
输出:
[PID = 14796] Sleep 2 Seconds
[PID = 15740] Pulling <multiprocessing.queues.Queue object at 0x000001F775470288>
[PID = 14796] Put msg=['an', <class 'object'>, 3]
[PID = 15740] Get res=['an', <class 'object'>, 3]
由此可以得到,put进程和get进程实现了消息的通信。
4、进程间的同步
进程间的同步我们通常使用锁Lock来实现。
import multiprocessing
import os
def f(lock: multiprocessing.Lock, i: int):
lock.acquire() # 加锁,如果已经锁住,其他进程会在这里阻塞,直到该进程的锁被释放
try:
print(f'[PID = {os.getpid()}] i = {i} 锁已加上,正在计算。。。')
finally:
lock.release() # 释放锁
print(f'[PID = {os.getpid()}] i = {i} 锁已释放')
def sync_process():
# 进程同步
lock = multiprocessing.Lock()
ps = [] # 存放进程
for num in range(10):
p = multiprocessing.Process(target=f, args=(lock, num))
ps.append(p)
p.start()
for p in ps:
p.join()
if __name__ == '__main__':
sync_process()
输出:
[PID = 19944] i = 3 锁已加上,正在计算。。。
[PID = 19944] i = 3 锁已释放
[PID = 24172] i = 2 锁已加上,正在计算。。。
[PID = 24172] i = 2 锁已释放
[PID = 16496] i = 1 锁已加上,正在计算。。。
[PID = 16496] i = 1 锁已释放
[PID = 2424] i = 0 锁已加上,正在计算。。。
[PID = 2424] i = 0 锁已释放
[PID = 13852] i = 5 锁已加上,正在计算。。。
[PID = 13852] i = 5 锁已释放
[PID = 19468] i = 4 锁已加上,正在计算。。。
[PID = 19468] i = 4 锁已释放
[PID = 19092] i = 7 锁已加上,正在计算。。。
[PID = 19092] i = 7 锁已释放
[PID = 18744] i = 6 锁已加上,正在计算。。。
[PID = 18744] i = 6 锁已释放
[PID = 3644] i = 8 锁已加上,正在计算。。。
[PID = 3644] i = 8 锁已释放
[PID = 23460] i = 9 锁已加上,正在计算。。。
[PID = 23460] i = 9 锁已释放
可以看到,不同的进程由于加锁的缘故是无法同时进行的,只有在一个进程的锁被释放之后,其他进程才能进入,执行代码。
5、进程间的共享内存
进程间的共享内存变量可以通过服务器进程管理器Manager来实现。
import ctypes
import multiprocessing
def modify_vld(int_data: int, str_data: str, l: list, d: dict):
int_data.value = 666
str_data.value = 'new_qa'
l.extend([33, 44])
d['name'] = str_data.value
def share_with_manager():
int_data = multiprocessing.Manager().Value(ctypes.c_int, 100)
str_data = multiprocessing.Manager().Value(ctypes.c_char_p, 'qa')
l = multiprocessing.Manager().list([11, 22])
d = multiprocessing.Manager().dict({'age': 24})
process = multiprocessing.Process(target=modify_vld, args=(int_data, str_data, l, d))
process.start()
process.join()
print(int_data.value)
print(str_data.value)
print(l)
print(d)
if __name__ == '__main__':
share_with_manager()
输出:
666
new_qa
[11, 22, 33, 44]
{'age': 24, 'name': 'new_qa'}
可以看到,modify_vld方法修改了每个共享变量的值。