g、多任务爬虫

多任务爬虫

1、进程

1.1、概念

任务: 一个任务就是一个程序; 对于软件来说, 一个功能就是一个任务
进程
- 操作系统中执行任务的一个单元
  - 进程本身是一个抽象的概念
  - 即进程就是一个过程、一个任务
- 只有运行起来才叫进程
  - 由CPU运行
- 后台进程: 服务
- 前台进程: 展示给用户, 通常来说优先级别更改, 不容易被杀死
- 多进程的优势: 在大型项目中, 充分利用内存和CPU
多任务
- 通过多进程实现
- 通过多线程实现
- 多进程+多线程实现
- 多进程＋协程实现

1.2、单进程单线程模型

import time

def eat():  # 吃东西
    while True:
        print('he eating something')
        time.sleep(2)   # 休息一下再吃

def praise():  # 表杨
    while True:
        print('he is good man')
        time.sleep(3)

if __name__ == '__main__':
    eat()
    praise()

注：从入口开始系统一直在执行eat()，因为eat()持续执行，并不会走到praise()

1.3、多进程实现多任务

import time
from multiprocessing import Process

def eat():  # 吃东西
    while True:
        print('he eating something')
        time.sleep(2)   # 休息一下再吃

def praise():  # 表杨
    while True:
        print('he is good man')
        time.sleep(3)

if __name__ == '__main__':
    eatProcess = Process(target=eat)
    eatProcess.start()
	praise()

Process() 创建子进程
Process参数说明
- target：指定子进程运行的任务(函数)
- name: 进程名
- args/kwargs: 要传入方法的参数
Process实例方法与属性
- is_alive()：返回进程是否在运行。
- join([timeout])：阻塞当前应用上下文环境的进程，直到调用此方法的进程终止或到达指定的timeout
- start()：进程准备就绪，等待CPU调度
- terminate()：不管任务是否完成，立即停止工作进程
- name：进程名字
- pid：进程号
扩展：ps -ef | grep xxx 查看xxx程序运行的进程信息
xxx一般是python脚本文件名

1.4、进程的创建并传递参数

带参函数

def eat(something='东西', food='面条'):  # 吃东西
while True:
  print('he eating ', something, food)
  time.sleep(2)  # 休息一下再吃

args传参

按位置传值

eg：

from multiprocessing import Process

eatProcess = Process(target=eat,  args=('苹果', '米饭'))

kwargs传参

关键参数传值

eg：

eatProcess = Process(target=eat, kwargs={'something': '苹果', 'food': '米饭'})

扩展

子类方式创建进程

class EatProcess(Process):
	def __init__(self, name, something, food):
		super().__init__(name=name)
		self.something = something
		self.food = food

	def run(self):
    	print('--启动的子进程-', self.name, self.pid, os.getppid())
    	while True:
        	print('he eating ', self.something, self.food)
        	time.sleep(2)  # 休息一下再吃

os.getppid()：获取当前进程所在父进程的id

# 实例化进程类对象并启动
eatPrg = EatProcess('eatPro', '西瓜', '啤酒')
eatPrg.start()

print('主进程', os.getpid())

os.getpid()：获取当前进程的id

进程池方式创建进程
```
pool = Pool(2)	
for i in range(10):
	# 进程池中添加新进程任务
	pool.apply_async(eat, args=('菜果 %d ' %i,))
pool.close()  # 关闭线程池, 在join之前必须先关闭pool
pool.join()  # 等待子进程全部完成后才退出整合系统
```
- Pool(processes=2)：
  - processes 要创建的进程数
  - 如果省略，将默认使用os.cpu_count()的值
- apply()
  - 在进程池中同步的提交任务,若提交的前一个任务没有执行完,后一个任务则不能执行
  - 此时进程池中的任务将变为串行执行
- apply_async()
  - 在进程池中异步的提交任务,若提交的前一个任务没有执行完,后一个任务也可以继续执行
  - 此时进程池中的任务将变为并行执行
- 从子线程中获取执行结果
  - apply_async(callback=xxxCallback) #通过callback将数据返回给主进程中
```
r_l = []

res = pool.apply_async(...)
r_l.append(res)
...
pool.join()
for res in r_l:
      print(res.get())	#在所有子进程都完成后获取它们的结果
```
```
def eat(food):
	for i in range(5):
		print('子进程', os.getpid(), os.getppid())
		print('--我要吃-第 %d 个' %i, food, os.getpid())
		time.sleep(2)
```

1.5、进程执行顺序

严格来讲, 进程执行没有顺序
每个进程都要在CPU中获取执行权限, 此时可以认为是随机获取
主进程与子进程
- 程序默认执行的进程通常叫做主进程(父进程)
- 进程中开启的其他进程叫做子进程

1.6、进程中变量作用域

进程之间的变量作用域(内存分配问题)
- 每个进程拥有自己独立的内存空间, 其中包含堆和栈
在父进程创建子进程的时候,如果子进程使用了父进程中的变量, 那么子进程会将使用的变量复制一份放入自己的内存中

如

import os
import time
from multiprocessing import Process

names = []  # 全局变量

def addNames():
    global names  # 子进程与主进程是是并发执行的，不能共享全局变量(主进程变量)，只是复制一份
    names.clear()  # 只会影响子进程的变量，不会影响主进程的变量
    names.append('Disen at '+str(os.getpid()))

time.sleep(2)
print('子进程', os.getpid(), names)

if __name__ == '__main__':
    names.append('Jack')
    names.append('Lucy')

p = Process(target=addNames)
p.start()

names.append('Cici')
print('主进程', os.getpid(), names)

扩展
- 进程间数据共享

Value或Array方式

  ```
  from multiprocessing import Array, Process, Value
  
  import os
  import time
  
  def f(n, a):
      print('子进程', os.getpid())
      n.value = 20.5  # Value的赋值
      for i in range(len(a)):
          a[i] += 10
          print('子进程', os.getpid(), a[i])
          time.sleep(1)
  
  if __name__ == '__main__':
      num = Value('f', 0.1)
      arr = Array('i', range(10))
      p = Process(target=f, args=(num, arr))
      p.start()
      p.join()
      print('主进程', os.getpid())
      print(num.value)  # 读取值
      print(arr[::-1])
  ```
  
     - Value('f',0.1) 或Array('i', )中的'f'或'i' 参考C语言中的类型
       - 'c': ctypes.c_char
         'u': ctypes.c_wchar	
         'b': ctypes.c_byte
         'B': ctypes.c_ubyte
         'h': ctypes.c_short
         'H': ctypes.c_ushort
         'i': ctypes.c_int
         'I': ctypes.c_uint
         'l': ctypes.c_long
         'L': ctypes.c_ulong
         'f': ctypes.c_float
         'd': ctypes.c_double

Manager方式

Manager还可注册我们自定义的Class类：
1、定义User类，如果个别属性要在多个进程共享，使用Value或List之类的类型
2、定义MyManager类，继承BaseManager, 内部什么也不写，即是简易版的Manager
3、声明创建MyManager对象的函数，注意： BaseManager对象必须要启动, 即 start()
4、直接创建Lock()对象

代码如下：

from multiprocessing import Process, Value, Lock
import os
import time
from multiprocessing.managers import BaseManager  

class User:
    def __init__(self, name, salary):
        self.name = name
        self.money = Value('f', salary)

    def increase(self):
        self.money.value += 1000
        time.sleep(1)

    def __repr__(self):
        return '{} money is {}'.format(self.name, self.money.value)
	  
class MyManager(BaseManager):
    # 自定义Manager
     pass
  
# 向管理器中注册模型类的类型
MyManager.register("User", User)
 
def Manager():  # 定义创建Manager类对象的函数
    m = MyManager()
    m.start()
    return m
 
def f(user, lock):
    with lock:
        user.increase()
        print(os.getpid(), '子进程', user)
 
if __name__ == '__main__':
    manager = Manager()  # 多进程间的数据管理器
    user = manager.User('disen', 100)

    lock = Lock()
    procs = [Process(target=f, args=(user, lock)) for i in range(5)]
    for proc in procs:
        proc.start()

    for proc in procs:
        proc.join()

    print('主进程', os.getpid(), user)

from multiprocessing import Process, Manager
import os
import time

def f(n, a, lock):
    with lock:  # 获取同步锁
        print('子进程', os.getpid())
        n.value += 20.5  # Value的赋值
        for i in range(len(a)):
            a[i] += 10
            print('子进程', os.getpid(), '开始累计：', a[i])
            time.sleep(1)

if __name__ == '__main__':
    manager = Manager()  # 多进程间的数据管理器
    arr = manager.Array('i', range(5))
    num = manager.Value('f', 1.5)

lock = manager.Lock()
procs = [Process(target=f, args=(num, arr, lock)) for i in range(5)]
for proc in procs:
    proc.start()

for proc in procs:
    proc.join()

print('主进程', os.getpid())
print(num.value)  # 读取值
print([a for a in arr])

 - Manager 是用于多进程间的数据管理器
- with lock: 表示获取上下文环境中锁对象，多个进程间只存在一把锁，当退出with 时，其它进程才能占有

1.7、进程间通信

参考：https://www.cnblogs.com/wangkangluo1/archive/2012/05/14/2498786.html

实现两个进程的通信, 可以使用管道(pipe)和队列实现

管道(pipe)

简要说明
- 管道也叫无名管道，它是 UNIX 系统 IPC（进程间通信 (Inter-Process Communication）的最古老形式
- 管道用来连接不同进程之间的数据流
eg
- 功能：创建两个进程, 由进程A给进程B发消息, 由进程B给进程A发消息
- 导包
```
import os
import time
```

from multiprocessing import Pipe, Process
```

发送者

def send_msg(pipe):
    for i in range(3):
        if i==2:
            msg = 'bye'
        else:
            msg = '您好,我是Disen'
        print('发送消息<{}>: {}'.format(os.getpid(), msg))
    pipe.send(msg)
        if msg == 'bye':
            return

        time.sleep(1)
    print('回复：', pipe.recv())

pipe.send(str) 发送消息

接收者

    def receive_msg(pipe):
        while True:
            msg = pipe.recv()
      print('收接消息<{}>: {}'.format(os.getpid(), msg))
    
            if msg == 'bye':
                print('再见，over')
              return
    
            pipe.send('您的消息已收到 ')
            time.sleep(1)
    ```
    
    pipe.recv() 接收消息
  
  - 主程序
  
    ```
    if __name__ == '__main__':
        pipe = Pipe()  # 返回一个元组（conn1, conn2）
        p1 = Process(name='发送者', target=send_msg, args=(pipe[0],))
        p2 = Process(name='接收者1', target=receive_msg, args=(pipe[1],))
    
        p1.start()
        p2.start()
    
        p1.join()
        p2.join()
        print('---over--')
    ```
  
    - multiprocessing.Pipe([duplex]) 创建管道对象
    - Returns a pair (conn1, conn2) of Connection objects representing the ends of a pipe.
      返回一对连接对象
    - If duplex is True (the default) then the pipe is bidirectional. If duplex is False then the pipe is unidirectional: conn1 can only be used for receiving messages and conn2 can only be used for sending messages.
      当duplex 为True(默认)时， 管道可以双向的(可收可发)，如果duplex为False，管理只能是单方向:conn1只能收消息，conn2只能发消息

队列 (Queue)
- 简要说明
  - class multiprocessing.Queue([maxsize])
  - Returns a process shared queue implemented using a pipe and a few locks/semaphores. When a process first puts an item on the queue a feeder thread is started which transfers objects from a buffer into the pipe.
    使用管道和少量的锁/信号量实现的进程共享的队列，当进程首先将一个项目放到队列中时，启动一个将线程从缓冲区转移到管道中的Feeder线程
- Queue对象的方法
  - 输入: put(obj[, block[, timeout]])
    - Put obj into the queue. If the optional argument block is True (the default) and timeout is None (the default), block if necessary until a free slot is available. If timeout is a positive number, it blocks at most timeout seconds and raises the Queue.Full exception if no free slot was available within that time. Otherwise (block is False), put an item on the queue if a free slot is immediately available, else raise the Queue.Full exception (timeout is ignored in that case).
      向队列存入对象。block为True，timeout为None，且队列中的量已最大时，将会阻塞到slot可用为至，如果设置了timeout超时时长，则到过超时时间后，则会抛出异常。如果block为False，如果队列够用时，直接存入，否则直接抛出异常。
      - put_nowait(obj) 不堵塞的存放数据
    - 获取输出: get([block[, timeout]])
      - Remove and return an item from the queue. If optional args block is True (the default) and timeout is None (the default), block if necessary until an item is available. If timeout is a positive number, it blocks at most timeout seconds and raises the Queue.Empty exception if no item was available within that time. Otherwise (block is False), return an item if one is immediately available, else raise the Queue.Empty exception (timeout is ignored in that case).
      - get_nowait() 不堵塞的get队列里面的数据
  - 队列大小： qsize()
  - 是否为空: empty()
  - 是否满： full()
  - 关闭： close()
- 如：Boss派20项活给5个工人完成
  - 导包
```
from multiprocessing import Process, Queue
import time
import os
```
  - boss函数
```
def boss(q):  # 大老板安排任务
    for i in range(20):
        msg = 'Boss按排的任务： %d ' %i
        q.put(msg)  ＃如果msg量没有达到最高值，则直接存入，反之，则等待
        print(time.strftime('%x %X', time.localtime()), msg)
        time.sleep(0.5)

    print('--boss任务派发完成----')
```
  - 工人接活函数
```
def worker(q):
    while True:
        msg = q.get(timeout=5)  # 5秒内没有消息，表示工作完成
        print('at {} 工人<{}> 收到: {}'.format(time.time(), os.getpid(), msg))
        time.sleep(2)
```
  - 程序入口
```
if __name__ == '__main__':
    q = Queue(maxsize=2)  # 最大的消息数量

    workers = []
    for i in range(5):
        p = Process(target=worker, args=(q, ))
        p.start()
        workers.append(p)  # 将所有工人进程管理起来
    boss(q)  # 老板开始派活
    for worker in workers:
        worker.terminate()  # 解散工人

    q.close()
    print('--工作完成---over--')
```
socket
共享内存
信号signals

2、线程

2.1、什么是线程

一个进程里面至少有一个线程，进程的概念只是一种抽象的概念，真正在CPU上面调度的是进程里的线程
线程是真正干活的，线程用的是进程里面包含的一堆资源,线程仅仅是一个调度单位，不包含资源
每一个进程在启动的时候都会默认创建一个线程, 这个线程叫主线程(MainThread)
一个进程（任务）里面可能对应多个分任务，如果一个进程里面只开启一个线程的话，多个分任务之间实际上是串行的执行效果，即一个程序里面只含有一条执行路径
扩展
- 多线程和多进程的选择
  - 计算密集型应用，使用多进程
  - 对于IO密集型应用，使用多线程
- Python中线程的特点
  - 在同一时刻，一个进程当中只能有一个线程处于运行状态;如果有一个线程使用了系统调度而阻塞，那么整个进程都会被挂起
  - 在多CPU系统中，为了最大限度的利用多核，可以开启多个线程，但是Python中的多线程是利用不了多核优势的。
  - 任何Python线程执行前，必须先获得GIL锁（Global Interpreter Lock），然后，每执行100条字节码，解释器就自动释放GIL锁，让别的线程有机会执行。这个GIL全局锁实际上把所有线程的执行代码都给上了锁，所以，多线程在Python中只能交替执行，即使100个线程跑在100核CPU上，也只能用到1个核
- 参考：
  
  https://blog.csdn.net/ybdesire/article/details/77842438
  http://dabeaz.com/python/UnderstandingGIL.pdf
  - 在同一个进程当中，多个线程彼此之间可以相互通信;但是进程与进程之间的通信必须基于IPC这种消息的通信机制（IPC机制包括队列和管道）
在一个进程当中，改变主线程可能会影响其它线程的行为;但是改变父进程并不会影响其它子进程的行为，因为进程与进程之间是完全隔离的
- 线程与进程的关系
  - 线程创建需要极少的内存, 时间短; 进程的创建需要大量的内存和时间
- 线程和进程的共同点：Master-Worker Structures (主从机构): Main --> Worker(Child)
- 注意: 在线程中一定要处理异常
- 同步与异步
  - 同步：即是指一个进程在执行某个请求的时候，若该请求需要一段时间才能返回信息，那么这个进程将会一直等待下去，直到收到返回信息才继续执行下去。
  - 异步：与同步相反，即进程不需要一直等下去，而是继续执行下面的操作，不管其他进程的状态。
    当有消息返回时系统会通知进行处理，这样可以提高执行的效率。
串行与并发
- CPU的地位：无论是串联、并行或并发,在用户看来都是同时运行的，不管是进程还是线程，都只是一个任务而已，真正干活的是CPU，CPU来做这些任务，而一个cpu（单核）同一时刻只能执行一个任务
- 串行：在执行多个任务时，一个任务接着一个任务执行，前一任务完成后，才能执行下一个任务。
并行：多个任务同时运行，只有具备多个cpu才能实现并行，含有几个cpu，也就意味着在同一时刻可以执行几个任务
- 并发：是伪并行，即看起来是同时运行的，实际上是单个CPU在多个程序之间来回的切换

2.2、创建线程

python提供了两种方式实现线程
- _thread: 低级的写法, 直接和C语言进行对接的线程
- threading: 高级的写法, 本质是_thread的封装

导包:

from theading import Thread
import time

创建线程的对象

ts = [Thread(target=worker, args=(i,)) for i in range(3)]

def worker(no):
for i in range(20):
  print(time.time(), no, '子线程正在执行计划', i)
  time.sleep(0.5)

启动线程
- 线程的执行方式: 并发随机执行
```
for t in ts:
	t.start()
```
连接子线程
- 主线程等待子线程完成后继续执行
```
for t in ts:
	t.join()
```

扩展

继承Thread类

class WorkerThread(Thread):
    def __init__(self, name):
        super().__init__(name=name)
      self.name = name

    def time(self):
      return time.strftime('%x %X', time.localtime())

    def run(self):
        for i in range(20):
            # self.ident 线程id
            print(self.time(), self.name, '子线程正在执行计划', i)
          time.sleep(0.5)

  
  - 创建并启动线程

ts = [WorkerThread(i) for i in range(3)]
for t in ts:
t.start()


### 2.3、线程安全与线程锁

- 线程间的数据是共享的

- 线程是程序执行的最小单元, 所有代码都必须在线程中执行, 默认情况下所有代码都在主线程中

- 如：存钱与取钱业务

  ```
# 声明全局变量
  money = 1000

  # 存钱
def add(n):
  for i in range(6):
try:
  global money
if money < 5000:
      sm = 1000
      money += sm
      print(n, '线程存了', sm, '剩余：', money)
  time.sleep(1)
  except:
  pass
  
  # 取钱
  def sub(n):
  for i in range(3):
  try:
global money
  if money >= 500:
  print(n, '-取钱之前-', money)
      sm = 1000
  money -= sm
      print(n, '线程取了{} 之后 剩余:'.format(sm), money)
  time.sleep(2)
  except:
  pass
  
  # 程序入口
  if __name__ == '__main__':
  	t1 = Thread(target=add, args=(1,))
  	t2 = Thread(target=sub, args=(2,))
  	t3 = Thread(target=sub, args=(3,))
  	t1.start()
  	t2.start()
  	t3.start()
  ```

- 多线程内存处理混乱

  - 多线程的数据混乱问题(多线程出现线程不安全问题)

  	- 混乱主要发生在两个线程运算的交接处
  	- 解决问题: 为了保证代码段整体稳定执行, 需要添加线程锁, 即在某个代码段执行前加锁, 代码段执行之后解锁(或释放锁)

  - 线程锁Lock:

  	- 优点: 解决了线程不安全问题, 即不会引起混乱
  - 缺点: 加入线程锁后会降低程序执行的效率
  - 死锁问题: 加锁之后, 程序运行时可能会出现bug,从而没有解锁, 此时会导致锁定的内容将线程block(锁死)解决方法: 只要加锁就要保证锁一定可以释放

  - 创建锁

  	- lock = threading.Lock()
  - lock = threading.RLock()
  
  		- 可重加锁，和加锁
  	- 在一个线程内重复LOCK同一个锁不会发生死锁
  
  - 加锁

  	- lock.acquire()

  - 解锁

  	- lock.release()

- 多线程保证线程安全

  - 解决线程安全

  	- 线程
  	- 操作(函数)
  	- 加锁区域

  - 如：安全的存款与取款

  	```
from threading import Thread, Lock
  	import time
  	
  	# 声明全局变量
  	money = 1000
  	def add(n, l):
  	    for i in range(10):
  	        print(n, '开始{} 存款...'.format(i))
  	        l.acquire()
  	        try:
  	            global money
  	            if money < 5000:
  	                sm = 1000
  	                money += sm
  	                print(n, '线程存了', sm, '剩余：', money)
  	            time.sleep(0.5)
  	        except:
  	            pass
  	        finally:
  	            l.release()
  	
  	def sub(n, l):
  	    for i in range(10):
  	        print(n, '开始{}次取款...'.format(i))
  	        l.acquire()
  	        try:
  	            global money
  	            if money >= 500:
  	                print(n, '-取钱之前-', money)
  	                sm = 1000
  	                money -= sm
  	                print(n, '线程取了{} 之后 剩余:'.format(sm), money)
  	            time.sleep(0.5)
  	        except:
  	            pass
  	        finally:
  	            l.release()
  	
  	if __name__ == '__main__':
  	    lock = Lock()  # 创建锁对象
  	    t1 = Thread(target=add, args=(1, lock))
  	    t2 = Thread(target=sub, args=(2, lock))
  	    t3 = Thread(target=sub, args=(3, lock))
  	    t1.start()
  	    t2.start()
  	    t3.start()
  	```

- ThreadLocal(线程中的变量)

  - 作用域

    - 普通变量: 即局部变量
  
    - 全局变量: global

    - 线程级变量: 仅在线程中存储, 在每个线程中是独立的

  - ThreadLocal 
  
  	- 用于存储线程中的变量
  - 同样是threading中的内容, 在该库中的名字叫 local
  	- 类似于字典的结构: 使用tid(thread id, 线程的唯一标识)作为key

  - ThreadLocal的操作: 验证线程中变量的存储
  
    第一步: 多个线程

    第二步: 操作同一个内存区域

    第三步: 使用相同的名字去操作, 存储和读取
  
    如：
  
    ```
    # coding=utf8
    import threading
    import time
    from random import randint
    
    data = threading.local()  # 使用线程本地变量
    
    
    def addNum(i):
        data.v = i  # 每个线程中的本地变量data.v 都是唯一的
        for i in range(5):
            data.v += randint(10, 200)
            print(threading.current_thread().name, data.v)
            time.sleep(1)
        print(threading.current_thread().name,'最终的数：',  data.v)
    
    
    if __name__ == '__main__':
        print("start Threads")
        try:
            for i in range(6):
                t = threading.Thread(target=addNum, args=(i,))
                t.start()
    
        except Exception as e:
            print("stop Threads")
    ```

- 多线程队列

- 条件变量(Condition)

  - cond =  threading.Condition([lock])  分配一个条件变量  

    - 默认绑定了一个RLock
    - 可以在初始化条件变量的时候传进去一个自己定义的锁.

  - 用法

    - 条件变量利用线程间共享的全局变量进行同步的一种机制
    - 两个动作

      - 1、一个 线程等待"条件变量的条件成立"而挂起
      - 2、另一个线程使“条件成立”

    - 与互斥锁结合使用

      - 为了防止竞争，防止死锁
      - 线程在改变条件状态前必须首先锁住互斥量(Lock)
      - 把条件变量到等待条件的线程列表上
      - 对互斥锁解锁

  - 常用函数

    - acquire(*args)  条件变量上锁  
    - release()  条件变量解锁  
    - wait([timeout])  等待唤醒，timeout表示超时  
    - notify(n=1)  唤醒最大n个等待的线程  
    - notifyAll()、notify_all()   唤醒所有等待的线程  

- 多线程队列

  - 实际上就是生产者与消费者队列
  - 1.生产者有存储大小限制，如果生成满仓则不生成，等待卖出去一部分再生产

    - 条件为满仓，等待消费

  - 2、消费者消费时，如果没有量，则等待生产，如果有量，则消费

    - 条件为量不足，等待生产

  - 3、存储数据的对象

    - 可以使用队列

  如：

  - 自定义同步多线程队列 ConcurrentQueue类

    ```
  class ConcurrentQueue():
    	#初始化
        def __init__(self, max_size=10):
            self.max_size = max_size  # 满仓量
            self.lock = Lock()  # 互斥对象
            self.cond = Condition(self.lock)
            self.q = Queue()  # 数据对象（仓库）
  
    	#获取数据(消费)
      def get(self):  # 获取队列的数据
            # 获取互斥锁和条件变量（默认包含互斥量）
          if self.cond.acquire():
                # 仓库为空时，无法满足消费者
              while self.q.empty():
                    print('仓库已空，请等待...')
                    self.cond.wait()  # 条件变量等待
                # 仓库不为空时
                obj = self.q.get()  # 获取要消息数据
                self.cond.notify()  # 通知等待的生产者线程，开始消费了...
              self.cond.release()  # 释放锁
            return obj
    
    	#存入数据(生产)
        def put(self, obj):
          if self.cond.acquire():
                # 仓库为满时，无法再生产
              while self.q.qsize() >= self.max_size:
                    print('仓库已满，请等待生产...')
                  self.cond.wait()  # 条件变量等待
                # 仓库未满，可以再生产
                self.q.put(obj)
              self.cond.notify()  # 通过等待的消费者线程，已生产了
                self.cond.release()  # 释放锁
  	
    	#生产者函数
    	def producer(cq: ConcurrentQueue):
        n = 1
      while True:
            cq.put('馒头 %d' % n)
          print(threading.current_thread().name, '已生产馒头 %d' % n)
            time.sleep(0.2)
          n += 1;
            
    	#消费者函数
  	def consumer(cq: ConcurrentQueue):
        while True:
          bread = cq.get()  # 获取面包
            print(threading.current_thread().name, '已消费', bread)
            time.sleep(0.5)
    
    程序入口测试
    if __name__ == '__main__':
        cq = ConcurrentQueue(20)
        # 创建消费者线程
      cts = [threading.Thread(target=consumer, args=(cq,)) for i in range(3)]
        # 创建生产者线程
      pts = [threading.Thread(target=producer, args=(cq,)) for i in range(2)]
        for ct in cts:	
            ct.start()
        for pt in pts:
            pt.start()
    ```


## 3、协程

### 3.1、线程的局限

- 多线程在执行过程中由CPU调度抢占资源，每切换一次子线程的同时，系统也在不断开销
- 如果线程太多，增加CPU调度复杂度，性能就会降低
- 在多线程同步时，如果互斥锁使用不当，容易产生死锁

### 3.2、协程的定义与原理

- 协程是以协作式调度的单线程，协程又称之为“微线程”，它是在一个线程内完成函数或子程序之间调度
- 一个函数(子程序)在调用时都是按层级调用，如A调用B,B调用C,再依次返回结果
- 函数调用是通过栈实现的，栈中存放函数的局部变量
- 函数或子程序的调用一个入口， 一次返回，调用顺序是明确。
- 协程在执行子程序或函数时，函数内部是可以中断，转而执行别的子程序，在适当时再返回接着执行。

### 3.3、协程比多线程突出的优点

- 协程的执行效率比多线程高

- 程序切换不是线程切换，而是由程序自身控制
- 没有线程切换的开销

- 协程不用锁机制

- 一个线程内不存在同时写变量冲突
- 在协程中控制共享资源不加锁，只需要判断状态就好了

- 多进程+协程

- 协程是一个线程执行
- 多核CPU时，多进程会充分使用CPU
- 协程的高效率，会提升程序的整体性能

### 3.4、协程的三个阶段

- yield/send

生产者

def producer(l): # 生产者
i = 0
while True:
if i < 10:
l.append(i)
print('producer', i)
yield i
i = i + 1
time.sleep(1)
else:
return

消费者

def consumer(l):
p = producer(l)
while True:
try:
next(p)
while len(l) > 0:
print('consumer', l.pop())
except:
pass

测试

if name == "main":
l = [] # 数据
consumer(l)


- yield from / @asyncio.coroutine

- asyncio是Python 3.4版本引入的标准库，直接内置了对异步IO的支持。
- asyncio的编程模型就是一个消息循环。我们从asyncio模块中直接获取一个EventLoop的引用，然后把需要执行的协程扔到EventLoop中执行，就实现了异步IO。
- Python的在3.4中引入了协程的概念，可是这个还是以生成器对象为基础。

- await/async 关键字

- Python 3.5添加了async和await这两个关键字，分别用来替换asyncio.coroutine和yield from。
- python3.5则确定了协程的语法。

- 注意：

- 实现协程的不仅仅是asyncio，tornado和gevent都实现了类似的功能。

- 案例分析

demo1

import asyncio

@asyncio.coroutine
def hello():
print("Hello world!")
# 异步调用asyncio.sleep(1):
r = yield from asyncio.sleep(1)
print("Hello again!")

获取EventLoop:

loop = asyncio.get_event_loop()

执行coroutine

loop.run_until_complete(hello())
loop.close()

demo2

我们用Task封装两个coroutine

import threading
import asyncio

@asyncio.coroutine
def hello():
print('Hello world! (%s)' % threading.currentThread())
yield from asyncio.sleep(1)
print('Hello again! (%s)' % threading.currentThread())

loop = asyncio.get_event_loop()
tasks = [hello(), hello()]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()


实战-HTTP请求网页

用asyncio的异步网络连接来获取sina、sohu和163的网站首页

async_wget.py

import asyncio

@asyncio.coroutine
def wget(host):
print('wget %s...' % host)
connect = asyncio.open_connection(host, 80)
reader, writer = yield from connect
header = 'GET / HTTP/1.0\r\nHost: %s\r\n\r\n' % host
writer.write(header.encode('utf-8'))
yield from writer.drain()
while True:
line = yield from reader.readline()
if line == b'\r\n':
break
print('%s header > %s' % (host, line.decode('utf-8').rstrip()))
# Ignore the body, close the socket
writer.close()

loop = asyncio.get_event_loop()
tasks = [wget(host) for host in ['www.sina.com.cn', 'www.sohu.com', 'www.163.com']]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()


- 总结

  - asyncio提供了完善的异步IO支持；
  - 异步操作需要在coroutine中通过yield from完成；
  - 多个coroutine可以封装成一组Task然后并发执行。

posted @ 2021-03-31 07:38 昵称已经被使用阅读(82) 评论(0) 收藏举报

刷新页面返回顶部

g、多任务爬虫

多任务爬虫

1、进程

1.1、概念

1.2、单进程单线程模型

1.3、多进程实现多任务

1.4、进程的创建并传递参数

1.5、进程执行顺序

1.6、进程中变量作用域

1.7、进程间通信

2、线程

2.1、什么是线程

2.2、创建线程

生产者

消费者

测试

demo1

获取EventLoop:

执行coroutine

demo2

我们用Task封装两个coroutine

用asyncio的异步网络连接来获取sina、sohu和163的网站首页

async_wget.py

公告