Python爬虫-并发爬虫

前言

人生真相，没有恐怖执行力，就会永远被社会毒打。

不要担心别人超过你，因为大多数人是无法长期坚持的。

不要做心存幻想的期待，抛弃所有的无用的想象，

做一个踏实的人，踏实的做好当下每一件事，

而不觉得别人希望自己活的好或不好。

⚠️声明：本文所涉及的爬虫技术及代码仅用于学习、交流与技术研究目的，禁止用于任何商业用途或违反相关法律法规的行为。若因不当使用造成法律责任，概与作者无关。请尊重目标网站的robots.txt协议及相关服务条款，共同维护良好的网络环境。

1.asyncio

1.1简介

asyncio 是 Python 标准库中用于编写异步程序的模块。它允许你用一种非阻塞的方式管理多个任务，从而提高程序的性能和响应能力，尤其是在 I/O 密集型操作（如网络请求、文件读写等）中非常有用。

核心概念

Event Loop（事件循环）：是 asyncio 的核心，用于管理并调度协程（coroutines）任务的运行。

Coroutine（协程）：是可以在需要时暂停和恢复执行的函数，使用 async def 定义。

Task（任务）：是对协程的封装，用于在事件循环中运行。通过 asyncio.create_task() 创建任务。

Future（未来对象）：表示一个异步操作的结果，类似于占位符。

Awaitable（可等待对象）：包括协程和 asyncio.Future，可以用 await 关键字等待其完成。

常见用法

网络操作：使用 asyncio 编写高性能网络客户端/服务端。

并发 I/O：多个 I/O 操作可以并行运行而不会阻塞主线程。

定时任务：通过 asyncio.sleep() 实现延时功能。

任务调度：通过 create_task() 和 gather() 管理多个协程。

asyncio.gather和asyncio.wait区别

特性	`asyncio.gather`	`asyncio.wait`
参数	接受可变参数 (`*tasks`) 或显式协程列表	接受任务的可迭代对象
返回值	按顺序返回所有任务结果	返回已完成任务集合和未完成任务集合
异常传播	抛出异常会中断整个 `gather`，但其他任务继续运行	抛出异常不会中断，异常需手动捕获
场景复杂度	简单场景，需同时获取所有任务结果	复杂场景，需对完成任务和未完成任务区分处理
任务完成顺序处理	结果顺序与任务传入顺序一致	结果按完成顺序区分，无固定顺序

1.2asyncio爬取豆瓣250

地址：https://movie.douban.com/top250

我们发现分页是根据start参数

第一页从0开始

第二页从25开始

第三页从50开始...

代码

loop = asyncio.get_event_loop():

获取当前的事件循环 loop，它是管理异步任务的核心。
在 Python 3.10+ 中，推荐直接使用 asyncio.run() 来启动任务，而不是显式获取事件循环。

await loop.run_in_executor(None, partial(requests.get, url.format(page * 25), headers=headers))：

因为 requests.get 是一个阻塞函数，它无法直接与异步函数兼容。
使用 loop.run_in_executor 把阻塞操作放到线程池中执行，确保不会阻塞事件循环。
partial(requests.get, url.format(page * 25), headers=headers) 用于绑定参数，生成一个新的 requests.get 调用。

tasks = [loop.create_task(get_db_movie_info(page)) for page in range(1, 3)]:

创建 3个任务，每个任务对应一个分页的数据抓取。

loop.run_until_complete(asyncio.wait(tasks)):

asyncio.wait(tasks) 会并发运行这些任务。
等待所有任务完成。

import asyncio
import requests
from bs4 import BeautifulSoup
from functools import partial

from pip._internal.network import session

url = 'https://movie.douban.com/top250?start=50&filter='
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
}
# 创建事件循环
loop = asyncio.get_event_loop()


async def get_db_movie_info(page):
    # r = await session.get(url, headers=headers)
    r = await loop.run_in_executor(None, partial(requests.get, url.format(page * 25), headers=headers))
    soup = BeautifulSoup(r.text, 'lxml')
    div_list = soup.find_all('div', attrs={'class': 'hd'})
    for div in div_list:
        print(div.get_text())

if __name__ == '__main__':
    tasks = [loop.create_task(get_db_movie_info(page)) for page in range(1, 3)]
    loop.run_until_complete(asyncio.wait(tasks))

2.aiohttp

2.1简介

aiohttp 是一个异步 HTTP 客户端和服务器库，特别适合用于异步任务和事件驱动的场景。

官网：https://docs.aiohttp.org/en/stable/

aiohttp客户端相关的官方文档：https://docs.aiohttp.org/en/stable/client.html#aiohttp-client

2.2安装

安装 aiohttp

pip install aiohttp

2.3基本使用

async def: 使用 async def 定义一个异步函数，表示这个函数将异步执行。

async with ClientSession() as session: 使用 async with 上下文管理器，创建一个 ClientSession 对象，它是进行 HTTP 请求的核心。ClientSession 会在请求结束后自动关闭，释放资源。通过 async with 可以确保在整个请求生命周期内，session 都能正确管理资源。

async with session.get(url, headers=headers) as r: 发起异步 GET 请求。session.get() 会异步发送 HTTP GET 请求，url 是请求的地址，headers 是请求头。r 是响应对象，它是一个异步对象，包含 HTTP 响应的内容。

res = await r.text(): 获取响应的文本内容（HTML）。r.text() 是一个异步操作，因此需要使用 await 等待它完成并返回响应的文本内容。

import asyncio
from aiohttp import ClientSession

url = 'https://www.baidu.com/'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
}

async def get_baidu():
    async with ClientSession() as session:
        async with session.get(url, headers=headers) as r:
            res = await r.text()
            print(res)

if __name__ == '__main__':
    asyncio.run(get_baidu())

2.4并发爬取

print_result: 该函数会在异步任务完成后被调用。task_obj.result() 获取异步任务的返回值，即爬取到的页面内容。

asyncio.create_task: 创建异步任务。每个任务负责爬取一个网站·。

add_done_callback(print_result): 将回调函数 print_result 绑定到每个任务上，任务执行完毕后会调用该回调函数处理结果。

await asyncio.wait(tasks): 等待所有任务完成。tasks 是一个包含所有任务的列表。

asyncio.run(main()): 启动异步事件循环并执行 main 函数。main 函数会创建和运行所有的爬虫任务。

import asyncio
import aiohttp

# 协程函数中可以绑定回调函数
def print_result(task_obj):
    print('下载内容为:', task_obj.result())

# 百度爬虫
async def baidu_spider():
    print('百度爬虫...')
    url = 'https://www.baidu.com'
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            result = await response.text()
            return result

# 京东爬虫
async def jd_spider():
    print('京东爬虫...')
    url = 'https://www.jd.com'
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            result = await response.text()
            return result

# 淘宝爬虫
async def taobao_spider():
    print('淘宝爬虫...')
    url = 'https://www.taobao.com/'
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            result = await response.text()
            return result


# 运行
async def main():
    task_baidu = asyncio.create_task(baidu_spider())
    task_baidu.add_done_callback(print_result)

    task_jd = asyncio.create_task(jd_spider())
    task_jd.add_done_callback(print_result)

    task_taobao = asyncio.create_task(taobao_spider())
    task_taobao.add_done_callback(print_result)

    tasks = [task_baidu, task_jd, task_taobao]
    await asyncio.wait(tasks)

if __name__ == '__main__':
    asyncio.run(main())

3.`aiohttp`爬取豆瓣电影

地址：https://movie.douban.com/top250

我们发现分页是根据start参数

第一页从0开始

第二页从25开始

第三页从50开始...

所以分页是 (pageindex-1)*pagesize，第一页就是0*2开始，第二页就是1*25开始，第三页就是2*25开始

代码

async def get_db_movie_info(page):异步函数，负责抓取指定分页的豆瓣电影数据。page 参数用于计算分页的起始索引，例如第 0 页的 start=0，第 1 页的 start=25。

async with aiohttp.ClientSession() as session:异步 HTTP 客户端会话，管理请求连接。

async with session.get(url.format(page * 25), headers=headers) as r:发送 GET 请求，访问豆瓣电影页面。使用 url.format(page * 25) 动态生成分页 URL。

await r.text():异步获取 HTTP 响应的内容，返回 HTML 文本。

loop = asyncio.get_event_loop():获取当前事件循环，用于调度和执行异步任务。

tasks = [loop.create_task(get_db_movie_info(page)) for page in range(10)]:创建 10 个任务（异步函数），每个任务负责抓取一个分页的电影数据。

loop.run_until_complete(asyncio.wait(tasks)):启动事件循环并等待所有任务完成。asyncio.wait(tasks) 会并行执行所有任务。

import asyncio

import aiohttp
from bs4 import BeautifulSoup

url = 'https://movie.douban.com/top250?start={}&filter='
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
}

async def get_db_movie_info(page):
    async with aiohttp.ClientSession() as session:
        async with session.get(url.format(page * 25), headers=headers) as r:
            soup = BeautifulSoup(await r.text(), "lxml")
            div_list = soup.find_all("div", class_="hd")
            for div in div_list:
                print(div.get_text())

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    tasks = [loop.create_task(get_db_movie_info(page)) for page in range(10)]
    loop.run_until_complete(asyncio.wait(tasks))

4.aiomysql

4.1简介

aiomysql 是一个用于与 MySQL 数据库异步交互的 Python 库，它基于 asyncio 提供异步操作接口。通过使用 aiomysql，可以高效地处理数据库操作，特别适合在高并发应用中使用。

aiomysql.create_pool() 创建连接池，可以设置以下参数：

host: 数据库主机名（如 localhost）。
port: 数据库端口（默认是 3306）。
user: 数据库用户名。
password: 数据库密码。
db: 要访问的数据库名称。
minsize: 连接池的最小连接数。
maxsize: 连接池的最大连接数。

await cursor.execute(sql, params) :执行 SQL 查询

await cursor.fetchall()：获取所有查询结果。

await cursor.fetchone()：获取单条查询结果。

await cursor.fetchmany(size)：获取指定数量的查询结果。

4.2安装

命令

pip install aiomysql

4.3future对象

4.3 1简介

Future 对象

asyncio.Future 是 asyncio 模块中的一个类，表示将来会有值（或者异常）的占位符。

它提供了一种机制，允许代码等待异步操作的完成，同时不阻塞事件循环。

一个 Future 对象有以下三种状态：

Pending（未完成）：任务还未执行或正在运行，结果未知。
Done（完成）：任务已完成，可以通过 .result() 获取结果。
Cancelled（取消）：任务被取消。

Future 对象的生命周期

创建 Future 对象：通过直接创建 asyncio.Future()，或者通过某些 asyncio 工具（如 ensure_future 或 create_task）间接生成。
填充结果或异常：异步任务执行完成后，通过 set_result 或 set_exception 方法将结果填充到 Future 对象中。
获取结果：使用 await 关键字等待 Future 完成，并直接获取结果。

Future 的状态及相关方法

常用方法

方法	描述
`.done()`	检查 Future 是否完成（包括成功或异常）。
`.result()`	获取 Future 的结果，若尚未完成或抛出异常，则会抛出错误。
`.set_result(value)`	手动设置 Future 的结果为 `value`。
`.set_exception(exc)`	手动将 Future 标记为已完成并设置异常为 `exc`。
`.cancel()`	取消 Future，成功返回 `True`，失败返回 `False`。
`.exception()`	获取 Future 的异常信息，如果未抛出异常则返回 `None`。

状态流转

创建时: 状态为 Pending。
执行完成: 状态转为 Done，通过 set_result 或 set_exception。
取消时: 状态变为 Cancelled，无法再设置结果或异常。

Future 与 asyncio 的关系

在 asyncio 中，Future 通常是协程任务（Task）的基础结构，Task 是 Future 的子类：

协程任务（Task）:
- 通过 asyncio.create_task 或 asyncio.ensure_future 将协程包装为一个 Task，实际是创建了一个 Future。
- Task 会自动管理状态，将协程的结果或异常填充到 Future。

4.3.2代码

await asyncio.ensure_future(work()):

将协程 work() 封装为一个可被调度的 Future 对象，并立即提交到事件循环中。
await 等待 work 协程执行完成，并获取其返回值。
res：接收 work 的返回值 123。

import asyncio

async def work():
    print(111)
    await asyncio.sleep(1)
    print(222)
    return 123

async def main():
    res = await asyncio.ensure_future(work())
    print(res)

# task任务和future任务必须使用await关键字调度运行
asyncio.run(main())

4.4基本使用

async with aiomysql.connect:建立到 MySQL 数据库的异步连接。

async with conn.cursor() as cursor:创建一个异步游标，cursor 用于操作数据库。

await cursor.execute:异步执行 SQL 查询。

asyncio.run(test_mysql()):启动一个新的事件循环并运行test_mysql异步函数。

import asyncio

import aiohttp
import aiomysql

async def test_mysql():
    async with aiomysql.connect(host='localhost', port=3306, user='root', password='root',db='py_spider') as conn:
        async with conn.cursor() as cursor:
            await cursor.execute('SELECT * from tx_work limit 10;')
            res = await cursor.fetchall()
            print(res)

asyncio.run(test_mysql())

5.协程爬取汽车之家并保存到mysql

5.1页面

地址：https://www.che168.com/china/list/#pvareaid=100945

选择一个汽车商品，点击进去查看详情

详情页面下面有一个车辆档案

我们需要爬取下面的参数配置

5.2请求地址带有js回调函数

5.2.1解决方案

首先查看车辆详情的api接口，如果没有api接口，再考虑查看页面源代码，爬取页面数据

这里找到车辆参数配置的api接口

https://cacheapigo.che168.com/CarProduct/GetParam.ashx?specid=39311&callback=configTitle

我们发现接口返回的并不是json接口数据，所以我们需要转换，有2中方式

使用切片，去除掉开头的configTitle(和结尾的)
删除掉请求链接的&callback=configTitle

我们这里使用第2种方式

删除掉&callback=configTitle发现结果返回的就是json数据了

这样就确定了车辆参数配置的api接口

https://cacheapigo.che168.com/CarProduct/GetParam.ashx?specid=39311

5.2.2原理

在这个 URL 中，结尾的 &callback=configTitle 是用于JSONP（JSON with Padding）请求的一部分。

JSONP 是一种解决跨域请求问题的技术，常用于早期的 Web 开发中。由于浏览器的同源策略限制，Ajax 请求无法直接跨域访问其他域的数据。JSONP 利用了 <script> 标签可以加载跨域资源的特点，通过服务端返回一段带有回调函数的 JavaScript 代码来绕过限制。

callback=configTitle 表示客户端告诉服务端：用回调函数 configTitle 来包装返回的数据。

服务器返回的数据格式会变成：

configTitle({
    "key1": "value1",
    "key2": "value2"
});

这种格式相当于服务端动态生成了一段 JavaScript 代码，返回的 JSON 数据被作为参数传递给 configTitle 函数。

客户端会在全局范围内定义一个名为 configTitle 的函数，用来处理服务端返回的数据。例如：

function configTitle(data) {
    console.log(data); // 输出服务器返回的 JSON 数据
}

当 URL 中包含 callback=configTitle 时，服务器会识别这是一个 JSONP 请求，并将返回的数据包装在回调函数中，例如：

configTitle({
    "key1": "value1",
    "key2": "value2"
});

当去掉 callback=configTitle 时，服务器不需要将数据包装在回调函数中，因此会直接返回标准的 JSON 格式：

{
    "key1": "value1",
    "key2": "value2"
}

5.3列表页查找详情id

我们发现配置详情的接口带了一个specid

我们可以尝试在列表页去查找这个specid，发现在商品的

标签里有specid="39311"这个特性

https://cacheapigo.che168.com/CarProduct/GetParam.ashx?specid=39311

获取specid的xpath

//ul[@class='viewlist_ul']/li/@specid

5.4列表分页

我们还需要动态获取分页链接

连续点击3次分页，发现了链接地址有变化，从中找到了pageIndex(页码)

# 第一页
https://www.che168.com/china/a0_0msdgscncgpi1ltocsp1exx0/
# 第二页
https://www.che168.com/china/a0_0msdgscncgpi1ltocsp2exx0/?pvareaid=102179#currengpostion
# 第三页
https://www.che168.com/china/a0_0msdgscncgpi1ltocsp3exx0/?pvareaid=102179#currengpostion

这样就确定了列表的链接地址了

https://www.che168.com/china/a0_0msdgscncgpi1ltocsp2exx0/?pvareaid=102179#currengpostion

5.5解决动态返回编码的防爬机制

汽车之家页面编码格式会随机变换，使用第三方包chardet实时检测编码格式，并且当页面编码格式为UTF-8-SIG时specid数据不存在。

安装包

pip install chardet

但是我在实际的爬取过程中页面的编码一直都是GB2312，这个问题依然记录一下

5.5去重

基于 Redis 的 set 数据结构和 MD5 哈希 算法实现去重：

get_md5:将字典 dict_item 转为字符串后，使用 hashlib.md5 计算其 MD5 哈希值。
str(dict_item) 的结果可能因为字典键值的顺序变化而不同，确保字典的键顺序一致很重要（可以使用 json.dumps 并设置 sort_keys=True）。
self.redis_client.sadd 方法向 Redis 中的集合 car:filter 添加一个元素，即当前 item 的 MD5 值。
Redis 的 set 数据结构天然支持去重
- 如果 md5_val 在集合中已存在，sadd 返回 0。
- 如果 md5_val 不在集合中，sadd 将其添加到集合并返回 1。

5.代码

在main函数里创建async with aiomysql.create_pool(user='root', password='root', db='py_spider') as pool和async with aiohttp.ClientSession() as session：

在 main 函数中创建数据库连接池（pool）和 HTTP 会话对象（session），使得资源的创建和管理更加集中、清晰和简洁。这些资源只有在程序的入口点（main 函数）被创建，整个程序执行过程中都能共享这些资源，避免了资源的重复创建和浪费。
使用 async with 语法可以确保在 main 函数结束时，连接池和 HTTP 会话会自动被关闭和清理。这样你不需要手动关闭这些资源，避免了资源泄露的风险。
将其放到 main 函数中，只需要创建一次即可复用，避免了不必要的资源浪费，并简化了代码的结构。

因为有些商品是相同的车型，他们的车型配置可能一下，会有重复数据

import asyncio
import hashlib
from idlelib.window import add_windows_to_menu

import aiohttp
import aiomysql
import chardet
import redis
from lxml import etree


class CarSpieder:
    # redis实例 用于去重
    redis_client = redis.Redis()

    # 初始化
    def __init__(self):
        self.list_url = 'https://www.che168.com/china/a0_0msdgscncgpi1ltocsp{}exx0/?pvareaid=102179#currengpostion'
        self.api_url = 'https://cacheapigo.che168.com/CarProduct/GetParam.ashx?specid={}'
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
            "Cookie": "listuserarea=0; fvlid=1737711125769tx3zx3jGeGTF; sessionid=856ad455-d4d4-42d8-9c78-800a6885691a; sessionip=222.67.155.44; area=310112; sessionvisit=b7ebb848-0105-4ae6-a409-fee5ff1c9912; sessionvisitInfo=856ad455-d4d4-42d8-9c78-800a6885691a||102179; che_sessionid=68EA4161-43FF-4F01-AB53-5954A39681F2%7C%7C2025-01-24+17%3A32%3A06.509%7C%7C0; che_sessionvid=4C6A3D37-16B7-45C2-9741-A0BEA3CF5CA8; userarea=510100; showNum=10; ahpvno=14; ahuuid=AF13A51E-EEE9-4D42-8867-36F067A2ED2C; v_no=10; visit_info_ad=68EA4161-43FF-4F01-AB53-5954A39681F2||4C6A3D37-16B7-45C2-9741-A0BEA3CF5CA8||-1||-1||10; che_ref=0%7C0%7C0%7C0%7C2025-01-24+17%3A37%3A52.694%7C2025-01-24+17%3A32%3A06.509; sessionuid=856ad455-d4d4-42d8-9c78-800a6885691a"
        }

    # 获取车辆配置Id
    async def get_car_id(self, page, session, pool):
        async with session.get(self.list_url.format(page), headers=self.headers) as r:
            content = await r.read()
            # 汽车之家会检测是否频繁请求, 如果频繁请求则将页面替换成UTF8编码格式并无法获取汽车id
            encoding = chardet.detect(content)['encoding']
            print(encoding)
            if encoding == 'GB2312' or encoding == 'ISO-8859-1':
                res = content.decode('gbk')
            else:
                res = content.decode(encoding)
                print("被反爬了...")
            tree = etree.HTML(res)
            id_list = tree.xpath("//ul[@class='viewlist_ul']/li/@specid")
            if id_list:
                print(id_list)
                tasks = [asyncio.create_task(self.get_car_info(spec_id, session, pool)) for spec_id in id_list]
                await asyncio.wait(tasks)

    # 请求数据
    async def get_car_info(self, spec_id, session, pool):
        tasks = list()
        async with session.get(self.api_url.format(spec_id), headers=self.headers) as r:
            res = await r.json()
            if res['result'].get('paramtypeitems'):
                item = dict()
                item['name'] = res['result']['paramtypeitems'][0]['paramitems'][0]['value']
                item['price'] = res['result']['paramtypeitems'][0]['paramitems'][1]['value']
                item['brand'] = res['result']['paramtypeitems'][0]['paramitems'][2]['value']
                item['length'] = res['result']['paramtypeitems'][1]['paramitems'][0]['value']
                item['width'] = res['result']['paramtypeitems'][1]['paramitems'][1]['value']
                item['height'] = res['result']['paramtypeitems'][1]['paramitems'][2]['value']
                item['enginemodel'] = res['result']['paramtypeitems'][2]['paramitems'][0]['value']
                item['displacementml'] = res['result']['paramtypeitems'][2]['paramitems'][1]['value']
                item['displacementl'] = res['result']['paramtypeitems'][2]['paramitems'][2]['value']
                tasks.append(asyncio.create_task(self.save_car_info(item, pool)))
            else:
                print("数据不存在...")

    # 保存数据
    async def save_car_info(self, item, pool):
        async with pool.acquire() as conn:
            async with conn.cursor() as cursor:
                md5_val = self.get_md5(item)
                redis_res = self.redis_client.sadd('car:filter', md5_val)
                if redis_res:
                    sql = """
                        INSERT INTO car_info (
                            id, name, price, brand, length, width, height, enginemodel, displacementml, displacementl
                        ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                        """
                    try:
                        await cursor.execute(sql, (
                            0, item['name'], item['price'], item['brand'], item['length'], item['width'],
                            item['height'], item['enginemodel'], item['displacementml'], item['displacementl']))
                        await conn.commit()
                        print('插入成功...')
                    except Exception as e:
                        print('数据插入失败:', e)
                        await conn.rollback()
                else:
                    print("数据重复")

    # 创建表
    async def create_car_table(self, pool):
        async with pool.acquire() as conn:
            async with conn.cursor() as cursor:
                create_table_sql = """
                            create table car_info(
                                   id int primary key auto_increment,
                                   name varchar(100),
                                   price varchar(100),
                                   brand varchar(100),
                                   length varchar(100),
                                   width varchar(100),
                                   height varchar(100),
                                   enginemodel varchar(100),
                                   displacementml varchar(100),
                                   displacementl varchar(100)
                            )
                            """
                check_table_sql = "show tables like 'car_info'"
                res = await cursor.execute(check_table_sql)
                if not res:
                    await cursor.execute(create_table_sql)

    # 创建请求对象
    async def create_asyncio_task(self, pool):
        async with aiohttp.ClientSession() as session:
            tasks = [asyncio.create_task(self.get_car_id(page, session, pool)) for page in range(1, 11)]
            await asyncio.wait(tasks)

    # md5验证
    @staticmethod
    def get_md5(dict_item):
        md5 = hashlib.md5()
        md5.update(str(dict_item).encode('utf-8'))
        return md5.hexdigest()

    # 启动函数
    async def main(self):
        async with aiomysql.create_pool(user='root', password='root', db='py_spider') as pool:
            await self.create_car_table(pool)
            await self.create_asyncio_task(pool)


if __name__ == '__main__':
    spider = CarSpieder()
    asyncio.run(spider.main())

redis

msyql

6.多线程(threading.Thread)

threading.Thread 是 Python 的线程类，用于创建并管理线程。线程是轻量级的并发执行单位，允许在同一个进程中并行执行代码片段。threading.Thread 提供了对线程的封装，使得线程的创建和管理更加方便。

threading.Thread 是 Python 中用于多线程编程的核心类。它提供了灵活的接口，可以轻松地创建和管理线程，支持参数传递、守护线程设置等功能。在使用多线程时，应注意线程的生命周期、资源竞争以及 GIL 对性能的影响。

线程安全
多线程可能会导致共享资源冲突，需要使用锁（threading.Lock）进行保护。

GIL 限制
Python 的全局解释器锁（GIL）限制了同一时刻只有一个线程执行 Python 字节码，可能会影响多线程的性能。

使用守护线程需谨慎
守护线程会在主线程退出时直接终止，不适合需要完整执行的任务。

6.1线程的创建和执行

一个线程从 threading.Thread 类实例化后，通过调用 start() 方法开始执行，具体逻辑由 target 参数指定的函数决定。

基本结构

import threading
def my_function():
    print(123)
# 创建线程
thread = threading.Thread(target=my_function)
# 启动线程
thread.start()
# 等待线程结束
thread.join()

threading.Thread 参数详解

创建线程对象时，可以传递以下参数：

threading.Thread(
    group=None,  # 保留参数，通常为 None
    target=None, # 线程函数，线程运行时执行的函数
    name=None,   # 线程名称，默认为 "Thread-N" 格式
    args=(),     # 传递给 target 函数的参数，必须是元组
    kwargs={},   # 传递给 target 函数的关键字参数
    daemon=None  # 是否设置为守护线程
)

参数说明

group：保留参数，目前无实际功能，一般传 None。
target：线程运行时调用的函数。如果不设置，则线程不执行任何任务。
name：线程的名称，默认为 "Thread-N"（N 是一个递增数字）。自定义名称有助于调试或日志记录。
args：传递给 target 函数的参数，必须是元组类型，即使只有一个参数也要加逗号，例如：args=(arg1,)。
kwargs：传递给 target 函数的关键字参数，以字典形式提供。
daemon：是否将线程设置为 守护线程。
- True：主线程退出时，不等待该线程结束。
- False：主线程退出时，需要等待该线程执行完毕（默认值）。

6.2线程的常用方法

实例方法

start()：启动线程，调用 target 指定的函数。线程启动后，代码会在新线程中执行。

thread.start()

join(timeout=None)：等待线程完成。主线程会阻塞，直到目标线程完成或达到超时时间。

thread.join()  # 等待线程完成

is_alive()：检测线程是否仍在运行。

if thread.is_alive():
    print("Thread is running.")

run()：定义线程运行时执行的代码，通常由 target 指定。直接调用 run() 不会启动新线程，而是在当前线程中运行。

thread.run()  # 不推荐直接调用

setName(name) 和 getName()：设置或获取线程的名称。

thread.setName("MyThread")
print(thread.getName())  # 输出: MyThread

静态方法

current_thread()：返回当前正在运行的线程对象。

current = threading.current_thread()
print(current.name)  # 输出当前线程名称

active_count()：返回当前活动线程的数量。

print(threading.active_count())

enumerate：返回当前所有活动线程的列表。

print(threading.enumerate())

6.3守护线程

守护线程会在主线程结束时自动终止。适用于需要后台运行的任务，例如日志记录。

thread = threading.Thread(target=my_function, daemon=True)
thread.start()

守护线程的特点：

主线程退出时不等待守护线程。
如果任务必须完整执行，不适合使用守护线程。

7.线程池(ThreadPoolExecutor)

ThreadPoolExecutor 是 Python 中 concurrent.futures 模块提供的一个类，用于创建一个线程池，它管理一组线程来执行任务。通过使用线程池，您可以控制线程的数量，并且可以轻松地管理多个并发任务。ThreadPoolExecutor 使得并发编程变得更加简单，避免了手动创建和管理线程的麻烦。

ThreadPoolExecutor 主要用于异步执行任务。它通过线程池（即一定数量的线程）来执行提交给它的任务，直到所有任务完成为止。

7.1核心参数

max_workers：控制线程池中最多可以有多少个工作线程。

默认值是 None，表示没有最大线程数限制，线程池可以根据系统的资源来动态调整线程数。
如果指定了 max_workers，线程池会根据这个值创建线程，并限制并发任务的数量。

timeout：这是 ThreadPoolExecutor 在关闭时等待任务完成的最大时间。如果任务在这个时间内完成，线程池会正常退出。如果超时，线程池会强制关闭，剩余任务可能会被丢弃。

thread_name_prefix：可选的线程名称前缀。在创建线程时，线程名会以该前缀作为开头。默认为空字符串。

7.2主要方法

**submit(fn, *args, kwargs)：submit() 方法用于提交一个任务（即某个函数），返回一个 Future 对象。通过 Future 对象，可以检查任务是否完成、获取返回结果、取消任务等。submit() 方法是异步的，即它不会阻塞调用线程。

future = executor.submit(some_function, arg1, arg2)

map(func, *iterables, timeout=None, chunksize=1)：

map() 方法接受一个函数和一个或多个可迭代对象，类似于 Python 的内建 map() 函数，但是 ThreadPoolExecutor 会并发地执行任务。它会并行地处理每个输入项，并返回结果。

timeout：限制等待任务完成的最大时间。
chunksize：每次提交给线程池的任务数量。

results = executor.map(some_function, iterable)

shutdown(wait=True)：关闭线程池，释放资源。wait=True 会阻塞直到所有任务完成并且所有线程都退出。如果 wait=False，shutdown() 会立即返回，允许任务继续在后台运行。

7.3Future对象

Future 对象是一个表示异步执行操作的占位符。每次调用 submit() 提交任务时，都会返回一个 Future 对象。它代表任务的结果，并提供了一些方法来获取任务的状态、结果或取消任务。

常用的 Future 方法包括：

result(timeout=None)：获取任务的结果。如果任务还没有完成，它会阻塞当前线程，直到任务完成。可以传递一个 timeout 参数，超时后抛出 TimeoutError。
done()：判断任务是否完成。
cancel()：取消任务。如果任务还没有开始执行，返回 True，否则返回 False。
exception(timeout=None)：如果任务抛出异常，则返回异常信息。可以传递 timeout 来限制等待时间。

8.进程(Process)

在 Python 中，Process 是 multiprocessing 模块提供的用于创建和管理子进程的类。它允许你创建并发执行的多个进程，每个进程可以并行地执行任务。这对于多核处理器的程序来说非常有用，可以有效地提高程序的执行效率。

基本用法

Process 类创建了一个子进程，并执行 worker 函数。在 start() 后，子进程会开始运行，主进程会等待子进程完成（通过调用 join()）。

from multiprocessing import Process
import time

def worker():
    print("Worker process started.")
    time.sleep(2)
    print("Worker process finished.")

if __name__ == '__main__':
    p = Process(target=worker)  # 创建一个新进程，目标函数为 worker
    p.start()  # 启动进程
    p.join()  # 等待进程结束
    print("Main process finished.")

关键属性和方法

target：指定进程启动后执行的函数（或方法）。

args：传递给 target 函数的参数，通常是一个元组（即使只有一个参数，也要加上逗号，例如 (arg,)）。

start()：启动进程并执行 target 函数。

join()：主进程阻塞，直到子进程执行完成。可以用来确保子进程的执行完成后，主进程再继续执行。

daemon：设置进程是否为守护进程。如果 daemon=True，进程会在主进程退出时被强制终止。

is_alive()：检查进程是否还在运行。

exitcode：获取进程的退出代码。

9.多线程爬取豆瓣250

地址：https://movie.douban.com/top250?start=0&filter=

代码

thread_obj_list = [threading.Thread(target=get_db_movie_info, args=(page,)) for page in range(1, 11)]:创建了一个包含 10 个线程对象的列表。每个线程对象独立运行 get_db_movie_info(page)，参数 page 分别为 1 到 10。

threading.Thread:是 Python 的多线程模块，用于创建和管理线程。
target=get_db_movie_info:指定线程的目标函数，也就是线程启动时将执行的函数。目标函数是 get_db_movie_info，用于爬取电影信息。
args=(page,):
- 为目标函数提供参数。每个线程接收一个参数 page，其值为 range(1, 11) 中的每个数字。
- args 参数必须是可迭代对象（如元组、列表）。单个值需要使用单元素元组，并通过逗号区分：args=(page,)

for thread in thread_obj_list:thread.start():

for thread in thread_obj_list:遍历线程对象列表。
thread.start():启动线程，线程进入 就绪状态，等待 CPU 调度执行。一旦线程被 CPU 调度，它会开始执行目标函数 get_db_movie_info。

列表推导式:

以下代码列表推导式的应用，它是传统循环的简化版本，属于一种 语法糖，使代码更紧凑和清晰，而不改变功能或逻辑。

 thread_obj_list = [threading.Thread(target=get_db_movie_info, args=(page,)) for page in range(1, 11)]

这段代码的含义是：

thread_obj_list = []
for page in range(1, 11):
    thread = threading.Thread(target=get_db_movie_info, args=(page,))
    thread_obj_list.append(thread)

多线程的并发执行:

10 个线程被几乎同时启动，爬取不同页的电影信息。
线程的执行顺序不确定，具体取决于操作系统的线程调度机制。

import threading
import requests
from lxml import etree

url = 'https://movie.douban.com/top250?start={}&filter='
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
}

def get_db_movie_info(page):
    r = requests.get(url.format(page * 25), headers=headers).text
    tree = etree.HTML(r)
    res = tree.xpath("//div[@class='hd']/a/span[1]/text()")
    print(res)

if __name__ == '__main__':
    thread_obj_list = [threading.Thread(target=get_db_movie_info, args=(page,)) for page in range(1, 11)]
    for thread in thread_obj_list:
        thread.start()

10.线程池爬取豆瓣250

with ThreadPoolExecutor(max_workers=5) as thread_pool：

with：以上下文管理的方式使用线程池，代码块执行完毕后自动释放资源。避免手动关闭线程池（如 thread_pool.shutdown()），减少代码复杂度。
ThreadPoolExecutor(max_workers=5)：创建一个线程池，最多同时运行 5 个线程。控制并发线程数，避免资源被大量线程竞争导致性能下降。

thread_pool.submit(get_db_movie_info, page).result()：

thread_pool.submit(get_db_movie_info, page)：将任务提交到线程池中。
- get_db_movie_info：要执行的目标函数。
- page：传递给目标函数的参数。
result()：阻塞主线程，直到对应任务完成后返回结果。
- Future.result()
- 如果任务成功执行，返回结果（None，因为目标函数没有返回值）。
- 如果任务中出现异常，result() 会将异常重新抛出。

import threading
from concurrent.futures.thread import ThreadPoolExecutor

import requests
from lxml import etree

url = 'https://movie.douban.com/top250?start={}&filter='
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
}

def get_db_movie_info(page):
    r = requests.get(url.format(page * 25), headers=headers)
    if r.status_code == 200:
        tree = etree.HTML(r.text)
        res = tree.xpath("//div[@class='hd']/a/span[1]/text()")
        print(res)
    else:
        print("爬取失败，状态码："+r.status_code)
if __name__ == '__main__':
    with ThreadPoolExecutor(max_workers=5) as thread_pool:
        for page in range(1, 11):
            thread_pool.submit(get_db_movie_info, page).result()

11.多进程爬取腾讯招聘

地址：https://careers.tencent.com/zh-cn/search.html?keyword=python&query=at_1

代码

queue.put(work_info_dict): 将获取到的招聘信息放入 queue 队列中，以便后续保存。

queue.get(): 从队列中获取爬取到的数据。queue.get() 默认情况下会阻塞，直到队列中有数据可用为止。

collection.insert_one(dict_data): 将爬取到的数据插入到 MongoDB 数据库中。

queue.task_done(): 标记队列中的任务已经完成，告诉主进程可以继续执行。

dict_data_queue = Queue(): 创建一个队列来存放爬取到的数据。

for page in range(1, 6): 创建多个进程来并发地爬取不同页的数据。

Process(target=get_tx_work_info, args=(page, dict_data_queue)): 每个进程负责请求并爬取一页数据。

process_save_tx_work_info = Process(target=save_tx_work_info, args=(dict_data_queue,)): 创建一个单独的进程来保存数据。

process.daemon = True: 设置进程为守护进程，主程序结束时，所有守护进程会被自动杀死。

process.start(): 启动进程。

time.sleep(10): 暂停 10 秒，以确保所有子进程完成抓取工作。

dict_data_queue.join(): 阻塞主进程，直到队列中的所有任务完成（即所有数据已保存到 MongoDB）。

import time
from multiprocessing import Process, JoinableQueue as Queue

import jsonpath
import pymongo
import requests

url = 'https://careers.tencent.com/tencentcareer/api/post/Query'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
}

def get_tx_work_info(page, queue):
    # timestamp=1737790967549&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=1&keyword=python&pageIndex=1&pageSize=10&language=zh-cn&area=cn
    params = {
        'timestamp': 1737790967549,
        'countryId': '',
        'cityId': '',
        'bgIds': '',
        'productId': '',
        'categoryId': '',
        'parentCategoryId': '',
        'attrId': '1',
        'keyword': 'python',
        'pageIndex': page,
        'pageSize': 10,
        'language': 'zh-cn',
        'area': 'cn'
    }
    r = requests.get(url, headers=headers, params=params).json()
    print(r)
    for info in r['Data']['Posts']:
        work_info_dict = dict()
        work_info_dict['PostId'] = jsonpath.jsonpath(info, '$..PostId')[0]
        work_info_dict['RecruitPostName'] = jsonpath.jsonpath(info, '$..RecruitPostName')[0]
        work_info_dict['CountryName'] = jsonpath.jsonpath(info, '$..CountryName')[0]
        work_info_dict['LocationName'] = jsonpath.jsonpath(info, '$..LocationName')[0]
        work_info_dict['CategoryName'] = jsonpath.jsonpath(info, '$..CategoryName')[0]
        work_info_dict['Responsibility'] = jsonpath.jsonpath(info, '$..Responsibility')[0]
        work_info_dict['LastUpdateTime'] = jsonpath.jsonpath(info, '$..LastUpdateTime')[0]
        print(work_info_dict)
        queue.put(work_info_dict)

def save_tx_work_info(queue):
    mongo_client = pymongo.MongoClient('localhost', 27017)
    collection = mongo_client['py_spider']['tx_work']
    while True:
        dict_data = queue.get()
        print(dict_data)
        collection.insert_one(dict_data)
        # 将当前队列中的计数器减1, 如果队列计数器为0, 则接堵塞, 可以直接退出主程序
        queue.task_done()

if __name__ == '__main__':
    # 进程必须创建函数入口
    dict_data_queue = Queue()
    # 创建进程对象列表
    process_list = list()
    for page in range(1, 6):
        process_get_tx_work_info = Process(target=get_tx_work_info, args=(page, dict_data_queue))
        process_list.append(process_get_tx_work_info)
    process_save_tx_work_info = Process(target=save_tx_work_info, args=(dict_data_queue,))
    process_list.append(process_save_tx_work_info)
    for process in process_list:
        process.daemon = True
        process.start()
    # 开启进程比较耗时
    time.sleep(10)
    # 如果队列中的计数器不为0则会堵塞主程序
    dict_data_queue.join()
    print("主程序退出")

Mongodb

12.多线程爬取爱奇艺电影数据

地址：https://list.iqiyi.com/www/1/-------------24-1-1-iqiyi--.html?s_source=PCW_SC

搜索到电影列表的api接口

代码

get_aiqiy_url：生成分页的API请求URL，将每一页的请求URL放入 url_queue 队列中。通过循环生成分页地址，分页范围是1到5页（即5个URL）。

get_aiqiy_api_json：从 url_queue 获取请求的URL，通过 requests.get() 发送HTTP请求，并将响应的JSON数据放入 json_queue 队列中。它是一个无限循环线程，不断获取URL并请求数据。

parse_aiqiyi_json：从 json_queue 获取返回的JSON数据，提取电影信息并将它们转化为字典形式存入 dict_data_queue 队列中。每部电影的字段包括 tvId, description, name, playUrl, imageUrl, categories, 和 albumName。

save_aiqiyi_dict_data：从 dict_data_queue 获取字典格式的电影数据，并将它们插入到MongoDB数据库中的 aiqiyi_thread 集合。插入成功后，打印出成功信息。

main：

创建一个线程池，首先启动获取URL的线程（get_aiqiy_url）
启动3个线程用来发送API请求并获取数据（get_aiqiy_api_json）。
启动1个线程来解析数据并将其转化为字典格式（parse_aiqiyi_json）。
启动1个线程来将解析后的数据保存到数据库（save_aiqiyi_dict_data）。
启动所有线程后，调用 join() 阻塞主线程，直到所有任务完成。

Queue 用于线程之间的安全数据交换：

url_queue：存储API请求URL。
json_queue：存储响应的JSON数据。
dict_data_queue：存储从JSON数据解析后的字典格式电影数据。

import threading
import time
from queue import Queue

import pymongo
import requests


class AiQiYiSpider:
    def __init__(self):
        self.mongo_client = pymongo.MongoClient('localhost', 27017)
        self.collection = self.mongo_client['py_spider']['aiqiyi_thread']
        self.api_url = 'https://pcw-api.iqiyi.com/search/recommend/list?channel_id=1&data_type=1&mode=24&page_id={}&ret_num=48&session=c2dd9840c5e21ce02123870e34190f6a'
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
            "Cookie": "QC005=09b0db975e18c39347afa9548d23589e; QP0037=0; curDeviceState=width%3D1536%3BconduitId%3D%3Bscale%3D125%3Bbrightness%3Ddark%3BisLowPerformPC%3D0%3Bos%3Dbrowser%3Bosv%3D10.0.19044; QC234=10415f3b1029f0cfc246372948186dba; QC235=e9b7a35840314b77859933558ae357fe; QC006=bb2f57868321ec99d24974016b9204a6; T00404=e8282161d4f6944ef9512453f8a58281; QP007=100800; IMS=IggQABj_zsm8BiomCiAxMWZkOGMyM2U4OTE0OWEyYTY0NWU3MzA1ZTc0Njc5YxAAIgByJAogMTFmZDhjMjNlODkxNDlhMmE2NDVlNzMwNWU3NDY3OWMQAIIBBCICEA6KASQKIgogMTFmZDhjMjNlODkxNDlhMmE2NDVlNzMwNWU3NDY3OWM; QC173=0; QC175=%7B%22upd%22%3Atrue%2C%22ct%22%3A%22%22%7D; QC189=8883_A%2C8185_A%2C10274_A%2C8739_B%2C9419_A%2C9379_B%2C9922_B%2C10276_A%2C8004_B%2C5257_B%2C9776_A%2C8873_E%2C10123_A%2C7423_C%2C9082_A%2C8401_A%2C6249_C%2C7996_B%2C9576_B%2C10358_A%2C9365_B%2C5465_B%2C6843_B%2C10096_A%2C6578_B%2C6312_B%2C6091_B%2C8690_C%2C8737_D%2C8742_A%2C9484_B%2C10193_B%2C6752_C%2C10426_B%2C10188_A%2C8971_B%2C7332_B%2C9683_B%2C8665_D%2C6237_A%2C9569_A%2C8983_C%2C7024_C%2C5592_B%2C9117_A%2C6031_B%2C7581_A%2C9506_A%2C9517_A%2C10216_B%2C9394_B%2C8542_B%2C6050_B%2C9167_B%2C9469_B%2C8812_B%2C6832_C%2C7074_C%2C7682_C%2C8867_B%2C5924_D%2C6151_C%2C5468_B%2C10447_A%2C6704_C%2C8808_B%2C8497_B%2C8342_B%2C8871_C%2C9790_A%2C9355_B%2C10389_A%2C8760_B%2C9292_B%2C6629_B%2C5670_B%2C9158_A%2C9805_B%2C9959_C%2C6082_B%2C5335_B; QC198=6c5db1f76d3ba2efa4b6f17656624fb3; TQC030=1; QC199=48aedbde07527cc6cde36993eb35e1a3; __dfp=a1590e70f186b7435599cecfdc3fbd4b2b543eabcc5cdf77da732dba65c4f7d252@1738925126490@1737629127490; QC008=1737629414.1737629414.1737796620.2; QC007=DIRECT; QC191=; QC010=78515861; nu=0; QC186=false",
            "Referer": "https://list.iqiyi.com/www/1/-------------24-1-1-iqiyi--.html?s_source=PCW_SC"
        }
        # 地址队列
        self.url_queue = Queue()
        # json数据队列
        self.json_queue = Queue()
        # 字典数据队列
        self.dict_data_queue = Queue()

    # 获取分页请求地址，并存入队列
    def get_aiqiy_url(self):
        for page in range(1, 6):
            # time.sleep(1)
            self.url_queue.put(self.api_url.format(page))

    # 获取请求的json数据，并存入队列
    def get_aiqiy_api_json(self):
        while True:
            url = self.url_queue.get()
            r = requests.get(url, headers=self.headers)
            print("请求地址：",url)
            print("状态吗：",r.status_code)
            self.json_queue.put(r.json())
            self.url_queue.task_done()

    # 解析分页请求的json数据，将每条电影数据添加到队列
    def parse_aiqiyi_json(self):
        while True:
            json_data = self.json_queue.get()
            movies = json_data['data']['list']
            for movie in movies:
                item = dict()
                item['tvId'] = movie['tvId']
                item['description'] = movie['description']
                item['name'] = movie['name']
                item['playUrl'] = movie['playUrl']
                item['imageUrl'] = movie['imageUrl']
                item['categories'] = ','.join(movie['categories'])
                item['albumName'] = movie['albumName']
                self.dict_data_queue.put(item)
            self.json_queue.task_done()

    # 将dict_data_queue队列数据添加到mongodb
    def save_aiqiyi_dict_data(self):
        while True:
            item = self.dict_data_queue.get()
            self.collection.insert_one(item)
            print("添加成功：", item)
            self.dict_data_queue.task_done()

    def main(self):
        # 创建线程列表对象
        thread_list = list()
        # 创建get_aiqiy_url线程对象
        thread_url = threading.Thread(target=self.get_aiqiy_url)
        # 将get_aiqiy_url线程对象添加到线程对象列表中
        thread_list.append(thread_url)

        # 获取数据线程队列
        for _ in range(3):
            thread_get_api_json = threading.Thread(target=self.get_aiqiy_api_json)
            thread_list.append(thread_get_api_json)
        # 解析数据线程队列
        tread_parse_json = threading.Thread(target=self.parse_aiqiyi_json)
        thread_list.append(tread_parse_json)
        # 保存数据线程队列
        thread_save_dict_data = threading.Thread(target=self.save_aiqiyi_dict_data)
        thread_list.append(thread_save_dict_data)

        # 所有线程对象设置为后台线程，确保主程序结束时，线程也能退出。
        for thread in thread_list:
            thread.daemon = True
            thread.start()

        # aiqiyi_queue.join() 会阻塞主线程，直到队列中的所有任务都完成。
        for aiqiyi_queue in [self.url_queue, self.json_queue, self.dict_data_queue]:
            aiqiyi_queue.join()

        print("爬取完成...")

if __name__ == '__main__':
    aqy = AiQiYiSpider()
    aqy.main()

mongodb

爬取5页，每页48条

13.线程池爬取百度招聘

13.1fiddler

地址：https://www.telerik.com/download/fiddler

填一些基本信息就可以下载

下载了直接安装即可

13.2代码

地址：https://talent.baidu.com/jobs/social-list?search=python

代码

from concurrent.futures import ThreadPoolExecutor, as_completed：asyncio.as_completed 和 concurrent.futures.as_completed 的功能类似，主要用于在多个异步任务中按完成顺序处理任务的结果。这里最终选用了 concurrent.futures.as_completed，因为代码基于 ThreadPoolExecutor，属于多线程执行模型。

with ThreadPoolExecutor(max_workers=5) as thread_pool：创建一个线程池，最大允许同时运行 5 个线程（max_workers=5）。ThreadPoolExecutor 是 Python 的并发库 concurrent.futures 中的一个线程池实现，用于管理线程的创建和任务调度。

thread_pool.submit(self.get_job_data, page)：

将 self.get_job_data 方法（获取指定页码的数据）作为任务提交到线程池。
page 参数是当前的页码，范围为 1 到 5。
返回一个 Future 对象，用于跟踪任务状态和结果。

future_list：包含所有提交任务对应的 Future 对象。

for future in as_completed(future_list)：

as_completed(future_list)：按任务完成顺序迭代 future_list 中的任务
即使任务的提交顺序是 1-5 页，但结果会按照哪个任务先完成，哪个任务先被处理的原则。

data = future.result()：future.result()：

如果任务成功完成，返回的是接口的 Response 对象。
如果任务执行过程中出现异常，则会在这里抛出异常。

from asyncio import as_completed
# from concurrent.futures.thread import ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor, as_completed

import pymysql
import requests


class BaiduJobSpider:
    # 构造函数初始化 数据库、请求地址、headers
    def __init__(self):
        self.db = pymysql.connect(host="localhost", user="root", password="root", db="py_spider")
        self.cursor = self.db.cursor()
        self.api_url = 'https://talent.baidu.com/httservice/getPostListNew'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
            'Cookie': 'BIDUPSID=79ED59B3DF405E7BE0B2F089BF5636C0; PSTM=1697716565; BAIDUID=79ED59B3DF405E7B87EFE83B3F670F21:FG=1; BAIDUID_BFESS=79ED59B3DF405E7B87EFE83B3F670F21:FG=1; ZFY=E8UL64u1CWxtvnkjGKUmcr39lCirPsWnNcY4Ojzc6Ts:C; H_WISE_SIDS=234020_110085_264353_268593_269904_271171_270102_275171_276572_276589_277162_277356_277636_277639_275732_259642_278054_278390_278574_274779_278791_278388_279020_279039_279610_279711_279998_280266_280304_280373_278414_276438_280619_279201_277759_280809_280902_280557_280873_280636_280926_281043_281153_277970_281148; H_WISE_SIDS_BFESS=234020_110085_264353_268593_269904_271171_270102_275171_276572_276589_277162_277356_277636_277639_275732_259642_278054_278390_278574_274779_278791_278388_279020_279039_279610_279711_279998_280266_280304_280373_278414_276438_280619_279201_277759_280809_280902_280557_280873_280636_280926_281043_281153_277970_281148; Hm_lvt_50e85ccdd6c1e538eb1290bc92327926=1699171013; Hm_lpvt_50e85ccdd6c1e538eb1290bc92327926=1699173864; RT="z=1&dm=baidu.com&si=439a22e1-0524-47fc-94cc-717583dbaefa&ss=lol6js62&sl=0&tt=0&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf"',
            'Referer': 'https://talent.baidu.com/jobs/social-list?search=python'
        }

    # 获取接口数据
    def get_job_data(self, page):
        post_data = {
            'recruitType': 'SOCIAL',
            'pageSize': 10,
            'keyWord': '',
            'curPage': page,
            'projectType': '',
        }
        return requests.post(self.api_url, data=post_data, headers=self.headers)

    # 解析数据
    def parse_job_data(self, r):
        datas = r.json()['data']['list']
        for data in datas:
            dict_data = dict()
            dict_data['education'] = data['education'] if data['education'] else '空'
            dict_data['name'] = data['name']
            dict_data['serviceCondition'] = data['serviceCondition']
            self.save_job_data(dict_data)

    # 保存数据
    def save_job_data(self, dict_data):
        sql = """
                   insert into baidu_job(id, education, name, service_condition) values (
                       %s, %s, %s, %s
                   )
               """
        try:
            self.cursor.execute(sql, (0, dict_data['education'], dict_data['name'], dict_data['serviceCondition']))
            self.db.commit()
            print('数据保存成功...')
        except Exception as e:
            print('数据保存失败:', e)
            self.db.rollback()

    # 创建表
    def create_table(self):
        sql = """
                    create table if not exists baidu_job(
                        id int primary key auto_increment,
                        education varchar(200),
                        name varchar(100),
                        service_condition text
                    );
                """
        try:
            self.cursor.execute(sql)
            print('表创建成功...')
        except Exception as e:
            print('表创建失败:', e)

    def main(self):
        self.create_table()
        with ThreadPoolExecutor(max_workers=5) as thread_pool:
            future_list = [thread_pool.submit(self.get_job_data, page) for page in range(1, 6)]

            # 按任务完成顺序解析数据
            for future in as_completed(future_list):
                try:
                    data = future.result()  # 获取任务结果
                    self.parse_job_data(data)  # 解析并保存
                except Exception as e:
                    print(f"Error in processing: {e}")

        # with ThreadPoolExecutor(max_workers=5) as thread_pool:
        #     future_list = list()
        #     # 获取请求数据
        #     for page in range(1, 6):
        #         r = thread_pool.submit(self.get_job_data, page)
        #         future_list.append(r)
        #     # 解析数据 然后保存
        #     for future in future_list:
        #         self.parse_job_data(future.result())

if __name__ == '__main__':
    spider = BaiduJobSpider()
    spider.main()

mongodb

14.多进程爬取芒果TV电影

14.1JoinableQueue

14.1.1基本概念

JoinableQueue 是 Python 中 queue 模块提供的一种队列，它是 Queue 类的一个子类，提供了更多的功能，主要是为了配合多进程或多线程的任务协作。

JoinableQueue 继承自 Queue，与普通的 Queue 类相似，但它有两个额外的方法：

task_done(): 用于表示任务已经完成。
join(): 用于阻塞等待队列中的所有任务都完成。

这些特性使得 JoinableQueue 特别适合用于多进程或多线程的任务调度场景。通常情况下，当你在多进程/线程中将任务放入队列时，task_done() 用来标记任务完成，join() 会等待所有任务完成后再继续执行。

14.2.2主要方法

put(item, block=True, timeout=None)：将项目 item 放入队列。如果 block 为 True，队列已满时会等待，直到队列有空余位置。

get(block=True, timeout=None)：从队列中获取项目。如果队列为空，且 block 为 True，则会阻塞等待直到队列有数据。

task_done()此方法是 JoinableQueue 的独特方法，用于标记一个任务已完成。通常配合 join() 使用，确保任务都完成后才能退出。

join()：此方法会阻塞当前线程/进程，直到队列中的所有任务都被标记为完成。它会等待所有放入队列的任务被处理后再退出。

任务管理：你可以将任务放入队列，消费者（即处理任务的进程/线程）从队列中获取任务并处理，处理完之后调用 task_done() 来标记任务完成。join() 可以用于等待所有任务完成。

进程/线程协作：在多个进程或线程中协调工作，确保任务在所有进程/线程处理完毕后才继续执行

14.2代码

地址：https://www.mgtv.com/lib/3?lastp=list_index

鼠标滑动几页，优先查找电影的api接口

代码

params_queue：存放分页请求的参数。

json_queue：存放请求回来的 JSON 数据。

dict_data_queue：存放解析后的字典数据。

main：

启动所有进程并加入 process_list，分别负责分页请求参数的生成、请求数据、解析数据和保存数据。
每个进程启动时都设置为守护进程（daemon=True），进程会在主程序退出时自动结束。
time.sleep(0.3) 可以控制进程启动的时机，避免瞬间启动过多进程。而且启动进程是耗时操作。
使用 join() 确保所有队列任务完成后程序再退出。

# from multiprocessing import Queue
import hashlib
import time
from multiprocessing import Process, JoinableQueue

import pymongo
import requests
import redis


class MGTVMovieSpider:
    # 在多进程中, 对于数据库初始化操作必须设置为类属性
    mongo_client = pymongo.MongoClient('localhost', 27017)
    collection = mongo_client['py_spider']['mgtv_movie_process']
    redis_client = redis.Redis()

    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
        }
        self.api_url = 'https://pianku.api.mgtv.com/rider/list/pcweb/v3'
        self.params_queue = JoinableQueue()
        self.json_queue = JoinableQueue()
        self.dict_data_queue = JoinableQueue()

    # 获取分页地址，并放入params_queue队列
    def get_mgtv_movie_params(self):
        for page in range(1, 6):
            # allowedRC=1&platform=pcweb&channelId=3&pn=1&pc=80&hudong=1&_support=10000000&kind=a1&edition=a1&area=a1&year=all&chargeInfo=a1&sort=c2
            params = {
                "allowedRC": "1",
                "platform": "pcweb",
                "channelId": "3",
                "pn": page,
                "pc": "80",
                "hudong": "1",
                "_support": "10000000",
                "kind": "a1",
                "edition": "a1",
                "area": "a1",
                "year": "all",
                "chargeInfo": "a1",
                "sort": "c2"
            }
            self.params_queue.put(params)

    # 从分页地址的队列params_queue中获取地址并请求获取数据，将数据放入数据队列json_queue
    def get_mgtv_movie_json(self):
        while True:
            params = self.params_queue.get()
            r = requests.get(self.api_url, headers=self.headers, params=params)
            self.json_queue.put(r.json())
            self.params_queue.task_done()

    # 从数据队列json_queue中获取数据并解析，放入到字典队列dict_data_queue中
    def parse_mgtv_movie_json(self):
        while True:
            r = self.json_queue.get()
            movies = r['data']['hitDocs']
            for movie in movies:
                dict_data = dict()
                dict_data['clipId'] = movie['clipId']
                dict_data['img'] = movie['img']
                dict_data['kind'] = ','.join(movie['kind'])
                dict_data['story'] = movie['story']
                dict_data['subtitle'] = movie['subtitle']
                dict_data['title'] = movie['title']
                self.dict_data_queue.put(dict_data)
            self.json_queue.task_done()

    # 将json字符串转为md5
    @staticmethod
    def get_md5(dict_data):
        return hashlib.md5(str(dict_data).encode('utf-8')).hexdigest()

    # 使用mongodb保存数据
    def save_mgtv_movie_dict(self):
        while True:
            dict_data = self.dict_data_queue.get()
            dict_data_md5 = self.get_md5(dict_data)
            res = self.redis_client.sadd("mgtv:movie:filter", dict_data_md5)
            if res:
                self.collection.insert_one(dict_data)
                print("数据插入成功：",dict_data)
            else:
                print("数据重复...")
            self.dict_data_queue.task_done()

    def main(self):
        process_list = list()
        # 创建生成参数队列的进程 并加入到进程列表
        process_params = Process(target=self.get_mgtv_movie_params)
        process_list.append(process_params)

        # 创建请求数据队列的进程 并加入到进程列表
        for _ in range(3):
            process_json = Process(target=self.get_mgtv_movie_json)
            process_list.append(process_json)

        # 创建解析数据队列 并加入到进程列表
        prcess_parser = Process(target=self.parse_mgtv_movie_json)
        process_list.append(prcess_parser)

        # 创建保存数据队列 并加入到进程列表
        proces_save = Process(target=self.save_mgtv_movie_dict)
        process_list.append(proces_save)

        for process in process_list:
            process.daemon = True
            process.start()
            time.sleep(0.3)

        for process in [self.params_queue, self.json_queue, self.dict_data_queue]:
            process.join()

        print("爬取完成...")


if __name__ == '__main__':
    spider = MGTVMovieSpider()
    spider.main()

redis

mongodb

15.aiofile

aiofile 是一个用于异步文件操作的 Python 库，它允许使用异步方式读取和写入文件内容。与标准库中的文件操作相比，aiofile 结合了 Python 的异步功能（如 asyncio），在处理 I/O 密集型任务时可以提高效率。

安装aiofile

pip install aiofile

基本用法

aiofile.read()：异步读取文件。

aiofile.write(content)：异步写入文件。

async for line in LineReader(aiofile).：逐行读取文件内容（适用于大文件）。

async with AIOFile(file_path, 'a') as afp：使用追加模式（'a'）写入文件。

aiofile.tell()：获取当前文件指针的位置。

aiofile.seek()：定位文件指针。

如果需要同步操作文件，可以使用 aiofile.threadpool.wrap 将同步操作包装为异步操作。

16.协程爬取王者荣耀英雄皮肤图片下载

16.1分析页面

地址：https://pvp.qq.com/web201605/herolist.shtml

我们以苍这个英雄为例，我们发现他的图片名称类似一个Id

https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/177/177-bigskin-1.jpg
https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/177/177-bigskin-2.jpg
https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/177/177-bigskin-3.jpg

点击英雄详情发现，图片链接地址域名、路径都是一样的

唯一不同的就是类似英雄的Id和皮肤的编号

我们查看英雄苍的编码4皮肤地址，发现404，实际上英雄苍也没有4号皮肤

所以我们假设每个英雄最大皮肤编码为30，当地址返回404就停止当前英雄循环

继续循环下一个英雄

https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/177/177-bigskin-4.jpg

英雄的Id我们分析页面源码发现他的图片链接里都有

所以我们可以通过xpath获取英雄的li标签，然后获取他的src属性，再去截取Id

但是这种稍微麻烦点，我们可以再尝试能不能获取英雄列表的api，通过接口获取

//ul[@class='herolist clearfix']/li/a/img/@src

使用浏览器抓包发现了一个herolist.json地址

根据命名规范大致确定这是一个英雄列表的json数据

打开后，发现json数据的ename就是我们要找的英雄Id

https://pvp.qq.com/web201605/js/herolist.json

16.2代码

主要步骤：

请求 self.json_url 获取英雄信息 JSON 数据。
为每个英雄创建并发任务，尝试下载最多 30 个皮肤图片。
异步请求每个皮肤的图片地址，若返回 404，则停止请求。
保存每张图片到本地 ./skin_images 文件夹。

import asyncio
import os

import aiofile
import aiohttp


class HeroSkinImgSpider:
    def __init__(self):
        # 英雄信息
        # https://pvp.qq.com/web201605/js/herolist.json
        self.json_url = 'https://pvp.qq.com/web201605/js/herolist.json'
        # 皮肤图片信息
        # https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/177/177-bigskin-1.jpg
        self.skin_url = 'https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/{}/{}-bigskin-{}.jpg'
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
        }

    # 获取英雄图片
    async def get_heroskinimg_content(self, session, hero_dict_data):
        hera_Id = hero_dict_data['ename']
        hera_name = hero_dict_data['cname']
        # 假设每个英雄最多30个皮肤 实际上没有英雄有超过30个皮肤
        for skin_Id in range(1, 30):
            async with session.get(self.skin_url.format(hera_Id, hera_Id, skin_Id), headers=self.headers) as r:
                if r.status == 200:
                    content = await r.read()
                    hear_skin_img = hera_name + '-' + str(hera_Id) + '-' + str(skin_Id) + '.jpg'
                    async with aiofile.async_open('./skin_images/' + hear_skin_img, 'wb') as f:
                        await f.write(content)
                        print('保存成功：' + hear_skin_img)
                else:
                    # 超过皮肤编号，地址会返回404 可以直接停止
                    break

    # 启动程序
    async def main(self):
        if not os.path.exists('./skin_images'):
            os.mkdir('./skin_images')
        # tasks = list()
        async with aiohttp.ClientSession() as session:
            async with session.get(self.json_url, headers=self.headers) as r:
                heroes = await r.json(content_type=None)
                tasks = [
                    self.get_heroskinimg_content(session, hero)
                    for hero in heroes
                ]
                await asyncio.gather(*tasks)
                # result = await r.json(content_type=None)
                # for item in result:
                #     coro_obj = self.get_heroskinimg_content(session, item)
                #     tasks.append(asyncio.create_task(coro_obj))
                # await asyncio.wait(tasks)


if __name__ == '__main__':
    spider = HeroSkinImgSpider()
    asyncio.run(spider.main())

皮肤图片

📌 创作不易，感谢支持！
每一篇内容都凝聚了心血与热情，如果我的内容对您有帮助，欢迎请我喝杯咖啡☕，您的支持是我持续分享的最大动力！

posted @ 2025-04-08 22:17 peng_boke 阅读(52) 评论(0) 收藏举报

刷新页面返回顶部

peng_boke

Python爬虫-并发爬虫

前言

1.asyncio

1.1简介

1.2asyncio爬取豆瓣250

2.aiohttp

2.1简介

2.2安装

2.3基本使用

2.4并发爬取

3.aiohttp爬取豆瓣电影

4.aiomysql

4.1简介

4.2安装

4.3future对象

4.3 1简介

4.3.2代码

4.4基本使用

5.协程爬取汽车之家并保存到mysql

5.1页面

5.2请求地址带有js回调函数

5.2.1解决方案

5.2.2原理

5.3列表页查找详情id

5.4列表分页

5.5解决动态返回编码的防爬机制

5.5去重

5.代码

6.多线程(threading.Thread)

6.1线程的创建和执行

6.2线程的常用方法

6.3守护线程

7.线程池(ThreadPoolExecutor)

7.1核心参数

7.2主要方法

7.3Future对象

8.进程(Process)

9.多线程爬取豆瓣250

10.线程池爬取豆瓣250

11.多进程爬取腾讯招聘

12.多线程爬取爱奇艺电影数据

13.线程池爬取百度招聘

13.1fiddler

13.2代码

14.多进程爬取芒果TV电影

14.1JoinableQueue

14.1.1基本概念

14.2.2主要方法

14.2代码

15.aiofile

16.协程爬取王者荣耀英雄皮肤图片下载

16.1分析页面

16.2代码

公告

3.`aiohttp`爬取豆瓣电影