爬虫案例-对比单线程、多线程、多进程、异步编程的效率

我采用了单线程，多线程，多进程，异步编程来爬取菜鸟教程100例题

先说结论：

单线程单进程耗时：0:00:21.835015
多线程耗时：0:00:07.817000
多进程耗时：0:00:05.293997
异步编程：0:00:02.583346！！！！！！！！！！！！！！！！！！！！！！！！

实践总结：爬虫最耗时的地方在于网络请求（占总耗时的90%），其次才是文件写入；

数据对比：

- 网络请求时间：假设每个请求耗时 0.5秒，100个请求总耗时约 50秒。

- 文件写入时间：即使写入 100次，每次耗时仅几毫秒，总耗时不足 1秒。
- 并发处理能力：多线程或多进程能否有效利用系统资源，减少总体等待时间。

实测数据对比分析

模型	耗时（秒）	相对速度提升	关键瓶颈
单线程	21.83	1x（基准）	完全串行，无并发
多线程（16线程）	7.81	~2.8x	GIL竞争、线程切换开销
多进程（16进程）	5.29	~4.1x	进程创建、IPC开销
异步协程	2.58	~8.5x	网络延迟（理论极限）

异步逼近理论极限：耗时主要取决于最慢的单个请求响应时间，而非并发数量。
多线程/进程开销：线程切换、GIL竞争、进程间通信等额外成本拖慢整体速度。

在I/O密集型任务中异步编程为何比多线程/多进程更快？

1. 资源开销差异

模型	线程/进程数	内存占用	上下文切换成本
多进程	16进程	高（独立内存）	高（内核级切换）
多线程	16线程	中等（共享内存）	中等（内核级切换）
异步协程	1线程 + N协程	极低	极低（用户态切换）

异步模型：单线程内通过协程管理成千上万的并发任务，协程切换由事件循环在用户态完成，无需内核介入。
多线程/进程：每次切换需陷入内核态，保存寄存器、堆栈等状态，开销显著。

2. I/O密集型任务特性

网络爬虫的核心耗时在于 等待远程服务器响应（I/O等待），而非本地CPU计算。

异步编程：在等待某个请求响应时，事件循环会立即切换到其他就绪任务，实现“零等待”并发。

多线程/进程：尽管可以并行发起多个请求，但每个线程/进程仍需阻塞等待自己的I/O完成，无法充分利用等待时间。

3. Python的GIL限制

多线程：受限于GIL，同一时刻仅一个线程能执行Python字节码。
虽然I/O操作会释放GIL，但线程调度和锁竞争仍会引入额外开销。
多进程：绕过GIL，但进程间通信（IPC）和数据序列化（Pickle）成本高。
异步编程：完全避开GIL，单线程不存在GIL锁，协程间无GIL竞争。
进一步解释
1. 异步编程确实绕过了 GIL：
  协程在单线程内运行，无需像多线程那样争夺 GIL，这是异步性能优势的来源。
2. 协程间仍可能发生资源竞争：
  当多个协程操作共享资源（如全局变量、文件、数据库连接）时，仍需同步控制（如锁）来保证数据一致性。
3. 异步锁机制的特殊性：
  异步锁（如 asyncio.Lock）的切换成本极低，且不会阻塞事件循环，与多线程/多进程的锁有本质区别。

4. 连接管理与复用

异步编程：通过 aiohttp 的连接池复用TCP连接，减少握手开销。
示例代码中，所有请求共享一个 ClientSession，自动复用连接。
多线程/进程：每个线程/进程独立管理连接，频繁创建/销毁连接增加延迟。

5. 任务调度效率

模型	任务调度方式	适用场景
多进程	操作系统进程调度	CPU密集型任务
多线程	操作系统线程调度	I/O密集型轻量级并发
异步协程	用户态事件循环（高效轮询）	高并发I/O密集型任务

异步事件循环：基于 epoll（Linux）或 kqueue（MacOS）实现高效I/O就绪通知，精确触发任务切换。
操作系统调度：线程/进程切换由内核统一管理，无法针对应用特性优化。

为何16核CPU未让多进程碾压异步？

网络延迟非CPU问题
爬虫性能瓶颈在于远程服务器响应速度和网络传输速度，而非本地CPU计算。
即使有16核，若每个请求需等待100ms响应，并行16个请求总耗时仍至少100ms。
异步模型可同时发起数百请求，总耗时≈最慢请求的响应时间。
资源争用与调度成本
- 多进程：16进程需分配独立内存，进程间通信（如序列化数据）消耗CPU。
- 异步：单进程内协程共享内存，无序列化开销。
连接数限制
多进程/线程受操作系统限制（如最大文件描述符数），异步模型可轻松管理上千连接。

代码层面的关键优化点

异步版本优势：

async with aiohttp.ClientSession() as session:  # 全局复用连接池
    tasks = [process_href(href, session) for href in hreflist]
    await asyncio.gather(*tasks)  # 一次性提交所有任务

多线程/进程版本缺陷：

# 多线程：每个线程独立创建连接
def process_href(href):
    response = requests.get(url)  # 无连接复用
    # ...

# 多进程：进程间无法共享连接池
with Pool(16) as p:
    p.map(process_href, urls)  # 每个进程独立发起请求

总结

异步编程在I/O密集型任务中凭借极低的开销和高效的调度机制，能够实现 数倍于多线程/进程的吞吐量。
多进程适合CPU密集型任务（如数值计算），但I/O场景下资源消耗和调度成本成为瓶颈。
多线程受限于GIL和内核调度，性能介于异步与多进程之间。

最终选择建议

场景	推荐模型
高频HTTP API调用	异步（`asyncio + aiohttp`）
文件批量下载（CPU空闲）	多进程（`multiprocessing`）
简单脚本（低并发）	多线程（`concurrent.futures`）

掌握异步编程后，您可轻松应对高并发爬虫、实时数据处理等场景，充分榨干硬件性能！

内核态和用户态

通俗解释

想象你是一名餐厅服务员（事件循环），需要同时服务多桌客人（协程）。

传统多线程（内核态切换）：每服务一桌客人，你必须跑到经理（操作系统内核）办公室登记当前状态，再切换到下一桌。
协程（用户态切换）：你直接在餐厅内（用户态）记下当前桌的进度，直接转向下一桌，无需外部审批。

用户态切换省去了频繁“跑办公室”的步骤，效率自然更高。

技术细节拆解

1. 协程（Coroutine）

本质：用户级轻量级线程，由程序员控制切换时机。
特点：
- 协程切换无需操作系统介入。
- 协程的上下文（如变量、执行位置）保存在用户空间内存中。

2. 事件循环（Event Loop）

职责：
1. 监听I/O事件（如网络响应到达、文件读取完成）。
2. 调度协程：当某个协程等待I/O时，挂起它并执行其他就绪协程。

工作流程：

while True:
检查哪些协程的I/O已就绪 → 执行这些协程 → 重复

3. 用户态 vs 内核态

对比项	用户态	内核态
权限	低权限，只能访问受限资源	高权限，可操作硬件和核心资源
切换成本	低（无需CPU模式切换）	高（需保存/恢复寄存器、权限切换）
典型操作	协程切换、应用逻辑	线程切换、系统调用（如文件读写）

协程切换过程（用户态）

挂起当前协程：
- 保存当前协程的上下文（如程序计数器、局部变量）到内存。
选择下一个协程：
- 事件循环检查哪些协程的I/O已就绪。
恢复下一个协程：
- 从内存加载目标协程的上下文，继续执行。

全程无需内核参与，所有操作在应用程序内完成。

对比多线程切换（内核态）

触发线程切换：
- 时间片用完、I/O阻塞、主动让出。
陷入内核：
- CPU从用户态切换到内核态。
保存上下文：
- 内核保存当前线程的寄存器、堆栈等信息。
调度新线程：
- 内核选择下一个线程，加载其上下文。
返回用户态：
- CPU切回用户态，执行新线程。

每次切换涉及数十微秒开销，高并发下累积成本巨大。

性能对比示例

假设有 1000个并发网络请求，每个请求需等待 100ms 的I/O：

多线程（1000线程）：
- 线程切换开销：假设每次切换 5μs
- 总切换开销：1000线程 × 5μs = 5ms
- 实际总耗时：100ms（受限于I/O） + 5ms ≈ 105ms
异步协程（单线程）：
- 协程切换开销：每次 0.1μs（纯内存操作）
- 总切换开销：1000协程 × 0.1μs = 0.1ms
- 实际总耗时：100ms（受限于I/O） + 0.1ms ≈ 100.1ms

总结

“用户态切换”：协程的调度完全由应用程序控制，无需通过操作系统内核，减少权限切换和数据保存的开销。
优势：
- 极低的切换成本（纳秒级 vs 微秒级）。
- 支持超高并发（轻松管理数万协程）。
适用场景：I/O密集型任务（如爬虫、API调用、实时聊天）。

理解这一点，就能明白为何异步编程在高并发I/O场景下性能碾压多线程/多进程。

关于多进程为何比多线程更快的解释

(1) 网络请求与 CPU 解析的混合负载

多线程的优势场景：纯 I/O 密集型任务（如仅发送请求，无后续解析）。
多进程的优势场景：CPU 密集型任务（如大量 HTML 解析或计算）。
你的任务特性：
实际任务包含 网络请求（I/O） + HTML 解析（CPU） 的混合负载。多进程能更好地并行处理 解析阶段 的 CPU 计算，而多线程受限于 GIL，解析时需串行执行。

(2) 线程/进程数量的影响

多线程（16线程）：
- 高并发请求可能触发服务器反爬限速，反而降低效率。
- 线程切换开销在混合负载中显著增加。
多进程（16进程）：
- 进程独立运行，解析阶段完全并行，不受 GIL 限制。
- 进程数接近 CPU 核心数时，资源利用率更高（假设你的机器有 8 核以上）。

(3) 测试环境的网络波动

多轮测试中网络延迟的随机性可能导致结果偏差，尤其是多线程对瞬时网络波动更敏感。

各版本耗时分析

(1) 单线程（21.8秒）

耗时组成：100次网络请求（约15秒） + 100次HTML解析（约6.8秒）。
瓶颈：完全串行，无并发优化。

(2) 多线程（7.8秒）

耗时组成：
- 网络请求：约3秒（16线程并发，但受服务器反爬限制）。
- HTML解析：约4.8秒（受 GIL 限制，实际为伪并行）。
瓶颈：GIL 导致解析阶段无法充分利用多核。

(3) 多进程（5.3秒）

耗时组成：
- 网络请求：约2.5秒（16进程独立请求，分散反爬检测）。
- HTML解析：约2.8秒（多核并行，无 GIL 限制）。
优势：CPU 解析阶段完全并行，抵消了进程创建的开销。

为何多进程反而更快？

(1) 解析阶段的 CPU 并行化

Beautiful Soup 解析：涉及字符串处理、DOM 树构建等 CPU 操作。
- 多进程可同时利用多核加速解析，而多线程因 GIL 串行执行。

(2) 反爬策略的规避

多进程的独立 IP 和请求头模拟更分散，降低了被服务器限速的概率。

(3) 测试机器的多核配置

若你的机器有 8 核或更多，16进程可充分占用 CPU 资源，而多线程受限于 GIL 的串行解析。

验证实验

(1) 调整线程/进程数

多线程：将线程数降至 4-8，观察是否因减少切换开销而提速。
多进程：将进程数设为 CPU 核心数（如 8），避免过多进程导致调度开销。

(2) 分离网络与解析阶段

纯网络耗时测试：仅发送请求，不解析 HTML，观察多线程是否显著优于多进程。
纯解析耗时测试：使用本地缓存的 HTML 内容，测试多进程解析速度是否碾压多线程。

import os
from concurrent.futures.process import ProcessPoolExecutor
from concurrent.futures.thread import ThreadPoolExecutor
from datetime import datetime
from bs4 import BeautifulSoup
def readHtml(path):
        # print(f"执行任务{path}")
        soup = BeautifulSoup(open(path,"r",encoding="utf-8"), "lxml")
        lis = soup.find("div", id="s-hotsearch-wrapper").find("ul", class_="s-hotsearch-content").find_all("li")
        list=[]
        for li in lis:
            url = li.find("a").get("href")
            title = li.find("a").find("span", class_="title-content-title").get_text()
            list.append([url,title])
        return list
if __name__ == '__main__':
    path = r"D:\learn\Python-卢战士优选\learncode\01FirstProject\Test\htmltest"
    chdir = os.listdir(path)
    paths = [os.path.join(path, file) for file in chdir]
    time1 = datetime.now()
    print("*" * 20, time1.strftime('%Y-%m-%d %H:%M:%S'), "*" * 20)
    # 单线程调用三次耗时记录：0:00:05.779819，0:00:04.695048，0:00:04.848575
    # for path in paths:
    #     readHtml(path)
    #多线程调用三次耗时记录：0:00:05.114409,0:00:05.216166,0:00:05.280045
    # 从结果来看，多线程对于处理cpu计算密集型任务时，几乎没有什么优势
    with ThreadPoolExecutor(max_workers=16) as executor:
        list(executor.map(readHtml, paths))#list# 确保所有任务完成
    #多进程调用三次耗时记录：0:00:00.056828,0:00:00.056402,0:00:00.060979
    # with ProcessPoolExecutor(max_workers=16) as executor:
    #     # list(executor.map(readHtml, paths))#list# 确保所有任务完成
    time2=datetime.now()
    print("*"*20,time2.strftime('%Y-%m-%d %H:%M:%S'),"*"*20)
    print(time2 - time1)

优化建议

(1) 混合并发模型

分工协作：（该方法已经验证，感觉没啥用）
- 使用多线程处理网络请求（I/O 密集型）。
- 使用多进程处理 HTML 解析（CPU 密集型）。

(2) 请求速率控制

添加随机延时，避免触发反爬：

import random, time
time.sleep(random.uniform(0.1, 0.5))  # 在每次请求前暂停

总结

你的测试结果反映了 混合负载任务（I/O + CPU） 在特定环境下的表现：

多进程 因绕过 GIL 并充分利用多核，在解析阶段优势明显，整体更快。
多线程 受限于 GIL 的串行解析，未能完全发挥并发潜力。
单线程 无并发优化，自然最慢。

要进一步提升性能，需根据任务特性设计更精细的并发策略，而非简单依赖单一模型。

单线程单进程代码

# user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36
'''
100题首页地址：https://www.runoob.com/python/python-100-examples.html
'''
import datetime
import re
import threading

import requests
from bs4 import BeautifulSoup
time1=datetime.datetime.now()
print("*"*20,time1.strftime('%Y-%m-%d %H:%M:%S'),"*"*20)
url="https://www.runoob.com/python/python-100-examples.html"
rootUrl="https://www.runoob.com"
# 根据首页地址获取题目列表及连接
header={#模拟浏览器范围，防止被反爬
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=header)
# 防止乱码
response.encoding=response.apparent_encoding
# 将网页内容转为bs4对象
soup = BeautifulSoup(response.text, "lxml")
# 查找题目链接
uls = soup.find("div",id="content").find_all("ul")
hreflist=[]
for ul in uls:
    lis = ul.find_all("li")
    for li in lis:
        hreflist.append(li.find("a").get("href"))

# 创建列表存储每一页所有信息
pageList=[]
for href in hreflist:
    # 根据每个连接获取例题子元素
    topicUrl=rootUrl+href
    response = requests.get(topicUrl, headers=header)
    response.encoding=response.apparent_encoding
    soup = BeautifulSoup(response.text, "lxml")
    # 一、获取题目所有说明文本
    allP = soup.find("div", id="content").find_all("p")
    # 创建字典，暂存每一页的各个板块信息
    dic={}
    # 使用list暂存所有说明
    noteList=[]
    '''
    0题目：有四个数字：1、2、3、4，能组成多少个互不相同且无重复数字的三位数？各是多少？
    1程序分析：可填在百位、十位、个位的数字都是1、2、3、4。组成所有的排列后再去掉不满足条件的排列。
    2程序源代码：
    3以上实例输出结果为：
    '''
    for p in allP:
        if not p.find("a"):#排除“返回上一页”
            noteList.append(p.get_text())
    dic["noteList"]=noteList
    # 二、获取实例代码
    try:
        code = soup.find("div", id="content").find("div",class_="example")
        if code:
            dic["code"]=code.get_text()
        else:
            pre=soup.find("div", id="content").find("pre")
            dic["code"] = pre.get_text()
    except Exception as e:
        pre = soup.find("div", id="content").find("pre")
        print(pre)
        print(f"第{hreflist.index(href)+1}题获取示例代码出错了：{e}")
    # 三、获取执行结果
    try:
        result = soup.find("div", id="content").find("pre")
        if result:
            dic["result"]=result.get_text()
        else:
            imgSrc = soup.find("div", id="content").find("img",attrs={"src":re.compile(r"^//")}).attrs["src"]
            dic["result"]=f"执行结果为非文本数据:数据链接为：https:{imgSrc}"
    except Exception as e:
        print(f"第{hreflist.index(href)+1}题获取执行结果出错了：{e}")
        dic["result"] = "无执行结果！"
    pageList.append(dic)
    # 写入文件
def wirteInFile():
    with open("../Test/python100例题.txt","w+",encoding="utf-8") as f:
    # with open("../Test/python100例题.doc","w+",encoding="utf-8") as f:
        for dic in pageList:
            noteList= dic["noteList"]
            code= dic["code"]
            result= dic["result"]
            try:
                if len(noteList)==4:
                    f.write(f"{pageList.index(dic)+1}"+noteList[0]+"\n")
                    f.write(noteList[1]+"\n")
                    f.write(noteList[2]+"\n")
                    f.write(code+"\n")
                    f.write(noteList[3]+"\n")
                    f.write(result+"\n")
                    f.write("*"*50+"\n")
                else:
                    f.write(f"{pageList.index(dic)+1}"+noteList[0]+"\n")
                    f.write(noteList[1]+"\n")
                    f.write(code+"\n")
                    f.write(noteList[2]+"\n")
                    f.write(result+"\n")
                    f.write("*"*50+"\n")
            except Exception as e:
                print(f"{pageList.index(dic)+1}出错了：{e}")
    print("写入完成！！！")
wirteInFile()
time2=datetime.datetime.now()
print("结束时间","*"*20,time2.strftime('%Y-%m-%d %H:%M:%S'),"*"*20)
print(time2-time1)

多进程代码

import datetime
import random
import re
import time
from concurrent.futures import ProcessPoolExecutor

import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
time1=datetime.datetime.now()
print("*"*20,time1.strftime('%Y-%m-%d %H:%M:%S'),"*"*20)

# 单个耗时任务
def process_href(args):
    # 解包参数
    href, rootUrl ,header= args  # 解包参数
    # 后续代码保持不变
    topicUrl = rootUrl + href
    # 爬取并处理单个链接的逻辑
    # 根据每个连接获取例题子元素
    topicUrl = rootUrl + href
    response = requests.get(topicUrl, headers=header)
    response.encoding = response.apparent_encoding
    soup = BeautifulSoup(response.text, "lxml")
    # 一、获取题目所有说明文本
    allP = soup.find("div", id="content").find_all("p")
    # 创建字典，暂存每一页的各个板块信息
    dic = {}
    # 使用list暂存所有说明
    noteList = []
    '''
    0题目：有四个数字：1、2、3、4，能组成多少个互不相同且无重复数字的三位数？各是多少？
    1程序分析：可填在百位、十位、个位的数字都是1、2、3、4。组成所有的排列后再去掉不满足条件的排列。
    2程序源代码：
    3以上实例输出结果为：
    '''
    for p in allP:
        if not p.find("a"):  # 排除“返回上一页”
            noteList.append(p.get_text())
    dic["noteList"] = noteList
    # 二、获取实例代码
    try:
        code = soup.find("div", id="content").find("div", class_="example")
        if code:
            dic["code"] = code.get_text()
        else:
            pre = soup.find("div", id="content").find("pre")
            dic["code"] = pre.get_text()
    except Exception as e:
        pre = soup.find("div", id="content").find("pre")
        print(pre)
        print(f"第{hreflist.index(href) + 1}题获取示例代码出错了：{e}")
    # 三、获取执行结果
    try:
        result = soup.find("div", id="content").find("pre")
        if result:
            dic["result"] = result.get_text()
        else:
            imgSrc = soup.find("div", id="content").find("img", attrs={"src": re.compile(r"^//")}).attrs["src"]
            dic["result"] = f"执行结果为非文本数据:数据链接为：https:{imgSrc}"
    except Exception as e:
        print(f"第{hreflist.index(href) + 1}题获取执行结果出错了：{e}")
        dic["result"] = "无执行结果！"
    return dic

# 写入文件
def wirteInFile(pageList):

    with open("../Test/python100例题.txt","w+",encoding="utf-8") as f:
    # with open("../Test/python100例题.doc","w+",encoding="utf-8") as f:
        for dic in pageList:
            noteList= dic["noteList"]
            code= dic["code"]
            result= dic["result"]
            try:
                if len(noteList)==4:
                    f.write(f"{pageList.index(dic)+1}"+noteList[0]+"\n")
                    f.write(noteList[1]+"\n")
                    f.write(noteList[2]+"\n")
                    f.write(code+"\n")
                    f.write(noteList[3]+"\n")
                    f.write(result+"\n")
                    f.write("*"*50+"\n")
                else:
                    f.write(f"{pageList.index(dic)+1}"+noteList[0]+"\n")
                    f.write(noteList[1]+"\n")
                    f.write(code+"\n")
                    f.write(noteList[2]+"\n")
                    f.write(result+"\n")
                    f.write("*"*50+"\n")
            except Exception as e:
                print(f"{pageList.index(dic)+1}出错了：{e}")
    print("写入完成！！！")

if __name__ == '__main__':
    url = "https://www.runoob.com/python/python-100-examples.html"
    rootUrl = "https://www.runoob.com"
    # 根据首页地址获取题目列表及连接
    header = {  # 模拟浏览器范围，防止被反爬
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"
    }
    response = requests.get(url, headers=header)
    # 防止乱码
    response.encoding = response.apparent_encoding
    # 将网页内容转为bs4对象
    soup = BeautifulSoup(response.text, "lxml")
    # 查找题目链接
    uls = soup.find("div", id="content").find_all("ul")
    hreflist = []
    for ul in uls:
        lis = ul.find_all("li")
        for li in lis:
            hreflist.append(li.find("a").get("href"))
    # executor.map只能接受一个参数，所以将参数打包成一个
    arglist = [(href, rootUrl,header) for href in hreflist]
 # 使用多进程处理网络请求任务:标准库线程池【平均用时5.7秒】#     多进程写入，最快用时：0:00:04.949686
    with ProcessPoolExecutor(max_workers=16) as executor:

        pageList=executor.map(process_href, arglist)
        wirteInFile(list(pageList))
        time2 = datetime.datetime.now()
        print("结束时间", "*" * 20, time2.strftime('%Y-%m-%d %H:%M:%S'), "*" * 20)
        print(time2 - time1)
# 使用多进程处理网络请求任务:传统线程池【平均用时6.7秒】

    # with Pool(16) as p:
    #     # 创建列表存储每一页所有信息
    #     pageList =p.map(process_href, arglist)
    # wirteInFile(pageList)
    # time2=datetime.datetime.now()
    # print("结束时间","*"*20,time2.strftime('%Y-%m-%d %H:%M:%S'),"*"*20)
    # print(time2-time1)

多线程代码

import datetime
import re
from concurrent.futures import ThreadPoolExecutor

import requests
from bs4 import BeautifulSoup
import concurrent.futures
import random, time

time1=datetime.datetime.now()
print("*"*20,time1.strftime('%Y-%m-%d %H:%M:%S'),"*"*20)
url="https://www.runoob.com/python/python-100-examples.html"
rootUrl="https://www.runoob.com"
# 根据首页地址获取题目列表及连接
header={#模拟浏览器范围，防止被反爬
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=header)
# 防止乱码
response.encoding=response.apparent_encoding
# 将网页内容转为bs4对象
soup = BeautifulSoup(response.text, "lxml")
# 查找题目链接
uls = soup.find("div",id="content").find_all("ul")
hreflist=[]
for ul in uls:
    lis = ul.find_all("li")
    for li in lis:
        hreflist.append(li.find("a").get("href"))


# 创建列表存储每一页所有信息
pageList=[]
# 单个耗时任务
def process_href(href):
    # 爬取并处理单个链接的逻辑
    # 根据每个连接获取例题子元素
    topicUrl = rootUrl + href
    response = requests.get(topicUrl, headers=header)
    response.encoding = response.apparent_encoding
    soup = BeautifulSoup(response.text, "lxml")
    # 一、获取题目所有说明文本

    allP = soup.find("div", id="content").find_all("p")
    # 创建字典，暂存每一页的各个板块信息
    dic = {}
    # 使用list暂存所有说明
    noteList = []
    '''
    0题目：有四个数字：1、2、3、4，能组成多少个互不相同且无重复数字的三位数？各是多少？
    1程序分析：可填在百位、十位、个位的数字都是1、2、3、4。组成所有的排列后再去掉不满足条件的排列。
    2程序源代码：
    3以上实例输出结果为：
    '''
    for p in allP:
        if not p.find("a"):  # 排除“返回上一页”
            noteList.append(p.get_text())
    dic["noteList"] = noteList
    # 二、获取实例代码
    try:
        code = soup.find("div", id="content").find("div", class_="example")
        if code:
            dic["code"] = code.get_text()
        else:
            pre = soup.find("div", id="content").find("pre")
            dic["code"] = pre.get_text()
    except Exception as e:
        pre = soup.find("div", id="content").find("pre")
        print(pre)
        print(f"第{hreflist.index(href) + 1}题获取示例代码出错了：{e}")
    # 三、获取执行结果
    try:
        result = soup.find("div", id="content").find("pre")
        if result:
            dic["result"] = result.get_text()
        else:
            imgSrc = soup.find("div", id="content").find("img", attrs={"src": re.compile(r"^//")}).attrs["src"]
            dic["result"] = f"执行结果为非文本数据:数据链接为：https:{imgSrc}"
    except Exception as e:
        print(f"第{hreflist.index(href) + 1}题获取执行结果出错了：{e}")
        dic["result"] = "无执行结果！"
    pageList.append(dic)
# 在多线程版本中，`pageList`是全局变量，多线程并发append可能导致性能问题，但此处可能影响较小。另外，网络请求的并发管理（如TCP连接数限制）也会影响效率。
#  `pageList.append(dic)`可能因多线程同时修改列表而导致竞争条件，【但Python的列表操作在CPython中是线程安全的（由于GIL）】。不过频繁的append操作仍可能导致性能下降，但整体影响可能不大。

# 写入文件
def wirteInFile():

    with open("../Test/python100例题.txt","w+",encoding="utf-8") as f:
    # with open("../Test/python100例题.doc","w+",encoding="utf-8") as f:
        for dic in pageList:
            noteList= dic["noteList"]
            code= dic["code"]
            result= dic["result"]
            try:
                if len(noteList)==4:
                    f.write(f"{pageList.index(dic)+1}"+noteList[0]+"\n")
                    f.write(noteList[1]+"\n")
                    f.write(noteList[2]+"\n")
                    f.write(code+"\n")
                    f.write(noteList[3]+"\n")
                    f.write(result+"\n")
                    f.write("*"*50+"\n")
                else:
                    f.write(f"{pageList.index(dic)+1}"+noteList[0]+"\n")
                    f.write(noteList[1]+"\n")
                    f.write(code+"\n")
                    f.write(noteList[2]+"\n")
                    f.write(result+"\n")
                    f.write("*"*50+"\n")
            except Exception as e:
                print(f"{pageList.index(dic)+1}出错了：{e}")
    print("写入完成！！！")

if __name__ == "__main__":
    # 多线程耗时0:00:07.817000
    with ThreadPoolExecutor(max_workers=16) as executor:
        executor.map(process_href, hreflist)

    wirteInFile()
    time2=datetime.datetime.now()
    print("结束时间","*"*20,time2.strftime('%Y-%m-%d %H:%M:%S'),"*"*20)
    print(time2-time1)

异步代码

import asyncio
import datetime
import re
import aiohttp
import requests
from bs4 import BeautifulSoup


# 单个耗时任务
async def process_href(href, rootUrl, session):
    topicUrl = rootUrl + href
    async with session.get(topicUrl, headers=header) as res:
        soup = BeautifulSoup(await res.text(), "lxml")
        return await analysishtml(soup, href)


async def analysishtml(soup, href):
    # 一、获取题目所有说明文本
    allP = soup.find("div", id="content").find_all("p")
    # 创建字典，暂存每一页的各个板块信息
    dic = {}
    # 使用list暂存所有说明
    noteList = []
    for p in allP:
        if not p.find("a"):  # 排除“返回上一页”
            noteList.append(p.get_text())
    dic["noteList"] = noteList
    # 二、获取实例代码
    try:
        code = soup.find("div", id="content").find("div", class_="example")
        if code:
            dic["code"] = code.get_text()
        else:
            pre = soup.find("div", id="content").find("pre")
            dic["code"] = pre.get_text()
    except Exception as e:
        pre = soup.find("div", id="content").find("pre")
        print(pre)
        print(f"第{hreflist.index(href) + 1}题获取示例代码出错了：{e}")
    # 三、获取执行结果
    try:
        result = soup.find("div", id="content").find("pre")
        if result:
            dic["result"] = result.get_text()
        else:
            imgSrc = soup.find("div", id="content").find("img", attrs={"src": re.compile(r"^//")}).attrs["src"]
            dic["result"] = f"执行结果为非文本数据:数据链接为：https:{imgSrc}"
    except Exception as e:
        print(f"第{hreflist.index(href) + 1}题获取执行结果出错了：{e}")
        dic["result"] = "无执行结果！"
    return dic


# 写入文件
async def wirteInFile(pageList):
    with open("../Test/python100例题.txt", "w+", encoding="utf-8") as f:
        # with open("../Test/python100例题.doc","w+",encoding="utf-8") as f:
        for dic in pageList:
            noteList = dic["noteList"]
            code = dic["code"]
            result = dic["result"]
            try:
                if len(noteList) == 4:
                    f.write(f"{pageList.index(dic) + 1}" + noteList[0] + "\n")
                    f.write(noteList[1] + "\n")
                    f.write(noteList[2] + "\n")
                    f.write(code + "\n")
                    f.write(noteList[3] + "\n")
                    f.write(result + "\n")
                    f.write("*" * 50 + "\n")
                else:
                    f.write(f"{pageList.index(dic) + 1}" + noteList[0] + "\n")
                    f.write(noteList[1] + "\n")
                    f.write(code + "\n")
                    f.write(noteList[2] + "\n")
                    f.write(result + "\n")
                    f.write("*" * 50 + "\n")
            except Exception as e:
                print(f"{pageList.index(dic) + 1}出错了：{e}")
    print("写入完成！！！")


if __name__ == '__main__':
    url = "https://www.runoob.com/python/python-100-examples.html"
    rootUrl = "https://www.runoob.com"
    header = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"
    }
    response = requests.get(url, headers=header)
    response.encoding = response.apparent_encoding
    # 将网页内容转为bs4对象
    soup = BeautifulSoup(response.text, "lxml")
    # 查找题目链接
    uls = soup.find("div", id="content").find_all("ul")
    # 定义内容目录链接列表
    hreflist = []
    for ul in uls:
        lis = ul.find_all("li")
        for li in lis:
            hreflist.append(li.find("a").get("href"))


    async def main():
        time1 = datetime.datetime.now()
        print("*" * 20, time1.strftime('%Y-%m-%d %H:%M:%S'), "*" * 20)
        async with aiohttp.ClientSession(headers=header) as session:
            worklist = [process_href(href, rootUrl, session) for href in hreflist]
            pageList = await asyncio.gather(*worklist)
        await wirteInFile(pageList)
        time2 = datetime.datetime.now()
        print("结束时间", "*" * 20, time2.strftime('%Y-%m-%d %H:%M:%S'), "*" * 20)
        print(time2 - time1)


    asyncio.run(main())

posted @ 2025-03-27 09:15 指尖下的世界阅读(107) 评论(0) 收藏举报

刷新页面返回顶部

指尖下的世界

今日事今日毕,今日无事早休息.