Python反爬机制应对从基础到实战，破解网站反爬策略

在数据采集领域，反爬机制是爬虫开发绕不开的核心问题。不同网站的反爬策略从简单的UA验证到复杂的验证码、动态渲染、行为检测层层递进，单纯的requests请求早已无法满足需求。本文基于实战经验，系统讲解Python应对各类反爬机制的核心思路、工具用法和避坑要点，覆盖静态反爬、动态反爬、验证码破解等全场景，所有代码均可直接复用。

一、反爬机制核心分类与应对思路

先明确网站常见的反爬手段及对应破解逻辑，避免盲目开发：

反爬类型	常见手段	核心应对思路	核心工具
基础身份验证	UA检测、Referer验证、IP封禁	模拟正常请求头、IP代理池、请求频率控制	requests、fake-useragent、代理池
动态数据加载	Ajax异步加载、JS加密参数	分析接口、逆向JS、模拟参数生成	requests、execjs、mitmproxy
页面动态渲染	Vue/React渲染、JS动态生成内容	无头浏览器渲染、直接调用接口	Selenium、Playwright、Pyppeteer
验证码拦截	滑块验证、点选验证、图文验证码	第三方打码平台、机器学习识别	ddddocr、超级鹰、2Captcha
行为检测	鼠标轨迹、操作间隔、Cookie验证	模拟真人操作、保持会话、随机化行为	Selenium、Playwright
高级反爬	指纹识别（Canvas/Font）、反调试	修改浏览器指纹、禁用调试、模拟环境	Playwright、FingerprintSwitcher

二、基础反爬应对：突破UA/IP/请求频率限制

这是最常见的反爬手段，也是爬虫开发的入门门槛，核心是“让请求看起来像真人操作”。

1. 模拟正常请求头（破解UA/Referer检测）

网站通过User-Agent判断请求是否来自爬虫，通过Referer判断请求来源是否合法，只需构造完整的请求头即可破解。

实战代码：

import requests
from fake_useragent import UserAgent

# 初始化UA生成器
ua = UserAgent()

# 构造请求头
headers = {
    "User-Agent": ua.random,  # 随机生成真实UA
    "Referer": "https://www.target.com/",  # 模拟从目标网站内跳转
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1"
}

# 发送请求
url = "https://www.target.com/data"
response = requests.get(url, headers=headers)
print(response.status_code)

关键要点：

避免使用固定UA，通过fake-useragent库随机生成不同浏览器的UA；
Referer需与请求的URL匹配（如爬取https://www.target.com/list，Referer设为https://www.target.com）；
安装依赖：pip install requests fake-useragent。

2. IP代理池（破解IP封禁）

单IP高频请求会被网站拉黑，需搭建IP代理池，每次请求切换不同IP。

实战代码（使用免费代理示例）：

import requests
from fake_useragent import UserAgent

ua = UserAgent()
headers = {"User-Agent": ua.random}

# 代理池（实际项目需用付费代理，如阿布云、快代理）
proxies_pool = [
    {"http": "http://112.114.34.22:8080", "https": "https://112.114.34.22:8080"},
    {"http": "http://183.237.214.10:9999", "https": "https://183.237.214.10:9999"},
]

# 随机选择代理
import random
proxy = random.choice(proxies_pool)

# 发送请求（添加超时和异常处理）
try:
    response = requests.get(
        url="https://www.target.com/data",
        headers=headers,
        proxies=proxy,
        timeout=10  # 超时时间，避免代理卡死
    )
    print(f"请求成功，IP：{proxy}")
except requests.exceptions.ProxyError:
    print(f"代理失效：{proxy}")
except requests.exceptions.Timeout:
    print("请求超时")

避坑要点：

免费代理稳定性差，生产环境优先使用付费代理（按需求选择高匿代理）；
需定期检测代理有效性，剔除失效IP；
避免频繁切换IP，同一IP间隔5~10秒请求一次。

3. 请求频率控制（破解高频请求检测）

网站通过请求间隔判断是否为爬虫，需模拟真人操作的间隔时间。

实战代码：

import requests
import time
import random

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
url_list = ["https://www.target.com/page1", "https://www.target.com/page2", "https://www.target.com/page3"]

for url in url_list:
    # 随机请求间隔（2~5秒）
    sleep_time = random.uniform(2, 5)
    time.sleep(sleep_time)
    
    # 发送请求
    response = requests.get(url, headers=headers)
    print(f"爬取{url}成功，间隔{sleep_time:.2f}秒")

进阶优化：

使用time.sleep(random.uniform(1, 3))替代固定间隔；
对不同页面设置不同的间隔范围（如列表页间隔2_{3秒，详情页间隔5}8秒）；
避免短时间内请求同一目录下的大量页面。

三、动态数据反爬：破解Ajax/JS加密参数

很多网站的核心数据通过Ajax异步加载，且请求参数（如sign、token）由JS加密生成，需逆向分析JS代码，模拟参数生成。

1. 分析Ajax接口（F12抓包）

以某电商网站为例，步骤如下：

打开Chrome开发者工具（F12）→ 切换到Network标签；
刷新页面或触发数据加载，筛选XHR/Fetch类型的请求；
找到返回目标数据的接口，查看Request URL、Request Method、Request Headers和Form Data；
复制接口URL，直接请求获取数据（无需爬取整个页面）。

实战代码（请求Ajax接口）：

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Referer": "https://www.target.com/",
    "X-Requested-With": "XMLHttpRequest"  # 标识Ajax请求
}

# Ajax接口地址（从F12中复制）
api_url = "https://www.target.com/api/data/list"

# 请求参数（从F12的Form Data中复制）
params = {
    "page": 1,
    "size": 20,
    "category": "electronics"
}

# 发送POST请求（根据接口类型选择GET/POST）
response = requests.post(api_url, headers=headers, data=params)
data = response.json()  # 直接获取JSON数据
print(data["list"])  # 提取目标数据

2. 逆向JS加密参数（破解sign/token）

若接口参数包含加密字段（如sign），需逆向分析生成该参数的JS代码：

在Chrome开发者工具中搜索加密参数名（如sign），找到生成该参数的JS函数；
复制JS代码到本地，使用execjs库执行JS代码生成参数；
模拟生成加密参数，发送请求。

实战代码（execjs执行JS）：

import execjs
import requests

# 1. 读取加密JS代码（从网站逆向得到）
with open("encrypt.js", "r", encoding="utf-8") as f:
    js_code = f.read()

# 2. 编译JS代码
ctx = execjs.compile(js_code)

# 3. 调用JS函数生成加密参数
timestamp = str(int(time.time() * 1000))
sign = ctx.call("generateSign", timestamp, "api_key123")  # 传入参数调用函数

# 4. 构造请求参数
params = {
    "page": 1,
    "size": 20,
    "timestamp": timestamp,
    "sign": sign
}

# 5. 发送请求
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
response = requests.get("https://www.target.com/api/data", headers=headers, params=params)
print(response.json())

关键要点：

安装execjs依赖：pip install execjs，Windows需安装Node.js；
复杂JS可使用mitmproxy抓包分析参数生成过程；
避免直接复制压缩后的JS代码，先格式化（使用Chrome开发者工具的Prettify功能）。

四、动态渲染反爬：破解Vue/React页面

Vue/React等框架渲染的页面，HTML源码中无实际数据，需用无头浏览器渲染页面或直接调用接口。

1. Playwright（推荐）：新一代无头浏览器

Playwright是微软推出的自动化工具，支持Chrome/Firefox/Safari，稳定性优于Selenium，反爬检测概率更低。

实战代码（渲染动态页面）：

from playwright.sync_api import sync_playwright
import time

with sync_playwright() as p:
    # 启动浏览器（headless=True为无头模式）
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    )
    page = context.new_page()
    
    # 访问页面（等待页面加载完成）
    page.goto("https://www.target.com/dynamic-page")
    page.wait_for_load_state("networkidle")  # 等待网络请求完成
    
    # 提取数据（通过CSS选择器）
    items = page.locator(".product-item").all_text_contents()
    for item in items:
        print(item)
    
    # 模拟翻页
    page.click(".next-page")
    time.sleep(2)  # 等待翻页完成
    next_items = page.locator(".product-item").all_text_contents()
    print(next_items)
    
    browser.close()

关键要点：

安装Playwright：pip install playwright，然后执行playwright install chromium；
使用wait_for_load_state或wait_for_selector等待元素加载，避免获取空数据；
模拟真人操作：添加随机点击、滚动、等待，避免机械操作。

2. 对比Selenium：适合简单场景

Selenium是老牌自动化工具，上手简单，适合新手：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# 配置Chrome选项
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")  # 无头模式
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
options.add_argument("--disable-blink-features=AutomationControlled")  # 禁用自动化检测

# 启动浏览器
driver = webdriver.Chrome(options=options)
driver.get("https://www.target.com/dynamic-page")

# 等待元素加载（显式等待）
wait = WebDriverWait(driver, 10)
items = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "product-item")))

# 提取数据
for item in items:
    print(item.text)

driver.quit()

五、验证码反爬：破解滑块/点选/图文验证

验证码是反爬的重要手段，个人开发优先使用第三方打码平台，效率高于自研识别。

1. 图文验证码（ddddocr识别）

ddddocr是开源的验证码识别库，支持常见的图文验证码识别：

import ddddocr
import requests

# 下载验证码图片
captcha_url = "https://www.target.com/captcha.jpg"
response = requests.get(captcha_url)
with open("captcha.jpg", "wb") as f:
    f.write(response.content)

# 识别验证码
ocr = ddddocr.DdddOcr()
with open("captcha.jpg", "rb") as f:
    img_bytes = f.read()
captcha_code = ocr.classification(img_bytes)
print(f"识别的验证码：{captcha_code}")

# 提交验证码
data = {
    "username": "test",
    "password": "123456",
    "captcha": captcha_code
}
response = requests.post("https://www.target.com/login", data=data)
print(response.text)

2. 滑块验证码（超级鹰打码平台）

复杂滑块验证码需使用付费打码平台，以超级鹰为例：

import requests
from PIL import Image
import base64

# 1. 下载滑块验证码图片
bg_url = "https://www.target.com/bg.jpg"  # 背景图
slide_url = "https://www.target.com/slide.png"  # 滑块图
bg_response = requests.get(bg_url)
slide_response = requests.get(slide_url)
with open("bg.jpg", "wb") as f:
    f.write(bg_response.content)
with open("slide.png", "wb") as f:
    f.write(slide_response.content)

# 2. 转换为base64编码（打码平台要求）
def img_to_base64(img_path):
    with open(img_path, "rb") as f:
        return base64.b64encode(f.read()).decode()

bg_base64 = img_to_base64("bg.jpg")
slide_base64 = img_to_base64("slide.png")

# 3. 调用超级鹰API识别滑块位置
chaojiying_url = "http://upload.chaojiying.net/Upload/Processing.php"
params = {
    "user": "你的用户名",
    "pass": "你的密码",
    "softid": "999999",
    "codetype": "9004",  # 滑块验证码类型
    "image": bg_base64,
    "slide": slide_base64
}
response = requests.post(chaojiying_url, data=params)
result = response.json()
print(f"滑块位置：{result['pic_str']}")

# 4. 模拟滑动（使用Playwright）
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://www.target.com/login")
    # 模拟滑块拖动
    slide = page.locator(".slide-btn")
    slide.drag_to(page.locator(".slide-bg"), position={"x": int(result['pic_str']), "y": 0})
    browser.close()

六、高级反爬应对：破解指纹识别/行为检测

1. 修改浏览器指纹（Playwright）

网站通过Canvas、Font等指纹识别自动化工具，需修改浏览器指纹：

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    context = browser.new_context(
        # 修改Canvas指纹
        locale="zh-CN",
        timezone_id="Asia/Shanghai",
        geolocation={"longitude": 116.403963, "latitude": 39.915119},
        permissions=["geolocation"]
    )
    page = context.new_page()
    
    # 注入JS修改指纹
    page.add_init_script("""
        // 重写Canvas方法
        HTMLCanvasElement.prototype.toDataURL = function() {
            return "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVQImWNgYGBgAAAABQABh6FO1AAAAABJRU5ErkJggg==";
        };
    """)
    
    page.goto("https://www.target.com/fingerprint-test")
    browser.close()

2. 模拟真人行为（避免机械操作）

from playwright.sync_api import sync_playwright
import random
import time

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://www.target.com")
    
    # 模拟鼠标滚动
    page.mouse.wheel(0, random.randint(100, 500))
    time.sleep(random.uniform(0.5, 1.5))
    
    # 模拟随机点击
    page.mouse.click(random.randint(100, 200), random.randint(200, 300))
    time.sleep(random.uniform(1, 2))
    
    # 模拟输入（逐字输入，模拟真人打字）
    input_box = page.locator("#search-input")
    text = "测试关键词"
    for char in text:
        input_box.type(char)
        time.sleep(random.uniform(0.1, 0.3))
    
    browser.close()

七、反爬应对避坑要点

遵守法律法规：仅爬取公开数据，避免爬取隐私信息，遵守网站的robots.txt协议；
避免过度爬取：控制请求频率，避免对网站服务器造成压力；
优先使用官方接口：若网站提供开放API，优先调用API而非爬虫；
动态调整策略：网站反爬机制会更新，需定期测试并调整爬虫策略；
分布式爬取：大规模爬取需使用分布式架构（如Scrapy+Scrapy-Redis），分散请求压力；
异常处理：添加完善的异常捕获机制，避免爬虫因单次失败而终止。

八、实战项目：完整反爬应对爬虫示例

以下是一个综合应对多种反爬机制的爬虫示例，爬取某电商网站商品数据：

import requests
import random
import time
from fake_useragent import UserAgent
from playwright.sync_api import sync_playwright

# 初始化工具
ua = UserAgent()
proxies_pool = [
    {"http": "http://112.114.34.22:8080", "https": "https://112.114.34.22:8080"},
]

def get_proxy():
    """随机获取代理"""
    return random.choice(proxies_pool)

def get_headers():
    """生成请求头"""
    return {
        "User-Agent": ua.random,
        "Referer": "https://www.target.com/",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    }

def crawl_list(page_num):
    """爬取商品列表（Ajax接口）"""
    url = "https://www.target.com/api/product/list"
    params = {
        "page": page_num,
        "size": 20,
        "timestamp": str(int(time.time() * 1000)),
    }
    # 生成加密sign（逆向JS后实现）
    params["sign"] = generate_sign(params["timestamp"])
    
    headers = get_headers()
    proxy = get_proxy()
    
    try:
        response = requests.get(
            url,
            headers=headers,
            params=params,
            proxies=proxy,
            timeout=10
        )
        data = response.json()
        return data["data"]["list"]
    except Exception as e:
        print(f"爬取第{page_num}页失败：{e}")
        return []

def crawl_detail(product_id):
    """爬取商品详情（动态渲染页面）"""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(user_agent=ua.random)
        page.goto(f"https://www.target.com/product/{product_id}")
        page.wait_for_load_state("networkidle")
        
        # 提取详情数据
        title = page.locator(".product-title").text_content()
        price = page.locator(".product-price").text_content()
        desc = page.locator(".product-desc").text_content()
        
        browser.close()
        return {"id": product_id, "title": title, "price": price, "desc": desc}

# 主程序
if __name__ == "__main__":
    all_products = []
    for page in range(1, 5):
        # 爬取列表
        list_data = crawl_list(page)
        if not list_data:
            continue
        all_products.extend(list_data)
        
        # 爬取详情
        for product in list_data:
            product_id = product["id"]
            detail_data = crawl_detail(product_id)
            print(detail_data)
            
            # 随机间隔
            time.sleep(random.uniform(3, 5))
        
        # 翻页间隔
        time.sleep(random.uniform(5, 8))
    
    print(f"共爬取{len(all_products)}个商品")

posted @ 2026-01-19 14:11 小帅记事阅读(4) 评论(0) 收藏举报

刷新页面返回顶部

小帅记事

Python反爬机制应对从基础到实战，破解网站反爬策略

一、反爬机制核心分类与应对思路

二、基础反爬应对：突破UA/IP/请求频率限制

1. 模拟正常请求头（破解UA/Referer检测）

实战代码：

关键要点：

2. IP代理池（破解IP封禁）

实战代码（使用免费代理示例）：

避坑要点：

3. 请求频率控制（破解高频请求检测）

实战代码：

进阶优化：

三、动态数据反爬：破解Ajax/JS加密参数

1. 分析Ajax接口（F12抓包）

实战代码（请求Ajax接口）：

2. 逆向JS加密参数（破解sign/token）

实战代码（execjs执行JS）：

关键要点：

四、动态渲染反爬：破解Vue/React页面

1. Playwright（推荐）：新一代无头浏览器

实战代码（渲染动态页面）：

关键要点：

2. 对比Selenium：适合简单场景

五、验证码反爬：破解滑块/点选/图文验证

1. 图文验证码（ddddocr识别）

2. 滑块验证码（超级鹰打码平台）

六、高级反爬应对：破解指纹识别/行为检测

1. 修改浏览器指纹（Playwright）

2. 模拟真人行为（避免机械操作）

七、反爬应对避坑要点

八、实战项目：完整反爬应对爬虫示例

公告