Python反爬机制应对从基础到实战,破解网站反爬策略

在数据采集领域,反爬机制是爬虫开发绕不开的核心问题。不同网站的反爬策略从简单的UA验证到复杂的验证码、动态渲染、行为检测层层递进,单纯的requests请求早已无法满足需求。本文基于实战经验,系统讲解Python应对各类反爬机制的核心思路、工具用法和避坑要点,覆盖静态反爬、动态反爬、验证码破解等全场景,所有代码均可直接复用。

一、反爬机制核心分类与应对思路

先明确网站常见的反爬手段及对应破解逻辑,避免盲目开发:

反爬类型 常见手段 核心应对思路 核心工具
基础身份验证 UA检测、Referer验证、IP封禁 模拟正常请求头、IP代理池、请求频率控制 requests、fake-useragent、代理池
动态数据加载 Ajax异步加载、JS加密参数 分析接口、逆向JS、模拟参数生成 requests、execjs、mitmproxy
页面动态渲染 Vue/React渲染、JS动态生成内容 无头浏览器渲染、直接调用接口 Selenium、Playwright、Pyppeteer
验证码拦截 滑块验证、点选验证、图文验证码 第三方打码平台、机器学习识别 ddddocr、超级鹰、2Captcha
行为检测 鼠标轨迹、操作间隔、Cookie验证 模拟真人操作、保持会话、随机化行为 Selenium、Playwright
高级反爬 指纹识别(Canvas/Font)、反调试 修改浏览器指纹、禁用调试、模拟环境 Playwright、FingerprintSwitcher

二、基础反爬应对:突破UA/IP/请求频率限制

这是最常见的反爬手段,也是爬虫开发的入门门槛,核心是“让请求看起来像真人操作”。

1. 模拟正常请求头(破解UA/Referer检测)

网站通过User-Agent判断请求是否来自爬虫,通过Referer判断请求来源是否合法,只需构造完整的请求头即可破解。

实战代码:

import requests
from fake_useragent import UserAgent

# 初始化UA生成器
ua = UserAgent()

# 构造请求头
headers = {
    "User-Agent": ua.random,  # 随机生成真实UA
    "Referer": "https://www.target.com/",  # 模拟从目标网站内跳转
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1"
}

# 发送请求
url = "https://www.target.com/data"
response = requests.get(url, headers=headers)
print(response.status_code)

关键要点:

  • 避免使用固定UA,通过fake-useragent库随机生成不同浏览器的UA;
  • Referer需与请求的URL匹配(如爬取https://www.target.com/list,Referer设为https://www.target.com);
  • 安装依赖:pip install requests fake-useragent

2. IP代理池(破解IP封禁)

单IP高频请求会被网站拉黑,需搭建IP代理池,每次请求切换不同IP。

实战代码(使用免费代理示例):

import requests
from fake_useragent import UserAgent

ua = UserAgent()
headers = {"User-Agent": ua.random}

# 代理池(实际项目需用付费代理,如阿布云、快代理)
proxies_pool = [
    {"http": "http://112.114.34.22:8080", "https": "https://112.114.34.22:8080"},
    {"http": "http://183.237.214.10:9999", "https": "https://183.237.214.10:9999"},
]

# 随机选择代理
import random
proxy = random.choice(proxies_pool)

# 发送请求(添加超时和异常处理)
try:
    response = requests.get(
        url="https://www.target.com/data",
        headers=headers,
        proxies=proxy,
        timeout=10  # 超时时间,避免代理卡死
    )
    print(f"请求成功,IP:{proxy}")
except requests.exceptions.ProxyError:
    print(f"代理失效:{proxy}")
except requests.exceptions.Timeout:
    print("请求超时")

避坑要点:

  • 免费代理稳定性差,生产环境优先使用付费代理(按需求选择高匿代理);
  • 需定期检测代理有效性,剔除失效IP;
  • 避免频繁切换IP,同一IP间隔5~10秒请求一次。

3. 请求频率控制(破解高频请求检测)

网站通过请求间隔判断是否为爬虫,需模拟真人操作的间隔时间。

实战代码:

import requests
import time
import random

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
url_list = ["https://www.target.com/page1", "https://www.target.com/page2", "https://www.target.com/page3"]

for url in url_list:
    # 随机请求间隔(2~5秒)
    sleep_time = random.uniform(2, 5)
    time.sleep(sleep_time)
    
    # 发送请求
    response = requests.get(url, headers=headers)
    print(f"爬取{url}成功,间隔{sleep_time:.2f}秒")

进阶优化:

  • 使用time.sleep(random.uniform(1, 3))替代固定间隔;
  • 对不同页面设置不同的间隔范围(如列表页间隔23秒,详情页间隔58秒);
  • 避免短时间内请求同一目录下的大量页面。

三、动态数据反爬:破解Ajax/JS加密参数

很多网站的核心数据通过Ajax异步加载,且请求参数(如signtoken)由JS加密生成,需逆向分析JS代码,模拟参数生成。

1. 分析Ajax接口(F12抓包)

以某电商网站为例,步骤如下:

  1. 打开Chrome开发者工具(F12)→ 切换到Network标签;
  2. 刷新页面或触发数据加载,筛选XHR/Fetch类型的请求;
  3. 找到返回目标数据的接口,查看Request URL、Request Method、Request Headers和Form Data;
  4. 复制接口URL,直接请求获取数据(无需爬取整个页面)。

实战代码(请求Ajax接口):

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Referer": "https://www.target.com/",
    "X-Requested-With": "XMLHttpRequest"  # 标识Ajax请求
}

# Ajax接口地址(从F12中复制)
api_url = "https://www.target.com/api/data/list"

# 请求参数(从F12的Form Data中复制)
params = {
    "page": 1,
    "size": 20,
    "category": "electronics"
}

# 发送POST请求(根据接口类型选择GET/POST)
response = requests.post(api_url, headers=headers, data=params)
data = response.json()  # 直接获取JSON数据
print(data["list"])  # 提取目标数据

2. 逆向JS加密参数(破解sign/token)

若接口参数包含加密字段(如sign),需逆向分析生成该参数的JS代码:

  1. 在Chrome开发者工具中搜索加密参数名(如sign),找到生成该参数的JS函数;
  2. 复制JS代码到本地,使用execjs库执行JS代码生成参数;
  3. 模拟生成加密参数,发送请求。

实战代码(execjs执行JS):

import execjs
import requests

# 1. 读取加密JS代码(从网站逆向得到)
with open("encrypt.js", "r", encoding="utf-8") as f:
    js_code = f.read()

# 2. 编译JS代码
ctx = execjs.compile(js_code)

# 3. 调用JS函数生成加密参数
timestamp = str(int(time.time() * 1000))
sign = ctx.call("generateSign", timestamp, "api_key123")  # 传入参数调用函数

# 4. 构造请求参数
params = {
    "page": 1,
    "size": 20,
    "timestamp": timestamp,
    "sign": sign
}

# 5. 发送请求
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}
response = requests.get("https://www.target.com/api/data", headers=headers, params=params)
print(response.json())

关键要点:

  • 安装execjs依赖:pip install execjs,Windows需安装Node.js;
  • 复杂JS可使用mitmproxy抓包分析参数生成过程;
  • 避免直接复制压缩后的JS代码,先格式化(使用Chrome开发者工具的Prettify功能)。

四、动态渲染反爬:破解Vue/React页面

Vue/React等框架渲染的页面,HTML源码中无实际数据,需用无头浏览器渲染页面或直接调用接口。

1. Playwright(推荐):新一代无头浏览器

Playwright是微软推出的自动化工具,支持Chrome/Firefox/Safari,稳定性优于Selenium,反爬检测概率更低。

实战代码(渲染动态页面):

from playwright.sync_api import sync_playwright
import time

with sync_playwright() as p:
    # 启动浏览器(headless=True为无头模式)
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    )
    page = context.new_page()
    
    # 访问页面(等待页面加载完成)
    page.goto("https://www.target.com/dynamic-page")
    page.wait_for_load_state("networkidle")  # 等待网络请求完成
    
    # 提取数据(通过CSS选择器)
    items = page.locator(".product-item").all_text_contents()
    for item in items:
        print(item)
    
    # 模拟翻页
    page.click(".next-page")
    time.sleep(2)  # 等待翻页完成
    next_items = page.locator(".product-item").all_text_contents()
    print(next_items)
    
    browser.close()

关键要点:

  • 安装Playwright:pip install playwright,然后执行playwright install chromium
  • 使用wait_for_load_statewait_for_selector等待元素加载,避免获取空数据;
  • 模拟真人操作:添加随机点击、滚动、等待,避免机械操作。

2. 对比Selenium:适合简单场景

Selenium是老牌自动化工具,上手简单,适合新手:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# 配置Chrome选项
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")  # 无头模式
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
options.add_argument("--disable-blink-features=AutomationControlled")  # 禁用自动化检测

# 启动浏览器
driver = webdriver.Chrome(options=options)
driver.get("https://www.target.com/dynamic-page")

# 等待元素加载(显式等待)
wait = WebDriverWait(driver, 10)
items = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "product-item")))

# 提取数据
for item in items:
    print(item.text)

driver.quit()

五、验证码反爬:破解滑块/点选/图文验证

验证码是反爬的重要手段,个人开发优先使用第三方打码平台,效率高于自研识别。

1. 图文验证码(ddddocr识别)

ddddocr是开源的验证码识别库,支持常见的图文验证码识别:

import ddddocr
import requests

# 下载验证码图片
captcha_url = "https://www.target.com/captcha.jpg"
response = requests.get(captcha_url)
with open("captcha.jpg", "wb") as f:
    f.write(response.content)

# 识别验证码
ocr = ddddocr.DdddOcr()
with open("captcha.jpg", "rb") as f:
    img_bytes = f.read()
captcha_code = ocr.classification(img_bytes)
print(f"识别的验证码:{captcha_code}")

# 提交验证码
data = {
    "username": "test",
    "password": "123456",
    "captcha": captcha_code
}
response = requests.post("https://www.target.com/login", data=data)
print(response.text)

2. 滑块验证码(超级鹰打码平台)

复杂滑块验证码需使用付费打码平台,以超级鹰为例:

import requests
from PIL import Image
import base64

# 1. 下载滑块验证码图片
bg_url = "https://www.target.com/bg.jpg"  # 背景图
slide_url = "https://www.target.com/slide.png"  # 滑块图
bg_response = requests.get(bg_url)
slide_response = requests.get(slide_url)
with open("bg.jpg", "wb") as f:
    f.write(bg_response.content)
with open("slide.png", "wb") as f:
    f.write(slide_response.content)

# 2. 转换为base64编码(打码平台要求)
def img_to_base64(img_path):
    with open(img_path, "rb") as f:
        return base64.b64encode(f.read()).decode()

bg_base64 = img_to_base64("bg.jpg")
slide_base64 = img_to_base64("slide.png")

# 3. 调用超级鹰API识别滑块位置
chaojiying_url = "http://upload.chaojiying.net/Upload/Processing.php"
params = {
    "user": "你的用户名",
    "pass": "你的密码",
    "softid": "999999",
    "codetype": "9004",  # 滑块验证码类型
    "image": bg_base64,
    "slide": slide_base64
}
response = requests.post(chaojiying_url, data=params)
result = response.json()
print(f"滑块位置:{result['pic_str']}")

# 4. 模拟滑动(使用Playwright)
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://www.target.com/login")
    # 模拟滑块拖动
    slide = page.locator(".slide-btn")
    slide.drag_to(page.locator(".slide-bg"), position={"x": int(result['pic_str']), "y": 0})
    browser.close()

六、高级反爬应对:破解指纹识别/行为检测

1. 修改浏览器指纹(Playwright)

网站通过Canvas、Font等指纹识别自动化工具,需修改浏览器指纹:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    context = browser.new_context(
        # 修改Canvas指纹
        locale="zh-CN",
        timezone_id="Asia/Shanghai",
        geolocation={"longitude": 116.403963, "latitude": 39.915119},
        permissions=["geolocation"]
    )
    page = context.new_page()
    
    # 注入JS修改指纹
    page.add_init_script("""
        // 重写Canvas方法
        HTMLCanvasElement.prototype.toDataURL = function() {
            return "";
        };
    """)
    
    page.goto("https://www.target.com/fingerprint-test")
    browser.close()

2. 模拟真人行为(避免机械操作)

from playwright.sync_api import sync_playwright
import random
import time

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://www.target.com")
    
    # 模拟鼠标滚动
    page.mouse.wheel(0, random.randint(100, 500))
    time.sleep(random.uniform(0.5, 1.5))
    
    # 模拟随机点击
    page.mouse.click(random.randint(100, 200), random.randint(200, 300))
    time.sleep(random.uniform(1, 2))
    
    # 模拟输入(逐字输入,模拟真人打字)
    input_box = page.locator("#search-input")
    text = "测试关键词"
    for char in text:
        input_box.type(char)
        time.sleep(random.uniform(0.1, 0.3))
    
    browser.close()

七、反爬应对避坑要点

  1. 遵守法律法规:仅爬取公开数据,避免爬取隐私信息,遵守网站的robots.txt协议;
  2. 避免过度爬取:控制请求频率,避免对网站服务器造成压力;
  3. 优先使用官方接口:若网站提供开放API,优先调用API而非爬虫;
  4. 动态调整策略:网站反爬机制会更新,需定期测试并调整爬虫策略;
  5. 分布式爬取:大规模爬取需使用分布式架构(如Scrapy+Scrapy-Redis),分散请求压力;
  6. 异常处理:添加完善的异常捕获机制,避免爬虫因单次失败而终止。

八、实战项目:完整反爬应对爬虫示例

以下是一个综合应对多种反爬机制的爬虫示例,爬取某电商网站商品数据:

import requests
import random
import time
from fake_useragent import UserAgent
from playwright.sync_api import sync_playwright

# 初始化工具
ua = UserAgent()
proxies_pool = [
    {"http": "http://112.114.34.22:8080", "https": "https://112.114.34.22:8080"},
]

def get_proxy():
    """随机获取代理"""
    return random.choice(proxies_pool)

def get_headers():
    """生成请求头"""
    return {
        "User-Agent": ua.random,
        "Referer": "https://www.target.com/",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    }

def crawl_list(page_num):
    """爬取商品列表(Ajax接口)"""
    url = "https://www.target.com/api/product/list"
    params = {
        "page": page_num,
        "size": 20,
        "timestamp": str(int(time.time() * 1000)),
    }
    # 生成加密sign(逆向JS后实现)
    params["sign"] = generate_sign(params["timestamp"])
    
    headers = get_headers()
    proxy = get_proxy()
    
    try:
        response = requests.get(
            url,
            headers=headers,
            params=params,
            proxies=proxy,
            timeout=10
        )
        data = response.json()
        return data["data"]["list"]
    except Exception as e:
        print(f"爬取第{page_num}页失败:{e}")
        return []

def crawl_detail(product_id):
    """爬取商品详情(动态渲染页面)"""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(user_agent=ua.random)
        page.goto(f"https://www.target.com/product/{product_id}")
        page.wait_for_load_state("networkidle")
        
        # 提取详情数据
        title = page.locator(".product-title").text_content()
        price = page.locator(".product-price").text_content()
        desc = page.locator(".product-desc").text_content()
        
        browser.close()
        return {"id": product_id, "title": title, "price": price, "desc": desc}

# 主程序
if __name__ == "__main__":
    all_products = []
    for page in range(1, 5):
        # 爬取列表
        list_data = crawl_list(page)
        if not list_data:
            continue
        all_products.extend(list_data)
        
        # 爬取详情
        for product in list_data:
            product_id = product["id"]
            detail_data = crawl_detail(product_id)
            print(detail_data)
            
            # 随机间隔
            time.sleep(random.uniform(3, 5))
        
        # 翻页间隔
        time.sleep(random.uniform(5, 8))
    
    print(f"共爬取{len(all_products)}个商品")
posted @ 2026-01-19 14:11  小帅记事  阅读(4)  评论(0)    收藏  举报