一、为什么一定要自己爬评论?
选品:直播团队每天需要 5k+ 新评论做情感分析,筛「质量/做工/物流」痛点
竞品:同一款蓝牙耳机,对手差评集中在「电流声」,新品 PK 文案直接规避
数据训练:做多模态大模型,需要「文本+图片+评分」成对样本
价格监控:差评突然增多 → 商家可能降价,提前预警
官方 taobao.itemcomment.get 固然能回评论,但:
需要企业店铺+品牌备案,个人 99% 被卡
返回字段被阉割,图片 url 带防盗链,2h 失效
→ 网页派依旧是 2025 最普惠方案。下面用纯 Python 把「搜索→JSONP→图片→追评→落库→BI 看板」一次撸完。
二、技术选型(全部开源)
| 模块 | 库 | 备注 |
|---|---|---|
| 网络 | aiohttp | 异步协程,单核 1w QPS 轻松 |
| 解析 | beautifulsoup4 + lxml | 剥 JSONP 壳 |
| JSON | orjson | 比 ujson 再快 30% |
| 并发 | asyncio + tqdm | 进度条肉眼可见 |
| 数据库 | SQLAlchemy 2.0 | ORM + 批量提交 |
| 去重 | redis + bloomfilter | 布隆过滤器,内存省 90% |
| 代理 | aiohttp-socks | 支持 socks5 账号密码 |
| 监控 | loguru + feishu | 一条 WebHook 飞书群播报 |
三、0 环境搭建(Linux / Win / mac 通用)
bash
# 1. 创建虚拟环境
python -m venv venv && source venv/bin/activate # Win: venv\Scripts\activate
# 2. 一次性 pip
pip install -U "aiohttp[speedups]" aiohttp-socks asyncio tqdm
pip install orjson beautifulsoup4 lxml sqlalchemy redis loguru
四、核心流程:6 步闭环(全部代码可跑)
① 找入口:JSONP 地址 + 签名算法(2025-10 有效)
接口:
https://rate.taobao.com/feedRateList.htm?auctionNumId={item_id}¤tPageNum={page}&pageSize=20&rateType=1&orderType=sort_weight&callback=jsonp123
返回:
JavaScript
jsonp123({"total":1523,"comments":[{...}]})
签名逻辑(与详情页同款):
Python
import time, hashlib, json, httpx
APPKEY = "12574478"
def sign(raw: str) -> str:
return hashlib.md5(raw.encode()).hexdigest().upper()
def build_url(item_id: int, page: int = 1) -> str:
data = json.dumps({"auctionNumId": item_id, "currentPageNum": page}, separators=(",", ":"))
t = str(int(time.time() * 1000))
sig = sign(f"{t}&{APPKEY}&{data}&")
base = "https://rate.taobao.com/feedRateList.htm"
params = dict(
auctionNumId=item_id,
currentPageNum=page,
pageSize=20,
rateType=1,
orderType="sort_weight",
callback=f"jsonp{t}",
t=t,
sign=sig
)
return f"{base}?{httpx.QueryParams(params)}"
② 剥 JSONP 壳 + 解析字段
Python
import httpx, re, orjson
from loguru import logger
client = httpx.AsyncClient(headers={"Referer": "https://item.taobao.com/"}, timeout=10)
async def fetch_page(item_id: int, page: int = 1):
url = build_url(item_id, page)
resp = await client.get(url)
text = resp.text
# 去壳
data = re.sub(r"^jsonp\d+\(|\)$", "", text)
return orjson.loads(data)
字段映射(20+ 字段全拿):
Python
def parse_comments(js: dict, item_id: int):
for c in js.get("comments", []):
yield {
"item_id": item_id,
"rate_id": c["rateId"],
"author": c["displayUserNick"],
"content": c["rateContent"],
"score": c.get("tmallRateInfo", {}).get("score", 5),
"pics": [p.get("url", "") for p in c.get("pics", [])],
"useful": c.get("useful", 0),
"append": {
"days": c["append"]["days"],
"content": c["append"]["content"]
} if c.get("append") else None,
"created_at": c["rateDate"] // 毫秒时间戳
}
③ 异步并发 + 进度条 + 令牌桶
单 IP 15 QPS 稳过,> 200/10min 必出滑块。
Python
import asyncio, tqdm, time
from asyncio import Semaphore
sem = Semaphore(15) # 令牌桶
async def fetch_all(item_id: int, max_page: int = 200):
first = await fetch_page(item_id, 1)
total_page = min((first.get("total", 0) + 19) // 20, max_page)
tasks = [
fetch_comment_page(item_id, p)
for p in tqdm.trange(2, total_page + 1, desc="pages")
]
return first, await asyncio.gather(*tasks)
async def fetch_comment_page(item_id: int, page: int):
async with sem: # 限速
await asyncio.sleep(0.05) # 礼貌间隔
js = await fetch_page(item_id, page)
return list(parse_comments(js, item_id))
④ 落库:SQLAlchemy 2.0 + 批量提交 + Redis 去重
建表(MySQL 8.0):
sql
CREATE TABLE tb_comment (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
item_id BIGINT NOT NULL,
rate_id BIGINT NOT NULL,
author VARCHAR(50) NOT NULL,
content TEXT NOT NULL,
score TINYINT NOT NULL,
pics JSON NULL,
useful INT DEFAULT 0,
append JSON NULL,
created_at DATETIME(3) NOT NULL,
UNIQUE KEY uk_rate (rate_id),
INDEX idx_item (item_id)
) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4;
SQLAlchemy ORM:
Python
from sqlalchemy import create_engine, Column, BigInteger, String, Text, DateTime, SmallInteger
from sqlalchemy.orm import declarative_base, sessionmaker
import redis, json, datetime as dt
Base = declarative_base()
engine = create_engine("mysql+pymysql://user:pwd@127.0.0.1:3306/crawler?charset=utf8mb4", pool_pre_ping=True)
Session = sessionmaker(bind=engine, expire_on_commit=False)
rdb = redis.Redis(host='127.0.0.1', port=6379, decode_responses=True)
class TbComment(Base):
__tablename__ = "tb_comment"
id = Column(BigInteger, primary_key=True)
item_id = Column(BigInteger, nullable=False)
rate_id = Column(BigInteger, nullable=False, unique=True)
author = Column(String(50), nullable=False)
content = Column(Text, nullable=False)
score = Column(SmallInteger, nullable=False)
pics = Column(Text, nullable=True) # JSON
useful = Column(BigInteger, default=0)
append = Column(Text, nullable=True) # JSON
created_at = Column(DateTime, nullable=False)
def bulk_save(rows: list[dict]):
"""Redis 布隆去重 + 批量插入"""
pipe = rdb.pipeline()
for row in rows:
pipe.sadd("rate_id_set", row["rate_id"])
exists = pipe.execute()
new_rows = [r for r, e in zip(rows, exists) if e == 1]
if not new_rows:
return 0
# 时间戳转 datetime
for r in new_rows:
r["created_at"] = dt.datetime.fromtimestamp(r["created_at"] / 1000)
r["pics"] = json.dumps(r["pics"], ensure_ascii=False)
r["append"] = json.dumps(r["append"], ensure_ascii=False) if r["append"] else None
with Session() as s:
s.bulk_insert_mappings(TbComment, new_rows)
s.commit()
return len(new_rows)
⑤ 主函数:一键跑
Python
async def main(item_id: int = 65231656231, max_page: int = 200):
first, pages = await fetch_all(item_id, max_page)
all_rows = list(parse_comments(first, item_id))
for rows in pages:
all_rows.extend(rows)
new_cnt = bulk_save(all_rows)
logger.success(f"item_id={item_id} 新增 {new_cnt} 条评论,重复率 {:.1%}".format(1 - new_cnt / len(all_rows)))
if __name__ == "__main__":
asyncio.run(main())
⑥ 云函数 / Docker 定时:每天 8 点飞书播报
Dockerfile
dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
COPY . .
CMD ["python", "-m", "crawl"]
docker-compose.yml
yaml
version: "3.9"
services:
crawler:
build: .
environment:
- DB_URL=mysql+pymysql://user:pwd@db:3306/crawler
- REDIS_HOST=redis
depends_on:
- db
- redis
db:
image: mysql:8.0
environment:
MYSQL_ROOT_PASSWORD: 123456
MYSQL_DATABASE: crawler
volumes:
- db_data:/var/lib/mysql
redis:
image: redis:7-alpine
volumes:
db_data:
宿主定时 + 飞书
bash
# 宿主 crontab
0 8 * * * docker-compose -f /home/comment/docker-compose.yml up --build
飞书推送函数(精简版)
Python
import httpx, os
FEISHU_HOOK = os.getenv("FEISHU_HOOK")
def report(item_id: int, new_cnt: int):
body = {
"msg_type": "text",
"content": {"text": f"商品 {item_id} 新增 {new_cnt} 条评论,已落库~"}
}
httpx.post(FEISHU_HOOK, json=body)
五、踩坑 & 反爬锦囊
JSONP 壳:正则为
^jsonp\d+\(|\)$,剥完再orjson.loadsReferer:必须
https://item.taobao.com/,否则 403限速:单 IP 15 QPS 稳过,> 200/10min 必出滑块
代理池:青果云 1G ≈ 0.8 元,能跑 8 万页
重复:Redis
rate_id_set秒级去重,内存省 90 %
六、结语
从 JSONP 签名、异步协程、Redis 去重、SQLAlchemy 落库,到 Docker 定时 + 飞书群播报,一条完整的 Python 闭环就打通了。
全部代码可直接扔进 PyCharm / VSCode 跑通,改一行 item_id 就能薅任意品类。
祝各位运营、剪辑、算法工程师爬得开心,爆单更开心!
浙公网安备 33010602011771号