用FastAPI构建爬虫接口服务：让爬虫变成可调用的API - 详解 - tlnshuju

用FastAPI构建爬虫接口服务：让爬虫变成可调用的API - 详解

2025-12-21 19:49 tlnshuju 阅读(6) 评论(0) 收藏举报

在日常开发中，爬虫通常以脚本形式运行，只能在本地或特定服务器手动执行，无法快速对接其他系统。而 FastAPI 作为一款高性能的 Python Web 框架，能轻松将爬虫封装成标准化 API，实现跨平台、多场景的灵活调用。本文将从环境准备到部署优化，带你完整搭建一套 “爬虫 + API” 服务。

一、核心优势：为什么选择 FastAPI + 爬虫

将爬虫与 FastAPI 结合，相比传统脚本模式有三个关键优势：

可复用性：API 形式的爬虫可被前端、后端、移动端等多端调用，无需重复编写爬虫逻辑。
易用性：通过 API 参数动态控制爬虫行为（如爬取页数、关键词），无需修改代码即可调整任务。
高性能：FastAPI 支持异步请求，可同时处理多个爬虫调用，且自带自动生成的接口文档，降低调试成本。

二、环境准备：所需工具与依赖

首先需安装核心依赖库，建议使用虚拟环境隔离项目依赖：

依赖库	作用说明	安装命令
FastAPI	构建 API 服务的核心框架	`pip install fastapi`
Uvicorn	运行 FastAPI 服务的 ASGI 服务器	`pip install uvicorn`
Requests	发送 HTTP 请求（同步爬虫）	`pip install requests`
BeautifulSoup4	解析 HTML 页面数据	`pip install beautifulsoup4`
Aiohttp	发送异步 HTTP 请求（异步爬虫优化）	`pip install aiohttp`

三、实战步骤：从爬虫脚本到 API 服务

以 “爬取某博客网站文章列表” 为例，分三步实现完整服务。

1. 编写基础爬虫函数

先实现一个可独立运行的爬虫函数，支持通过参数控制爬取页数，便于后续对接 API：

python

运行

import requests
from bs4 import BeautifulSoup
def crawl_blog(page: int = 1) -> list:
    """
    爬取博客文章列表
    :param page: 爬取页码，默认第1页
    :return: 文章列表（包含标题、链接、发布时间）
    """
    # 目标网站URL（此处以示例链接为例，实际需替换为合法目标）
    url = f"https://example-blog.com/page/{page}"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/118.0.0.0"}
    try:
        # 发送请求并解析页面
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # 若状态码非200，抛出异常
        soup = BeautifulSoup(response.text, "html.parser")
        # 提取文章数据
        articles = []
        for item in soup.find_all("div", class_="article-item"):
            title = item.find("h2").text.strip()
            link = item.find("a")["href"]
            publish_time = item.find("span", class_="time").text.strip()
            articles.append({"title": title, "link": link, "publish_time": publish_time})
        return articles
    except Exception as e:
        # 捕获异常并返回错误信息
        return {"error": f"爬虫执行失败：{str(e)}"}

2. 搭建 FastAPI 服务，封装爬虫为 API

初始化 FastAPI 应用，定义 API 接口，将爬虫函数作为接口的核心逻辑，支持通过 URL 参数传递 “爬取页码”：

python

运行

from fastapi import FastAPI
# 导入上文编写的爬虫函数
from crawler import crawl_blog
# 初始化FastAPI应用
app = FastAPI(
    title="博客爬虫API服务",
    description="通过API调用爬虫，获取博客文章列表",
    version="1.0.0"
)
# 定义GET接口：获取指定页码的博客文章
@app.get("/api/crawl/blog", summary="爬取博客文章列表")
def get_blog_articles(page: int = 1):
    """
    调用爬虫获取博客文章
    - **page**: 可选参数，爬取的页码（默认1）
    """
    # 调用爬虫函数，返回结果
    result = crawl_blog(page=page)
    return {"code": 200 if isinstance(result, list) else 500, "data": result}

3. 运行与测试 API 服务

启动服务：在项目根目录执行命令，通过 Uvicorn 运行 FastAPI 服务：
bash
```
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```
- main:app：表示从main.py文件导入app实例
- --host 0.0.0.0：允许外部设备访问服务
- --reload：开发环境启用热重载，修改代码后自动重启服务
测试接口：
- 访问 FastAPI 自动生成的文档界面：http://localhost:8000/docs
- 在文档中找到/api/crawl/blog接口，点击 “Try it out”，输入页码（如page=2），点击 “Execute” 即可查看爬虫结果
- 也可通过 curl 命令测试：
  bash
```
curl "http://localhost:8000/api/crawl/blog?page=2"
```

四、进阶优化：让 API 服务更稳定、高效

基础服务搭建完成后，可通过以下优化提升可用性：

1. 异步化改造：提升并发能力

传统requests是同步库，同一时间只能处理一个爬虫请求。改用aiohttp实现异步爬虫，配合 FastAPI 的异步接口，可同时处理多个调用：

python

运行

# 异步爬虫函数
import aiohttp
async def async_crawl_blog(page: int = 1) -> list:
    url = f"https://example-blog.com/page/{page}"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/118.0.0.0"}
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url, headers=headers, timeout=10) as response:
                response.raise_for_status()
                html = await response.text()  # 异步获取响应内容
                soup = BeautifulSoup(html, "html.parser")
                # 后续数据提取逻辑与同步版本一致...
                return articles
    except Exception as e:
        return {"error": f"异步爬虫执行失败：{str(e)}"}
# 异步API接口
@app.get("/api/async/crawl/blog", summary="异步爬取博客文章列表")
async def async_get_blog_articles(page: int = 1):
    result = await async_crawl_blog(page=page)  # 异步调用爬虫
    return {"code": 200 if isinstance(result, list) else 500, "data": result}

2. 添加缓存：避免重复爬取

对爬虫结果添加缓存（如使用redis），相同页码的请求直接返回缓存数据，减少对目标网站的请求压力：

python

运行

import redis
from fastapi import Depends
# 初始化redis连接
redis_client = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)
def get_redis():
    return redis_client
@app.get("/api/crawl/blog/cached", summary="带缓存的博客爬取接口")
def get_cached_blog_articles(page: int = 1, redis: redis.Redis = Depends(get_redis)):
    # 先查缓存
    cache_key = f"blog:page:{page}"
    cached_data = redis.get(cache_key)
    if cached_data:
        return {"code": 200, "data": eval(cached_data), "from": "cache"}
    # 缓存不存在，调用爬虫
    result = crawl_blog(page=page)
    if isinstance(result, list):
        # 缓存结果（设置1小时过期）
        redis.setex(cache_key, 3600, str(result))
    return {"code": 200 if isinstance(result, list) else 500, "data": result, "from": "crawler"}

3. 部署建议：生产环境稳定运行

开发完成后，需通过以下配置确保服务稳定：

关闭热重载：生产环境执行命令uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4（--workers指定进程数，建议为 CPU 核心数的 2 倍）。
反向代理：用 Nginx 作为反向代理，处理静态资源、负载均衡，并隐藏后端服务端口。
进程管理：用Supervisor或systemd管理 Uvicorn 进程，确保服务崩溃后自动重启。

五、总结

通过 FastAPI 封装爬虫，只需三步即可实现 “脚本→API” 的转变：编写可参数化的爬虫函数、搭建 API 接口、测试与优化。这种模式不仅提升了爬虫的复用性，还能通过 FastAPI 的异步、文档等特性，降低对接成本，让爬虫能力快速融入业务系统。

如果需要进一步扩展，还可添加接口认证（如 JWT）、请求频率限制、爬虫任务队列等功能，让服务更安全、可控。

刷新页面返回顶部

tlnshuju