AnyCrawl docker部署
desc
• AnyCrawl 提供高性能网页数据爬取,其功能专为 LLM 集成和数据处理而设计 • 支持利用搜索引擎直接查询获取结果内容,类似 searxng • 提供开发者友好的API,支持动态内容抓取,并输出结构化数据,如markdown、网站元信息等 • 支持Docker一键快速部署,资源占用相对较低 • 项目开源,地址参考:https://github.com/any4ai/AnyCrawl • 该项目大概工作原理如下图所示:
部署方法一:
- 提前准备好Docker、docker-compose软件环境
- 新建docker-compose.yml配置文件,内容如下(确保8080不被占用,如已被占用,请修改下面的端口映射配置):
name: anycrawl x-common-service: &common-service networks: - anycrawl-network volumes: - ./storage:/usr/src/app/storage x-common-env: &common-env NODE_ENV: ${NODE_ENV:-production} ANYCRAWL_HEADLESS: ${ANYCRAWL_HEADLESS:-true} ANYCRAWL_PROXY_URL: ${ANYCRAWL_PROXY_URL:-} ANYCRAWL_IGNORE_SSL_ERROR: ${ANYCRAWL_IGNORE_SSL_ERROR:-true} ANYCRAWL_REDIS_URL: ${ANYCRAWL_REDIS_URL:-redis://redis:6379} ANYCRAWL_API_PORT: ${ANYCRAWL_API_PORT:-8080} ANYCRAWL_API_AUTH_ENABLED: ${ANYCRAWL_API_AUTH_ENABLED:-false} ANYCRAWL_API_DB_TYPE: "sqlite" ANYCRAWL_API_DB_CONNECTION: "/usr/src/app/db/database.db" services: api: <<: *common-service image: ghcr.io/any4ai/anycrawl-api environment: <<: *common-env ports: - "8080:8080" volumes: - ./storage:/usr/src/app/storage - ./db:/usr/src/app/db depends_on: - redis scrape-puppeteer: <<: *common-service image: ghcr.io/any4ai/anycrawl-scrape-puppeteer environment: <<: *common-env depends_on: - redis scrape-playwright: <<: *common-service image: ghcr.io/any4ai/anycrawl-scrape-playwright environment: <<: *common-env depends_on: - redis scrape-cheerio: <<: *common-service image: ghcr.io/any4ai/anycrawl-scrape-cheerio environment: <<: *common-env depends_on: - redis redis: image: redis:7-alpine volumes: - redis-data:/data networks: - anycrawl-network command: redis-server --appendonly yes volumes: redis-data: networks: anycrawl-network: driver: bridge
一键启动,执行如下命令
docker-compose up -d
部署方法二:
拉取镜像并运行
docker run -p 8080:8080 ghcr.io/any4ai/anycrawl:latest
或者后台运行
docker run -d -p 8080:8080 ghcr.io/any4ai/anycrawl:latest
注意,这里拉取可能会拉取不下来(修改/etc/docker/daemon.json
,重启 Docker 再拉就行)
把下面任意一条地址直接写到 /etc/docker/daemon.json
里,重启 Docker 再拉就行:
{ "registry-mirrors": [ "https://ghcr.io.dockerproxy.com", "https://ghcr.88888888.xyz", "https://ghcr.dockerproxy.cn" ] }
重启docker
sudo systemctl daemon-reload && sudo systemctl restart docker docker pull ghcr.io/any4ai/anycrawl
正常启动后的一个结果:
验证效果:
1. 爬取一篇文章内容,返回 LLM 友好的markdown内容 • 爬取网页接口:http://127.0.0.1:8080/v1/scrape • 请求方法:POST • 请求参数:
{ "url": "https://blog.luler.top/d/55", //网页链接 "engine": "playwright", //支持多种引擎,cheerio、puppeteer 、playwright,cheerio适合静态网页,puppeteer 、playwright适合动态网页 "proxy": "http://127.0.0.1:10808" //支持设置代理,非必填 }
postman请求示例:
2 获取搜索引擎结果 (SERP)
- • 爬取网页接口:http://127.0.0.1:8080/v1/search
- • 请求方法:POST
- • 请求参数:
{ "url": "https://blog.luler.top/d/55", //网页链接 "engine": "playwright", //支持多种引擎,cheerio、puppeteer 、playwright,cheerio适合静态网页,puppeteer 、playwright适合动态网页 "proxy": "http://127.0.0.1:10808" //支持设置代理,非必填 }
postman请求示例