deepseek+crawl4ai进行爬虫——支持自然语言进行结构化爬虫
官方文档
https://github.com/unclecode/crawl4ai
安装环境
python3.9+ (原因:https://github.com/unclecode/crawl4ai/issues/826)
pip install -U crawl4ai
crawl4ai-setup
python -m playwright install --with-deps chromium
获取deepseek apikey
https://platform.deepseek.com/api_keys

根据官方示例调整代码,更换为deepseek大模型(运行官方示例需要FQ)
llm_config = LLMConfig(provider="deepseek/deepseek-chat", api_token="yourkey"),
crawl4ai支持的大模型见官方文档:
https://docs.crawl4ai.com/core/browser-crawler-config/

注意请参照github上的代码改造,不然可能运行失败
import os import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field class OpenAIModelFee(BaseModel): model_name: str = Field(..., description="Name of the OpenAI model.") input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") output_fee: str = Field(..., description="Fee for output token for the OpenAI model.") async def main(): browser_config = BrowserConfig(verbose=True) run_config = CrawlerRunConfig( word_count_threshold=1, extraction_strategy=LLMExtractionStrategy( # Here you can use any provider that Litellm library supports, for instance: ollama/qwen2 # provider="ollama/qwen2", api_token="no-token",
llm_config = LLMConfig(provider="deepseek/deepseek-chat", api_token="yourkey"),
schema=OpenAIModelFee.schema(), extraction_type="schema", instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""" ), cache_mode=CacheMode.BYPASS, ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url='https://openai.com/api/pricing/', config=run_config ) print(result.extracted_content) if __name__ == "__main__": asyncio.run(main())

浙公网安备 33010602011771号