MetaGPT day05 MetaGPT 爬虫工程师智能体

Metagpt 爬虫智能体

需求

1.用ActionNode重写订阅智能体,实现自然语言爬取解析网站内容
2. 根据尝试实现思路1,即使用llm提取出需要的信息而不是写爬虫代码。
3. 目前,订阅智能体是通过RunSubscription运行的,即RunSubscription这个action,不仅创建了订阅智能体代码,并启动了SubscriptionRunner,这会让我们的RunSubscription一直无法退出,请尝试将二者分离,即从RunSubscription分离出AddSubscriptionTask的action,并且让SubscriptionRunner单独运行(可以是同一个进程也可以是不同的进程。

实现思路

可以有两个思路:
1. 我们实现一个智能体,它可以爬取我们要求的任意网站,然后进行数据的分析,最后再总结;
2. 实现一个可以写订阅智能体代码的智能体,这个智能体可以浏览我们需要爬取的网页,写爬虫和网页信息提取的代码,生成Role,甚至根据我们的订阅需求,直接完整调用SubscriptionRunner,实现我们的订阅需求

提取页面梗概内容

def get_outline(page):
    """
    从 HTML 页面提取文档大纲信息,返回一个包含元素信息的列表。

    Args:
        page: 包含 HTML 内容的页面对象。

    Returns:
        list: 包含元素信息的列表,每个元素信息是一个字典,包括元素的名称、深度、文本内容、
              可能的 ID 和类别信息。

    Note:
        该函数通过调用 _get_soup 函数,使用 BeautifulSoup 解析 HTML 页面。
        然后,通过递归处理 HTML 树中的每个元素,提取其名称、深度、文本内容等信息。

        在递归过程中,忽略了一些特定的元素(如 script 和 style 标签)。
        对于某些特殊标签(如 svg),只提取名称和深度,而不提取文本内容。

        最终,将提取的元素信息以字典形式组织成列表,表示文档的大纲结构。

    """
    # 使用 _get_soup 函数解析 HTML 页面
    soup = _get_soup(page.html)

    # 初始化存储文档大纲信息的列表
    outline = []

    def process_element(element, depth):
        """
        递归处理 HTML 元素,提取其信息并添加到大纲列表中。

        Args:
            element: 当前处理的 HTML 元素。
            depth: 元素的深度。

        Returns:
            None
        """
        # 获取元素名称
        name = element.name

        # 忽略没有名称的元素
        if not name:
            return

        # 忽略特定的元素(script 和 style 标签)
        if name in ["script", "style"]:
            return

        # 初始化元素信息字典
        element_info = {"name": element.name, "depth": depth}

        # 对于特殊标签(如 svg),只提取名称和深度,不提取文本内容
        if name in ["svg"]:
            element_info["text"] = None
            outline.append(element_info)
            return

        # 提取元素文本内容
        element_info["text"] = element.string

        # 检查元素是否具有 "id" 属性,如果是则添加到元素信息中
        if "id" in element.attrs:
            element_info["id"] = element["id"]

        # 检查元素是否具有 "class" 属性,如果是则添加到元素信息中
        if "class" in element.attrs:
            element_info["class"] = element["class"]

        # 将当前元素信息添加到大纲列表
        outline.append(element_info)

        # 递归处理当前元素的子元素
        for child in element.children:
            process_element(child, depth + 1)

    # 遍历 HTML body 元素的子元素,开始构建文档大纲
    for element in soup.body.children:
        process_element(element, 1)

    # 返回最终的文档大纲列表
    return outline

如下是传给爬虫工程师智能体的网页梗概,爬虫工程师根据这个来写爬虫代码:

div: 
  section.css-170vgkc.ant-layout.ant-layout-has-sider: 
   aside.ant-layout-sider.ant-layout-sider-light: 
    div.ant-layout-sider-children: 
     div.layout-sider-top: 
      div.layout-sider-logo: 
       a.expanded-logo: 
      div.layout-sider-collapsed-icon: 
       img: 
     ul: 
      li: 
       a: 
        img: 
        span: 首页
        div.tip_info: 首页
      li: 
       a.layout-sider-active: 
        img: 
        span: 融资快报
        div.tip_info: 融资快报
      li: 
       a: 
        img: 
        span: 融资事件
        div.tip_info: 融资事件
      li: 
       a: 
        img: 
        span: 项目库
        div.tip_info: 项目库
      li: 
       a: 
        img: 
        span: 机构库
        div.tip_info: 机构库
      li: 
       a: 
        img: 
        span: 项目集
        div.tip_info: 项目集
      li: 
       a: 
        img: 
        span: 定向对接
        div.tip_info: 定向对接
      li: 
       a: 
        img: 
        span: 融通创新
        div.tip_info: 融通创新
   section.site-layout.content-transition-big.ant-layout: 
    main.ant-layout-content: 
     div.pc-layout-header-wrp: 
      div.css-1h3sp1q: 
       div.header-content: 
        div.header-row.css-vxgrp0: 
         div.css-tpekb2: 
          div.search: 
           div.css-1sg8hfp: 
            div.custom-search-input.ant-select-show-search.ant-select-auto-complete.ant-select.ant-select-combobox.ant-select-enabled: 
             div.ant-select-selection.ant-select-selection--single: 
              div.ant-select-selection__rendered: 
               div.ant-select-selection__placeholder: 公司/项目名/投资机构/赛道
               ul: 
                li.ant-select-search.ant-select-search--inline: 
                 div.ant-select-search__field__wrap: 
                  input.ant-input.ant-select-search__field: 
                  span.ant-select-search__field__mirror:  
              span.ant-select-arrow: 
               i.anticon.anticon-down.ant-select-arrow-icon: 
                svg: 
         div.css-aiepd8: 
          div.login: 
           a.backTo36Kr: 返回36氪
           div.css-0: 登录
            div.css-b1f3kf: 登录
             div.login-text: 登录
     div.pc-layout-content-wrapper-outer.css-w72mzi: 
      div.pc-layout-content-wrapper.css-cw5dhi: 
       div: 
        div.css-s23dpp: 
         div.css-vxgrp0: 
          div.css-tpekb2: 
           div.css-kpimdk: 
            h1.page-title: 融资快报
            div.content-flow: 
             div.newsflash-catalog-flow: 
              div.kr-loading-more: 
               div: 
                div.css-xle9x: 
                 div.item-title: 
                  span.type: 快讯
                  a.title: 睿普康完成过亿元A轮融资
                 div.item-desc: 
                  span: 近日,国内专业从事卫星通信、蜂窝通信及电源管理芯片研发企业合肥睿普康集成电路有限公司(简称“睿普康”)成功完成A轮融资,融资总额过亿元。本轮投资方包括深投控、阿玛拉、合肥海恒等专业机构。睿普康创立于2019年,是专业的卫星通信、蜂窝通信及电源管理芯片研发企业,专注于天地一体互联互通终端芯片及物联网芯片研发,所研发的芯片主要应用于汽车通信、智能手机、物联网、智能电网、智慧家庭等领域。(一元航天)
                ... 后续省略

image-20240126134844379

使用自然语言写爬虫代码

import datetime
import sys
from typing import Optional
from uuid import uuid4
from aiocron import crontab
from metagpt.actions import UserRequirement
from metagpt.actions.action import Action
from metagpt.actions.action_node import ActionNode
from metagpt.roles import Role
from metagpt.schema import Message
from metagpt.tools.web_browser_engine import WebBrowserEngine
from metagpt.utils.common import CodeParser, any_to_str
from metagpt.utils.parse_html import _get_soup
from pytz import BaseTzInfo
from metagpt.logs import logger

# 先写NODES
LANGUAGE = ActionNode(
    key="Language",
    expected_type=str,
    instruction="Provide the language used in the project, typically matching the user's requirement language.",
    example="en_us",
)

CRON_EXPRESSION = ActionNode(
    key="Cron Expression",
    expected_type=str,
    instruction="If the user requires scheduled triggering, please provide the corresponding 5-field cron expression. "
    "Otherwise, leave it blank.",
    example="15 14 * * *",
)

CRAWLER_URL_LIST = ActionNode(
    key="Crawler URL List",
    expected_type=list[str],
    instruction="List the URLs user want to crawl. Leave it blank if not provided in the User Requirement.",
    example=["https://example1.com", "https://example2.com"],
)

PAGE_CONTENT_EXTRACTION = ActionNode(
    key="Page Content Extraction",
    expected_type=str,
    instruction="Specify the requirements and tips to extract from the crawled web pages based on User Requirement.",
    example="Retrieve the titles and content of articles published today.",
)

CRAWL_POST_PROCESSING = ActionNode(
    key="Crawl Post Processing",
    expected_type=str,
    instruction="Specify the processing to be applied to the crawled content, such as summarizing today's news.",
    example="Generate a summary of today's news articles.",
)

INFORMATION_SUPPLEMENT = ActionNode(
    key="Information Supplement",
    expected_type=str,
    instruction="If unable to obtain the Cron Expression, prompt the user to provide the time to receive subscription "
    "messages. If unable to obtain the URL List Crawler, prompt the user to provide the URLs they want to crawl. Keep it "
    "blank if everything is clear",
    example="",
)

NODES = [
    LANGUAGE,
    CRON_EXPRESSION,
    CRAWLER_URL_LIST,
    PAGE_CONTENT_EXTRACTION,
    CRAWL_POST_PROCESSING,
    INFORMATION_SUPPLEMENT,
]

PARSE_SUB_REQUIREMENTS_NODE = ActionNode.from_children("ParseSubscriptionReq", NODES)

PARSE_SUB_REQUIREMENT_TEMPLATE = """
### User Requirement
{requirements}
"""

SUB_ACTION_TEMPLATE = """
## Requirements
Answer the question based on the provided context {process}. If the question cannot be answered, please summarize the context.

## context
{data}"
"""

PROMPT_TEMPLATE = """Please complete the web page crawler parse function to achieve the User Requirement. The parse \
function should take a BeautifulSoup object as input, which corresponds to the HTML outline provided in the Context.

```python
from bs4 import BeautifulSoup

# only complete the parse function
def parse(soup: BeautifulSoup):
    ...
    # Return the object that the user wants to retrieve, don't use print
```

## User Requirement
{requirement}

## Context

The outline of html page to scrabe is show like below:

```tree
{outline}
```
"""

# 辅助函数: 获取html css大纲视图
def get_outline(page):
    soup = _get_soup(page.html)
    outline = []

    def process_element(element, depth):
        name = element.name
        if not name:
            return
        if name in ["script", "style"]:
            return

        element_info = {"name": element.name, "depth": depth}

        if name in ["svg"]:
            element_info["text"] = None
            outline.append(element_info)
            return

        element_info["text"] = element.string
        # Check if the element has an "id" attribute
        if "id" in element.attrs:
            element_info["id"] = element["id"]

        if "class" in element.attrs:
            element_info["class"] = element["class"]
        outline.append(element_info)
        for child in element.children:
            process_element(child, depth + 1)

    for element in soup.body.children:
        process_element(element, 1)

    return outline

# 触发器:crontab
class CronTrigger:
    def __init__(self, spec: str, tz: Optional[BaseTzInfo] = None) -> None:
        segs = spec.split(" ")
        if len(segs) == 6:
            spec = " ".join(segs[1:])
        self.crontab = crontab(spec, tz=tz)

    def __aiter__(self):
        return self

    async def __anext__(self):
        await self.crontab.next()
        return Message(datetime.datetime.now().isoformat())

# 写爬虫代码的Action
class WriteCrawlerCode(Action):
    async def run(self, requirement):
        requirement: Message = requirement[-1]
        data = requirement.instruct_content.dict()
        urls = data["Crawler URL List"]
        query = data["Page Content Extraction"]

        codes = {}
        for url in urls:
            codes[url] = await self._write_code(url, query)
        return "\n".join(f"# {url}\n{code}" for url, code in codes.items())

    async def _write_code(self, url, query):
        page = await WebBrowserEngine().run(url)
        outline = get_outline(page)
        outline = "\n".join(
            f"{' '*i['depth']}{'.'.join([i['name'], *i.get('class', [])])}: {i['text'] if i['text'] else ''}"
            for i in outline
        )
        code_rsp = await self._aask(PROMPT_TEMPLATE.format(outline=outline, requirement=query))
        code = CodeParser.parse_code(block="", text=code_rsp)
        return code

# 分析订阅需求的Action
class ParseSubRequirement(Action):
    async def run(self, requirements):
        requirements = "\n".join(i.content for i in requirements)
        context = PARSE_SUB_REQUIREMENT_TEMPLATE.format(requirements=requirements)
        node = await PARSE_SUB_REQUIREMENTS_NODE.fill(context=context, llm=self.llm)
        return node

# 运行订阅智能体的Action
class RunSubscription(Action):
    async def run(self, msgs):
        from metagpt.roles.role import Role
        from metagpt.subscription import SubscriptionRunner

        code = msgs[-1].content
        req = msgs[-2].instruct_content.dict()
        urls = req["Crawler URL List"]
        process = req["Crawl Post Processing"]
        spec = req["Cron Expression"]
        SubAction = self.create_sub_action_cls(urls, code, process)
        SubRole = type("SubRole", (Role,), {})
        role = SubRole()
        role._init_actions([SubAction])
        runner = SubscriptionRunner()

        async def callback(msg):
            print(msg)

        await runner.subscribe(role, CronTrigger(spec), callback)
        await runner.run()

    @staticmethod
    def create_sub_action_cls(urls: list[str], code: str, process: str):
        modules = {}
        for url in urls[::-1]:
            code, current = code.rsplit(f"# {url}", maxsplit=1)
            name = uuid4().hex
            module = type(sys)(name)
            exec(current, module.__dict__)
            modules[url] = module

        class SubAction(Action):
            async def run(self, *args, **kwargs):
                pages = await WebBrowserEngine().run(*urls)
                if len(urls) == 1:
                    pages = [pages]

                data = []
                for url, page in zip(urls, pages):
                    data.append(getattr(modules[url], "parse")(page.soup))
                return await self.llm.aask(SUB_ACTION_TEMPLATE.format(process=process, data=data))

        return SubAction

# 定义爬虫工程师角色
class CrawlerEngineer(Role):
    name: str = "John"
    profile: str = "Crawling Engineer"
    goal: str = "Write elegant, readable, extensible, efficient code"
    constraints: str = "The code should conform to standards like PEP8 and be modular and maintainable"

    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)

        self._init_actions([WriteCrawlerCode])
        self._watch([ParseSubRequirement])

# 定义订阅助手角色
class SubscriptionAssistant(Role):
    """Analyze user subscription requirements."""

    name: str = "Grace"
    profile: str = "Subscription Assistant"
    goal: str = "analyze user subscription requirements to provide personalized subscription services."
    constraints: str = "utilize the same language as the User Requirement"

    def __init__(self, **kwargs) -> None:
        super().__init__(**kwargs)

        self._init_actions([ParseSubRequirement, RunSubscription])
        self._watch([UserRequirement, WriteCrawlerCode])

    async def _think(self) -> bool:
        cause_by = self._rc.history[-1].cause_by
        if cause_by == any_to_str(UserRequirement):
            state = 0
        elif cause_by == any_to_str(WriteCrawlerCode):
            state = 1

        if self._rc.state == state:
            self._rc.todo = None
            return False
        self._set_state(state)
        return True

    async def _act(self) -> Message:
        logger.info(f"{self._setting}: ready to {self._rc.todo}")
        response = await self._rc.todo.run(self._rc.history)
        msg = Message(
            content=response.content,
            instruct_content=response.instruct_content,
            role=self.profile,
            cause_by=self._rc.todo,
            sent_from=self,
        )
        self._rc.memory.add(msg)
        return msg

if __name__ == "__main__":
    import asyncio
    from metagpt.team import Team

    team = Team()
    team.hire([SubscriptionAssistant(), CrawlerEngineer()])
    team.run_project("从36kr创投平台https://pitchhub.36kr.com/financing-flash爬取所有初创企业融资的信息,获取标题,链接, 时间,总结今天的融资新闻,然后在14:20发送给我")
    asyncio.run(team.run())
posted @ 2024-01-26 17:55  passion2021  阅读(88)  评论(0编辑  收藏  举报