mini-swe-agent 工具与环境

mini 只有一把武器：bash。它放弃了所有自定义工具（读文件、写文件、搜索、浏览器、MCP），把这个复杂的生态压缩成了单一工具定义和一套环境抽象。models/utils/ 里只有两套 action 解析（toolcall 和 text-based），environments/ 里的所有实现只做一件事——把 action 里的命令字符串传给 subshell，收集输出，检测完成信号。这份笔记覆盖工具的注册与解析、环境实现、以及"为什么不要 shell session"这个设计的来龙去脉。

1. 唯一工具：bash

1.1 工具定义

models/utils/actions_toolcall.py 里定义了全系统唯一的工具 schema：

BASH_TOOL = {
    "type": "function",
    "function": {
        "name": "bash",
        "description": "Execute a bash command",
        "parameters": {
            "type": "object",
            "properties": {
                "command": {
                    "type": "string",
                    "description": "The bash command to execute",
                }
            },
            "required": ["command"],
        },
    },
}

就这么多。没有 path 参数（用 cd 解决），没有 start_line/end_line（用 sed -n 或 nl+sed 解决），没有 search_regex（用 grep 解决），没有 write_file（用 cat heredoc 解决）。每个被移除的工具名背后都对应着一行 bash 命令，这些命令在 prompt 的 "Useful command examples" 部分被显式教给模型。

1.2 工具解析

LitellmModel._parse_actions 从 litellm 响应中提取 tool_calls：

def _parse_actions(self, response) -> list[dict]:
    tool_calls = response.choices[0].message.tool_calls or []
    return parse_toolcall_actions(tool_calls, format_error_template=self.config.format_error_template)

parse_toolcall_actions 做严格校验：

没有 tool_calls → FormatError，提醒模型"每条回复必须至少有一个工具调用"
工具名不是 bash → FormatError，提醒"不支持此工具"
参数 JSON 解析失败 → FormatError，提示参数格式错误
没找到 command 参数 → FormatError
全部通过 → 返回 [{"command": "...", "tool_call_id": "..."}]

FormatError 不是致命错误——它携带的 user 消息包含了 model 上一轮的失败原因和正确格式要求。模型下一轮会看到这条消息并纠正。

1.3 工具结果格式化

format_toolcall_observation_messages 把执行结果转换成模型可以理解的工具结果消息：

def format_toolcall_observation_messages(*, actions, outputs, observation_template, ...):
    not_executed = {"output": "", "returncode": -1, "exception_info": "action was not executed"}
    padded_outputs = outputs + [not_executed] * (len(actions) - len(outputs))
    results = []
    for action, output in zip(actions, padded_outputs):
        content = Template(observation_template).render(output=output, **template_vars)
        msg = {
            "content": content,
            "tool_call_id": action.get("tool_call_id"),
            "role": "tool" if "tool_call_id" in action else "user",
        }

两个值得关注的细节：

padding：如果某个 action 因异常未执行（比如在列表中排在已抛异常的 action 后面），用 not_executed 填充。消息不会丢失编号，确保 history 中的 action-observation 一一对应。
role 选择：有 tool_call_id 的消息写 role="tool"（走原生 tool call 协议），没有的写 role="user"（走文本协议或 human 模式）。

1.4 长输出截断

mini.yaml 的默认 observation_template 在输出超过 10000 字符时会自动截断：

{
  "returncode": 0,
  "output_head": "<前5000字符>",
  "output_tail": "<后5000字符>",
  "elided_chars": 12500,
  "warning": "Output too long."
}

截断不是让模型猜。warning 字段明确告诉模型输出太长，elided_chars 告诉模型中间被裁掉了多少。模型被训练成可以基于首尾片段做判断——而且如果真的需要中间部分，它还可以用 sed -n '5000,5100p' 去精确拉取。

default.yaml（text-based 模式）的做法不同：超过 10000 字符直接拒绝输出全文，只给 head+tail 和一段指令，告诉模型用更精确的命令重试。这是一种更严厉的策略——"不确定就再看一次，别猜"。

以上是模型侧的事情：定义工具、解析工具调用、格式化工具结果。但解析出来的 {"command": "ls -la"} 最终要在哪里执行、怎么执行，这是环境侧要解决的问题。

2. 环境抽象

2.1 统一接口

所有环境实现只暴露两个核心方法：

class Environment(Protocol):
    def execute(self, action: dict, cwd: str = "", *, timeout: int | None = None) -> dict[str, Any]: ...
    def get_template_vars(self, **kwargs) -> dict[str, Any]: ...
    def serialize(self) -> dict: ...

execute 接收一个 action（包含 command 字段），返回一个结果字典：

{"output": "...", "returncode": 0, "exception_info": ""}

不管底层是 subprocess（本地）、docker exec（容器）还是 singularity exec（HPC），这个契约不变。Agent 循环不需要知道命令是在哪执行的。

2.2 为什么不要 shell session

FAQ 里有一段重要的话，直接说明了 mini 与 SWE-agent 及其他 agent 框架最核心的区别：

Executes actions with subprocess.run — every action is completely independent (as opposed to keeping a stateful shell session running). This makes it trivial to execute the actions in sandboxes (literally just switch out subprocess.run with docker exec) and to scale up effortlessly. Seriously, this is a big deal.

有状态的 shell session 意味着：

环境变量在命令之间累积（export VAR=value 会一直影响后续命令）
cd 的效果持续（当前目录是一个隐藏状态）
后台进程可能残留（python server.py &）
每个 action 的执行环境依赖于之前所有 action 的执行历史
沙箱化时，你需要为每个 agent 实例维护一个长期运行的进程

无状态的 subshell 意味着：

每次 subprocess.run 都是全新的 shell
不存在"当前目录"——模型必须写 cd /path && command
环境变量不传递——模型写 VAR=value command
命令超时 kill 后不会留下孤儿进程
沙箱化时，只需把 subprocess.run 换成 docker exec container_id

所以 mini 把 shell session 的状态管理成本转移给了模型。prompt 里有两处明确告诉模型： "Directory or environment variable changes are not persistent"，并且给出了写法指导。这看起来是增加了模型负担，但在这个 LLM 能力阶段，这个负担比维护 shell session 状态机的代码复杂度要低得多。

无状态 subshell 决定了"每条命令都是独立进程"这个大前提，但在这个进程里面，很多 Linux 工具的默认行为是给交互式用户设计的，对 agent 并不友好。所以在真正调用 subprocess.run 之前，需要先预设一批环境变量来修正这些工具的默认行为。

2.3 环境变量预设

local 和 docker 环境都预设了一组环境变量：

env:
  PAGER: cat
  MANPAGER: cat
  LESS: -R
  PIP_PROGRESS_BAR: 'off'
  TQDM_DISABLE: '1'

为什么需要这些：

PAGER=cat / MANPAGER=cat：许多 Linux 命令（git log、man、systemctl status）默认会通过 less 分页输出，而 less 是交互式的。如果把 less 的输出 pipe 到 subshell，进程会卡住。这些变量强制用 cat 代替交互式 pager。
LESS=-R：如果模型通过 PAGER=less 显式使用 less，-R 保留颜色控制码但不让 less 进入交互模式。
PIP_PROGRESS_BAR=off / TQDM_DISABLE=1：pip 安装包和 Python 进度条默认输出大量不可见字符到 stdout（回车覆盖、颜色代码），把上下文占满且不可读。

这些不是业务逻辑的配置，而是"让 bash 在非交互上下文中行为可预期"的基础设施。

那么这些环境变量到底是怎么传给子进程的？具体执行又是什么样的？这就到了具体的环境实现——LocalEnvironment 和 DockerEnvironment。

2.4 LocalEnvironment：在本地机器上执行命令

LocalEnvironment 做的事情一句话概括：把模型给的一行命令字符串，用 subprocess.run 在本机跑起来，收集结果并返回。整个逻辑就是一次函数调用，传入一个 {"command": "..."} 字典，返回一个输出字典。

但光看代码不如分步理解这个函数到底做了什么。下面按执行顺序拆开来。

第一步：从 action 里取出命令字符串

command = action.get("command", "")

action 是 1.2 节解析出来的那个字典，比如 {"command": "ls -la", "tool_call_id": "..."}。这里把命令字符串取出来。如果传进来的 action 没有 command 字段，就取空字符串（subprocess.run("") 会执行一个什么也不做的 shell）。

第二步：调用 subprocess.run

result = subprocess.run(
    command,              # ① 要跑的命令
    shell=True,           # ② 通过 /bin/sh 解析执行
    text=True,            # ③ 输出当文本，别返回 bytes
    cwd=cwd or self.config.cwd or os.getcwd(),  # ④ 在哪个目录跑
    env=os.environ | self.config.env,           # ⑤ 传给子进程的环境变量
    timeout=timeout or self.config.timeout,     # ⑥ 最多跑多少秒
    stdout=subprocess.PIPE,                     # ⑦ 收集正常输出
    stderr=subprocess.STDOUT,                   # ⑧ 错误输出合并到正常输出
)

第三步：异常处理

上面只说 subprocess.run 这行代码。实际上整个调用还包在一个 try-except 里（前面片段省略了）：

命令超时 → subprocess.TimeoutExpired → 返回 returncode: -1，输出里包含"命令超时"
其他异常 → 尝试获取部分的 stdout/stderr，标记异常信息

不管遇到什么错，都不会让 agent 崩溃。错误信息被写进输出字典返回给模型，模型可以据此调整后续行为。

第四步：从结果构造返回字典

subprocess.run 返回的对象里，主要用到两个字段：

result.stdout：命令的全部输出（因为 stderr 也合并到这里了）
result.returncode：返回码。0 表示成功，非 0 表示某种错误

这两个字段被打包成环境层的统一返回格式：

output = {"output": result.stdout, "returncode": result.returncode, "exception_info": ""}

然后这个字典回到 agent 循环中，通过 1.3 节的 format_toolcall_observation_messages 渲染成模型下一轮能看到的消息。

一个额外功能：告诉模型自己跑在什么平台

def get_template_vars(self, **kwargs) -> dict:
    return recursive_merge(
        self.config.model_dump(),
        platform.uname()._asdict(),  # 注入 system, release, version, machine
        os.environ,
        kwargs,
    )

platform.uname() 返回当前系统的信息：system（Linux 还是 Darwin）、release（内核版本号）、version（详细版本）、machine（架构，如 x86_64）。这些变量被合并到了渲染系统 prompt 时的模板变量里。所以 system prompt 开头的 {{system}} {{release}} {{version}} {{machine}} 会在 Linux 上渲染成 Linux 6.6.87.2 #1 SMP x86_64，在 macOS 上渲染成 Darwin 24.6.0 ...。模型知道自己在什么平台上，就能正确选择 sed 的 -i 语法（macOS 需要 -i ''，Linux 只用 -i），以及其他平台相关的命令。

LocalEnvironment 直接在当前机器上跑命令，适合日常开发。但批量评估和不受信任务需要隔离——这就是 DockerEnvironment 的用武之地。

2.5 DockerEnvironment：在隔离容器里执行命令

LocalEnvironment 直接在当前机器上跑命令，适合日常开发。但如果模型跑的是不受信任的代码（比如批量评估时的自动修 bug 脚本，或者用户让 agent 生成的一键部署脚本里有 rm -rf），Local 模式会直接影响到你的真实系统。

解决这个问题的做法是：把命令放到一个隔离的容器里跑。这就是 DockerEnvironment。

Docker 需要额外安装，不是系统自带的。可以把它理解成一个轻量的隔离沙箱——容器里的进程看不到你真实电脑上的文件，也改不到你真实系统。一条命令在容器里把东西搞坏了，最坏情况就是删掉容器重建，宿主机不受影响。

DockerEnvironment 的核心不是"复杂的容器管理"，而是做了一次替换：把 subprocess.run(command, shell=True) 替换为 subprocess.run(["docker", "exec", 容器ID, "bash", "-lc", command])。

初始化时启动容器：

def _start_container(self):
    cmd = [
        "docker", "run", "-d", "--name", container_name,
        "-w", self.config.cwd,
        *self.config.run_args,
        self.config.image,
        "sleep", self.config.container_timeout,
    ]

sleep 2h 作为容器的 entrypoint——保持容器活着但不做任何事。后台通过 docker exec 注入命令。TTL 过后自动退出。cleanup 用 (timeout 60 docker stop || docker rm -f) & 异步清理，不阻塞。

一条 docker exec 命令的完整拼接：

cmd = [self.config.executable, "exec", "-w", cwd]
for key, value in self.config.env.items():
    cmd.extend(["-e", f"{key}={value}"])
cmd.extend([self.container_id, *self.config.interpreter, command])

interpreter 默认是 ["bash", "-lc"]。-l（login shell）确保 .bashrc 和 /etc/profile 被加载，这对许多依赖 conda、pip 安装的可执行文件路径很重要。-c 是 bash 执行脚本的标志。你可以配置 interpreter: ["python", "-c"] 让模型写 Python 代码而不是 bash——虽然工具名还叫 bash，但实际执行器可以任意换。

不管是 Local 还是 Docker，命令执行完之后都面临同一个问题：agent 怎么知道模型认为任务做完了？这就是完成检测——所有环境共享的最后一段逻辑。

2.6 完成检测的共同逻辑

_check_finished 是所有环境（local、docker、singularity）都复用的一小段逻辑：

def _check_finished(self, output: dict):
    lines = output.get("output", "").lstrip().splitlines(keepends=True)
    if lines and lines[0].strip() == "COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT" and output["returncode"] == 0:
        submission = "".join(lines[1:])
        raise Submitted({
            "role": "exit", "content": submission,
            "extra": {"exit_status": "Submitted", "submission": submission},
        })

COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT 之后的文本行作为 submission 传给 agent。这个设计让模型可以在同一行命令中既标记完成又传递最终结果：

echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
cat final_patch.diff

第二行开始的全部内容成为 submission。或者更复杂的：

echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
echo "### Changes"
git diff
echo "### Test Results"
python -m pytest

上面的全部内容——工具定义、工具解析、环境执行、完成检测——都围绕 toolcall 模式展开：模型通过 API 原生的 tool_calls 结构化字段来指定要跑的命令。但并非所有模型都支持这个能力，所以 mini 保留了一套备选方案。

3. Text-based 模式

3.1 为什么还有 text-based

v2 之前的 mini 用的是 text-based 模式：模型在回复中嵌入特殊标记的 bash 代码块，系统用正则解析。v2 默认使用 toolcall 模式，但 text-based 仍然保留，用于：

不支持原生 tool call 的模型（某些本地模型、旧版 API）
用 completions endpoint 而非 responses endpoint 的场景
微调数据——text-based 格式的轨迹更直接

3.2 正则解析

actions_text.py 的 parse_regex_actions 用正则提取代码块：

def parse_regex_actions(content: str, *, action_regex: str, format_error_template: str) -> list[dict]:
    actions = [a.strip() for a in re.findall(action_regex, content, re.DOTALL)]
    if len(actions) != 1:
        raise FormatError(...)
    return [{"command": action} for action in actions]

注意：text-based 模式严格限制每轮一条命令。len(actions) != 1 就抛 FormatError。这和 toolcall 模式（支持一次回复多个 tool call）形成了对比——不是因为技术限制，而是 text-based 解析天然就比结构化 tool call 更脆弱，增加限制来降低格式错误的发生率。

3.3 两者的消息格式差异

Toolcall 模式产生的消息流：

system: ...
user: 请解决这个问题...
assistant: [tool_calls: bash {"command": "ls"}]
tool: {"returncode": 0, "output": "README.md\nsrc/"}
assistant: [tool_calls: bash {"command": "cat README.md"}]
tool: {"returncode": 0, "output": "# My Project..."}

Text-based 模式产生的消息流：

system: ...
user: 请解决这个问题...
assistant: THOUGHT: ... \n```mswea_bash_command\nls\n```
user: <returncode>0</returncode>\n<output>README.md\nsrc/</output>
assistant: THOUGHT: ... \n```mswea_bash_command\ncat README.md\n```
user: <returncode>0</returncode>\n<output># My Project...</output>

在 toolcall 模式中，工具结果是 tool role 的消息；在 text-based 模式中，工具结果是 user role 的消息。这个差异来自 API 协议层面：toolcall API 要求 tool role 对应 tool_call_id，而 text-based 把结果当作另一个 user 发来的消息。

以上已经覆盖了 mini 工具和环境的完整链路。

mini-swe-agent 最小复现

同一个小米模型、同一个 prompt、同一个 task，多次运行的结果可能截然不同：
运行 A：ls -la → echo DONE → 结束
运行 B：ls -la → ls -la → ls capability-view/ → 超步数
运行 C：ls && echo DONE（合并成一条命令）
使用glm-5.1会更加稳定些。

import json
import os
import subprocess
from openai import OpenAI

client = OpenAI(
    base_url="https://aihubmix.com/v1",
    api_key=os.getenv("AIHUBMIX_API_KEY"),
)
MODEL = "coding-glm-5.1-free"

BASH_TOOL = {
    "type": "function",
    "function": {
        "name": "bash",
        "description": "Execute a bash command",
        "parameters": {
            "type": "object",
            "properties": {"command": {"type": "string"}},
            "required": ["command"],
        },
    },
}


class Submitted(Exception):
    pass


class FormatError(Exception):
    def __init__(self, *messages: dict):
        super().__init__(str(messages))
        self.messages = list(messages)


def query():
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=[BASH_TOOL],
    )
    choice = response.choices[0]
    tool_calls = choice.message.tool_calls or []

    actions = []
    for tc in tool_calls:
        args = json.loads(tc.function.arguments)
        actions.append({"command": args["command"], "tool_call_id": tc.id})

    # mini-swe-agent 风格：模型必须始终调用工具
    if not actions:
        raise FormatError({
            "role": "user",
            "content": "你必须调用 bash 工具。如果任务完成，执行 echo DONE。",
        })

    message = {
        "role": "assistant",
        "content": choice.message.content or "",
        "extra": {"actions": actions},
    }
    messages.append(message)
    return message


def execute(message):
    for action in message["extra"]["actions"]:
        command = action["command"]
        tool_call_id = action["tool_call_id"]
        print(f"[Executing]: {command}")
        result = subprocess.run(command, shell=True, capture_output=True, text=True)
        output = result.stdout
        print(output, end="")
        lines = output.strip().splitlines()
        if lines and (lines[0].strip() == "DONE" or lines[-1].strip() == "DONE"):
            raise Submitted("Task complete")
        messages.append(
            {
                "role": "tool",
                "tool_call_id": tool_call_id,
                "content": json.dumps({"returncode": result.returncode, "output": output}),
            }
        )

task = input("Task: ").strip()
STEP_LIMIT = 3
messages = [
    {"role": "system", "content": "你是一个编程助手，可以执行 bash 命令。完成后执行命令 echo DONE 来结束任务。"},
    {"role": "user", "content": f"{task}"},
]
step = 0

while True:
    if step >= STEP_LIMIT:
        print("Out of steps.")
        break
    try:
        execute(query())
        step += 1
    except FormatError as e:
        print(f"[FormatError]: {e}")
        messages.extend(e.messages)
    except Submitted:
        print("Done.")
        break

posted @ 2026-05-06 09:12 湾仔码农阅读(0) 评论(0) 收藏举报

刷新页面返回顶部

FishL

博观而约取，厚积而薄发