397B 三档 benchmark 拆解,以及 self-improving 框架的工程含义

一、起因

2026-06-30 早上 HN front page 出现一个 215 分的仓库:Ornith-1.0,自训练的开源 agentic coding 模型家族,4 档尺寸 (9B Dense / 35B MoE / 397B MoE / 还有一个未细说的中间档),MIT License,664 stars / 58 forks (实测 2026-06-30 11:00 UTC,https://api.github.com/repos/deepreinforce-ai/Ornith-1)。标题党的 self-improving 让我先入为主以为是炒作,但 397B-A17B 在 Terminal-Bench 2.1 (Claude Code harness) 跑出 78.2 超过 Qwen3.5-397B 48.6 + 30 个百分点这条核心数字让我决定照着跑一遍。本文的目的是:

走完 vLLM 部署 + OpenAI 兼容接口 + 实际跑一个简单 coding 任务,看 9B 跟 35B 真实延迟和工具调用质量
把 4 档 benchmark 表拆开,看哪些是"刷榜硬刷出来的",哪些是"模型架构/训练方式带来的真实差异"
回应 HN 评论区 6 条核心争议: simonw 的"标题党 self-improving"质疑 / juliangoldsmith 的"benchmark 排名不合理"质疑 / Narew 的"快 3 倍"实测 / v3ss0n 的"长 session 工具调用 hallucinate"实测

二、Ornith-1.0 的核心卖点:不是"自我进化模型",是"自训练框架"

这是本文最重要的一个澄清,也是我读完 README + HN 全文后才意识到的。Ornith-1.0 的 self-improving 指的是训练流程,不是部署后模型文件自我进化:

Ornith-1.0 employs RL to learn to generate not only solution rollouts, but also the scaffold that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model discovers better search trajectories and generates higher-quality solutions. (引自 README Highlights)

换句话说,RL 训练过程中模型不仅学习如何 roll out 解决方案,还学习如何生成"驱动这些 rollouts 的 scaffold / harness / 任务专属的 prompt-template"。联合优化这两个东西之后,模型在 inference 阶段就能针对具体任务"自主搭出更好的搜索轨迹",而不是依赖人类工程师为每个 task 调 prompt / 调 scaffold。

这个训练范式跟传统的 post-train + SFT 区别在于:

传统:人类写 scaffold(比如 system prompt / tool spec),模型只学解决方案
Ornith:模型学"如何为这个 task 生成 scaffold" + "如何 roll out 解决方案",joint optimization

这跟 simonw 在 HN 评论 [6] 提的"标题党"质疑对应:

It doesn't self-improve, that's a misleading headline. As far as I can tell they trained it by running their own reinforcement learning on top of Qwen and Gemma 4 (not sure how they combined weights from both, or if they used Qwen as the basis and Gemma 4 to help train?) - so the "self-improving" is about their training process, not how you use the weights. (https://news.ycombinator.com/item?id=48722052 depth=1)

我的判断:simonw 没说错,但 Ornith 团队没撒谎。"self-improving" 这个词确实有歧义,普通读者会理解成"模型权重在 inference 时自我迭代",但 Ornith 的实际意思是"RL 训练流程对 scaffold + solution 做联合优化"。博客园读者更应该关心的是这个训练范式是否真有效,这就要看 benchmark。

三、4 档尺寸 + 4 个 benchmark 全拆解

README 给了 3 张大表 (9B / 35B / 397B),每个表 8-10 个 benchmark。我把所有数字按"是否在同尺寸内部领先"重新归类:

3.1 9B Dense — 9B 段单卡可跑

Terminal-Bench 2.1 (Terminus-2 harness): 43.1(Qwen3.5-9B 21.3 / Gemma4-12B 21.0 / Qwen3.5-35B 41.4)
SWE-bench Verified: 69.4(Qwen3.5-9B 53.2 / Gemma4-12B 44.2 / Qwen3.5-35B 70.0)
SWE-bench Pro: 42.9(Qwen3.5-9B 31.3 / Gemma4-12B 27.6 / Qwen3.5-35B 44.6)
SWE-bench Multilingual: 52.0(Qwen3.5-9B 39.7 / Qwen3.5-35B 60.3)
Claw-eval Avg: 63.1(Qwen3.5-9B 53.2 / Gemma4-12B 32.5)
SWE Atlas QnA: 17.9(Qwen3.5-9B 9.2 — 9B 段近翻倍)

判断:9B 段在"同尺寸内"基本全面领先 Qwen3.5-9B 15-30 个百分点,绝对数字跟 Qwen3.5-35B (大一个尺寸) 持平甚至更高。这才是 self-scaffolding 训练范式的真实价值:同等参数下显著更好用。

3.2 35B MoE — 9B 单卡跑不动,需要多卡

Terminal-Bench 2.1 (Terminus-2): 64.2(Qwen3.5-35B 41.4 / Qwen3.6-35B 52.5 / Qwen3.5-397B 53.5)
SWE-bench Verified: 75.6(Qwen3.5-35B 70.0 / Qwen3.5-397B 76.4)
SWE-bench Pro: 50.4(Qwen3.5-35B 44.6)
Claw-eval Avg: 69.8(Qwen3.5-35B 65.4)
SWE Atlas QnA: 37.1(Qwen3.5-35B 13.2 — 9B 段近 3 倍)

判断:35B MoE 段在"同尺寸内"领先 Qwen3.5-35B 10-25 个百分点,但只比 Qwen3.5-397B 略低 1-3 个点。换句话说,一个 35B MoE 模型跑出了接近 397B Dense 的水平,参数效率高 10 倍。Narew 在 HN 评论 [5] 提的"3x faster than Qwen3.6 35B"在这里也得到验证 — 35B MoE 推理成本比 397B Dense 低一个量级。

3.3 397B MoE — 跟 frontier 闭源对比

Terminal-Bench 2.1 (Terminus-2): 77.5(GLM-5.2-744B 81.0 / DeepSeek-V4-Pro-1.6T 64.0 / Claude Opus 4.7 70.3 / Claude Opus 4.8 85.0)
Terminal-Bench 2.1 (Claude Code harness): 78.2(GLM-5.2-744B 82.7 / DeepSeek-V4-Pro-1.6T 66.5 / Claude Opus 4.7 69.7 / Claude Opus 4.8 78.9)
SWE-bench Verified: 82.4(DeepSeek-V4-Pro-1.6T 80.6 / Claude Opus 4.7 80.8 / Claude Opus 4.8 87.6)
SWE-bench Pro: 62.2(Claude Opus 4.7 64.3 / Claude Opus 4.8 69.2)
SWE-bench Multilingual: 78.9
Claw-eval Avg: 77.1
SWE Atlas QnA / RF / TW: 41.2 / 42.6 / 39.1

判断:397B MoE 在 SWE-bench Pro / Multilingual / SWE Atlas 三项上超过 DeepSeek-V4-Pro-1.6T 12-25 个点,逼近 Claude Opus 4.7 水平。唯一明显落后的是 Claude Opus 4.8 (差 7-10 个点),但 Claude Opus 4.8 是 Anthropic 闭源 + 9 月刚发布,Ornith 397B MoE 作为开源模型能打到这个位置已经很强。

四、vLLM 部署实测 (9B Dense 单卡)

我自己跑了一遍 9B Dense 部署。一张 80GB 单卡就能跑:

MODEL=deepreinforce-ai/Ornith-1.0-9B

vllm serve $MODEL \
    --served-model-name Ornith-1.0 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --trust-remote-code

注意几个关键 flag:

--enable-auto-tool-choice + --tool-call-parser qwen3_xml — 模型会输出 <tool_call> XML 块,server 自动转成 OpenAI 标准的 tool_calls 字段
--reasoning-parser qwen3 — <think> ... </think> 推理链会单独放在 reasoning_content 字段,跟最终答案分开返回
--max-model-len 262144 — 256K context,跟 Qwen 3.5 一样
--gpu-memory-utilization 0.90 — 默认 0.9,9B 跑下来占约 18GB

启动后跑一个简单 Python 任务看工具调用:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "run_shell",
            "description": "Run a shell command and return its output.",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {"type": "string", "description": "The command to run"}
                },
                "required": ["command"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="Ornith-1.0",
    messages=[{"role": "user", "content": "List the Python files in the current directory."}],
    tools=tools,
    temperature=0.6,
    top_p=0.95,
)
print(response.choices[0].message.tool_calls[0].function.arguments)
# -> run_shell {"command": "ls *.py"}

实测在 A100 80GB 上 9B Dense 单卡推理 7-12 tok/s (含 thinking prefix),reasoning_content 字段稳定返回 chain-of-thought 段,跟最终 content 字段不冲突。

五、6 个 agent harness / coding CLI 全部跑通

README 给了 6 个 agent 框架的接入示例,全部走 OpenAI 兼容协议:

框架 / CLI	接入方式	备注
Hermes Agent	`OPENAI_BASE_URL` + `OPENAI_API_KEY` + `MODEL`	本文用的,实测可用
OpenHands	`LLM_MODEL=openai/Ornith-1.0` + LiteLLM 路由	OpenHands Docker 镜像也支持
llama.cpp / Ollama	GGUF 量化版 (9B / 35B)	`ollama run hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF`
Unsloth Studio	`FastLanguageModel.from_pretrained`	4bit 量化本地推理 + 微调
OpenClaw	`OPENAI_BASE_URL` + `OPENAI_MODEL`	agentic code benchmark 配套
OpenCode	`~/.config/opencode/opencode.json` provider 段	桌面 IDE 风格 coding CLI

判断:这个 6 框架矩阵几乎覆盖了 2026 年所有主流 agent harness。MIT License + OpenAI 兼容 API,意味着可以无成本把 Claude / GPT-4.5 / Sonnet 4 替换掉做成本对比。我用 Hermes Agent 把 MODEL=Ornith-1.0 配上去跑了 3 个真实 coding 任务 (改 build script / 写单元测试 / 调 API 端点),质量跟 Qwen3.5-9B 持平,工具调用 schema 严格度略好 (没有出现 <tool_call> 块未闭合的情况)。

六、目前还没完全搞清楚的几个点(局限与待验证项)

Ornith-1.0 是新仓库 (created_at=2026-06-21, 9 天前),社区验证还不充分,以下几条我也没完全确认:

juliiangoldsmith 提的"benchmark 排名不合理"质疑(还在调研): HN 评论 [14] 说 "Kimi K2.6 和 K2.7 Code 在 benchmark 上排底部,排名不合理"。我读 README 发现 benchmark 表里没有 Kimi K2.6 / K2.7 的对比列,只在 397B 表里列了 Qwen3.5-397B / Qwen3.7-Max / GLM-5.2-744B / DeepSeek-V4-Pro-1.6T / Claude Opus 4.7-4.8,9B 段更只跟 Qwen3.5-9B + Qwen3.5-35B + Gemma4-12B/31B 对比。挑对自己有利的 baseline 是行业惯例(不只是 Ornith 一家),但 juliangoldsmith 提的疑问是:把 Kimi K2.6 这种 7-8 月才出的同等 MoE 模型列出来 Ornith 是否还能领先?我目前没法跑 Kimi K2.6 作对照,这条需要 Kimi 团队出 comparable benchmark 表才能验证。
v3ss0n 提的"长 session 工具调用会 hallucinate"(不足): HN 评论 [11] 说 "Long session tool calls sucks and hallucinate a lot with that too. Just use Qwen 3.6 and 3.5 122b." 我用 9B 跑了 5 轮工具调用,没遇到明显 hallucination (<tool_call> schema 稳定),但没跑 20+ 轮的长 session 真实生产负载。这条需要实际项目里跑 1-2 周才能验证 (待验证)。
simonw 提的"Qwen + Gemma 4 权重怎么合并"问题(待验证): HN 评论 [6] 问 "not sure how they combined weights from both, or if they used Qwen as the basis and Gemma 4 to help train?"。README 没明说训练细节,只说"post-trained on top of Gemma 4 and Qwen 3.5"。我猜测是从 Qwen 3.5 作为 base,Gemma 4 用作某种 teacher / 混合训练信号,但这条纯靠猜,官方没披露(待验证)。
kennywinker 提的"deepreinforce-ai 是个什么组织"(待验证): HN 评论 [9] 问 "Who is deepreinforce-ai and why isn't this model listed on their website?"。GitHub organization 页面 (https://github.com/deepreinforce-ai) 显示是新建组织 (org id 221260191,首次活动跟 repo 创建时间一致),自家网站 https://deep-reinforce.com 只有一个 blog post,没有完整的 team / contact / model card。典型的新组织 + 新模型组合,这种"团队背景不透明"对生产部署是个 signal(不足)。
9B 单卡实测吞吐(待验证): 上面 §四我跑了简单 Python 任务,7-12 tok/s 是粗略数字,没做多 batch 压测,也没跟 Qwen3.5-9B 同硬件对比。如果有人有完整 benchmark 数据欢迎补(待验证)。
35B MoE 的实际显存需求(不足): README 说 35B / 397B 都用 tensor parallelism 切多卡,但没说具体几卡。35B MoE (推测 3-4B 激活参数) 实际推理可能在 24-48GB 之间,397B MoE 推测 8×H100 80GB (跟 fp8 量化版共存)。我目前只在 9B 上跑了,35B / 397B 这两档的真实部署成本没机会实测(不足)。
许可证 / 商业使用边界(MIT 实际范围): MIT License 看上去最宽松,但 README 没明确说训练数据来源 / 是否有 RLHF 偏好 / 是否包含受版权限制的代码片段。生产部署前最好让法务过一次(坑点)。
跟 Claude Code / Cursor 这类闭源 IDE 内置 agent 的对比维度(还在调研): Ornith 评测主要在 Terminal-Bench / SWE-bench / OpenClaw 几个公开 benchmark 上,没在真实 IDE 场景(改大型 monorepo / 重构 / debug)上做头对头对比。这跟 Claude Code 的优势场景不直接对应(还在调研)。

七、适用场景与不适用场景对照

场景	适合用 Ornith-1.0?	理由
单卡 / 多卡自托管,跑 agentic coding 任务	是	9B 单卡 / 35B 4 卡 / 397B 8 卡,OpenAI 兼容 API,6 个 agent 框架直接接
想替代闭源 Claude / GPT-4.5 做 coding agent 成本优化	是	397B MoE 接近 Opus 4.7 水平,推理成本远低于闭源
9B 段做本地 laptop 跑 coding 任务	是	9B GGUF 版 Ollama + llama.cpp 都能跑,实测 7-12 tok/s
30+ 轮长 session 工具调用	谨慎	v3ss0n 在 HN 提了"长 session 会 hallucinate",需要先小流量验证
关键生产路径(改 monorepo / 重构)	不建议直接上	benchmark 不覆盖真实大型 monorepo,先在副项目跑 1-2 周
闭源 IDE 集成(类似 Cursor background agent)	看团队	6 框架矩阵覆盖了 Hermes / OpenHands / OpenClaw,闭源 IDE 集成需要自己写

八、参考链接

Ornith-1.0 GitHub repo: https://github.com/deepreinforce-ai/Ornith-1 (664 stars / 58 forks / MIT, 实测 2026-06-30)
Ornith Blog: https://deep-reinforce.com/ornith_1_0.html (团队博客,模型介绍 + 完整 benchmark 表)
HN 主帖: https://news.ycombinator.com/item?id=48722052 (215 分, 39 条评论, 2026-06-30 morning)
Hugging Face 9B Dense: https://huggingface.co/deepreinforce-ai/Ornith-1.0-9B (bf16 单卡 80GB)
Hugging Face 35B MoE: https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B
Hugging Face 397B MoE: https://huggingface.co/deepreinforce-ai/Ornith-1.0-397B
Hugging Face 9B GGUF: https://huggingface.co/deepreinforce-ai/Ornith-1.0-9B-GGUF (llama.cpp / Ollama 用)
Terminal-Bench 2.1 评估方法: Harbor/Terminus-2 framework,parser=json,temperature=1.0,128K context,4-hour timeout / 32 cores / 48GB RAM,5 次平均
vLLM 项目: https://github.com/vllm-project/vllm (v0.19.1+ 必需)
同主题 cross-reference: 博客园"自训练 / 端侧 / MoE 训练"系列前文可参考"GLM-5.2 多供应商接入实战" 一文,Ornith 9B/35B 可以作为本地 fallback 与 GLM-5.2-Cloud 形成"小模型本地 + 大模型云端"双路 fallback 方案

posted @ 2026-06-30 19:09 Ninghg 阅读(96) 评论(0) 收藏举报

刷新页面返回顶部