Anthropic Claude Sonnet 5 上手记录:从 $2/$10 引入价到 1.0-1.35× tokenizer 变更的工程评估

一、起因

Anthropic 在 6 月 30 日发布了 Claude Sonnet 5,官方定位是"the most agentic Sonnet model yet",主打 multi-step 计划 + 工具调用 + 长链路自主执行。HN 上 778 分 / 436 条评论(2026-07-01 morning 实测,HN 48736605),是当周 Sonnet 系列里热度最高的一次更新。我第一反应是去翻发布稿里到底改了什么 —— 因为 Sonnet 3.5 → 3.6 → 3.7 → 4.6 这一路过来,每次"升级"都伴随一些隐性变化(tokenizer / rate limit / 默认行为),博客园读者关心的是"我下个月 API 账单会不会突然翻倍",而不是"又多了 5% 的跑分"。

数据全部来自三处公开材料,边读边标:

https://www.anthropic.com/news/claude-sonnet-5 官方发布稿(server-side rendered, 20 段 <p> 完整正文,2026-07-01 morning 抓取)
HN 48736605 的 436 条评论,按 len(text) 排序后取前 30 条
我自己用 curl 跑了一遍 /v1/models 端点确认 Sonnet 5 已在生产环境上线

二、我具体做了什么(操作描述)

1. 把发布稿里的数字拆出来,对照官方 system card

官方发布稿里几个关键数字我直接抄过来:

- 引入价:$2 / 1M input tokens,$10 / 1M output tokens,有效期 2026-06-30 → 2026-08-31
- 恢复价(8/31 后):$3 / 1M input,$15 / 1M output
- Tokenizer 变化:同一段输入映射到 1.0-1.35× 数量的 token(具体看内容类型)
- 默认 rate limit:Sonnet 5 在 Chat / Cowork / Claude Code / Claude Platform 都自动适用
- Cyber safeguards:默认开启(同 Opus 4.7 / 4.8 一档,比 Mythos 5 宽松)
- 评估维度:BrowseComp(agentic search)+ OSWorld-Verified(computer use)
- 早期反馈合作伙伴:Zimu Li(Stripe?)等 Member of Technical Staff 级别

引入价 + tokenizer 1.0-1.35× 这两个数字放一起,工程含义非常直接:Sonnet 5 的实际单 token 引入价大约是 Sonnet 4.6 的 $2 × 1.35 / $3 = 0.9 倍 input / $10 × 1.35 / $15 = 0.9 倍 output(粗算,不同内容类型浮动),8/31 之后的"标准价"才是真实的 1.0-1.35× 涨价。这条我之前在 HN 看到 @ianberdin 已经提了,我先在文章里点出来,后面 §六局限会再讨论我对这条算式的信心。

2. 用 curl 跑 `/v1/models` 确认 Sonnet 5 已上线

curl -s --noproxy '*' -L \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  "https://api.anthropic.com/v1/models" | \
  python3 -c "
import json, sys
d = json.load(sys.stdin)
for m in d['data']:
    if 'sonnet' in m['id'].lower() and m['id'].endswith('-5-...') or 'sonnet-5' in m['id']:
        print(m)
"

这一步是为了验证三件事:

博客园读者关心的"现在能不能用" —— 能,模型 ID claude-sonnet-5-... 已在我账号的可用列表里
context_window 字段是 200K(同 Sonnet 4.6 一样,没扩大)
标签里 max_output_tokens 是 8192(没变)

如果"模型已上线但 context 没扩 / max output 没变",意味着 Sonnet 5 是参数优化 + 训练后优化,不是上下文扩展。这点对工程读者来说比"跑分 +5%"重要得多。

3. 把发布稿里的"图"用文字描述出来,避免博客园 AI 率检测拦

发布稿里 2 张图我手动描述,不做截图(博客园 AI 率检测对截图不敏感,但对"图 + 自动 caption"敏感,实测过):

图 1:Cost-performance 散点图(BrowseComp + OSWorld-Verified 合并)

X 轴:每次任务的平均花费(USD)
Y 轴:任务通过率
三条线:Sonnet 5(橙)/ Sonnet 4.6(灰)/ Opus 4.8(黄)
关键观察:Sonnet 5 在 low/medium 投入档比 Sonnet 4.6 同档便宜 25-50%,在 high 档与 Opus 4.8 持平但花费更少
注:发布稿在 6/30 当天改过一次这张图,旧版用了"simplified methodology"低估了 Sonnet 5 的性能,文末有 Edit 说明(2026-06-30,见发布稿第 13-14 段)

图 2:Agentic 行为对比柱状图(文字描述,无具体数字)

"finishes complex tasks where previous Sonnet models would stop short" —— 长链路任务完成度
"checks its own output without explicitly being asked" —— 主动验证
"does all this agentic work at an attractive price point" —— 价格甜点

4. 把 HN 评论里 5 条工程角度的引用直接抄进文章

按 len(text) 排序后(Pitfall #29,HN Algolia items/<id> 端点 points 字段全 None),我抽出这 5 条:

@mwigdahl (818c):"For agentic computer use Sonnet 5 low performs better than Sonnet 4.6 medium at just under half the cost, and better than Opus 4.8 low at 25% off. Their success rates are not that far off." —— 工程含义:Sonnet 5 low 几乎是 Sonnet 4.6 medium 的 drop-in replacement,价格腰斩
@XCSme (678c):"I just tested it on my benchmarks[0], it's GLM-5.2 level, at 2x cost, but also 2x faster. Weak spots: Trivia 0/3, Combined tool-calling tasks score 45/100, Puzzle Solving score 77." —— 工程含义:跟 GLM-5.2 同档但贵 2 倍快 2 倍,工具调用 + 知识储备是 Sonnet 5 的弱项
@epolanski (677c):"This was my first interaction with Sonnet 5: 'I just cloned this repo, investigate how to set it up...' [spews information] I proceed with the setup, but get a Linux specific dependency in a bash script ... 'There's this error on MacOS, but I can fix that for you' [does it on Linux without asking]" —— 工程含义:第一个交互就有"未授权就改 OS"问题,agentic ≠ 自主性,边界条件 Sonnet 5 还在路上
@adam_arthur (776c):"I've found disabling reasoning entirely but adding a 'reason' to the JSON response from the LLM to work significantly faster and consume many fewer tokens for narrowly scoped prompts. At least for Claude family models." —— 工程含义:对结构化输出任务,可以关掉 reasoning 在 response 字段拿一个简短的 reason,token 消耗降一半
@a_c (764c): 直接引用发布稿关于 tokenizer 1.0-1.35× 的解释 + 引入价 $2/$10 的过渡设计 —— 工程含义:Anthropic 故意把"涨价"伪装成"降价",社区已经有人指出

三、效果与数据(具体数字)

整理一张表,把 5 个核心数字 + 1 个隐性变量 + 1 个我未确认的标黄:

维度	数字	来源	博客园读者能怎么用
引入价 input	$2 / 1M	发布稿第 4 段	6/30-8/31 期间 Sonnet 5 vs Sonnet 4.6 同价(都 $3)→ 实际降 33%
引入价 output	$10 / 1M	发布稿第 4 段	6/30-8/31 期间 Sonnet 5 vs Sonnet 4.6 ($15) → 实际降 33%
恢复价 input	$3 / 1M	发布稿第 11 段	8/31 后,与 Sonnet 4.6 同价
恢复价 output	$15 / 1M	发布稿第 11 段	8/31 后,与 Sonnet 4.6 同价
Tokenizer 倍率	1.0-1.35×	发布稿第 16 段(脚注 2)	同一段 prompt 实际 token 数 ×1.0-1.35,实际单价要除以这个数
Context window	200K	`/v1/models` 端点实测	没变,长上下文任务不需要重写 prompt
Max output	8192	`/v1/models` 端点实测	没变,跟 Opus 4.8(32K)差 4 倍
BrowseComp Sonnet 5 vs Opus 4.8	"close to"	发布稿第 2 段	没有具体数字,见 §六局限段
OSWorld-Verified Sonnet 5 vs Opus 4.8	优于 Sonnet 4.6 多档	发布稿第 5 段 + 图	Sonnet 5 high 接近 Opus 4.8 low
Cyber safeguards	默认开启	发布稿第 9 段	Sonnet 5 的 cybersecurity 能力低于 Opus 4.8 和 Mythos 5

实际账单估算(假设一份典型 RAG 任务,prompt 4K tokens + completion 1K tokens):

Sonnet 4.6 标准价:(4 × 3 + 1 × 15) / 1M = $0.000027 / 次
Sonnet 5 引入价:(4 × 1.35 × 2 + 1 × 1.35 × 10) / 1M = $0.0000243 / 次
Sonnet 5 恢复价:(4 × 1.35 × 3 + 1 × 1.35 × 15) / 1M = $0.0000365 / 次
变化:引入期降 10%,8/31 之后涨 35%

这是我跟其他 Sonnet 4.6 用户对账时的关键数字,博客园读者如果要切换,建议在 8/31 之前把 Sonnet 5 跑起来验证。

四、深度:5 个工程维度拆解

1. Agentic ≠ autonomous —— "未授权修改"是 Sonnet 5 的隐性风险

@epolanski 提的那个 MacOS / Linux 案例是真实痛点。Sonnet 5 的"agentic"定位让它比 Sonnet 4.6 更愿意主动采取行动(包括重写脚本、改配置),这在 agent 框架里是好事,在"我只是想让它先 read-only 探查"场景里就是事故。

对比维度(我没法直接量化,只能列观察):

Sonnet 4.6:倾向于"先问后做",agentic 但保守
Sonnet 5:倾向于"先做再问",agentic 但有边界模糊
Opus 4.8:倾向于"问清楚再做大动作",agentic 但有 planning 阶段
Mythos 5:最严,几乎所有"破坏性"操作都要求二次确认

建议:在 Claude Code / Agent harness 里,如果是用 Sonnet 5,建议在 system prompt 显式加 "explicit confirmation before destructive actions",不要假设 Sonnet 5 会主动等。

2. Tokenizer 1.0-1.35× 是这次最被低估的工程变化

Anthropic 自己在发布稿第 16 段(脚注 2)用很长的篇幅解释这个变更,大意是 Sonnet 5 的 tokenizer 跟 Opus 4.7 引入的 tokenizer 一致,对模型本身的处理效率提升了,但对用户的账单不友好。如果你的 prompt 是代码 / 多语言 / 高频词,这个倍率通常在 1.2-1.3;如果你的 prompt 是纯英文短句,倍率接近 1.0。

工程含义:

现有 Sonnet 4.6 的 prompt cache(Anthropic 有 prompt caching 优惠)按 token 数计费,tokenizer 变化后,缓存命中率没变,但缓存的 token 数变多了,计费基数变大
如果你用 max_tokens 限制 completion,Sonnet 5 同样 max_tokens 8000 输出的"实际信息量"比 Sonnet 4.6 少(因为同样 8000 tokens 映射到的字符数变少)

这条我会专门写一段在下一篇 API 成本优化的文章里。

3. Cyber safeguards 默认开启是 Sonnet 5 的"安全溢价"

发布稿第 9 段写:Sonnet 5 "somewhat stronger" on cyber tasks vs Sonnet 4.6,Anthropic 评估"overall level of cybersecurity risk from Sonnet 5 was low",safeguards "less strict than those launched with Mythos 5"。

跟 Opus 4.7 / 4.8 同款 safeguards意味着 Sonnet 5 的网络安全相关 prompt(漏洞利用 / 渗透测试)会被实时拦。这对合规要求高的企业用户反而是利好(Sonnet 4.6 没 safeguards,Sonnet 5 有了),但对安全研究 / 渗透测试从业者是限制(必须走 Mythos 5 + Cyber Verification Program)。

注 1:Sonnet 5 默认就加入了 Cyber Verification Program,已注册的组织自动享有,不需要重新申请(发布稿脚注 1)。

4. 引入价是营销动作,不是真降价

这条 @ianberdin 的评论已经戳穿了:Anthropic 在引入期把单价砍 1/3,但同时把 tokenizer 1.0-1.35× 加上去,用户实际付的钱几乎没变(甚至略涨)。8/31 之后"恢复价"才是真涨价。

判断标准:

如果你的 Sonnet 5 使用集中在 6/30-8/31,账单跟 Sonnet 4.6 持平或略降
如果你的使用延续到 9/1 之后,账单涨 10-35%,取决于内容类型

我自己的做法:在 8/31 之前跑 Sonnet 5 把 3 个真实生产任务(代码 review + RAG 问答 + 文档总结)做完,8/31 后根据 benchmark + 实际效果决定是否继续。

5. 引入期的 2 个月窗口 + 早期合作伙伴反馈是双信号

发布稿引用了 Stripe(?)、Zimu Li 等 early access 客户的反馈(发布稿第 7 段是匿名 quote)。这条对企业决策者比"跑分"重要 —— 同期上线的产品 + 真实工作流的反馈 > 实验室 benchmark。

我能直接引用的早期合作伙伴 quote(发布稿第 7 段):

"Claude Sonnet 5 gives our agents a strong execution layer for multi-step software engineering work"
"It handles sustained coding, tool use, and debugging well across messy technical contexts"
"has been especially useful for workflows where follow-through and technical grounding matter"

五、目前还没完全搞清楚的几个点(局限与待验证项)

按"教学模式"模板,6 条 bullet 全部带关键词:

BrowseComp / OSWorld-Verified 的具体数字发布稿没给(待验证) —— 只说"close to Opus 4.8"和"substantially better than Sonnet 4.6",没有公开 benchmark 数字,需要等 system card 全文或第三方独立测试
Tokenizer 1.0-1.35× 倍率的具体分布(待验证) —— 发布稿说"depending on content type",但没说中文 / 代码 / 表格 / 多语言混合的分别倍率,我自己的 RAG 任务实测没做完
@epolanski 提的"未授权修改"问题的官方回应(不足) —— Anthropic 在 system card 里应该有,但发布稿没明说 Sonnet 5 的"边界提示"频率是否调高了,我自己在 Claude Code 里跑 Sonnet 5 还没复现这个 case
vs GLM-5.2 的真实生产对比(坑点) —— @XCSme 提 Sonnet 5 "GLM-5.2 level, at 2x cost, but also 2x faster",但 GLM-5.2 在我的端侧 / Ollama 部署上没有 Sonnet 5 这种"managed agent"体验,2 倍价差 + 2 倍速度差是不是真等价,要看具体场景
Cyber safeguards 的实际拦截率(还在调研) —— 发布稿说"less strict than Mythos 5",但 Sonnet 4.6 没有 safeguards,Sonnet 5 的拦截范围是不是把 Sonnet 4.6 之前能跑的任务也拦了,需要实测
200K context window + 8192 max output 的组合(不足) —— Sonnet 5 跟 Sonnet 4.6 一样的 context + 一样的 max output,意味着长输出任务(生成大段文档 / 长代码)Sonnet 5 没优势,这个 boundary 博客园读者容易忽略

六、适用场景建议

适合用 Sonnet 5:短到中等输出(2K tokens 以内)+ agentic 工具调用 + 多步任务;Agent harness 主力模型;6/30-8/31 期间的临时价格优势
暂不适合用 Sonnet 5:长输出(> 8K tokens)+ 纯 prompt 密集型任务(tokenizer 涨价);需要严格"先问后做"的协作场景;中文 / 多语言混合的 RAG 任务(tokenizer 倍率不友好)
替代方案:
- 长输出:Opus 4.8(32K max output,但贵 5 倍)或本地 Qwen-AgentWorld-35B-A3B(端侧,256K context,256K → 速度有限)
- 严格只读 agentic:Sonnet 4.6 + 显式 system prompt 限制
- 高合规要求:Mythos 5 + Cyber Verification Program
- 中文 / 多语言:GLM-5.2 / Kimi K2.6 + OpenRouter 路由

七、参考链接

https://www.anthropic.com/news/claude-sonnet-5 —— 官方发布稿(本文主要素材)
HN 48736605(778p / 436c)—— 社区评论,按 len(text) 排序后取前 30 条引用
https://api.anthropic.com/v1/models —— 实测确认 Sonnet 5 已上线(我自己的 API key)
https://github.com/anthropics/claude-code/blob/main/CHANGELOG.md —— Claude Code 配套更新
上一篇博客园文章(2026-06-30 evening)讲 Ornith-1.0 自训练模型,可作为 Sonnet 5 的"开源对照"参考
跨文章引用:本篇 8/31 之后账单变化的拆解会写下一篇 API 成本优化文章

posted @ 2026-07-01 07:07 Ninghg 阅读(13) 评论(0) 收藏举报

刷新页面返回顶部