输出解析器

初学大模型时，最容易把模型回答理解成“它回我一段文字就结束了”。但在真实项目里，模型输出通常还要继续流向别的程序模块，例如：

存进数据库；
返回给前端页面渲染；
交给下一个工作流节点；
作为 Agent、工具、接口调用的参数；
进入风控、审核、统计、报表逻辑。

这时，“一段自然语言”通常就不够用了。程序更希望拿到的是：

一个字符串；
一个 JSON 字典；
一个字段固定的对象；
一个带校验规则的强类型数据结构。

常见解析器分类:

解析器	最终结果	适用场景	说明
StrOutputParser	字符串 `str`	只需要展示文本，不需要拆字段	最简单，通常就是取出模型输出正文内容
JsonOutputParser	Python `dict` / `list`	希望模型返回 JSON，再交给程序继续处理	适合字段抽取、接口返回、工作流参数传递
PydanticOutputParser	Pydantic 对象	需要强类型和运行时校验	适合对字段类型、长度、范围有明确要求的业务场景

schema: 可以看成“数据结构说明书”或“输出格式规范”。有点像ts的约束

结构化输出的本质，就是先定义 schema，再让模型按这个 schema 输出结果

常见结构化输出:

方案	定义方式	最终结果	是否支持运行时校验	适用场景
TypedDict	Python 标准库 `typing.TypedDict`	dict	否	只需要固定字段结构，不需要严格校验
Pydantic	`BaseModel` + `Field(...)`	Pydantic 对象	是	需要强类型、范围、长度、必填项校验
JSON Schema	标准 JSON Schema 字典	通常为 `dict`	视具体实现而定	需要跨语言、跨系统共享结构协议

在 LangChain 中，这些结构通常会和下面这种方式配合使用：

model.with_structured_output(...)

TypedDict / Pydantic / JSON Schema 是“定义输出结构”的方式；
with_structured_output(...) 是“让模型按这个结构输出并自动解析”的入口。

区分结构化输出和输出解析器

输出解析器：更强调“模型已经输出了，我怎么把它转成程序可用的数据”；
结构化输出：更强调“在模型输出之前，我先规定好它应该长什么样”。

方式一(只用输出解析器):

from langchain_core.output_parsers import JsonOutputParser

parser = JsonOutputParser()
result = model.invoke("你是谁")
data = parser.invoke(result)

print(data)
# 让模型生成结果，再由 parser 负责解析

方式二(只用结构化输出):

from typing import TypeDict

class Person(TypeDict):
    name: str
    age: int
    
structured_model = model.with_structured_output(Person)
result = structured_model.invoke("请返回一个任务信息")
print(reuslt)

#不需要单独手写 parser，而是由 with_structured_output(...) 统一完成“约束输出 + 自动解析”
#更适合现代模型已经支持原生结构化输出的场景

方式三(结构化输出+额外校验处理):

structured_model = model.with_structed_output(Person)
result = structured_model.invoke("请返回一个人物信息")

# 这里可以继续做业务校验、入库前清洗、字段转换等处理
print(reuslt)

实际使用场景:

场景	不做解析，常见问题	更合适的做法
聊天问答、文案润色、摘要展示	只需要显示文本，结构要求低	StrOutputParser
从文本里抽取字段，如“问题/答案”“时间/人物/事件”	文本不稳定，后续逻辑难写	JsonOutputParser
给前端卡片、表单、列表页返回固定字段	字段缺失或字段名漂移会影响渲染	with_structured_output(TypedDict)
给数据库、审批流、订单系统写入严格数据	需要校验类型、范围、长度	`PydanticOutputParser` 或 `with_structured_output(Pydantic)`
与外部系统按统一协议对接	需要语言无关、协议明确	JSON Schema

输出解析器常用方法:

1.解析输出

parser.invoke(...)：更偏 LangChain / Runnable 风格，适合和 prompt | model | parser 链式组合。本章案例大多是这种写法。
parser.parse(text)：更偏“我已经拿到一段纯文本了，现在只想解析这段文本”。

2.给模型解释说明

解析器不只是“事后处理”，很多时候还会事前帮你约束模型输出。

常见方法：

parser.get_format_instructions()

会返回一段格式说明文字，告诉模型：应该输出什么结构；有哪些字段；每个字段是什么类型；是否只能返回 JSON；是否不能加额外解释文字

这段说明通常会被拼进 Prompt 中，让模型一开始就尽量按可解析格式输出，从而降低解析失败率

常见解析器用法

1. StrOutputParser

LangChain 里最简单的输出解析器,它做的事情非常直接：把模型返回内容取出来，当作字符串使用。它不做结构化解析，也不关心字段、键名、数据类型，只关心“把最终文本拿到手

问答机器人直接展示文本；
文章摘要、标题生成、改写润色；
翻译、续写、营销文案；
只需要把结果显示到前端，不需要拆字段的场景。

# LangChain 里最简单的解析器：从模型返回中取出 content 字段，转成 纯字符串，不做 JSON 等结构解析
# 适合：只关心「模型说了什么话」、不需要拆成字段的场景
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import os
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model
from loguru import logger

load_dotenv(encoding="utf-8")

# 构造对话模板（与第 13 章 ChatPromptTemplate 用法一致）
chat_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "你是一个{role}，请简短回答我提出的问题"),
        ("human", "请回答:{question}"),
    ]
)

# 填充占位符，得到消息列表，供模型使用
prompt = chat_prompt.invoke(
    {"role": "AI助手", "question": "什么是LangChain，简洁回答100字以内"}
)
logger.info(prompt)
# 初始化大模型（需配置 API Key 等）
model = init_chat_model(
    model="qwen-plus",
    model_provider="openai",
    api_key=os.getenv("aliQwen-api"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

# 调用模型：传入 prompt，得到的是 AIMessage 等对象（原始输出）
result = model.invoke(prompt)
logger.info(f"模型原始输出:\n{result}")

# # 创建字符串解析器：只做「从 result 里取 content 转成 str」
# 单条 AIMessage 时确实等价于 result.content；用解析器的好处：可链式组合（prompt | model | parser）
parser = StrOutputParser()

# 解析：parser.invoke(result) 等价于从 result 中取 content，得到纯字符串
response = parser.invoke(result)
logger.info(f"解析后的结构化结果:\n{response}")

2. JsonOutputParser

将模型的自由文本输出解析为结构化 JSON

用法一

直接在提示词里手写JSON要求

chat_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "你是一个{role}，请简短回答我提出的问题，结果返回json格式，q字段表示问题，a字段表示答案。",
        ),
        ("human", "请回答:{question}"),
    ]
)

prompt = chat_prompt.invoke(
    {"role": "AI助手", "question": "什么是LangChain，简洁回答100字以内"}
)
model = init_chat_model(
    model="qwen-plus",
    model_provider="openai",
    api_key=os.getenv("aliQwen-api"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

# 模型返回的可能是带 JSON 的文本
result = model.invoke(prompt)

# 创建 JSON 解析器（不绑 Pydantic 时，解析结果为 dict/list）
parser = JsonOutputParser()
# 尝试从 result 的 content 中解析出 JSON
response = parser.invoke(result)
logger.info(f"解析后的结构化结果:\n{response}")

用法二

用get_format_instructions()自动生成格式说明，让解析器自己生成格式说明，再拼进 Prompt。

结构化输出

所谓结构化输出（Structured Output），就是不满足于“模型输出一段 JSON 文本”，而是进一步要求：

输出必须符合某个明确 schema；
LangChain 直接帮你解析成字典或对象；
必要时还能做字段验证。

LangChain 官方现在特别强调：很多现代模型已经支持原生结构化输出。
这意味着，在支持的模型上，优先使用：

model.with_structured_output(...)

往往会比传统“手写 Prompt + Parser 解析”的方式更稳、更省心。

TypeDict:

TypedDict 来自 Python 标准库 typing，它的作用是：描述一个字典应该有哪些键、每个键是什么类型。它更像是一张结构说明书

但要特别注意：TypedDict 不负责真正的运行时校验。也就是说，它更偏“说明结构”，不是“强制验证”。

Annotated：给字段加解释说明

Annotated 也是 Python 标准库 typing 里的能力。它的作用不是“换一种类型”，而是：在原有类型上附加一段元数据或说明。

Annotated[str, "动物名称"]

本质上还是 str，只是额外挂了一段“这是动物名称”的说明。LangChain 可以利用这些说明生成更清晰的 schema 描述，让模型更容易理解每个字段该填什么

# LangChain 支持用 TypedDict（Python 标准库）或 Pydantic 定义结构，再由模型按该结构生成并解析
# with_structured_output 做什么？
# 在模型上调用 llm.with_structured_output(某个类型)，会返回一个「带了结构化输出能力」的可调用对象
# 调用 .invoke(messages) 时，模型会被引导输出符合该结构的 JSON，并自动解析成 Python 的 dict（或对应类型），无需再手写 Parser

import os
from typing import TypedDict, Annotated
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model

load_dotenv(encoding="utf-8")

llm = init_chat_model(
    model="qwen-plus",
    model_provider="openai",
    api_key=os.getenv("aliQwen-api"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

# 用 TypedDict 定义「一个动物」的结构；Annotated 里的字符串是给模型看的描述，便于生成合适内容
class  Animal(TypedDict):
    animal: Annotated[str, "动物"]
    emoji: Annotated[str, "表情"]
    
# 定义「动物列表」：一个字段 animals，类型是 Animal 的列表
class AnimalList(TypedDict):
    animals: Annotated[list[Animal], "动物与表情列表"]

    
# 普通对话消息
messages = [{"role": "user", "content": "任意生成三种动物，以及他们的 emoji 表情"}]
# 给模型绑定「结构化输出」：按 AnimalList 的结构返回并解析
llm_with_structured_output = llm.with_structured_output(AnimalList)
resp = llm_with_structured_output.invoke(messages)
print(
    resp
)  # 得到符合 AnimalList 的 dict，如 {"animals": [{"animal": "猫", "emoji": "🐱"}, ...]}

使用场景：

TypedDict 适合这些情况：

前端页面需要固定字段，但字段值不用做复杂校验；
工作流节点之间传递结构化数据；
想要结构清晰，但不想引入太重的校验逻辑；
模型本身支持较好的结构化输出能力。

一句话总结：TypedDict 适合“我要稳定结构”，但不一定要求“严格数据合法性校验”。

Pydantic: 从结构说明到真正校验

Pydantic 解决的是：“这个数据不仅要长得像，而且必须真的合法，

Pydantic 是 Python 生态里非常常用的数据校验库。它可以在创建对象时对字段做：类型检查；范围检查；长度检查；定义校验。这也是为什么它在真实业务开发里非常重要

# 用 **Pydantic** 的 Field(ge=0, le=150) 放在 Annotated 里，即 Annotated[int, Field(ge=0, le=150, ...)]，Pydantic 会在**运行时**校验数值是否在 0–150，超出则抛 ValidationError
# 需要「强类型 + 取值范围校验」时，用 Pydantic 模型 + Field 约束；LangChain 的结构化输出也可用这类模型做解析与校验

from typing import Annotated
from pydantic import BaseModel, Field, ValidationError

# 用 Annotated 结合 Pydantic 的 Field：ge=0, le=150 会在运行时校验，不在范围内会触发 ValidationError
Age = Annotated[int, Field(ge=0, le=150, description="年龄，范围0-150")]
class Person(BaseModel):
    name: str
    age: int
    age2: Age  # 这里 age2 会被 Pydantic 按 Field(ge=0, le=150) 校验
try:
    p = Person(name="z3", age=11, age2=188)  # age2=188 超出 0–150，会抛 ValidationError
    print(p)
except ValidationError as e:
    print("数据校验失败：")
    print(e)

Annotated 本身不校验；真正发生校验的是 Pydantic 的 Field(...) 规则

PydanticOutputParser完整流程

# 两者区别
# PydanticOutputParser：**专门**配合 Pydantic 模型，解析结果会转成** Pydantic 实例**，并可利用 Pydantic 的校验（如字段类型、validator），不合格会抛错
# JsonOutputParser：把模型文本解析成「任意」JSON（dict/list），或可绑一个 Pydantic 模型约束形状

# 用 Pydantic 定义 Product（name、category、description），其中 description 用 field_validator 校验长度 ≥ 10 → 
# 用 PydanticOutputParser(pydantic_object=Product) 创建解析器 → 用 get_format_instructions() 得到格式说明并拼进 Prompt → 模型返回后 parser.invoke(result) 得到 Product 对象

import os
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from loguru import logger
from pydantic import BaseModel, Field, field_validator

load_dotenv(encoding="utf-8")

class Product(BaseModel):
    """产品信息：名称、类别、简介。简介长度需 ≥ 10，由下方 validator 校验。"""

    name: str = Field(description="产品名称")
    category: str = Field(description="产品类别")
    description: str = Field(description="产品简介")

    @field_validator("description")
    def validate_description(cls, value):
        """Pydantic 校验器：description 长度必须 ≥ 10，否则抛 ValueError。"""
        if len(value) < 10:
            raise ValueError("产品简介长度必须大于等于10")
        return value
    
# 创建 Pydantic 输出解析器：解析结果会转成 Product 实例并做校验
parser = PydanticOutputParser(pydantic_object=Product)
# 生成「格式说明」字符串，拼进 Prompt，引导模型按 Product 的字段输出 JSON
format_instructions = parser.get_format_instructions()

# 在 system 里放入 {format_instructions}，human 里放 {topic}
prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", "你是一个AI助手，你只能输出结构化的json数据\n{format_instructions}"),
        ("human", "请你输出标题为：{topic}的新闻内容"),
    ]
)

prompt = prompt_template.format_messages(
    topic="华为Mate X7", format_instructions=format_instructions
)
logger.info(prompt)

model = init_chat_model(
    model="qwen-plus",
    model_provider="openai",
    api_key=os.getenv("aliQwen-api"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

result = model.invoke(prompt)
logger.info(f"模型原始输出:\n{result.content}")

# 解析：把 result 转成 Product 实例，若格式或校验不通过会抛错
response = parser.invoke(result)
logger.info(f"解析后的结构化结果:\n{response}")

这个案例体现了 Pydantic 路线最完整、最经典的工作流：

定义一个 Product Pydantic 模型；
用 Field(...) 给字段加说明；
用 field_validator(...) 写更细的校验逻辑；
创建 PydanticOutputParser(pydantic_object=Product)；
用 get_format_instructions() 生成格式说明；
把说明拼入 Prompt；
模型输出后，解析成 Pydantic 实例。

使用场景

当你遇到下面这些需求时，通常就该优先考虑 Pydantic：

要把结果写进数据库，不能容忍字段类型乱掉；
要把模型输出接到订单、审批、风控、报表等业务系统；
某些字段必须满足范围、枚举、长度等限制；
希望一旦数据不合法，就立刻抛错，而不是悄悄放过。

一句话总结：

Pydantic 适合“我要的不只是结构化，而是可验证、可托底、可工程化的数据”。

从简单到严格的选择路径

只需要文本展示 用 StrOutputParser
需要简单 JSON 字段 用 JsonOutputParser
需要固定结构，且模型支持结构化输出 优先 with_structured_output(TypedDict)
需要严格校验 用 Pydantic + PydanticOutputParser，或 with_structured_output(Pydantic模型)
需要跨语言 / 外部协议对齐 用 JSON Schema

posted @ 2026-05-05 15:23 幻影之舞阅读(53) 评论(0) 收藏举报

刷新页面返回顶部

alannero

输出解析器

公告