SNAPSHOT.md

需求文档转换 Skill — 完整代码包

生成时间:2026-05-23 16:55
包含文件:8 个
本文件包含该 Skill 的全部源代码,可直接按目录结构重建。


目录结构

req-doc-conv/
├── SKILL.md
│   ├── check_template.py
│   ├── create_custom_template.py
│   ├── parse_filter.json
│   ├── renderer.py
│   ├── requirements.txt
│   ├── supplier_parser.py
│   └── supplier_schema.py

SKILL.md

---
name: 需求文档转换
description: 将供应商侧的技术需求文档(.docx)转换为给甲方客户看的需求说明文档(.docx),格式严格固定。核心功能清单表格支持动态纵向合并单元格,同一系统功能下的子功能行左边必须纵向合并。
compatibility:
  - Python: 3.11及以上版本
  - python-docx: Word文档解析和生成
  - Pillow: 图片处理

---

# 概述

本工具链用于将供应商编写的技术需求文档自动转换为面向甲方的业务需求说明文档。

**核心特性**:
- 输入:供应商 Word 文档(.docx),包含数据库设计、API 接口、技术实现细节
- 输出:甲方 Word 文档(.docx),格式固定
- 数据流转:`供应商Word` → JSON → `甲方Word`
- 自定义占位符:使用 `{{placeholder_name}}` 格式,不依赖 Jinja2

---

# 目录规范(重要)

**所有输入输出必须在Skill目录内进行,严禁在输入文档所在目录创建任何文件。**

Skill目录即本文件(SKILL.md)所在的目录,脚本通过 `os.path.dirname(os.path.abspath(__file__))` 自动定位,无需手动配置。

| 目录 | 用途 | 说明 |
|------|------|------|
| `scripts/` | 脚本文件 | 解析器、渲染器等 |
| `scripts/parse_filter.json` | 过滤配置 | 定义解析时需排除的文本 |
| `templates/` | 模板文件 | `custom_template.docx` 必须在此目录 |
| `parsed_output/` | 输出目录 | 解析JSON、最终Word均在此 |

**严禁行为**:
- ❌ 在输入文档所在目录创建 `parsed_output/` 目录
- ❌ 在输入文档所在目录创建 `templates/` 目录
- ❌ 在输入文档所在目录自动生成 `custom_template.docx`
- ❌ 将输出文件写入Skill目录以外的位置

**正确做法**:
- ✅ 解析输出:`<skill_dir>/parsed_output/converted_document.json`
- ✅ 模板路径:`<skill_dir>/templates/custom_template.docx`
- ✅ 渲染输出:`<skill_dir>/parsed_output/client_document.docx`
- ✅ 图片目录:`<skill_dir>/parsed_output/images/`

---

# 环境要求

**Python版本**:必须使用 Python 3.11 及以上版本执行所有脚本。

**执行命令**(所有路径基于Skill目录):
```bash
# 解析供应商文档(输出默认到 <skill_dir>/parsed_output/converted_document.json)
py -3.11 scripts/supplier_parser.py <docx_path>

# 渲染到模板(模板固定使用 <skill_dir>/templates/custom_template.docx)
py -3.11 scripts/renderer.py

# 检查模板
py -3.11 scripts/check_template.py templates/custom_template.docx

命令行参数说明

脚本 参数 说明
supplier_parser.py docx_path 供应商Word文档路径(必需)
-o, --output 输出目录路径(可选,默认为Skill目录下parsed_output/)
-n, --name 输出JSON文件名(可选,默认converted_document.json)
--filter 过滤配置文件路径(可选,默认scripts/parse_filter.json)
renderer.py json_path JSON数据文件路径(可选,默认parsed_output/converted_document.json)
-t, --template Word模板文件路径(可选,默认templates/custom_template.docx)
-o, --output 输出Word文件路径(可选,默认parsed_output/client_document.docx)
--max-funcs 最大功能点数量(可选,默认10)
--max-subs 每个功能点最大子功能数量(可选,默认10)
create_custom_template.py -o, --output 输出模板文件路径(可选)
--max-funcs 最大功能点数量(可选)
--max-subs 每个功能点最大子功能数量(可选)
-f, --force 强制覆盖已存在的模板(可选)
check_template.py template_path 模板文件路径(必需)
pack.py -o, --output 输出打包文件路径(可选,默认SNAPSHOT.md)
--no-md 不打包.md文档(可选)
--root 指定Skill根目录(可选)

安装依赖

py -3.11 -m pip install python-docx Pillow

使用场景

适用对象

  • 文档标准化:快速生成符合公司标准格式的需求说明书
  • 格式转换:将供应商技术文档转换为目标模板格式

不适用场景

  • 直接编辑 Word 文件(使用 docx 技能)
  • 需要AI改写内容的场景

工作流程

供应商 Word (.docx)
       │
       ▼
supplier_parser.py (Python 3.11+) → converted_document.json
       │
       ▼
renderer.py (Python 3.11+) → 甲方 Word (.docx)

两阶段说明

  1. 解析阶段supplier_parser.py 提取供应商文档内容,生成结构化 JSON
  2. 渲染阶段renderer.py 将 JSON 渲染到目标模板

文件结构

需求文档转换/
├── scripts/
│   ├── supplier_parser.py       # 解析器:Word → JSON
│   ├── renderer.py              # 渲染器:JSON → Word
│   ├── supplier_schema.py       # JSON Schema 定义
│   ├── create_custom_template.py # 模板创建工具
│   ├── check_template.py        # 模板检查工具
│   ├── parse_filter.json        # 解析过滤配置
│   ├── requirements.txt         # 依赖列表
│   ├── pack.py                  # 打包脚本:一键导出全部代码到单个md
├── templates/
│   └── custom_template.docx     # 目标模板(含占位符)
├── parsed_output/
│   ├── converted_document.json  # 解析后的JSON
│   ├── images/                  # 提取的图片
│   └── client_document.docx     # 最终输出
├── SKILL.md                     # Skill描述文件
└── SNAPSHOT.md                  # 打包输出(由pack.py生成)

核心模块说明

1. 解析器 (supplier_parser.py)

将供应商 Word 文档解析为结构化 JSON。

执行命令

py -3.11 "scripts/supplier_parser.py"

提取规则

标题级别 标题文本 目标字段 内容类型
Heading 2 原始需求 sections.original_req 普通文本
Heading 2 需求澄清 sections.req_clarify 普通文本
Heading 2 需求分析 sections.req_analysis 普通文本
Heading 3 方案概述 sections.solution_overview 普通文本
Heading 2 新增表 sections.new_tables 富文本块
Heading 2 修改表 sections.modify_tables 富文本块
Heading 2 数据兼容性要求 sections.data_compatibility 富文本块
Heading 2 功能点 functions[] 嵌套结构

富文本提取特性

  • 自动合并相同样式的连续 runs,减少存储空间
  • 支持项目符号列表(圆点、三角、方框、菱形等)
  • 支持编号列表(阿拉伯数字、中文数字、带圈数字、罗马数字等)
  • 支持多级列表(自动计算编号文本)
  • 支持文本缩进(悬挂缩进和首行缩进)
  • 支持表格单元格横向合并(gridSpan)和纵向合并(vMerge)
  • 支持表格单元格富文本
  • 支持图片提取和尺寸调整
  • 工作量列自动删除:子功能点表格中表头含"工作量"的整列(表头+数据)会被移除
  • XXX功能点跳过:功能点名称含"XXX"且所有子功能点名称也含"XXX"时,该功能点及全部内容被跳过
  • 中台三问排除:功能点名称含"中台三问"时自动跳过,后续单独处理

2. 渲染器 (renderer.py)

将 JSON 数据渲染到 Word 模板。

执行命令

py -3.11 "scripts/renderer.py"

核心功能

  • 支持普通文本、富文本、表格、图片
  • 支持项目符号列表、编号列表、多级列表渲染
  • 支持文本缩进(悬挂缩进和首行缩进)
  • 支持嵌套功能点结构(最多10个功能点,每个10个子功能点)
  • 自动删除未使用的占位符章节
  • 保留模板样式
  • 通过占位符状态判断删除哪些段落(无需硬编码标题名称)

列表渲染说明

  • 项目符号使用标准 Unicode 字符渲染(• ▶ ■ ◆ ✓ ★ →),无需依赖特殊字体
  • 解析器兼容供应商文档中的 Wingdings/PUA 项目符号,自动转换为类型名
  • 编号列表自动添加编号前缀文本

3. JSON Schema (supplier_schema.py)

定义 JSON 数据结构契约。

数据结构

{
  "doc_title": "文档标题",
  "doc_date": "2024年6月30日",
  "sections": {
    "original_req": "原始需求内容",
    "req_clarify": "需求澄清内容",
    "req_analysis": "需求分析内容",
    "solution_overview": "方案概述内容",
    "new_tables": [],
    "modify_tables": [],
    "data_compatibility": []
  },
  "functions": [
    {
      "name": "功能点名称",
      "scene_desc": "场景描述",
      "sub_functions": [
        {
          "name": "子功能点名称",
          "content_blocks": []
        }
      ]
    }
  ],
  "zhongtai_interfaces": [
    {
      "interface_name": "接口名称",
      "method": "方法名",
      "description": "接口说明"
    }
  ],
  "images_dir": "images"
}

4. 解析过滤配置 (parse_filter.json)

解析器支持通过配置文件过滤不需要的段落内容,避免目录、模板说明等干扰文本被提取到JSON中。

过滤规则类型

规则类型 配置键 说明
样式过滤 exclude_styles 按段落样式名过滤(不区分大小写),如 "toc 1""TOCHeading"
文本过滤 exclude_texts 按文本内容过滤,段落包含此文本即被排除
正则过滤 exclude_text_patterns 按正则表达式过滤,匹配段落全文

配置示例

{
    "exclude_styles": ["toc 1", "toc 2", "TOCHeading"],
    "exclude_texts": ["请在此处填写", "模板说明"],
    "exclude_text_patterns": ["^\\d+\\.\\s*.+\\t\\d+$"]
}

使用方式

# 使用默认过滤配置(scripts/parse_filter.json)
py -3.11 scripts/supplier_parser.py <docx_path>

# 使用自定义过滤配置
py -3.11 scripts/supplier_parser.py <docx_path> --filter my_filter.json

注意:蓝色斜体等富文本样式的内容不会被过滤,只有配置中明确指定的规则才会生效。


内容块格式

富文本块(紧凑数组格式)

{
  "t": "text",
  "runs": [
    ["普通文本"],
    ["加粗蓝色", {"b": true, "c": "0000FF"}],
    ["继续普通"]
  ],
  "list": "disc/420/h420",
  "indent": {"left": 420, "hanging": 420}
}

格式说明

  • runs: 数组列表,每个元素为 [文本, 样式?]
  • 样式属性:b(粗体)、i(斜体)、u(下划线)、c(颜色HEX)、sz(字号半磅)
  • list: 列表信息(仅列表段落存在),紧凑字符串格式
    • 项目符号:"disc""triangle""square" 等,可选带级别 :1 和缩进 /420/h420
    • 编号列表:"1.""(一)" 等,可选带级别 :1 和缩进 /0/f420
  • indent: 缩进信息(仅非默认缩进时存在)
    • {"left": 420, "hanging": 420}{"left": 0, "firstLine": 420}

表格块

{
  "t": "table",
  "cols": 4,
  "h": [
    {"v": "序号", "r": [["序号"]]},
    {"v": "字段名称", "r": [["字段名称"]]},
    {"v": "字段说明", "r": [["字段说明"]]},
    {"v": "备注", "r": [["备注"]]}
  ],
  "d": [
    [{"v": "1", "r": [["1"]]}, {"v": "地市", "r": [["地市"]]}, {"v": "", "r": []}, {"v": "", "r": []}],
    [{"v": "2", "r": [["2"]]}, {"v": "区县", "r": [["区县"]]}, {"v": "", "r": []}, {"v": "", "r": []}]
  ],
  "col_widths": [1500, 3500, 2500, 2500]
}

单元格格式

  • v: 单元格纯文本
  • r: 富文本runs [[文本, 样式?]]
  • gs: 横向合并列数(gridSpan)
  • vm: 纵向合并状态 "restart""continue"

表格格式

  • col_widths: 列宽列表(可选),单位:twentieths of a point(1/20磅),如 [1500, 3500, 4000]

图片块

{
  "t": "img",
  "p": "images/img_001.png",
  "w": 200,
  "h": 150
}

模板占位符

模板使用 {{placeholder_name}} 格式的占位符:

文档级占位符

  • {{doc_title}} - 文档标题
  • {{doc_date}} - 文档日期

章节占位符

  • {{original_req}} - 原始需求
  • {{req_clarify}} - 需求澄清
  • {{req_analysis}} - 需求分析
  • {{solution_overview}} - 方案概述
  • {{new_tables}} - 新增表
  • {{modify_tables}} - 修改表
  • {{data_compatibility}} - 数据兼容性要求

中台三问占位符

  • {{func_tables}} - 中台三问接口表格列表
  • {{zhongtai_scene}} - 中台三问场景描述
  • {{happy_path}} - 正常路径
  • {{sad_path}} - 异常路径

功能点占位符(1-10):

  • {{func_N_name}} - 功能点名称
  • {{func_N_scene}} - 场景描述
  • {{func_N_has_scene}} - 是否有场景描述(布尔值,用于条件渲染)
  • {{func_N_has_subs}} - 是否有子功能点(布尔值,用于条件渲染)
  • {{func_N_sub_M_name}} - 子功能点名称
  • {{func_N_sub_M_content}} - 子功能点内容

技术要点

占位符渲染流程

  1. 构建占位符映射表 (build_placeholder_map)
  2. 分析模板结构,确定删除范围 (analyze_template_structure)
  3. 遍历段落,替换占位符为实际内容
  4. 插入表格和图片到正确位置
  5. 删除未使用的占位符章节
  6. 清理残留占位符

样式保留机制

渲染时保留模板段落的样式:

  • 保存原始段落的 style 和 run 样式
  • 清空段落后恢复样式
  • 新内容应用模板样式

智能删除机制

通过占位符状态自动判断删除范围:

  • 空功能点名称 → 删除整个功能点章节
  • 空子功能点名称 → 删除子功能点章节
  • 空场景描述 → 删除"场景描述"标题
  • 无子功能点 → 删除"实现方案"标题

注意事项

  1. Python版本:必须使用 Python 3.11 及以上版本
  2. 富文本格式:使用紧凑数组格式,解析时自动合并同样式runs
  3. 列表支持:解析和渲染均支持项目符号列表、编号列表、多级列表及文本缩进
  4. 文本过滤:解析时支持通过 parse_filter.json 配置过滤规则,排除目录、模板说明等干扰内容
  5. 模板限制:最多支持10个功能点,每个功能点最多10个子功能点
  6. 图片路径:图片路径相对于 JSON 文件所在目录
  7. 表格样式:表格使用默认 TableGrid 样式,支持横向合并(gridSpan)、纵向合并和单元格富文本
  8. 空行保留:渲染后保留文档中的空行,不会误删
  9. 模板保护:create_custom_template.py 默认不覆盖已存在的模板,需 -f 参数强制覆盖
  10. 打包交付:使用 pack.py 一键打包全部代码到 SNAPSHOT.md,便于交付

---
## scripts/check_template.py

```python
"""
检查模板文件是否合规
"""
from docx import Document
import re

def check_template(template_path: str):
    """检查模板文件"""
    doc = Document(template_path)
    
    print(f"检查模板: {template_path}\n")
    
    # 收集所有占位符
    placeholders = set()
    
    for para in doc.paragraphs:
        text = para.text
        found = re.findall(r'\{\{(\w+)\}\}', text)
        for ph in found:
            placeholders.add(ph)
    
    # 预期的占位符
    expected_placeholders = {
        "doc_title", "doc_date",
        "original_req", "req_clarify", "req_analysis", "solution_overview",
        "new_tables", "modify_tables", "data_compatibility",
        "zhongtai_scene", "func_tables",
        "happy_path", "sad_path"
    }
    
    # 功能点占位符
    for i in range(1, 11):
        expected_placeholders.add(f"func_{i}_name")
        expected_placeholders.add(f"func_{i}_scene")
        for j in range(1, 11):
            expected_placeholders.add(f"func_{i}_sub_{j}_name")
            expected_placeholders.add(f"func_{i}_sub_{j}_content")
    
    print("=== 模板中的占位符 ===")
    for ph in sorted(placeholders):
        status = "[OK]" if ph in expected_placeholders else "[?]"
        print(f"  {status} {{{{ {ph} }}}}")
    
    print(f"\n总计: {len(placeholders)} 个占位符")
    
    # 检查缺失的占位符
    missing = expected_placeholders - placeholders
    if missing:
        print(f"\n缺失的占位符: {len(missing)} 个")
        for ph in sorted(missing)[:10]:
            print(f"  - {{{{ {ph} }}}}")
        if len(missing) > 10:
            print(f"  ... 还有 {len(missing) - 10} 个")
    
    # 检查未知占位符
    unknown = placeholders - expected_placeholders
    if unknown:
        print(f"\n未知占位符: {len(unknown)} 个")
        for ph in sorted(unknown):
            print(f"  ? {{{{ {ph} }}}}")
    
    # 检查标题样式
    print("\n=== 标题样式检查 ===")
    heading_count = {"Heading 1": 0, "Heading 2": 0, "Heading 3": 0, "Heading 4": 0}
    for para in doc.paragraphs:
        style_name = para.style.name if para.style else "Normal"
        if "Heading" in style_name:
            heading_count[style_name] = heading_count.get(style_name, 0) + 1
    
    for style, count in heading_count.items():
        print(f"  {style}: {count} 个")
    
    # 结论
    print("\n=== 检查结果 ===")
    if not missing and not unknown:
        print("[OK] 模板合规,可以进行渲染")
        return True
    else:
        print("[FAIL] 模板存在问题")
        return False


if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(description='检查模板文件是否合规')
    parser.add_argument('template_path', help='模板文件路径')
    
    args = parser.parse_args()
    
    check_template(args.template_path)

scripts/create_custom_template.py

"""
创建自定义占位符模板
预设足够数量的功能点和子功能点占位符
"""

from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml.ns import qn
import os


def set_chinese_font(run, font_name='宋体'):
    """设置中文字体"""
    run.font.name = font_name
    run._element.rPr.rFonts.set(qn('w:eastAsia'), font_name)


def create_custom_template(output_path: str, max_funcs: int = 10, max_subs: int = 10):
    """
    创建自定义占位符模板
    
    Args:
        output_path: 输出路径
        max_funcs: 最大功能点数量
        max_subs: 每个功能点最大子功能数量
    """
    doc = Document()
    
    # 设置默认字体
    style = doc.styles['Normal']
    style.font.name = '宋体'
    style._element.rPr.rFonts.set(qn('w:eastAsia'), '宋体')
    style.font.size = Pt(10.5)
    
    # ========== 文档标题 ==========
    title = doc.add_paragraph()
    title.alignment = WD_ALIGN_PARAGRAPH.CENTER
    run = title.add_run('{{doc_title}}')
    run.font.size = Pt(22)
    run.bold = True
    set_chinese_font(run, '黑体')
    
    # 文档日期
    date_para = doc.add_paragraph()
    date_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    run = date_para.add_run('{{doc_date}}')
    set_chinese_font(run)
    
    doc.add_paragraph()
    
    # ========== 一、原始需求 ==========
    h1 = doc.add_heading('一、原始需求', level=1)
    for run in h1.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{original_req}}')
    set_chinese_font(run)
    
    # ========== 二、需求澄清 ==========
    h1 = doc.add_heading('二、需求澄清', level=1)
    for run in h1.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{req_clarify}}')
    set_chinese_font(run)
    
    # ========== 三、需求分析 ==========
    h1 = doc.add_heading('三、需求分析', level=1)
    for run in h1.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{req_analysis}}')
    set_chinese_font(run)
    
    # ========== 四、方案概述 ==========
    h1 = doc.add_heading('四、方案概述', level=1)
    for run in h1.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{solution_overview}}')
    set_chinese_font(run)
    
    # ========== 五、数据模型 ==========
    h1 = doc.add_heading('五、数据模型', level=1)
    for run in h1.runs:
        set_chinese_font(run, '黑体')
    
    # 5.1 新增表
    h2 = doc.add_heading('5.1 新增表', level=2)
    for run in h2.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{new_tables}}')
    set_chinese_font(run)
    
    # 5.2 修改表
    h2 = doc.add_heading('5.2 修改表', level=2)
    for run in h2.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{modify_tables}}')
    set_chinese_font(run)
    
    # 5.3 数据兼容性要求
    h2 = doc.add_heading('5.3 数据兼容性要求', level=2)
    for run in h2.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{data_compatibility}}')
    set_chinese_font(run)
    
    # ========== 六、功能分解 ==========
    h1 = doc.add_heading('六、功能分解', level=1)
    for run in h1.runs:
        set_chinese_font(run, '黑体')
    
    # 预设功能点和子功能点
    for i in range(1, max_funcs + 1):
        # 功能点标题
        h2 = doc.add_heading('{{func_%d_name}}' % i, level=2)
        for run in h2.runs:
            set_chinese_font(run, '黑体')
        
        # 场景描述
        h3 = doc.add_heading('场景描述', level=3)
        for run in h3.runs:
            set_chinese_font(run, '黑体')
        p = doc.add_paragraph()
        run = p.add_run('{{func_%d_scene}}' % i)
        set_chinese_font(run)
        
        # 实现方案
        h3 = doc.add_heading('实现方案', level=3)
        for run in h3.runs:
            set_chinese_font(run, '黑体')
        
        # 子功能点
        for j in range(1, max_subs + 1):
            # 子功能点标题
            h4 = doc.add_heading('{{func_%d_sub_%d_name}}' % (i, j), level=4)
            for run in h4.runs:
                set_chinese_font(run, '黑体')
            
            # 子功能点内容
            p = doc.add_paragraph()
            run = p.add_run('{{func_%d_sub_%d_content}}' % (i, j))
            set_chinese_font(run)
    
    # ========== 七、中台三问 ==========
    h1 = doc.add_heading('七、中台三问', level=1)
    for run in h1.runs:
        set_chinese_font(run, '黑体')
    
    # 7.1 场景描述
    h2 = doc.add_heading('7.1 场景描述', level=2)
    for run in h2.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{zhongtai_scene}}')
    set_chinese_font(run)
    
    # 7.2 功能列表
    h2 = doc.add_heading('7.2 功能列表', level=2)
    for run in h2.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{func_tables}}')
    set_chinese_font(run)
    
    # ========== 八、测试建议 ==========
    h1 = doc.add_heading('八、测试建议', level=1)
    for run in h1.runs:
        set_chinese_font(run, '黑体')
    
    # 正常路径
    h2 = doc.add_heading('正常路径', level=2)
    for run in h2.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{happy_path}}')
    set_chinese_font(run)
    
    # 异常路径
    h2 = doc.add_heading('异常路径', level=2)
    for run in h2.runs:
        set_chinese_font(run, '黑体')
    p = doc.add_paragraph()
    run = p.add_run('{{sad_path}}')
    set_chinese_font(run)
    
    # 保存模板
    doc.save(output_path)
    print(f"模板已创建: {output_path}")
    print(f"  - 预设 {max_funcs} 个功能点")
    print(f"  - 每个功能点预设 {max_subs} 个子功能点")
    
    return output_path


if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(description='创建自定义占位符模板')
    parser.add_argument('-o', '--output', default='templates/custom_template.docx', help='输出模板文件路径(默认: templates/custom_template.docx)')
    parser.add_argument('--max-funcs', type=int, default=10, help='最大功能点数量(默认: 10)')
    parser.add_argument('--max-subs', type=int, default=10, help='每个功能点最大子功能数量(默认: 10)')
    parser.add_argument('-f', '--force', action='store_true', help='强制覆盖已存在的模板')
    
    args = parser.parse_args()
    
    if os.path.exists(args.output) and not args.force:
        print(f"模板已存在: {args.output}")
        print("如需覆盖,请使用 -f 或 --force 参数")
        print("跳过模板创建,保留用户手动修改的模板")
    else:
        output_dir = os.path.dirname(args.output)
        if output_dir:
            os.makedirs(output_dir, exist_ok=True)
        
        create_custom_template(args.output, args.max_funcs, args.max_subs)

scripts/parse_filter.json

{
    // 解析过滤配置文件
    // 在解析供应商Word文档时,匹配到此配置中规则的段落将被跳过,不进行JSON解析
    // 蓝色斜体等富文本样式的内容不受影响,只有配置中明确指定的才会被过滤
    
    // 按段落样式名过滤(不区分大小写)
    // 典型场景:目录段落、页眉页脚等
    "exclude_styles": [
        "toc 1",        // 一级目录
        "toc 2",        // 二级目录
        "toc 3",        // 三级目录
        "toc 4",        // 四级目录
        "TOC 1",
        "TOC 2",
        "TOC 3",
        "TOC 4",
        "TOCHeading"    // 目录标题
    ],
    
    // 按文本内容过滤(支持多行文本,段落文本包含此内容即匹配)
    // 典型场景:模板说明、填写提示、注释性文字
    // 多行文本用 \n 表示换行,如 "第一行\n第二行"
    "exclude_texts": [
        "请在此处填写",
        "请填写",
        "此处填写",
        "以下为模板说明",
        "模板说明",
        "本模板使用说明",
        "【说明】",
        "【注】",
        "【示例】",	"我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来",
		"我是说明文字,这行文字不应该被提取出来"
    ],
    
    // 按正则表达式过滤(对段落全文匹配,包括换行后的内容)
    // 典型场景:目录行("1.1. xxx\t5"格式)
    "exclude_text_patterns": [
        "^\\d+\\.\\s*\\d+\\.\\s*\\d*\\.?\\s*.+\\t\\d+$",
        "^\\d+\\.\\s*.+\\t\\d+$"
    ]
}

scripts/renderer.py

# -*- coding: utf-8 -*-
"""
需求文档渲染器 v3
将JSON数据渲染到Word模板

功能:
- 支持普通文本、富文本、表格、图片
- 支持项目符号、编号列表、多级列表
- 支持文本缩进(悬挂缩进和首行缩进)
- 支持嵌套功能点结构
- 自动删除未使用的占位符章节
- 保留模板样式
- 优雅处理空白标题(通过占位符状态判断)
"""

import os
import json
import re
from docx import Document
from docx.shared import Pt, RGBColor
from docx.oxml.ns import qn
from docx.oxml import OxmlElement


BULLET_CHARS = {
    "disc": "\u2022",
    "triangle": "\u25b6",
    "square": "\u25a0",
    "diamond": "\u25c6",
    "check": "\u2713",
    "star": "\u2605",
    "arrow": "\u2192",
}


def _get_bullet_char(bullet_name):
    return BULLET_CHARS.get(bullet_name, "\u2022")


def create_table_element(table_data: dict):
    """
    创建表格XML元素(支持合并单元格和富文本)
    
    单元格格式:
    {"v": "文本", "r": [[run]], "gs": 横向合并数, "vm": 纵向合并状态}
    
    兼容旧格式:如果单元格是字符串,自动转换为新格式
    """
    cols = table_data.get("cols", 4)
    headers = table_data.get("h", [])
    data = table_data.get("d", [])
    
    tbl = OxmlElement('w:tbl')
    
    tblPr = OxmlElement('w:tblPr')
    tblStyle = OxmlElement('w:tblStyle')
    tblStyle.set(qn('w:val'), 'TableGrid')
    tblPr.append(tblStyle)
    
    tblBorders = OxmlElement('w:tblBorders')
    for border_name in ['top', 'left', 'bottom', 'right', 'insideH', 'insideV']:
        border = OxmlElement(f'w:{border_name}')
        border.set(qn('w:val'), 'single')
        border.set(qn('w:sz'), '4')
        border.set(qn('w:space'), '0')
        border.set(qn('w:color'), '000000')
        tblBorders.append(border)
    tblPr.append(tblBorders)
    tbl.append(tblPr)
    
    tblGrid = OxmlElement('w:tblGrid')
    col_widths = table_data.get("col_widths", None)
    for i in range(cols):
        gridCol = OxmlElement('w:gridCol')
        if col_widths and i < len(col_widths):
            gridCol.set(qn('w:w'), str(col_widths[i]))
        else:
            gridCol.set(qn('w:w'), '2500')
        tblGrid.append(gridCol)
    tbl.append(tblGrid)

    def create_cell(cell_data, is_header=False, col_idx=0):
        if isinstance(cell_data, str):
            cell_data = {"v": cell_data, "r": [[cell_data]]}

        tc = OxmlElement('w:tc')

        tcPr = OxmlElement('w:tcPr')

        if cell_data.get("gs"):
            gridSpan = OxmlElement('w:gridSpan')
            gridSpan.set(qn('w:val'), str(cell_data["gs"]))
            tcPr.append(gridSpan)

        if cell_data.get("vm"):
            vMerge = OxmlElement('w:vMerge')
            vMerge.set(qn('w:val'), cell_data["vm"])
            tcPr.append(vMerge)

        tcW = OxmlElement('w:tcW')
        if col_widths and col_idx < len(col_widths):
            tcW.set(qn('w:w'), str(col_widths[col_idx]))
            tcW.set(qn('w:type'), 'dxa')
        else:
            tcW.set(qn('w:w'), '2500')
            tcW.set(qn('w:type'), 'dxa')
        tcPr.append(tcW)

        tc.append(tcPr)
        
        p = OxmlElement('w:p')
        
        runs = cell_data.get("r", [])
        if not runs and cell_data.get("v"):
            runs = [[cell_data["v"]]]
        
        for run_item in runs:
            if not run_item:
                continue
            
            text = run_item[0] if len(run_item) > 0 else ""
            if not text:
                continue
            
            r = OxmlElement('w:r')
            rPr = OxmlElement('w:rPr')
            
            if is_header:
                b = OxmlElement('w:b')
                rPr.append(b)
            
            if len(run_item) > 1 and run_item[1]:
                style = run_item[1]
                if style.get("b"):
                    b = OxmlElement('w:b')
                    rPr.append(b)
                if style.get("i"):
                    i = OxmlElement('w:i')
                    rPr.append(i)
                if style.get("u"):
                    u = OxmlElement('w:u')
                    u.set(qn('w:val'), 'single')
                    rPr.append(u)
                if style.get("c"):
                    color = OxmlElement('w:color')
                    color.set(qn('w:val'), style["c"])
                    rPr.append(color)
                if style.get("sz"):
                    sz = OxmlElement('w:sz')
                    sz.set(qn('w:val'), str(style["sz"]))
                    rPr.append(sz)
            
            if len(rPr) > 0:
                r.append(rPr)
            
            t = OxmlElement('w:t')
            t.text = text
            t.set('{http://www.w3.org/XML/1998/namespace}space', 'preserve')
            r.append(t)
            p.append(r)
        
        tc.append(p)
        return tc
    
    if headers:
        tr = OxmlElement('w:tr')
        for col_idx, h in enumerate(headers):
            tc = create_cell(h, is_header=True, col_idx=col_idx)
            tr.append(tc)
        tbl.append(tr)

    for row_data in data:
        tr = OxmlElement('w:tr')
        for col_idx, cell_data in enumerate(row_data):
            tc = create_cell(cell_data, is_header=False, col_idx=col_idx)
            tr.append(tc)
        tbl.append(tr)
    
    return tbl


def build_placeholder_map(data: dict, max_funcs: int = 10, max_subs: int = 10) -> dict:
    """构建占位符到内容的映射"""
    placeholder_map = {}
    
    placeholder_map["doc_title"] = data.get("doc_title", "")
    placeholder_map["doc_date"] = data.get("doc_date", "")
    
    sections = data.get("sections", {})
    placeholder_map["original_req"] = sections.get("original_req", "")
    placeholder_map["req_clarify"] = sections.get("req_clarify", "")
    placeholder_map["req_analysis"] = sections.get("req_analysis", "")
    placeholder_map["solution_overview"] = sections.get("solution_overview", "")
    
    placeholder_map["new_tables"] = sections.get("new_tables", [])
    placeholder_map["modify_tables"] = sections.get("modify_tables", [])
    placeholder_map["data_compatibility"] = sections.get("data_compatibility", [])
    
    # 中台三问接口列表 -> func_tables 占位符
    zhongtai_interfaces = data.get("zhongtai_interfaces", [])
    if zhongtai_interfaces:
        func_tables_data = []
        for iface in zhongtai_interfaces:
            func_tables_data.append([
                {"v": iface.get("method", ""), "r": [[iface.get("method", "")]]},
                {"v": iface.get("interface_name", ""), "r": [[iface.get("interface_name", "")]]},
                {"v": iface.get("description", ""), "r": [[iface.get("description", "")]]}
            ])
        func_tables_blocks = [{
            "t": "table",
            "cols": 3,
            "h": [
                {"v": "接口ID", "r": [["接口ID"]]},
                {"v": "接口名称", "r": [["接口名称"]]},
                {"v": "功能说明", "r": [["功能说明"]]}
            ],
            "d": func_tables_data,
            "col_widths": [1500, 3500, 4000]
        }]
    else:
        func_tables_blocks = []
    placeholder_map["func_tables"] = func_tables_blocks
    
    placeholder_map["zhongtai_scene"] = data.get("zhongtai_scene", "")
    placeholder_map["happy_path"] = data.get("happy_path", "")
    placeholder_map["sad_path"] = data.get("sad_path", "")
    
    functions = data.get("functions", [])
    for i in range(max_funcs):
        if i < len(functions):
            func = functions[i]
            placeholder_map[f"func_{i+1}_name"] = func.get("name", "")
            placeholder_map[f"func_{i+1}_scene"] = func.get("scene_desc", "")
            placeholder_map[f"func_{i+1}_has_scene"] = bool(func.get("scene_desc", ""))
            placeholder_map[f"func_{i+1}_has_subs"] = bool(func.get("sub_functions", []))
            
            sub_functions = func.get("sub_functions", [])
            for j in range(max_subs):
                if j < len(sub_functions):
                    sub = sub_functions[j]
                    placeholder_map[f"func_{i+1}_sub_{j+1}_name"] = sub.get("name", "")
                    placeholder_map[f"func_{i+1}_sub_{j+1}_content"] = sub.get("content_blocks", [])
                else:
                    placeholder_map[f"func_{i+1}_sub_{j+1}_name"] = ""
                    placeholder_map[f"func_{i+1}_sub_{j+1}_content"] = []
        else:
            placeholder_map[f"func_{i+1}_name"] = ""
            placeholder_map[f"func_{i+1}_scene"] = ""
            placeholder_map[f"func_{i+1}_has_scene"] = False
            placeholder_map[f"func_{i+1}_has_subs"] = False
            for j in range(max_subs):
                placeholder_map[f"func_{i+1}_sub_{j+1}_name"] = ""
                placeholder_map[f"func_{i+1}_sub_{j+1}_content"] = []
    
    return placeholder_map


def find_placeholder(text: str) -> list:
    """查找文本中的所有占位符"""
    return re.findall(r'\{\{(\w+)\}\}', text)


def get_run_style(para):
    """获取段落中第一个run的样式(用于保留模板样式)"""
    style = {}
    for run in para.runs:
        if run.font:
            if run.font.name:
                style["font_name"] = run.font.name
            if run.font.size:
                style["font_size"] = run.font.size
            if run.font.color and run.font.color.rgb:
                style["font_color"] = str(run.font.color.rgb)
            if run.font.bold is not None:
                style["bold"] = run.font.bold
            if run.font.italic is not None:
                style["italic"] = run.font.italic
            if run.font.underline is not None:
                style["underline"] = run.font.underline
        break
    return style


def apply_run_style(run, style: dict):
    """将样式应用到run"""
    if not style:
        return
    
    if style.get("font_name"):
        run.font.name = style["font_name"]
        run._element.rPr.rFonts.set(qn('w:eastAsia'), style["font_name"])
    if style.get("font_size"):
        run.font.size = style["font_size"]
    if style.get("font_color"):
        run.font.color.rgb = RGBColor.from_string(style["font_color"])
    if style.get("bold"):
        run.bold = style["bold"]
    if style.get("italic"):
        run.italic = style["italic"]
    if style.get("underline"):
        run.underline = style["underline"]


def apply_style_to_run(run, style: dict):
    """将样式应用到run(用于富文本渲染)"""
    if not style:
        return
    
    if style.get("b"):
        run.bold = True
    if style.get("i"):
        run.italic = True
    if style.get("u"):
        run.underline = True
    if style.get("c"):
        run.font.color.rgb = RGBColor.from_string(style["c"])
    if style.get("sz"):
        run.font.size = Pt(style["sz"] / 2)


def render_rich_text(para, block: dict, template_style: dict = None):
    """
    渲染富文本到段落
    
    紧凑格式:
    {
      "t": "text",
      "runs": [["普通文本"], ["加粗蓝色", {"b": true, "c": "0000FF"}]],
      "list": "disc/420/h420",
      "indent": {"left": 420, "hanging": 420}
    }
    
    list字符串格式:
    - bullet: "disc" 或 "disc:1" (符号名:ilvl)
    - numbered: "1." 或 "1.:1" (编号文本:ilvl)
    - 带缩进: "disc/420/h420" (列表/左缩进/悬挂或首行缩进)
    """
    if not block:
        return
    
    runs_data = block.get("runs", [])
    list_info = block.get("list", None)
    indent_info = block.get("indent", None)
    
    if list_info:
        _apply_list_to_paragraph(para, list_info)
    
    if indent_info:
        _apply_indent_to_paragraph(para, indent_info)
    
    if not runs_data:
        return
    
    for run_item in runs_data:
        if not run_item or not isinstance(run_item, list):
            continue
        
        text = run_item[0] if len(run_item) > 0 else ""
        if not text:
            continue
        
        run = para.add_run(text)
        apply_run_style(run, template_style)
        
        if len(run_item) > 1 and run_item[1]:
            apply_style_to_run(run, run_item[1])


def _parse_list_string(list_str: str) -> dict:
    """
    解析紧凑列表字符串为结构化信息
    
    格式:
    - "disc" -> bullet, ilvl=0, bullet=disc
    - "disc:1" -> bullet, ilvl=1, bullet=disc
    - "1." -> numbered, ilvl=0, numText="1."
    - "1.:1" -> numbered, ilvl=1, numText="1."
    - "disc/420/h420" -> bullet + indent
    - "1./0/f420" -> numbered + indent
    """
    parts = list_str.split('/')
    
    list_part = parts[0]
    indent_left = None
    indent_special = None
    
    if len(parts) >= 3:
        indent_left = int(parts[1]) if parts[1] != "0" else 0
        sp = parts[2]
        if sp.startswith('h'):
            indent_special = ("hanging", int(sp[1:]))
        elif sp.startswith('f'):
            indent_special = ("firstLine", int(sp[1:]))
        elif sp != "0":
            indent_special = ("hanging", int(sp))
    
    list_parts = list_part.split(':')
    primary = list_parts[0]
    ilvl = int(list_parts[1]) if len(list_parts) > 1 else 0
    
    if primary in BULLET_CHARS:
        result = {
            "type": "bullet",
            "ilvl": ilvl,
            "bullet": primary,
        }
    else:
        result = {
            "type": "numbered",
            "ilvl": ilvl,
            "numText": primary,
        }
    
    if indent_left is not None or indent_special is not None:
        result["indent"] = {}
        if indent_left:
            result["indent"]["left"] = indent_left
        if indent_special:
            result["indent"][indent_special[0]] = indent_special[1]
    
    return result


def _apply_list_to_paragraph(para, list_str, has_indent: bool = False):
    """将列表信息应用到段落XML"""
    info = _parse_list_string(list_str) if isinstance(list_str, str) else list_str
    
    pPr = para._element.find(qn('w:pPr'))
    if pPr is None:
        pPr = OxmlElement('w:pPr')
        para._element.insert(0, pPr)
    
    list_type = info.get("type", "bullet")
    ilvl = info.get("ilvl", 0)
    indent_from_list = info.get("indent", None)
    
    if list_type == "bullet":
        bullet_name = info.get("bullet", "disc")
        bullet_char = _get_bullet_char(bullet_name)
        
        if not has_indent and indent_from_list:
            indent = pPr.find(qn('w:ind'))
            if indent is None:
                indent = OxmlElement('w:ind')
                pPr.append(indent)
            
            if "left" in indent_from_list:
                indent.set(qn('w:left'), str(indent_from_list["left"]))
            if "hanging" in indent_from_list:
                indent.set(qn('w:hanging'), str(indent_from_list["hanging"]))
                if indent.get(qn('w:firstLine')) is not None:
                    del indent.attrib[qn('w:firstLine')]
            elif "firstLine" in indent_from_list:
                indent.set(qn('w:firstLine'), str(indent_from_list["firstLine"]))
                if indent.get(qn('w:hanging')) is not None:
                    del indent.attrib[qn('w:hanging')]
        elif not has_indent:
            indent = pPr.find(qn('w:ind'))
            if indent is None:
                indent = OxmlElement('w:ind')
                pPr.append(indent)
            
            left = 420 + ilvl * 420
            indent.set(qn('w:left'), str(left))
            indent.set(qn('w:hanging'), str(420))
        
        prefix_run = para.add_run(bullet_char + " ")
        rPr = prefix_run._element.find(qn('w:rPr'))
        if rPr is None:
            rPr = OxmlElement('w:rPr')
            prefix_run._element.insert(0, rPr)
        rFonts = rPr.find(qn('w:rFonts'))
        if rFonts is None:
            rFonts = OxmlElement('w:rFonts')
            rPr.insert(0, rFonts)
        rFonts.set(qn('w:ascii'), 'Arial Unicode MS')
        rFonts.set(qn('w:hAnsi'), 'Arial Unicode MS')
        rFonts.set(qn('w:eastAsia'), 'SimSun')
        rFonts.set(qn('w:cs'), 'Arial Unicode MS')
    
    elif list_type == "numbered":
        num_text = info.get("numText", "")
        
        if not has_indent and indent_from_list:
            indent = pPr.find(qn('w:ind'))
            if indent is None:
                indent = OxmlElement('w:ind')
                pPr.append(indent)
            
            if "left" in indent_from_list:
                indent.set(qn('w:left'), str(indent_from_list["left"]))
            if "hanging" in indent_from_list:
                indent.set(qn('w:hanging'), str(indent_from_list["hanging"]))
                if indent.get(qn('w:firstLine')) is not None:
                    del indent.attrib[qn('w:firstLine')]
            elif "firstLine" in indent_from_list:
                indent.set(qn('w:firstLine'), str(indent_from_list["firstLine"]))
                if indent.get(qn('w:hanging')) is not None:
                    del indent.attrib[qn('w:hanging')]
        elif not has_indent:
            indent = pPr.find(qn('w:ind'))
            if indent is None:
                indent = OxmlElement('w:ind')
                pPr.append(indent)
            
            left = 420 + ilvl * 420
            indent.set(qn('w:left'), str(left))
            indent.set(qn('w:firstLine'), str(0))
        
        if num_text:
            prefix_run = para.add_run(num_text + " ")


def _apply_indent_to_paragraph(para, indent_info: dict):
    """将缩进信息应用到段落XML"""
    pPr = para._element.find(qn('w:pPr'))
    if pPr is None:
        pPr = OxmlElement('w:pPr')
        para._element.insert(0, pPr)
    
    indent = pPr.find(qn('w:ind'))
    if indent is None:
        indent = OxmlElement('w:ind')
        pPr.append(indent)
    
    if "left" in indent_info:
        indent.set(qn('w:left'), str(indent_info["left"]))
    if "hanging" in indent_info:
        indent.set(qn('w:hanging'), str(indent_info["hanging"]))
        if indent.get(qn('w:firstLine')) is not None:
            del indent.attrib[qn('w:firstLine')]
    elif "firstLine" in indent_info:
        indent.set(qn('w:firstLine'), str(indent_info["firstLine"]))
        if indent.get(qn('w:hanging')) is not None:
            del indent.attrib[qn('w:hanging')]


def analyze_template_structure(para_list: list, placeholder_map: dict) -> dict:
    """
    分析模板结构,确定哪些段落需要删除
    
    返回:
    - paras_to_remove: 需要删除的段落索引集合
    - func_ranges: 每个功能点的段落范围 {func_num: (start_idx, end_idx)}
    - sub_ranges: 每个子功能点的段落范围 {(func_num, sub_num): (start_idx, end_idx)}
    """
    paras_to_remove = set()
    func_ranges = {}
    sub_ranges = {}
    
    # 记录功能点状态
    func_has_scene = {}
    func_has_subs = {}
    for i in range(1, 11):
        func_has_scene[i] = placeholder_map.get(f"func_{i}_has_scene", False)
        func_has_subs[i] = placeholder_map.get(f"func_{i}_has_subs", False)
    
    # 第一遍:记录所有占位符段落的位置
    placeholder_positions = {}  # {placeholder_name: paragraph_index}
    for i, para in enumerate(para_list):
        text = para.text
        placeholders = find_placeholder(text)
        for ph in placeholders:
            placeholder_positions[ph] = i
    
    # 第二遍:确定功能点和子功能点的范围
    current_func = None
    current_sub = None
    func_start = None
    sub_start = None
    
    for i, para in enumerate(para_list):
        text = para.text
        placeholders = find_placeholder(text)
        style_name = para.style.name if para.style else "Normal"
        
        if "Heading 1" in style_name or "标题 1" in style_name:
            if current_func and func_start is not None:
                func_ranges[current_func] = (func_start, i - 1)
            if current_func and current_sub and sub_start is not None:
                sub_ranges[(current_func, current_sub)] = (sub_start, i - 1)
            current_func = None
            current_sub = None
            func_start = None
            sub_start = None
        
        for ph in placeholders:
            if ph in ["new_tables", "modify_tables", "data_compatibility", "func_tables"]:
                if current_func and func_start is not None:
                    func_ranges[current_func] = (func_start, i - 1)
                if current_func and current_sub and sub_start is not None:
                    sub_ranges[(current_func, current_sub)] = (sub_start, i - 1)
                current_func = None
                current_sub = None
                func_start = None
                sub_start = None
                continue
            
            match = re.match(r'func_(\d+)_name$', ph)
            if match:
                if current_func and func_start is not None:
                    func_ranges[current_func] = (func_start, i - 1)
                
                current_func = int(match.group(1))
                func_start = i
                current_sub = None
                sub_start = None
            
            match = re.match(r'func_(\d+)_sub_(\d+)_name$', ph)
            if match:
                if current_func and current_sub and sub_start is not None:
                    sub_ranges[(current_func, current_sub)] = (sub_start, i - 1)
                
                current_sub = int(match.group(2))
                sub_start = i
    
    if current_func and func_start is not None:
        func_ranges[current_func] = (func_start, len(para_list) - 1)
    if current_func and current_sub and sub_start is not None:
        sub_ranges[(current_func, current_sub)] = (sub_start, len(para_list) - 1)
    
    # 第三遍:标记需要删除的段落
    for i, para in enumerate(para_list):
        text = para.text
        placeholders = find_placeholder(text)
        style_name = para.style.name if para.style else "Normal"
        
        # 检查是否是空占位符
        for ph in placeholders:
            value = placeholder_map.get(ph, "")
            
            # 空的功能点名称
            if re.match(r'func_(\d+)_name$', ph) and not value:
                func_num = int(re.match(r'func_(\d+)_name$', ph).group(1))
                if func_num in func_ranges:
                    start, end = func_ranges[func_num]
                    for idx in range(start, end + 1):
                        paras_to_remove.add(idx)
            
            # 空的子功能点名称
            elif re.match(r'func_(\d+)_sub_(\d+)_name$', ph) and not value:
                match = re.match(r'func_(\d+)_sub_(\d+)_name$', ph)
                func_num = int(match.group(1))
                sub_num = int(match.group(2))
                if (func_num, sub_num) in sub_ranges:
                    start, end = sub_ranges[(func_num, sub_num)]
                    for idx in range(start, end + 1):
                        paras_to_remove.add(idx)
            
            # 空的其他占位符
            elif not value and ph in ["new_tables", "modify_tables", "data_compatibility", "func_tables"]:
                paras_to_remove.add(i)
    
    # 第四遍:根据占位符状态删除相关标题
    # 查找"场景描述"和"实现方案"标题,根据状态决定是否删除
    for i, para in enumerate(para_list):
        text = para.text.strip()
        style_name = para.style.name if para.style else "Normal"
        
        # 只处理Heading样式的段落
        if "Heading" not in style_name:
            continue
        
        # 查找这个标题之后最近的占位符来确定所属功能点
        for j in range(i + 1, min(i + 10, len(para_list))):
            next_text = para_list[j].text
            next_placeholders = find_placeholder(next_text)
            
            for ph in next_placeholders:
                # 场景描述相关
                if ph.startswith("func_") and "_scene" in ph:
                    match = re.match(r'func_(\d+)_scene', ph)
                    if match:
                        func_num = int(match.group(1))
                        if not func_has_scene.get(func_num, False):
                            paras_to_remove.add(i)
                        break
                
                # 实现方案相关
                elif ph.startswith("func_") and "_sub_" in ph and "_name" in ph:
                    match = re.match(r'func_(\d+)_sub', ph)
                    if match:
                        func_num = int(match.group(1))
                        if not func_has_subs.get(func_num, False):
                            paras_to_remove.add(i)
                        break
            
            if next_placeholders:
                break
    
    return paras_to_remove


def render_template(template_path: str, placeholder_map: dict, context_dir: str, output_path: str):
    """渲染模板"""
    doc = Document(template_path)
    
    table_count = 0
    img_count = 0
    
    para_list = list(doc.paragraphs)
    insertions = {}
    
    # 分析模板结构,确定需要删除的段落
    paras_to_remove = analyze_template_structure(para_list, placeholder_map)
    
    # 渲染占位符
    for i, para in enumerate(para_list):
        if i in paras_to_remove:
            continue
        
        text = para.text
        placeholders = find_placeholder(text)
        
        if not placeholders:
            continue
        
        for ph in placeholders:
            if ph not in placeholder_map:
                continue
            
            value = placeholder_map[ph]
            
            # 保存段落样式和run样式
            saved_style = para.style.name if para.style else None
            saved_run_style = get_run_style(para)
            
            para.clear()
            
            if saved_style:
                try:
                    para.style = saved_style
                except:
                    pass
            
            if isinstance(value, str):
                if value:
                    run = para.add_run(value)
                    apply_run_style(run, saved_run_style)
                else:
                    paras_to_remove.add(i)
            
            elif isinstance(value, list) and value:
                elements_to_insert = []
                first_text_done = False
                
                for block in value:
                    block_type = block.get("t", "text")
                    
                    if block_type == "text":
                        if not first_text_done:
                            render_rich_text(para, block, saved_run_style)
                            first_text_done = True
                        else:
                            new_p = OxmlElement('w:p')
                            pPr = OxmlElement('w:pPr')
                            new_p.append(pPr)
                            
                            from docx.text.paragraph import Paragraph
                            new_para = Paragraph(new_p, doc)
                            render_rich_text(new_para, block, saved_run_style)
                            
                            elements_to_insert.append(new_p)
                    
                    elif block_type == "table":
                        tbl = create_table_element(block)
                        elements_to_insert.append(tbl)
                        table_count += 1
                    
                    elif block_type == "img":
                        img_rel_path = block.get("p", "")
                        img_path = os.path.normpath(os.path.join(context_dir, img_rel_path))
                        width = block.get("w", 200)
                        
                        if os.path.exists(img_path):
                            try:
                                new_p = OxmlElement('w:p')
                                pPr = OxmlElement('w:pPr')
                                new_p.append(pPr)
                                
                                from docx.text.paragraph import Paragraph
                                temp_para = Paragraph(new_p, doc)
                                run = temp_para.add_run()
                                run.add_picture(img_path, width=Pt(width))
                                
                                elements_to_insert.append(new_p)
                                img_count += 1
                            except Exception as e:
                                print(f"图片加载失败: {e}")
                
                if not first_text_done:
                    para.add_run(" ")
                
                if elements_to_insert:
                    insertions[i] = elements_to_insert
            
            break
    
    # 插入表格和图片
    for para_idx in sorted(insertions.keys(), reverse=True):
        para = para_list[para_idx]
        para_element = para._element
        parent = para_element.getparent()
        idx = list(parent).index(para_element)
        
        for element in reversed(insertions[para_idx]):
            parent.insert(idx + 1, element)
    
    # 删除标记的段落
    for idx in sorted(paras_to_remove, reverse=True):
        if idx < len(para_list):
            para = para_list[idx]
            para_element = para._element
            parent = para_element.getparent()
            if parent is not None:
                parent.remove(para_element)
    
    # 删除所有仍包含占位符的段落
    for para in list(doc.paragraphs):
        text = para.text
        if find_placeholder(text):
            para_element = para._element
            parent = para_element.getparent()
            if parent is not None:
                parent.remove(para_element)
    
    # 注意:不再删除空段落,保留文档中的空行
    
    doc.save(output_path)
    
    print(f"渲染完成: {output_path}")
    print(f"删除了 {len(paras_to_remove)} 个段落")
    print(f"渲染了 {table_count} 个表格")
    print(f"渲染了 {img_count} 张图片")


def render(json_path: str, template_path: str, output_path: str, max_funcs: int = 10, max_subs: int = 10):
    """主渲染函数"""
    context_dir = os.path.dirname(json_path)
    
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    placeholder_map = build_placeholder_map(data, max_funcs, max_subs)
    
    render_template(template_path, placeholder_map, context_dir, output_path)


if __name__ == "__main__":
    import argparse
    
    skill_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    default_template = os.path.join(skill_dir, "templates", "custom_template.docx")
    default_json = os.path.join(skill_dir, "parsed_output", "converted_document.json")
    default_output = os.path.join(skill_dir, "parsed_output", "client_document.docx")

    parser = argparse.ArgumentParser(description='将JSON数据渲染到Word模板')
    parser.add_argument('json_path', nargs='?', default=default_json, help=f'JSON数据文件路径(默认: {default_json})')
    parser.add_argument('-t', '--template', default=default_template, help=f'Word模板文件路径(默认: {default_template})')
    parser.add_argument('-o', '--output', default=default_output, help=f'输出Word文件路径(默认: {default_output})')
    parser.add_argument('--max-funcs', type=int, default=10, help='最大功能点数量(默认: 10)')
    parser.add_argument('--max-subs', type=int, default=10, help='每个功能点最大子功能数量(默认: 10)')

    args = parser.parse_args()

    render(args.json_path, args.template, args.output, args.max_funcs, args.max_subs)

scripts/requirements.txt

# Python版本要求: >=3.11
python-docx>=0.8.12
Pillow>=9.0.0

scripts/supplier_parser.py

# -*- coding: utf-8 -*-
"""
供应商文档解析器 v3
将供应商docx文件解析为JSON格式

提取规则(使用标题级别+文本精确匹配):
- 1.1 原始需求 (Heading 2) -> sections.original_req (普通文本)
- 1.2 需求澄清 (Heading 2) -> sections.req_clarify (普通文本)
- 2.1 需求分析 (Heading 2) -> sections.req_analysis (普通文本)
- 3.2.3 方案概述 (Heading 3) -> sections.solution_overview (普通文本)
- 6.1 新增表 (Heading 2) -> sections.new_tables (富文本内容块)
- 6.2 修改表 (Heading 2) -> sections.modify_tables (富文本内容块)
- 6.3 数据兼容性要求 (Heading 2) -> sections.data_compatibility (富文本内容块)
- 4.x 功能点 (Heading 2) -> functions[] (嵌套结构)

列表支持:
- 项目符号:圆点、三角、方框、菱形、对号、星形等
- 编号列表:阿拉伯数字、中文括号、带圈数字、多级编号等
- 文本缩进:悬挂缩进(hanging)和首行缩进(firstLine)
"""

import os
import re
import json
import hashlib
from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import Table
from docx.text.paragraph import Paragraph
from docx.oxml.ns import qn
from PIL import Image
import io


BULLET_MAP = {
    "\u2022": "disc",
    "\u2024": "disc",
    "\u2219": "disc",
    "\u00b7": "disc",
    "\u2981": "disc",
    "\u25e6": "disc",
    "\u26ac": "disc",
    "\u25cf": "disc",
    "\u26ab": "disc",
    "\u2b24": "disc",
    "\u25aa": "disc",
    "\u25ab": "disc",
    "\uf06c": "disc",
    "\uf0b7": "disc",
    "\uf0a2": "disc",

    "\u25b2": "triangle",
    "\u25b3": "triangle",
    "\u25b4": "triangle",
    "\u25b5": "triangle",
    "\u25b6": "triangle",
    "\u25b7": "triangle",
    "\u25b8": "triangle",
    "\u25b9": "triangle",
    "\u25ba": "triangle",
    "\u25bb": "triangle",
    "\u25bc": "triangle",
    "\u25bd": "triangle",
    "\u25be": "triangle",
    "\u25bf": "triangle",
    "\u25c0": "triangle",
    "\u25c1": "triangle",
    "\u25c2": "triangle",
    "\u25c3": "triangle",
    "\u25c4": "triangle",
    "\u25c5": "triangle",
    "\uf0d8": "triangle",
    "\uf0a7": "triangle",

    "\u25a0": "square",
    "\u25a1": "square",
    "\u25a2": "square",
    "\u25a3": "square",
    "\u25a4": "square",
    "\u25a5": "square",
    "\u25a6": "square",
    "\u25a7": "square",
    "\u25a8": "square",
    "\u25a9": "square",
    "\u25aa": "square",
    "\u25ab": "square",
    "\u25ac": "square",
    "\u25ad": "square",
    "\u25ae": "square",
    "\u25af": "square",
    "\u25fb": "square",
    "\u25fc": "square",
    "\u2b1c": "square",
    "\u2b1b": "square",
    "\uf06e": "square",
    "\uf0a8": "square",
    "\uf0a3": "square",

    "\u25c6": "diamond",
    "\u25c7": "diamond",
    "\u25c8": "diamond",
    "\u25c9": "diamond",
    "\u25ca": "diamond",
    "\u25cb": "diamond",
    "\u25cc": "diamond",
    "\u25cd": "diamond",
    "\u25ce": "diamond",
    "\u25cf": "diamond",
    "\u2726": "diamond",
    "\u2727": "diamond",
    "\u2756": "diamond",
    "\u2757": "diamond",
    "\u2b29": "diamond",
    "\u2b2a": "diamond",
    "\u2b2b": "diamond",
    "\u2b2c": "diamond",
    "\u2b2d": "diamond",
    "\u2b2e": "diamond",
    "\u2b2f": "diamond",
    "\uf075": "diamond",
    "\uf0a4": "diamond",
    "\uf0b5": "diamond",

    "\u2713": "check",
    "\u2714": "check",
    "\u2611": "check",
    "\u2610": "check",
    "\u2705": "check",
    "\u2706": "check",
    "\u2718": "check",
    "\u2716": "check",
    "\u2717": "check",
    "\u2612": "check",
    "\u274e": "check",
    "\u2752": "check",
    "\uf0fc": "check",
    "\uf050": "check",

    "\u2605": "star",
    "\u2606": "star",
    "\u2728": "star",
    "\u2729": "star",
    "\u272a": "star",
    "\u272b": "star",
    "\u272c": "star",
    "\u272d": "star",
    "\u272e": "star",
    "\u272f": "star",
    "\u2730": "star",
    "\u2b50": "star",
    "\u2b51": "star",
    "\u2b52": "star",
    "\u2731": "star",
    "\u2732": "star",
    "\uf0b2": "star",
    "\uf0a5": "star",

    "\u2190": "arrow",
    "\u2191": "arrow",
    "\u2192": "arrow",
    "\u2193": "arrow",
    "\u2194": "arrow",
    "\u2195": "arrow",
    "\u2196": "arrow",
    "\u2197": "arrow",
    "\u2198": "arrow",
    "\u2199": "arrow",
    "\u219a": "arrow",
    "\u219b": "arrow",
    "\u219c": "arrow",
    "\u219d": "arrow",
    "\u219e": "arrow",
    "\u219f": "arrow",
    "\u21a0": "arrow",
    "\u21a1": "arrow",
    "\u21a2": "arrow",
    "\u21a3": "arrow",
    "\u21a4": "arrow",
    "\u21a5": "arrow",
    "\u21a6": "arrow",
    "\u21a7": "arrow",
    "\u21a8": "arrow",
    "\u21a9": "arrow",
    "\u21aa": "arrow",
    "\u21ab": "arrow",
    "\u21ac": "arrow",
    "\u21ad": "arrow",
    "\u21ae": "arrow",
    "\u21af": "arrow",
    "\u21b0": "arrow",
    "\u21b1": "arrow",
    "\u21b2": "arrow",
    "\u21b3": "arrow",
    "\u21b4": "arrow",
    "\u21b5": "arrow",
    "\u21b6": "arrow",
    "\u21b7": "arrow",
    "\u21b8": "arrow",
    "\u21b9": "arrow",
    "\u21ba": "arrow",
    "\u21bb": "arrow",
    "\u21bc": "arrow",
    "\u21bd": "arrow",
    "\u21be": "arrow",
    "\u21bf": "arrow",
    "\u21c0": "arrow",
    "\u21c1": "arrow",
    "\u21c2": "arrow",
    "\u21c3": "arrow",
    "\u21c4": "arrow",
    "\u21c5": "arrow",
    "\u21c6": "arrow",
    "\u21c7": "arrow",
    "\u21c8": "arrow",
    "\u21c9": "arrow",
    "\u21ca": "arrow",
    "\u21cb": "arrow",
    "\u21cc": "arrow",
    "\u21cd": "arrow",
    "\u21ce": "arrow",
    "\u21cf": "arrow",
    "\u21d0": "arrow",
    "\u21d1": "arrow",
    "\u21d2": "arrow",
    "\u21d3": "arrow",
    "\u21d4": "arrow",
    "\u21d5": "arrow",
    "\u21d6": "arrow",
    "\u21d7": "arrow",
    "\u21d8": "arrow",
    "\u21d9": "arrow",
    "\u21da": "arrow",
    "\u21db": "arrow",
    "\u21dc": "arrow",
    "\u21dd": "arrow",
    "\u21de": "arrow",
    "\u21df": "arrow",
    "\u21e0": "arrow",
    "\u21e1": "arrow",
    "\u21e2": "arrow",
    "\u21e3": "arrow",
    "\u21e4": "arrow",
    "\u21e5": "arrow",
    "\u21e6": "arrow",
    "\u21e7": "arrow",
    "\u21e8": "arrow",
    "\u21e9": "arrow",
    "\u21ea": "arrow",
    "\u21eb": "arrow",
    "\u21ec": "arrow",
    "\u21ed": "arrow",
    "\u21ee": "arrow",
    "\u21ef": "arrow",
    "\u21f0": "arrow",
    "\u21f1": "arrow",
    "\u21f2": "arrow",
    "\u21f3": "arrow",
    "\u21f4": "arrow",
    "\u21f5": "arrow",
    "\u21f6": "arrow",
    "\u21f7": "arrow",
    "\u21f8": "arrow",
    "\u21f9": "arrow",
    "\u21fa": "arrow",
    "\u21fb": "arrow",
    "\u21fc": "arrow",
    "\u21fd": "arrow",
    "\u21fe": "arrow",
    "\u21ff": "arrow",
    "\u27a1": "arrow",
    "\u2b05": "arrow",
    "\u2b06": "arrow",
    "\u2b07": "arrow",
    "\u2b08": "arrow",
    "\u2b09": "arrow",
    "\u2b0a": "arrow",
    "\u2b0b": "arrow",
    "\u2b0c": "arrow",
    "\u2b0d": "arrow",
    "\u2b0e": "arrow",
    "\u2b0f": "arrow",
    "\u2b10": "arrow",
    "\u2b11": "arrow",
    "\u2b95": "arrow",
    "\u2b96": "arrow",
    "\u2b97": "arrow",
    "\u2b98": "arrow",
    "\u2b99": "arrow",
    "\u2b9a": "arrow",
    "\u2b9b": "arrow",
    "\u2b9c": "arrow",
    "\u2b9d": "arrow",
    "\u2b9e": "arrow",
    "\u2b9f": "arrow",
}

NUM_FMT_MAP = {
    "decimal": "decimal",
    "chineseCounting": "chineseCounting",
    "decimalEnclosedCircleChinese": "enclosedCircle",
    "decimalEnclosedCircle": "enclosedCircle",
    "lowerLetter": "lowerLetter",
    "upperLetter": "upperLetter",
    "lowerRoman": "lowerRoman",
    "upperRoman": "upperRoman",
    "ideographTraditional": "ideographTraditional",
    "ideographEnclosedCircle": "ideographEnclosedCircle",
    "bullet": "bullet",
}


def _resolve_numbering(doc, num_id, ilvl):
    """
    从文档的numbering.xml中解析编号定义
    
    Returns:
        dict with keys: type, numFmt, lvlText, bullet, ilvl
        or None if not found
    """
    if not num_id or num_id == "0":
        return None
    
    try:
        numbering_part = doc.part.numbering_part
        numbering_xml = numbering_part._element
        
        num_elem = None
        for n in numbering_xml.findall(qn('w:num')):
            if n.get(qn('w:numId')) == str(num_id):
                num_elem = n
                break
        
        if num_elem is None:
            return None
        
        abstract_num_id = None
        abs_ref = num_elem.find(qn('w:abstractNumId'))
        if abs_ref is not None:
            abstract_num_id = abs_ref.get(qn('w:val'))
        
        if abstract_num_id is None:
            return None
        
        abstract_num = None
        for an in numbering_xml.findall(qn('w:abstractNum')):
            if an.get(qn('w:abstractNumId')) == str(abstract_num_id):
                abstract_num = an
                break
        
        if abstract_num is None:
            return None
        
        for lvl in abstract_num.findall(qn('w:lvl')):
            if lvl.get(qn('w:ilvl')) == str(ilvl):
                numFmt_elem = lvl.find(qn('w:numFmt'))
                lvlText_elem = lvl.find(qn('w:lvlText'))
                
                num_fmt = ""
                if numFmt_elem is not None:
                    raw_fmt = numFmt_elem.get(qn('w:val'), "")
                    num_fmt = NUM_FMT_MAP.get(raw_fmt, raw_fmt)
                
                lvl_text = ""
                if lvlText_elem is not None:
                    lvl_text = lvlText_elem.get(qn('w:val'), "")
                
                result = {
                    "type": "bullet" if num_fmt == "bullet" else "numbered",
                    "ilvl": ilvl,
                }
                
                pPr = lvl.find(qn('w:pPr'))
                if pPr is not None:
                    ind = pPr.find(qn('w:ind'))
                    if ind is not None:
                        left_v = ind.get(qn('w:left'))
                        hanging_v = ind.get(qn('w:hanging'))
                        firstLine_v = ind.get(qn('w:firstLine'))
                        if left_v is not None:
                            result["indent_left"] = int(left_v)
                        if hanging_v is not None:
                            result["indent_hanging"] = int(hanging_v)
                        elif firstLine_v is not None:
                            result["indent_firstLine"] = int(firstLine_v)
                
                if num_fmt == "bullet":
                    bullet_name = BULLET_MAP.get(lvl_text, None)
                    if bullet_name:
                        result["bullet"] = bullet_name
                    else:
                        result["bullet"] = "disc"
                else:
                    result["numFmt"] = num_fmt
                    result["lvlText"] = lvl_text
                
                return result
    except Exception:
        pass
    
    return None


def _compute_num_text(num_fmt, lvl_text, counters):
    """
    根据编号格式和级别文本计算实际编号文本
    
    Args:
        num_fmt: 编号格式 (decimal, chineseCounting, enclosedCircle等)
        lvl_text: 级别文本模板 (如 "%1.", "(%1)", "%1、")
        counters: 各级别的计数值 [c0, c1, c2, ...]
    
    Returns:
        实际编号文本字符串 (如 "1.", "(一)", "一、")
    """
    def format_counter(fmt, val):
        if fmt == "decimal":
            return str(val)
        elif fmt == "chineseCounting":
            cn = ["零", "一", "二", "三", "四", "五", "六", "七", "八", "九", "十",
                  "十一", "十二", "十三", "十四", "十五", "十六", "十七", "十八", "十九", "二十"]
            if 0 < val <= 20:
                return cn[val]
            return str(val)
        elif fmt == "enclosedCircle":
            circles = "①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳"
            if 0 < val <= 20:
                return circles[val - 1]
            return str(val)
        elif fmt == "lowerLetter":
            if 0 < val <= 26:
                return chr(ord('a') + val - 1)
            return str(val)
        elif fmt == "upperLetter":
            if 0 < val <= 26:
                return chr(ord('A') + val - 1)
            return str(val)
        elif fmt == "lowerRoman":
            roman_map = [(1000,'m'),(900,'cm'),(500,'d'),(400,'cd'),
                         (100,'c'),(90,'xc'),(50,'l'),(40,'xl'),
                         (10,'x'),(9,'ix'),(5,'v'),(4,'iv'),(1,'i')]
            result = ''
            for arabic, roman in roman_map:
                while val >= arabic:
                    result += roman
                    val -= arabic
            return result or str(val)
        elif fmt == "upperRoman":
            roman_map = [(1000,'M'),(900,'CM'),(500,'D'),(400,'CD'),
                         (100,'C'),(90,'XC'),(50,'L'),(40,'XL'),
                         (10,'X'),(9,'IX'),(5,'V'),(4,'IV'),(1,'I')]
            result = ''
            for arabic, roman in roman_map:
                while val >= arabic:
                    result += roman
                    val -= arabic
            return result or str(val)
        else:
            return str(val)
    
    if not lvl_text:
        return ""
    
    result = lvl_text
    for i in range(len(counters) - 1, -1, -1):
        placeholder = f"%{i + 1}"
        if placeholder in result:
            result = result.replace(placeholder, format_counter(num_fmt, counters[i]))
    
    return result


def _load_parse_filter(config_path: str = None) -> dict:
    """
    加载解析过滤配置(支持JSONC格式,即带 // 行注释的JSON)
    
    配置文件格式 (parse_filter.json):
    {
        // 注释说明
        "exclude_styles": ["toc 1", ...],       // 按样式名过滤
        "exclude_texts": ["请在此处填写", ...],   // 按文本内容过滤(支持多行\n)
        "exclude_text_patterns": ["^\\d+\\.", ...] // 按正则过滤
    }
    """
    if config_path is None:
        config_path = os.path.join(os.path.dirname(__file__), "parse_filter.json")
    
    default_filter = {
        "exclude_styles": [],
        "exclude_texts": [],
        "exclude_text_patterns": []
    }
    
    if not os.path.exists(config_path):
        return default_filter
    
    try:
        with open(config_path, 'r', encoding='utf-8') as f:
            raw = f.read()
        
        cleaned_lines = []
        for line in raw.split('\n'):
            stripped = line.lstrip()
            if stripped.startswith('//'):
                continue
            in_string = False
            escape = False
            result = []
            for ch in line:
                if escape:
                    result.append(ch)
                    escape = False
                    continue
                if ch == '\\':
                    result.append(ch)
                    escape = True
                    continue
                if ch == '"':
                    in_string = not in_string
                    result.append(ch)
                    continue
                if not in_string and ch == '/' and result and result[-1] == '/':
                    result.pop()
                    break
                result.append(ch)
            cleaned_lines.append(''.join(result))
        
        cleaned = '\n'.join(cleaned_lines)
        config = json.loads(cleaned)
        
        for key in default_filter:
            if key not in config:
                config[key] = default_filter[key]
        
        config.pop("exclude_text_prefixes", None)
        
        config["_compiled_patterns"] = []
        for pattern in config.get("exclude_text_patterns", []):
            try:
                config["_compiled_patterns"].append(re.compile(pattern, re.DOTALL))
            except re.error:
                pass
        
        return config
    except Exception:
        return default_filter


def _should_filter_paragraph(para: Paragraph, parse_filter: dict) -> bool:
    """
    判断段落是否应被过滤(不进行JSON解析)
    
    过滤规则(按优先级):
    1. 样式匹配:段落样式在排除列表中(不区分大小写)
    2. 文本包含匹配:段落文本包含排除文本(支持多行文本,\n表示换行)
    3. 正则匹配:段落文本匹配排除正则(re.DOTALL模式,.匹配换行)
    
    注意:蓝色斜体等富文本样式的内容不会被过滤,只有配置中明确指定的才会被过滤
    """
    if not parse_filter:
        return False
    
    style_name = para.style.name if para.style else ""
    
    for excluded_style in parse_filter.get("exclude_styles", []):
        if style_name.lower() == excluded_style.lower():
            return True
    
    text = para.text.strip()
    if not text:
        return False
    
    for excluded_text in parse_filter.get("exclude_texts", []):
        normalized = excluded_text.replace('\\n', '\n')
        if normalized in text:
            return True
    
    for pattern in parse_filter.get("_compiled_patterns", []):
        if pattern.search(text):
            return True
    
    return False


def iter_block_items(parent):
    """按文档顺序遍历段落和表格"""
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
    else:
        raise ValueError("parent must be Document")
    
    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


def get_heading_level(para: Paragraph) -> int:
    """获取标题级别 (1-4), 非标题返回0"""
    style_name = para.style.name if para.style else ""
    if "Heading 1" in style_name or "标题 1" in style_name:
        return 1
    elif "Heading 2" in style_name or "标题 2" in style_name:
        return 2
    elif "Heading 3" in style_name or "标题 3" in style_name:
        return 3
    elif "Heading 4" in style_name or "标题 4" in style_name:
        return 4
    return 0


def extract_run_style(run) -> dict:
    """提取run的样式"""
    style = {}
    
    if run.bold:
        style["b"] = True
    if run.italic:
        style["i"] = True
    if run.underline:
        style["u"] = True
    if run.font.color and run.font.color.rgb:
        color = str(run.font.color.rgb)
        if color != "000000":
            style["c"] = color
    if run.font.size:
        sz = int(run.font.size.pt * 2)
        if sz != 24:
            style["sz"] = sz
    
    return style


def _styles_equal(s1: dict, s2: dict) -> bool:
    """比较两个样式字典是否相等"""
    if not s1 and not s2:
        return True
    if not s1 or not s2:
        return False
    return s1 == s2


def extract_text_with_format(para: Paragraph, doc=None, num_counters=None) -> dict:
    """
    提取段落的富文本(紧凑数组格式,自动合并相同样式的连续runs)
    
    格式说明:
    - runs: 文本片段列表,每个片段为数组 [text, style?]
    - 数组格式:[文本内容, 样式对象(可选)]
    - 相同样式连续文本会自动合并,减少存储空间
    - list: 列表信息(仅列表段落存在),紧凑字符串格式
      - 项目符号: "disc" 或 "disc:1" 或 "disc/420/h420"
      - 编号列表: "1." 或 "1.:1" 或 "1./0/f420"
    - indent: 缩进信息(仅非默认缩进时存在)
      - {"left":420, "hanging":420} 或 {"left":0, "firstLine":420}
    
    示例:
    {
      "t": "text",
      "runs": [["普通文本"], ["加粗蓝色", {"b": true, "c": "0000FF"}]],
      "list": "disc/420/h420",
      "indent": {"left": 420, "hanging": 420}
    }
    """
    if not para.text.strip():
        return None
    
    raw_items = []
    
    for run in para.runs:
        run_text = run.text
        if not run_text:
            continue
        style = extract_run_style(run)
        raw_items.append((run_text, style))
    
    if not raw_items:
        return None
    
    merged = []
    for text, style in raw_items:
        if merged and _styles_equal(merged[-1][1], style):
            merged[-1] = (merged[-1][0] + text, merged[-1][1])
        else:
            merged.append((text, style))
    
    runs = []
    for text, style in merged:
        if style:
            runs.append([text, style])
        else:
            runs.append([text])
    
    result = {"t": "text", "runs": runs}
    
    list_info = _extract_list_info(para, doc, num_counters)
    if list_info:
        result["list"] = list_info
    else:
        indent_info = _extract_indent_info(para)
        if indent_info:
            result["indent"] = indent_info
    
    return result


def _extract_list_info(para: Paragraph, doc, num_counters) -> str:
    """
    提取段落的列表信息,返回紧凑字符串格式
    
    格式:
    - bullet: "disc" 或 "disc:1" (符号名:ilvl,ilvl=0时省略)
    - numbered: "1." 或 "1.:1" (编号文本:ilvl,ilvl=0时省略)
    - 带缩进: "disc/420/h420" (符号名/left/hanging) 或 "1./0/f420" (编号文本/left/firstLine)
    """
    pPr = para._element.find(qn('w:pPr'))
    if pPr is None:
        return None
    
    numPr = pPr.find(qn('w:numPr'))
    if numPr is None:
        return None
    
    numId_elem = numPr.find(qn('w:numId'))
    ilvl_elem = numPr.find(qn('w:ilvl'))
    
    if numId_elem is None:
        return None
    
    num_id = numId_elem.get(qn('w:val'))
    ilvl = int(ilvl_elem.get(qn('w:val'), '0')) if ilvl_elem is not None else 0
    
    if num_id == "0":
        return None
    
    style_name = para.style.name if para.style else ""
    if "Heading" in style_name or "标题" in style_name:
        return None
    
    if doc is None:
        return None
    
    num_def = _resolve_numbering(doc, num_id, ilvl)
    if num_def is None:
        return None
    
    ind = pPr.find(qn('w:ind'))
    indent_parts = []
    left_val = 0
    has_para_indent = False
    
    if ind is not None:
        left = ind.get(qn('w:left'))
        hanging = ind.get(qn('w:hanging'))
        first_line = ind.get(qn('w:firstLine'))
        
        if left is not None:
            left_val = int(left)
            has_para_indent = True
        if hanging is not None:
            indent_parts.append("h" + str(int(hanging)))
            has_para_indent = True
        elif first_line is not None:
            indent_parts.append("f" + str(int(first_line)))
            has_para_indent = True
    
    if not has_para_indent and "indent_left" in num_def:
        left_val = num_def["indent_left"]
        if "indent_hanging" in num_def:
            indent_parts.append("h" + str(num_def["indent_hanging"]))
        elif "indent_firstLine" in num_def:
            indent_parts.append("f" + str(num_def["indent_firstLine"]))
        else:
            indent_parts.append("0")
    
    if indent_parts:
        indent_parts.insert(0, str(left_val))
    elif left_val != 0:
        indent_parts = [str(left_val), "0"]
    else:
        indent_parts = ["0", "0"]
    
    if num_def["type"] == "bullet":
        bullet_name = num_def.get("bullet", "disc")
        parts = [bullet_name]
        if ilvl > 0:
            parts.append(str(ilvl))
        prefix = ":".join(parts)
    else:
        if num_counters is not None:
            key = (num_id, ilvl)
            if key not in num_counters:
                for k in list(num_counters.keys()):
                    if k[0] == num_id and k[1] > ilvl:
                        del num_counters[k]
            
            num_counters[key] = num_counters.get(key, 0) + 1
            
            counters = []
            for lvl in range(ilvl + 1):
                counters.append(num_counters.get((num_id, lvl), 1))
            
            num_text = _compute_num_text(
                num_def.get("numFmt", "decimal"),
                num_def.get("lvlText", ""),
                counters
            )
        else:
            num_text = ""
        
        parts = [num_text] if num_text else ["?"]
        if ilvl > 0:
            parts.append(str(ilvl))
        prefix = ":".join(parts)
    
    if indent_parts:
        return prefix + "/" + "/".join(indent_parts)
    else:
        return prefix


def _extract_indent_info(para: Paragraph) -> dict:
    """提取段落的缩进信息"""
    pPr = para._element.find(qn('w:pPr'))
    if pPr is None:
        return None
    
    ind = pPr.find(qn('w:ind'))
    if ind is None:
        return None
    
    left = ind.get(qn('w:left'))
    first_line = ind.get(qn('w:firstLine'))
    hanging = ind.get(qn('w:hanging'))
    
    indent = {}
    if left is not None and int(left) != 0:
        indent["left"] = int(left)
    if hanging is not None and int(hanging) != 0:
        indent["hanging"] = int(hanging)
    elif first_line is not None and int(first_line) != 0:
        indent["firstLine"] = int(first_line)
    
    if not indent:
        return None
    
    return indent


def extract_cell_rich_text(cell) -> dict:
    """
    提取单元格的富文本内容
    
    返回格式:
    {"v": "纯文本", "r": [["文本片段", {样式}]]}
    """
    runs = []
    plain_text = ""
    
    for para in cell.paragraphs:
        para_text = ""
        para_runs = []
        
        for run in para.runs:
            run_text = run.text
            if not run_text:
                continue
            
            style = extract_run_style(run)
            if style:
                para_runs.append([run_text, style])
            else:
                para_runs.append([run_text])
            para_text += run_text
        
        if para_runs:
            if runs and para_text:
                runs.append(["\n"])
            runs.extend(para_runs)
            plain_text += para_text
    
    merged = []
    for text, style in [(r[0], r[1] if len(r) > 1 else None) for r in runs]:
        if merged and _styles_equal(merged[-1][1], style):
            merged[-1] = (merged[-1][0] + text, merged[-1][1])
        else:
            merged.append((text, style))
    
    final_runs = []
    for text, style in merged:
        if style:
            final_runs.append([text, style])
        else:
            final_runs.append([text])
    
    return {"v": plain_text, "r": final_runs}


def extract_table_compact(table: Table) -> dict:
    """
    提取表格(支持合并单元格和富文本)
    
    格式说明:
    - cols: 列数
    - h: 表头行,每个单元格为 {"v": "文本", "r": [[run]], "gs": 横向合并数, "vm": 纵向合并状态}
    - d: 数据行,格式同上
    
    合并单元格说明:
    - gs (gridSpan): 横向合并列数,如 gs=2 表示合并2列
    - vm (vMerge): 纵向合并状态,"restart"表示开始合并,"continue"表示继续合并
    """
    cols = len(table.columns)
    
    headers = []
    data = []
    
    for row_idx, row in enumerate(table.rows):
        cells = []
        tc_list = row._tr.findall(qn('w:tc'))
        
        for tc_idx, tc in enumerate(tc_list):
            cell_data = {"v": "", "r": []}
            
            p_list = tc.findall(qn('w:p'))
            for p in p_list:
                for r in p.findall('.//' + qn('w:r')):
                    text_elem = r.find(qn('w:t'))
                    if text_elem is not None and text_elem.text:
                        text = text_elem.text
                        cell_data["v"] += text
                        
                        style = {}
                        rPr = r.find(qn('w:rPr'))
                        if rPr is not None:
                            if rPr.find(qn('w:b')) is not None:
                                style["b"] = True
                            if rPr.find(qn('w:i')) is not None:
                                style["i"] = True
                            if rPr.find(qn('w:u')) is not None:
                                style["u"] = True
                            color_elem = rPr.find(qn('w:color'))
                            if color_elem is not None:
                                c = color_elem.get(qn('w:val'))
                                if c and c != "000000":
                                    style["c"] = c
                            sz_elem = rPr.find(qn('w:sz'))
                            if sz_elem is not None:
                                sz = sz_elem.get(qn('w:val'))
                                if sz and int(sz) != 24:
                                    style["sz"] = int(sz)
                        
                        if style:
                            cell_data["r"].append([text, style])
                        else:
                            cell_data["r"].append([text])
            
            tcPr = tc.find(qn('w:tcPr'))
            if tcPr is not None:
                gs_elem = tcPr.find(qn('w:gridSpan'))
                if gs_elem is not None:
                    gs_val = gs_elem.get(qn('w:val'))
                    if gs_val:
                        cell_data["gs"] = int(gs_val)
                
                vm_elem = tcPr.find(qn('w:vMerge'))
                if vm_elem is not None:
                    vm_val = vm_elem.get(qn('w:val'))
                    if vm_val == "restart":
                        cell_data["vm"] = "restart"
                    else:
                        cell_data["vm"] = "continue"
            
            cells.append(cell_data)
        
        if row_idx == 0:
            headers = cells
        else:
            data.append(cells)
    
    return {"t": "table", "cols": cols, "h": headers, "d": data}


def extract_image_from_paragraph(para: Paragraph, images_dir: str, img_counter: list) -> dict:
    """从段落中提取图片"""
    for run in para.runs:
        drawings = run._element.findall('.//' + qn('a:blip'))
        for drawing in drawings:
            embed = drawing.get(qn('r:embed'))
            if not embed:
                embed = drawing.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed')
            if embed:
                try:
                    image_part = para.part.related_parts[embed]
                    image_bytes = image_part.blob
                    
                    ext = image_part.content_type.split('/')[-1]
                    if ext == 'jpeg':
                        ext = 'jpg'
                    
                    img_hash = hashlib.md5(image_bytes).hexdigest()[:8]
                    img_name = f"img_{img_counter[0]:03d}_{img_hash}.{ext}"
                    img_counter[0] += 1
                    
                    img_path = os.path.join(images_dir, img_name)
                    with open(img_path, 'wb') as f:
                        f.write(image_bytes)
                    
                    pil_img = Image.open(io.BytesIO(image_bytes))
                    w, h = pil_img.size
                    
                    if w > 400:
                        ratio = 400 / w
                        w = 400
                        h = int(h * ratio)
                    
                    return {"t": "img", "p": f"images/{img_name}", "w": w, "h": h}
                    
                except Exception as e:
                    print(f"图片提取失败: {e}")
    
    return None


def parse_supplier_docx(docx_path: str, output_dir: str = None, filter_config_path: str = None) -> dict:
    """
    解析供应商文档
    
    Args:
        docx_path: docx文件路径
        output_dir: 输出目录(用于存储图片和JSON)
        filter_config_path: 过滤配置文件路径(可选,默认为scripts/parse_filter.json)
    
    Returns:
        解析后的JSON数据
    """
    if output_dir is None:
        output_dir = os.path.dirname(docx_path)
    
    images_dir = os.path.join(output_dir, "images")
    os.makedirs(images_dir, exist_ok=True)
    
    doc = Document(docx_path)
    
    result = {
        "doc_title": "",
        "doc_date": "",
        "sections": {
            "original_req": "",
            "req_clarify": "",
            "req_analysis": "",
            "solution_overview": "",
            "new_tables": [],
            "modify_tables": [],
            "data_compatibility": []
        },
        "functions": [],
        "zhongtai_interfaces": [],
        "images_dir": "images"
    }
    
    img_counter = [1]
    blocks = list(iter_block_items(doc))
    num_counters = {}
    parse_filter = _load_parse_filter(filter_config_path)
    
    # 章节匹配规则:(标题级别, 标题包含文本) -> 目标字段
    # 使用级别+文本组合来精确匹配,避免父子章节重名问题
    SECTION_RULES = {
        (2, "原始需求"): ("original_req", "text"),
        (2, "需求澄清"): ("req_clarify", "text"),
        (2, "需求分析"): ("req_analysis", "text"),
        (3, "方案概述"): ("solution_overview", "text"),
        (2, "新增表"): ("new_tables", "blocks"),
        (2, "修改表"): ("modify_tables", "blocks"),
        (2, "数据兼容性要求"): ("data_compatibility", "blocks"),
    }
    
    current_section = None
    current_section_type = None
    current_function = None
    current_sub_function = None
    in_function_section = False
    in_zhongtai_section = False
    zhongtai_current_interface = None
    
    def _finalize_function(func):
        """功能点结束时检查:如果功能点含XXX且所有子功能点也含XXX,则移除"""
        if func is None:
            return
        name = func.get("name", "")
        subs = func.get("sub_functions", [])
        if "xxx" in name.lower() and subs:
            all_xxx = all("xxx" in s.get("name", "").lower() for s in subs)
            if all_xxx:
                if func in result["functions"]:
                    result["functions"].remove(func)
    
    for idx, block in enumerate(blocks):
        if isinstance(block, Paragraph):
            if _should_filter_paragraph(block, parse_filter):
                continue
            
            level = get_heading_level(block)
            text = block.text.strip()
            
            # 检查是否是一级标题(章节开始)
            if level == 1:
                _finalize_function(current_function)
                if "功能分解" in text:
                    in_function_section = True
                    current_section = None
                elif "数据模型" in text:
                    in_function_section = False
                    current_section = None
                else:
                    in_function_section = False
                    current_section = None
                current_function = None
                in_zhongtai_section = False
                zhongtai_current_interface = None
                continue
            
            # 检查是否匹配章节规则(必须在level检查之后)
            matched = False
            for (rule_level, rule_text), (target_field, content_type) in SECTION_RULES.items():
                if level == rule_level and rule_text in text:
                    _finalize_function(current_function)
                    current_function = None
                    current_section = target_field
                    current_section_type = content_type
                    matched = True
                    break
            
            if matched:
                continue
            
            # 检查是否是功能点标题 (Heading 2, 在功能分解章节内)
            if level == 2 and in_function_section:
                _finalize_function(current_function)
                # 中台三问能开接口同步:进入特殊解析模式
                if "中台三问" in text:
                    in_zhongtai_section = True
                    current_function = None
                    current_section = None
                    current_section_type = None
                    continue
                current_function = {
                    "name": text,
                    "scene_desc": "",
                    "sub_functions": []
                }
                result["functions"].append(current_function)
                current_section = "function"
                current_section_type = None
                in_zhongtai_section = False
                zhongtai_current_interface = None
                continue
            
            # 中台三问模式:Heading 4 = 接口名称
            # 在该接口范围内(直到下一个Heading 4或更高级别)查找第一个表格
            if in_zhongtai_section and level == 4:
                interface_info = {
                    "interface_name": text,
                    "method": "",
                    "description": ""
                }
                # 向后查找该接口范围内的第一个表格
                for j in range(idx + 1, len(blocks)):
                    next_block = blocks[j]
                    if isinstance(next_block, Paragraph):
                        next_level = get_heading_level(next_block)
                        if next_level >= 4:
                            # 遇到下一个Heading 4或更高级别,停止查找
                            if next_level == 4:
                                break
                            continue
                    elif isinstance(next_block, Table):
                        table_block = extract_table_compact(next_block)
                        # 纵向表格:每行第一列是属性名,第二列是值
                        for row in table_block.get("d", []):
                            if len(row) >= 2:
                                attr_name = row[0].get("v", "") if isinstance(row[0], dict) else str(row[0])
                                attr_value = row[1].get("v", "") if isinstance(row[1], dict) else str(row[1])
                                if "接口名称" in attr_name:
                                    interface_info["interface_name"] = attr_value
                                elif "方法名" in attr_name:
                                    interface_info["method"] = attr_value
                                elif "接口说明" in attr_name:
                                    interface_info["description"] = attr_value
                        result["zhongtai_interfaces"].append(interface_info)
                        break
                continue
            
            # 检查是否是场景描述 (Heading 3)
            if level == 3 and in_function_section and current_function:
                if "场景描述" in text:
                    current_section = "scene_desc"
                    current_section_type = "text"
                elif "实现方案" in text:
                    current_section = "implementation"
                    current_section_type = None
                else:
                    current_section = None
                continue
            
            # 检查是否是子功能点 (Heading 4)
            if level == 4 and in_function_section and current_function:
                current_sub_function = {
                    "name": text,
                    "content_blocks": []
                }
                current_function["sub_functions"].append(current_sub_function)
                current_section = "sub_function"
                current_section_type = "blocks"
                continue
            
            # 其他标题级别,重置section
            if level > 0:
                current_section = None
                continue
            
            # 处理普通段落内容
            else:
                # 处理固定章节的文本内容
                if current_section in ["original_req", "req_clarify", "req_analysis", "solution_overview"]:
                    if text:
                        if result["sections"][current_section]:
                            result["sections"][current_section] += "\n" + text
                        else:
                            result["sections"][current_section] = text
                
                # 处理富文本章节(新增表、修改表、数据兼容性要求)
                elif current_section in ["new_tables", "modify_tables", "data_compatibility"]:
                    if text:
                        text_block = extract_text_with_format(block, doc, num_counters)
                        if text_block:
                            result["sections"][current_section].append(text_block)
                    # 也检查图片
                    img_block = extract_image_from_paragraph(block, images_dir, img_counter)
                    if img_block:
                        result["sections"][current_section].append(img_block)
                
                # 处理功能点的场景描述
                elif current_section == "scene_desc" and current_function:
                    if text:
                        if current_function["scene_desc"]:
                            current_function["scene_desc"] += "\n" + text
                        else:
                            current_function["scene_desc"] = text
                
                # 处理子功能点的内容块
                elif current_section == "sub_function" and current_sub_function:
                    if text:
                        text_block = extract_text_with_format(block, doc, num_counters)
                        if text_block:
                            current_sub_function["content_blocks"].append(text_block)
                    # 也检查图片
                    img_block = extract_image_from_paragraph(block, images_dir, img_counter)
                    if img_block:
                        current_sub_function["content_blocks"].append(img_block)
        
        elif isinstance(block, Table):
            # 处理表格
            if current_section in ["new_tables", "modify_tables", "data_compatibility"]:
                table_block = extract_table_compact(block)
                result["sections"][current_section].append(table_block)
            
            elif current_section == "sub_function" and current_sub_function:
                table_block = extract_table_compact(block)
                current_sub_function["content_blocks"].append(table_block)
    
    # 文档结束,finalize最后一个功能点
    _finalize_function(current_function)
    
    for func in result["functions"]:
        for sub in func.get("sub_functions", []):
            for block in sub.get("content_blocks", []):
                if block.get("t") == "table":
                    headers = block.get("h", [])
                    workload_idx = None
                    for idx, h in enumerate(headers):
                        if isinstance(h, dict) and "工作量" in h.get("v", ""):
                            workload_idx = idx
                            break
                    if workload_idx is not None:
                        del headers[workload_idx]
                        for row in block.get("d", []):
                            if workload_idx < len(row):
                                del row[workload_idx]
                        block["cols"] = block.get("cols", len(headers)) - 1
    
    for para in doc.paragraphs[:10]:
        text = para.text.strip()
        if text and not any(c in text for c in ['目录', '版本', '修改']):
            if len(text) > 5 and "需求" in text:
                result["doc_title"] = text
                break
        
        if re.match(r'\d{4}年\d{1,2}月\d{1,2}日', text):
            result["doc_date"] = text
    
    return result


def _get_skill_dir() -> str:
    """获取Skill根目录(scripts的上级目录)"""
    return os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


def main():
    """主函数 - 支持命令行参数"""
    import argparse
    
    skill_dir = _get_skill_dir()
    default_output = os.path.join(skill_dir, "parsed_output")
    
    parser = argparse.ArgumentParser(description='解析供应商Word文档为JSON格式')
    parser.add_argument('docx_path', help='供应商Word文档路径')
    parser.add_argument('-o', '--output', default=None, help=f'输出目录路径(默认: {default_output})')
    parser.add_argument('-n', '--name', default='converted_document.json', help='输出JSON文件名(默认: converted_document.json)')
    parser.add_argument('--filter', default=None, help='过滤配置文件路径(默认: scripts/parse_filter.json)')
    
    args = parser.parse_args()
    
    docx_path = args.docx_path
    output_dir = args.output if args.output else default_output
    
    os.makedirs(output_dir, exist_ok=True)
    
    print(f"解析文档: {docx_path}")
    result = parse_supplier_docx(docx_path, output_dir, args.filter)
    
    json_path = os.path.join(output_dir, args.name)
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(result, f, ensure_ascii=False, indent=2)
    
    print(f"JSON已保存: {json_path}")
    print(f"图片目录: {os.path.join(output_dir, 'images')}")
    print("\n提示: 已跳过AI转换环节,JSON可直接用于渲染")
    
    print("\n=== 解析结果摘要 ===")
    print(f"文档标题: {result['doc_title']}")
    print(f"文档日期: {result['doc_date']}")
    
    print(f"\n固定章节:")
    for key, value in result['sections'].items():
        if isinstance(value, str):
            preview = value[:50] + "..." if len(value) > 50 else value or "(空)"
            print(f"  {key}: {preview}")
        elif isinstance(value, list):
            print(f"  {key}: {len(value)}个内容块")
    
    print(f"\n功能点数量: {len(result['functions'])}")
    for func in result['functions']:
        print(f"  - {func['name']}")
        print(f"    场景描述: {func['scene_desc'][:30]}...")
        print(f"    子功能点数量: {len(func['sub_functions'])}")
        for sub in func['sub_functions']:
            blocks_count = len(sub['content_blocks'])
            print(f"      - {sub['name']} ({blocks_count}个内容块)")


if __name__ == "__main__":
    main()

scripts/supplier_schema.py

"""
供应商文档JSON存储模板(Schema定义)

文档结构:
- doc_title: 文档标题
- doc_date: 文档日期
- sections: 固定章节内容(普通文本)
  - original_req: 原始需求
  - req_clarify: 需求澄清
  - req_analysis: 需求分析
  - solution_overview: 方案概述
- functions: 功能点列表(嵌套结构)
  - name: 功能点名称
  - scene_desc: 场景描述(普通文本)
  - sub_functions: 子功能点列表
    - name: 子功能点名称
    - content_blocks: 内容块列表(支持富文本、表格、图片)

内容块类型:
- text: 富文本(紧凑数组格式)
  {"t": "text", "runs": [["文本"], ["带样式", {"b": true}]], "list": "disc/420/h420", "indent": {"left": 420, "hanging": 420}}
- table: 表格(支持合并单元格和富文本)
  {"t": "table", "cols": n, "h": [{"v": "文本", "r": [[run]], "gs": n, "vm": "状态"}], "d": [[...]]}
- image: 图片 {"t": "img", "p": "path", "w": n, "h": n}
"""

from typing import List, Optional, Literal
from pydantic import BaseModel, Field
import json


class TextStyle(BaseModel):
    b: Optional[bool] = Field(None, description="粗体")
    i: Optional[bool] = Field(None, description="斜体")
    u: Optional[bool] = Field(None, description="下划线")
    c: Optional[str] = Field(None, description="颜色(HEX)")
    sz: Optional[int] = Field(None, description="字号(半磅)")


class TextBlock(BaseModel):
    t: Literal["text"] = "text"
    runs: List[List] = Field(default_factory=list, description="紧凑数组格式: [[text], [text, style]]")
    list: Optional[str] = Field(None, description="列表信息,紧凑字符串格式: disc/420/h420 或 1./0/f420")
    indent: Optional[dict] = Field(None, description="缩进信息: {left: 420, hanging: 420} 或 {left: 0, firstLine: 420}")


class TableCell(BaseModel):
    v: str = Field(default="", description="单元格纯文本")
    r: List[List] = Field(default_factory=list, description="富文本runs: [[text, style?]]")
    gs: Optional[int] = Field(None, description="横向合并列数(gridSpan)")
    vm: Optional[str] = Field(None, description="纵向合并状态: restart/continue")


class TableBlock(BaseModel):
    t: Literal["table"] = "table"
    cols: int = Field(..., description="列数")
    h: List[dict] = Field(default_factory=list, description="表头行(TableCell列表)")
    d: List[List[dict]] = Field(default_factory=list, description="数据行(TableCell二维数组)")
    col_widths: Optional[List[int]] = Field(None, description="列宽列表,单位: twentieths of a point")


class ImageBlock(BaseModel):
    t: Literal["img"] = "img"
    p: str = Field(..., description="图片路径")
    w: int = Field(..., description="宽度")
    h: int = Field(..., description="高度")


class SubFunction(BaseModel):
    name: str = Field(..., description="子功能点名称")
    content_blocks: List[dict] = Field(default_factory=list, description="内容块列表")


class Function(BaseModel):
    name: str = Field(..., description="功能点名称")
    scene_desc: str = Field(default="", description="场景描述")
    sub_functions: List[SubFunction] = Field(default_factory=list, description="子功能点列表")


class Sections(BaseModel):
    original_req: str = Field(default="", description="原始需求")
    req_clarify: str = Field(default="", description="需求澄清")
    req_analysis: str = Field(default="", description="需求分析")
    solution_overview: str = Field(default="", description="方案概述")
    new_tables: list = Field(default_factory=list, description="新增表(富文本内容块)")
    modify_tables: list = Field(default_factory=list, description="修改表(富文本内容块)")
    data_compatibility: list = Field(default_factory=list, description="数据兼容性要求(富文本内容块)")


class ZhongtaiInterface(BaseModel):
    interface_name: str = Field(..., description="接口名称")
    method: str = Field(default="", description="方法名")
    description: str = Field(default="", description="接口说明")


class SupplierDocument(BaseModel):
    doc_title: str = Field(..., description="文档标题")
    doc_date: Optional[str] = Field(None, description="文档日期")
    sections: Sections = Field(default_factory=Sections, description="固定章节")
    functions: List[Function] = Field(default_factory=list, description="功能点列表")
    zhongtai_interfaces: List[ZhongtaiInterface] = Field(default_factory=list, description="中台三问接口列表")
    images_dir: str = Field(default="images", description="图片存储目录")


def create_example_json() -> dict:
    """创建示例JSON(使用紧凑数组格式)"""
    return {
        "doc_title": "需求XXXXX_需求分析说明书",
        "doc_date": "2024年6月30日",
        "sections": {
            "original_req": "这是需要提取的原始需求内容...",
            "req_clarify": "这是需要提取的需求澄清内容...",
            "req_analysis": "这是需要提取的需求分析内容...",
            "solution_overview": "这是方案概述..."
        },
        "functions": [
            {
                "name": "功能点1:我是一个功能点名称",
                "scene_desc": "这是功能点1的场景描述...",
                "sub_functions": [
                    {
                        "name": "我是一个子功能点",
                        "content_blocks": [
                            {
                                "t": "text",
                                "runs": [
                                    ["富文本信息,"],
                                    ["蓝色字体", {"c": "0000FF"}],
                                    [","],
                                    ["红色字体", {"c": "FF0000"}]
                                ]
                            },
                            {
                                "t": "text",
                                "runs": [
                                    ["小字体", {"c": "FF0000", "sz": 21}]
                                ]
                            },
                            {
                                "t": "text",
                                "runs": [
                                    ["这是一个列表项"]
                                ],
                                "list": "disc/420/h420"
                            },
                            {
                                "t": "text",
                                "runs": [
                                    ["这是一个编号项"]
                                ],
                                "list": "1./0/f420"
                            },
                            {
                                "t": "text",
                                "runs": [
                                    ["这是一个缩进段落"]
                                ],
                                "indent": {"left": 420, "firstLine": 420}
                            },
                            {
                                "t": "text",
                                "runs": [
                                    ["下面是一张表格:"]
                                ]
                            },
                            {
                                "t": "table",
                                "cols": 4,
                                "h": [
                                    {"v": "序号", "r": [["序号"]]},
                                    {"v": "字段名称", "r": [["字段名称"]]},
                                    {"v": "字段说明", "r": [["字段说明"]]},
                                    {"v": "备注", "r": [["备注"]]}
                                ],
                                "d": [
                                    [{"v": "1", "r": [["1"]]}, {"v": "地市", "r": [["地市"]]}, {"v": "", "r": []}, {"v": "", "r": []}],
                                    [{"v": "2", "r": [["2"]]}, {"v": "区县", "r": [["区县"]]}, {"v": "", "r": []}, {"v": "", "r": []}]
                                ]
                            },
                            {
                                "t": "text",
                                "runs": [
                                    ["下面是一张图片:"]
                                ]
                            },
                            {
                                "t": "img",
                                "p": "images/img_001.png",
                                "w": 200,
                                "h": 150
                            }
                        ]
                    }
                ]
            }
        ],
        "images_dir": "images"
    }


if __name__ == "__main__":
    print("=== 紧凑数组格式 ===")
    example = create_example_json()
    print(json.dumps(example, ensure_ascii=False, indent=2))
    
    print("\n=== 格式说明 ===")
    print("runs数组格式: [[文本], [文本, 样式对象]]")
    print("示例: [['普通文本'], ['加粗蓝色', {'b': true, 'c': '0000FF'}]]")

重建步骤

1. 安装依赖

pip install -r scripts/requirements.txt

2. 创建模板

py -3.11 scripts/create_custom_template.py -o templates/custom_template.docx

3. 运行

# 解析供应商文档
py -3.11 scripts/supplier_parser.py <供应商文档.docx>

# 渲染甲方文档
py -3.11 scripts/renderer.py

说明

  • Python 版本 >= 3.11
  • supplier_schema.py 为参考文件,非运行必需(需安装 pydantic)
  • 输出目录 parsed_output/ 运行时自动创建
  • 模板 custom_template.docx 不存在时需先运行 create_custom_template.py

posted on 2026-05-23 09:09  lapin  阅读(4)  评论(0)    收藏  举报

导航