SNAPSHOT.md
需求文档转换 Skill — 完整代码包
生成时间:2026-05-23 16:55
包含文件:8 个
本文件包含该 Skill 的全部源代码,可直接按目录结构重建。
目录结构
req-doc-conv/
├── SKILL.md
│ ├── check_template.py
│ ├── create_custom_template.py
│ ├── parse_filter.json
│ ├── renderer.py
│ ├── requirements.txt
│ ├── supplier_parser.py
│ └── supplier_schema.py
SKILL.md
---
name: 需求文档转换
description: 将供应商侧的技术需求文档(.docx)转换为给甲方客户看的需求说明文档(.docx),格式严格固定。核心功能清单表格支持动态纵向合并单元格,同一系统功能下的子功能行左边必须纵向合并。
compatibility:
- Python: 3.11及以上版本
- python-docx: Word文档解析和生成
- Pillow: 图片处理
---
# 概述
本工具链用于将供应商编写的技术需求文档自动转换为面向甲方的业务需求说明文档。
**核心特性**:
- 输入:供应商 Word 文档(.docx),包含数据库设计、API 接口、技术实现细节
- 输出:甲方 Word 文档(.docx),格式固定
- 数据流转:`供应商Word` → JSON → `甲方Word`
- 自定义占位符:使用 `{{placeholder_name}}` 格式,不依赖 Jinja2
---
# 目录规范(重要)
**所有输入输出必须在Skill目录内进行,严禁在输入文档所在目录创建任何文件。**
Skill目录即本文件(SKILL.md)所在的目录,脚本通过 `os.path.dirname(os.path.abspath(__file__))` 自动定位,无需手动配置。
| 目录 | 用途 | 说明 |
|------|------|------|
| `scripts/` | 脚本文件 | 解析器、渲染器等 |
| `scripts/parse_filter.json` | 过滤配置 | 定义解析时需排除的文本 |
| `templates/` | 模板文件 | `custom_template.docx` 必须在此目录 |
| `parsed_output/` | 输出目录 | 解析JSON、最终Word均在此 |
**严禁行为**:
- ❌ 在输入文档所在目录创建 `parsed_output/` 目录
- ❌ 在输入文档所在目录创建 `templates/` 目录
- ❌ 在输入文档所在目录自动生成 `custom_template.docx`
- ❌ 将输出文件写入Skill目录以外的位置
**正确做法**:
- ✅ 解析输出:`<skill_dir>/parsed_output/converted_document.json`
- ✅ 模板路径:`<skill_dir>/templates/custom_template.docx`
- ✅ 渲染输出:`<skill_dir>/parsed_output/client_document.docx`
- ✅ 图片目录:`<skill_dir>/parsed_output/images/`
---
# 环境要求
**Python版本**:必须使用 Python 3.11 及以上版本执行所有脚本。
**执行命令**(所有路径基于Skill目录):
```bash
# 解析供应商文档(输出默认到 <skill_dir>/parsed_output/converted_document.json)
py -3.11 scripts/supplier_parser.py <docx_path>
# 渲染到模板(模板固定使用 <skill_dir>/templates/custom_template.docx)
py -3.11 scripts/renderer.py
# 检查模板
py -3.11 scripts/check_template.py templates/custom_template.docx
命令行参数说明:
| 脚本 | 参数 | 说明 |
|---|---|---|
| supplier_parser.py | docx_path | 供应商Word文档路径(必需) |
| -o, --output | 输出目录路径(可选,默认为Skill目录下parsed_output/) | |
| -n, --name | 输出JSON文件名(可选,默认converted_document.json) | |
| --filter | 过滤配置文件路径(可选,默认scripts/parse_filter.json) | |
| renderer.py | json_path | JSON数据文件路径(可选,默认parsed_output/converted_document.json) |
| -t, --template | Word模板文件路径(可选,默认templates/custom_template.docx) | |
| -o, --output | 输出Word文件路径(可选,默认parsed_output/client_document.docx) | |
| --max-funcs | 最大功能点数量(可选,默认10) | |
| --max-subs | 每个功能点最大子功能数量(可选,默认10) | |
| create_custom_template.py | -o, --output | 输出模板文件路径(可选) |
| --max-funcs | 最大功能点数量(可选) | |
| --max-subs | 每个功能点最大子功能数量(可选) | |
| -f, --force | 强制覆盖已存在的模板(可选) | |
| check_template.py | template_path | 模板文件路径(必需) |
| pack.py | -o, --output | 输出打包文件路径(可选,默认SNAPSHOT.md) |
| --no-md | 不打包.md文档(可选) | |
| --root | 指定Skill根目录(可选) |
安装依赖:
py -3.11 -m pip install python-docx Pillow
使用场景
适用对象
- 文档标准化:快速生成符合公司标准格式的需求说明书
- 格式转换:将供应商技术文档转换为目标模板格式
不适用场景
- 直接编辑 Word 文件(使用 docx 技能)
- 需要AI改写内容的场景
工作流程
供应商 Word (.docx)
│
▼
supplier_parser.py (Python 3.11+) → converted_document.json
│
▼
renderer.py (Python 3.11+) → 甲方 Word (.docx)
两阶段说明:
- 解析阶段:
supplier_parser.py提取供应商文档内容,生成结构化 JSON - 渲染阶段:
renderer.py将 JSON 渲染到目标模板
文件结构
需求文档转换/
├── scripts/
│ ├── supplier_parser.py # 解析器:Word → JSON
│ ├── renderer.py # 渲染器:JSON → Word
│ ├── supplier_schema.py # JSON Schema 定义
│ ├── create_custom_template.py # 模板创建工具
│ ├── check_template.py # 模板检查工具
│ ├── parse_filter.json # 解析过滤配置
│ ├── requirements.txt # 依赖列表
│ ├── pack.py # 打包脚本:一键导出全部代码到单个md
├── templates/
│ └── custom_template.docx # 目标模板(含占位符)
├── parsed_output/
│ ├── converted_document.json # 解析后的JSON
│ ├── images/ # 提取的图片
│ └── client_document.docx # 最终输出
├── SKILL.md # Skill描述文件
└── SNAPSHOT.md # 打包输出(由pack.py生成)
核心模块说明
1. 解析器 (supplier_parser.py)
将供应商 Word 文档解析为结构化 JSON。
执行命令:
py -3.11 "scripts/supplier_parser.py"
提取规则:
| 标题级别 | 标题文本 | 目标字段 | 内容类型 |
|---|---|---|---|
| Heading 2 | 原始需求 | sections.original_req | 普通文本 |
| Heading 2 | 需求澄清 | sections.req_clarify | 普通文本 |
| Heading 2 | 需求分析 | sections.req_analysis | 普通文本 |
| Heading 3 | 方案概述 | sections.solution_overview | 普通文本 |
| Heading 2 | 新增表 | sections.new_tables | 富文本块 |
| Heading 2 | 修改表 | sections.modify_tables | 富文本块 |
| Heading 2 | 数据兼容性要求 | sections.data_compatibility | 富文本块 |
| Heading 2 | 功能点 | functions[] | 嵌套结构 |
富文本提取特性:
- 自动合并相同样式的连续 runs,减少存储空间
- 支持项目符号列表(圆点、三角、方框、菱形等)
- 支持编号列表(阿拉伯数字、中文数字、带圈数字、罗马数字等)
- 支持多级列表(自动计算编号文本)
- 支持文本缩进(悬挂缩进和首行缩进)
- 支持表格单元格横向合并(gridSpan)和纵向合并(vMerge)
- 支持表格单元格富文本
- 支持图片提取和尺寸调整
- 工作量列自动删除:子功能点表格中表头含"工作量"的整列(表头+数据)会被移除
- XXX功能点跳过:功能点名称含"XXX"且所有子功能点名称也含"XXX"时,该功能点及全部内容被跳过
- 中台三问排除:功能点名称含"中台三问"时自动跳过,后续单独处理
2. 渲染器 (renderer.py)
将 JSON 数据渲染到 Word 模板。
执行命令:
py -3.11 "scripts/renderer.py"
核心功能:
- 支持普通文本、富文本、表格、图片
- 支持项目符号列表、编号列表、多级列表渲染
- 支持文本缩进(悬挂缩进和首行缩进)
- 支持嵌套功能点结构(最多10个功能点,每个10个子功能点)
- 自动删除未使用的占位符章节
- 保留模板样式
- 通过占位符状态判断删除哪些段落(无需硬编码标题名称)
列表渲染说明:
- 项目符号使用标准 Unicode 字符渲染(• ▶ ■ ◆ ✓ ★ →),无需依赖特殊字体
- 解析器兼容供应商文档中的 Wingdings/PUA 项目符号,自动转换为类型名
- 编号列表自动添加编号前缀文本
3. JSON Schema (supplier_schema.py)
定义 JSON 数据结构契约。
数据结构:
{
"doc_title": "文档标题",
"doc_date": "2024年6月30日",
"sections": {
"original_req": "原始需求内容",
"req_clarify": "需求澄清内容",
"req_analysis": "需求分析内容",
"solution_overview": "方案概述内容",
"new_tables": [],
"modify_tables": [],
"data_compatibility": []
},
"functions": [
{
"name": "功能点名称",
"scene_desc": "场景描述",
"sub_functions": [
{
"name": "子功能点名称",
"content_blocks": []
}
]
}
],
"zhongtai_interfaces": [
{
"interface_name": "接口名称",
"method": "方法名",
"description": "接口说明"
}
],
"images_dir": "images"
}
4. 解析过滤配置 (parse_filter.json)
解析器支持通过配置文件过滤不需要的段落内容,避免目录、模板说明等干扰文本被提取到JSON中。
过滤规则类型:
| 规则类型 | 配置键 | 说明 |
|---|---|---|
| 样式过滤 | exclude_styles |
按段落样式名过滤(不区分大小写),如 "toc 1"、"TOCHeading" |
| 文本过滤 | exclude_texts |
按文本内容过滤,段落包含此文本即被排除 |
| 正则过滤 | exclude_text_patterns |
按正则表达式过滤,匹配段落全文 |
配置示例:
{
"exclude_styles": ["toc 1", "toc 2", "TOCHeading"],
"exclude_texts": ["请在此处填写", "模板说明"],
"exclude_text_patterns": ["^\\d+\\.\\s*.+\\t\\d+$"]
}
使用方式:
# 使用默认过滤配置(scripts/parse_filter.json)
py -3.11 scripts/supplier_parser.py <docx_path>
# 使用自定义过滤配置
py -3.11 scripts/supplier_parser.py <docx_path> --filter my_filter.json
注意:蓝色斜体等富文本样式的内容不会被过滤,只有配置中明确指定的规则才会生效。
内容块格式
富文本块(紧凑数组格式)
{
"t": "text",
"runs": [
["普通文本"],
["加粗蓝色", {"b": true, "c": "0000FF"}],
["继续普通"]
],
"list": "disc/420/h420",
"indent": {"left": 420, "hanging": 420}
}
格式说明:
runs: 数组列表,每个元素为[文本, 样式?]- 样式属性:
b(粗体)、i(斜体)、u(下划线)、c(颜色HEX)、sz(字号半磅) list: 列表信息(仅列表段落存在),紧凑字符串格式- 项目符号:
"disc"、"triangle"、"square"等,可选带级别:1和缩进/420/h420 - 编号列表:
"1."、"(一)"等,可选带级别:1和缩进/0/f420
- 项目符号:
indent: 缩进信息(仅非默认缩进时存在){"left": 420, "hanging": 420}或{"left": 0, "firstLine": 420}
表格块
{
"t": "table",
"cols": 4,
"h": [
{"v": "序号", "r": [["序号"]]},
{"v": "字段名称", "r": [["字段名称"]]},
{"v": "字段说明", "r": [["字段说明"]]},
{"v": "备注", "r": [["备注"]]}
],
"d": [
[{"v": "1", "r": [["1"]]}, {"v": "地市", "r": [["地市"]]}, {"v": "", "r": []}, {"v": "", "r": []}],
[{"v": "2", "r": [["2"]]}, {"v": "区县", "r": [["区县"]]}, {"v": "", "r": []}, {"v": "", "r": []}]
],
"col_widths": [1500, 3500, 2500, 2500]
}
单元格格式:
v: 单元格纯文本r: 富文本runs[[文本, 样式?]]gs: 横向合并列数(gridSpan)vm: 纵向合并状态"restart"或"continue"
表格格式:
col_widths: 列宽列表(可选),单位:twentieths of a point(1/20磅),如[1500, 3500, 4000]
图片块
{
"t": "img",
"p": "images/img_001.png",
"w": 200,
"h": 150
}
模板占位符
模板使用 {{placeholder_name}} 格式的占位符:
文档级占位符:
{{doc_title}}- 文档标题{{doc_date}}- 文档日期
章节占位符:
{{original_req}}- 原始需求{{req_clarify}}- 需求澄清{{req_analysis}}- 需求分析{{solution_overview}}- 方案概述{{new_tables}}- 新增表{{modify_tables}}- 修改表{{data_compatibility}}- 数据兼容性要求
中台三问占位符:
{{func_tables}}- 中台三问接口表格列表{{zhongtai_scene}}- 中台三问场景描述{{happy_path}}- 正常路径{{sad_path}}- 异常路径
功能点占位符(1-10):
{{func_N_name}}- 功能点名称{{func_N_scene}}- 场景描述{{func_N_has_scene}}- 是否有场景描述(布尔值,用于条件渲染){{func_N_has_subs}}- 是否有子功能点(布尔值,用于条件渲染){{func_N_sub_M_name}}- 子功能点名称{{func_N_sub_M_content}}- 子功能点内容
技术要点
占位符渲染流程
- 构建占位符映射表 (
build_placeholder_map) - 分析模板结构,确定删除范围 (
analyze_template_structure) - 遍历段落,替换占位符为实际内容
- 插入表格和图片到正确位置
- 删除未使用的占位符章节
- 清理残留占位符
样式保留机制
渲染时保留模板段落的样式:
- 保存原始段落的 style 和 run 样式
- 清空段落后恢复样式
- 新内容应用模板样式
智能删除机制
通过占位符状态自动判断删除范围:
- 空功能点名称 → 删除整个功能点章节
- 空子功能点名称 → 删除子功能点章节
- 空场景描述 → 删除"场景描述"标题
- 无子功能点 → 删除"实现方案"标题
注意事项
- Python版本:必须使用 Python 3.11 及以上版本
- 富文本格式:使用紧凑数组格式,解析时自动合并同样式runs
- 列表支持:解析和渲染均支持项目符号列表、编号列表、多级列表及文本缩进
- 文本过滤:解析时支持通过
parse_filter.json配置过滤规则,排除目录、模板说明等干扰内容 - 模板限制:最多支持10个功能点,每个功能点最多10个子功能点
- 图片路径:图片路径相对于 JSON 文件所在目录
- 表格样式:表格使用默认 TableGrid 样式,支持横向合并(gridSpan)、纵向合并和单元格富文本
- 空行保留:渲染后保留文档中的空行,不会误删
- 模板保护:create_custom_template.py 默认不覆盖已存在的模板,需 -f 参数强制覆盖
- 打包交付:使用
pack.py一键打包全部代码到SNAPSHOT.md,便于交付
---
## scripts/check_template.py
```python
"""
检查模板文件是否合规
"""
from docx import Document
import re
def check_template(template_path: str):
"""检查模板文件"""
doc = Document(template_path)
print(f"检查模板: {template_path}\n")
# 收集所有占位符
placeholders = set()
for para in doc.paragraphs:
text = para.text
found = re.findall(r'\{\{(\w+)\}\}', text)
for ph in found:
placeholders.add(ph)
# 预期的占位符
expected_placeholders = {
"doc_title", "doc_date",
"original_req", "req_clarify", "req_analysis", "solution_overview",
"new_tables", "modify_tables", "data_compatibility",
"zhongtai_scene", "func_tables",
"happy_path", "sad_path"
}
# 功能点占位符
for i in range(1, 11):
expected_placeholders.add(f"func_{i}_name")
expected_placeholders.add(f"func_{i}_scene")
for j in range(1, 11):
expected_placeholders.add(f"func_{i}_sub_{j}_name")
expected_placeholders.add(f"func_{i}_sub_{j}_content")
print("=== 模板中的占位符 ===")
for ph in sorted(placeholders):
status = "[OK]" if ph in expected_placeholders else "[?]"
print(f" {status} {{{{ {ph} }}}}")
print(f"\n总计: {len(placeholders)} 个占位符")
# 检查缺失的占位符
missing = expected_placeholders - placeholders
if missing:
print(f"\n缺失的占位符: {len(missing)} 个")
for ph in sorted(missing)[:10]:
print(f" - {{{{ {ph} }}}}")
if len(missing) > 10:
print(f" ... 还有 {len(missing) - 10} 个")
# 检查未知占位符
unknown = placeholders - expected_placeholders
if unknown:
print(f"\n未知占位符: {len(unknown)} 个")
for ph in sorted(unknown):
print(f" ? {{{{ {ph} }}}}")
# 检查标题样式
print("\n=== 标题样式检查 ===")
heading_count = {"Heading 1": 0, "Heading 2": 0, "Heading 3": 0, "Heading 4": 0}
for para in doc.paragraphs:
style_name = para.style.name if para.style else "Normal"
if "Heading" in style_name:
heading_count[style_name] = heading_count.get(style_name, 0) + 1
for style, count in heading_count.items():
print(f" {style}: {count} 个")
# 结论
print("\n=== 检查结果 ===")
if not missing and not unknown:
print("[OK] 模板合规,可以进行渲染")
return True
else:
print("[FAIL] 模板存在问题")
return False
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='检查模板文件是否合规')
parser.add_argument('template_path', help='模板文件路径')
args = parser.parse_args()
check_template(args.template_path)
scripts/create_custom_template.py
"""
创建自定义占位符模板
预设足够数量的功能点和子功能点占位符
"""
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml.ns import qn
import os
def set_chinese_font(run, font_name='宋体'):
"""设置中文字体"""
run.font.name = font_name
run._element.rPr.rFonts.set(qn('w:eastAsia'), font_name)
def create_custom_template(output_path: str, max_funcs: int = 10, max_subs: int = 10):
"""
创建自定义占位符模板
Args:
output_path: 输出路径
max_funcs: 最大功能点数量
max_subs: 每个功能点最大子功能数量
"""
doc = Document()
# 设置默认字体
style = doc.styles['Normal']
style.font.name = '宋体'
style._element.rPr.rFonts.set(qn('w:eastAsia'), '宋体')
style.font.size = Pt(10.5)
# ========== 文档标题 ==========
title = doc.add_paragraph()
title.alignment = WD_ALIGN_PARAGRAPH.CENTER
run = title.add_run('{{doc_title}}')
run.font.size = Pt(22)
run.bold = True
set_chinese_font(run, '黑体')
# 文档日期
date_para = doc.add_paragraph()
date_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
run = date_para.add_run('{{doc_date}}')
set_chinese_font(run)
doc.add_paragraph()
# ========== 一、原始需求 ==========
h1 = doc.add_heading('一、原始需求', level=1)
for run in h1.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{original_req}}')
set_chinese_font(run)
# ========== 二、需求澄清 ==========
h1 = doc.add_heading('二、需求澄清', level=1)
for run in h1.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{req_clarify}}')
set_chinese_font(run)
# ========== 三、需求分析 ==========
h1 = doc.add_heading('三、需求分析', level=1)
for run in h1.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{req_analysis}}')
set_chinese_font(run)
# ========== 四、方案概述 ==========
h1 = doc.add_heading('四、方案概述', level=1)
for run in h1.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{solution_overview}}')
set_chinese_font(run)
# ========== 五、数据模型 ==========
h1 = doc.add_heading('五、数据模型', level=1)
for run in h1.runs:
set_chinese_font(run, '黑体')
# 5.1 新增表
h2 = doc.add_heading('5.1 新增表', level=2)
for run in h2.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{new_tables}}')
set_chinese_font(run)
# 5.2 修改表
h2 = doc.add_heading('5.2 修改表', level=2)
for run in h2.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{modify_tables}}')
set_chinese_font(run)
# 5.3 数据兼容性要求
h2 = doc.add_heading('5.3 数据兼容性要求', level=2)
for run in h2.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{data_compatibility}}')
set_chinese_font(run)
# ========== 六、功能分解 ==========
h1 = doc.add_heading('六、功能分解', level=1)
for run in h1.runs:
set_chinese_font(run, '黑体')
# 预设功能点和子功能点
for i in range(1, max_funcs + 1):
# 功能点标题
h2 = doc.add_heading('{{func_%d_name}}' % i, level=2)
for run in h2.runs:
set_chinese_font(run, '黑体')
# 场景描述
h3 = doc.add_heading('场景描述', level=3)
for run in h3.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{func_%d_scene}}' % i)
set_chinese_font(run)
# 实现方案
h3 = doc.add_heading('实现方案', level=3)
for run in h3.runs:
set_chinese_font(run, '黑体')
# 子功能点
for j in range(1, max_subs + 1):
# 子功能点标题
h4 = doc.add_heading('{{func_%d_sub_%d_name}}' % (i, j), level=4)
for run in h4.runs:
set_chinese_font(run, '黑体')
# 子功能点内容
p = doc.add_paragraph()
run = p.add_run('{{func_%d_sub_%d_content}}' % (i, j))
set_chinese_font(run)
# ========== 七、中台三问 ==========
h1 = doc.add_heading('七、中台三问', level=1)
for run in h1.runs:
set_chinese_font(run, '黑体')
# 7.1 场景描述
h2 = doc.add_heading('7.1 场景描述', level=2)
for run in h2.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{zhongtai_scene}}')
set_chinese_font(run)
# 7.2 功能列表
h2 = doc.add_heading('7.2 功能列表', level=2)
for run in h2.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{func_tables}}')
set_chinese_font(run)
# ========== 八、测试建议 ==========
h1 = doc.add_heading('八、测试建议', level=1)
for run in h1.runs:
set_chinese_font(run, '黑体')
# 正常路径
h2 = doc.add_heading('正常路径', level=2)
for run in h2.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{happy_path}}')
set_chinese_font(run)
# 异常路径
h2 = doc.add_heading('异常路径', level=2)
for run in h2.runs:
set_chinese_font(run, '黑体')
p = doc.add_paragraph()
run = p.add_run('{{sad_path}}')
set_chinese_font(run)
# 保存模板
doc.save(output_path)
print(f"模板已创建: {output_path}")
print(f" - 预设 {max_funcs} 个功能点")
print(f" - 每个功能点预设 {max_subs} 个子功能点")
return output_path
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='创建自定义占位符模板')
parser.add_argument('-o', '--output', default='templates/custom_template.docx', help='输出模板文件路径(默认: templates/custom_template.docx)')
parser.add_argument('--max-funcs', type=int, default=10, help='最大功能点数量(默认: 10)')
parser.add_argument('--max-subs', type=int, default=10, help='每个功能点最大子功能数量(默认: 10)')
parser.add_argument('-f', '--force', action='store_true', help='强制覆盖已存在的模板')
args = parser.parse_args()
if os.path.exists(args.output) and not args.force:
print(f"模板已存在: {args.output}")
print("如需覆盖,请使用 -f 或 --force 参数")
print("跳过模板创建,保留用户手动修改的模板")
else:
output_dir = os.path.dirname(args.output)
if output_dir:
os.makedirs(output_dir, exist_ok=True)
create_custom_template(args.output, args.max_funcs, args.max_subs)
scripts/parse_filter.json
{
// 解析过滤配置文件
// 在解析供应商Word文档时,匹配到此配置中规则的段落将被跳过,不进行JSON解析
// 蓝色斜体等富文本样式的内容不受影响,只有配置中明确指定的才会被过滤
// 按段落样式名过滤(不区分大小写)
// 典型场景:目录段落、页眉页脚等
"exclude_styles": [
"toc 1", // 一级目录
"toc 2", // 二级目录
"toc 3", // 三级目录
"toc 4", // 四级目录
"TOC 1",
"TOC 2",
"TOC 3",
"TOC 4",
"TOCHeading" // 目录标题
],
// 按文本内容过滤(支持多行文本,段落文本包含此内容即匹配)
// 典型场景:模板说明、填写提示、注释性文字
// 多行文本用 \n 表示换行,如 "第一行\n第二行"
"exclude_texts": [
"请在此处填写",
"请填写",
"此处填写",
"以下为模板说明",
"模板说明",
"本模板使用说明",
"【说明】",
"【注】",
"【示例】", "我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来;我是说明文字,这行文字不应该被提取出来",
"我是说明文字,这行文字不应该被提取出来"
],
// 按正则表达式过滤(对段落全文匹配,包括换行后的内容)
// 典型场景:目录行("1.1. xxx\t5"格式)
"exclude_text_patterns": [
"^\\d+\\.\\s*\\d+\\.\\s*\\d*\\.?\\s*.+\\t\\d+$",
"^\\d+\\.\\s*.+\\t\\d+$"
]
}
scripts/renderer.py
# -*- coding: utf-8 -*-
"""
需求文档渲染器 v3
将JSON数据渲染到Word模板
功能:
- 支持普通文本、富文本、表格、图片
- 支持项目符号、编号列表、多级列表
- 支持文本缩进(悬挂缩进和首行缩进)
- 支持嵌套功能点结构
- 自动删除未使用的占位符章节
- 保留模板样式
- 优雅处理空白标题(通过占位符状态判断)
"""
import os
import json
import re
from docx import Document
from docx.shared import Pt, RGBColor
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
BULLET_CHARS = {
"disc": "\u2022",
"triangle": "\u25b6",
"square": "\u25a0",
"diamond": "\u25c6",
"check": "\u2713",
"star": "\u2605",
"arrow": "\u2192",
}
def _get_bullet_char(bullet_name):
return BULLET_CHARS.get(bullet_name, "\u2022")
def create_table_element(table_data: dict):
"""
创建表格XML元素(支持合并单元格和富文本)
单元格格式:
{"v": "文本", "r": [[run]], "gs": 横向合并数, "vm": 纵向合并状态}
兼容旧格式:如果单元格是字符串,自动转换为新格式
"""
cols = table_data.get("cols", 4)
headers = table_data.get("h", [])
data = table_data.get("d", [])
tbl = OxmlElement('w:tbl')
tblPr = OxmlElement('w:tblPr')
tblStyle = OxmlElement('w:tblStyle')
tblStyle.set(qn('w:val'), 'TableGrid')
tblPr.append(tblStyle)
tblBorders = OxmlElement('w:tblBorders')
for border_name in ['top', 'left', 'bottom', 'right', 'insideH', 'insideV']:
border = OxmlElement(f'w:{border_name}')
border.set(qn('w:val'), 'single')
border.set(qn('w:sz'), '4')
border.set(qn('w:space'), '0')
border.set(qn('w:color'), '000000')
tblBorders.append(border)
tblPr.append(tblBorders)
tbl.append(tblPr)
tblGrid = OxmlElement('w:tblGrid')
col_widths = table_data.get("col_widths", None)
for i in range(cols):
gridCol = OxmlElement('w:gridCol')
if col_widths and i < len(col_widths):
gridCol.set(qn('w:w'), str(col_widths[i]))
else:
gridCol.set(qn('w:w'), '2500')
tblGrid.append(gridCol)
tbl.append(tblGrid)
def create_cell(cell_data, is_header=False, col_idx=0):
if isinstance(cell_data, str):
cell_data = {"v": cell_data, "r": [[cell_data]]}
tc = OxmlElement('w:tc')
tcPr = OxmlElement('w:tcPr')
if cell_data.get("gs"):
gridSpan = OxmlElement('w:gridSpan')
gridSpan.set(qn('w:val'), str(cell_data["gs"]))
tcPr.append(gridSpan)
if cell_data.get("vm"):
vMerge = OxmlElement('w:vMerge')
vMerge.set(qn('w:val'), cell_data["vm"])
tcPr.append(vMerge)
tcW = OxmlElement('w:tcW')
if col_widths and col_idx < len(col_widths):
tcW.set(qn('w:w'), str(col_widths[col_idx]))
tcW.set(qn('w:type'), 'dxa')
else:
tcW.set(qn('w:w'), '2500')
tcW.set(qn('w:type'), 'dxa')
tcPr.append(tcW)
tc.append(tcPr)
p = OxmlElement('w:p')
runs = cell_data.get("r", [])
if not runs and cell_data.get("v"):
runs = [[cell_data["v"]]]
for run_item in runs:
if not run_item:
continue
text = run_item[0] if len(run_item) > 0 else ""
if not text:
continue
r = OxmlElement('w:r')
rPr = OxmlElement('w:rPr')
if is_header:
b = OxmlElement('w:b')
rPr.append(b)
if len(run_item) > 1 and run_item[1]:
style = run_item[1]
if style.get("b"):
b = OxmlElement('w:b')
rPr.append(b)
if style.get("i"):
i = OxmlElement('w:i')
rPr.append(i)
if style.get("u"):
u = OxmlElement('w:u')
u.set(qn('w:val'), 'single')
rPr.append(u)
if style.get("c"):
color = OxmlElement('w:color')
color.set(qn('w:val'), style["c"])
rPr.append(color)
if style.get("sz"):
sz = OxmlElement('w:sz')
sz.set(qn('w:val'), str(style["sz"]))
rPr.append(sz)
if len(rPr) > 0:
r.append(rPr)
t = OxmlElement('w:t')
t.text = text
t.set('{http://www.w3.org/XML/1998/namespace}space', 'preserve')
r.append(t)
p.append(r)
tc.append(p)
return tc
if headers:
tr = OxmlElement('w:tr')
for col_idx, h in enumerate(headers):
tc = create_cell(h, is_header=True, col_idx=col_idx)
tr.append(tc)
tbl.append(tr)
for row_data in data:
tr = OxmlElement('w:tr')
for col_idx, cell_data in enumerate(row_data):
tc = create_cell(cell_data, is_header=False, col_idx=col_idx)
tr.append(tc)
tbl.append(tr)
return tbl
def build_placeholder_map(data: dict, max_funcs: int = 10, max_subs: int = 10) -> dict:
"""构建占位符到内容的映射"""
placeholder_map = {}
placeholder_map["doc_title"] = data.get("doc_title", "")
placeholder_map["doc_date"] = data.get("doc_date", "")
sections = data.get("sections", {})
placeholder_map["original_req"] = sections.get("original_req", "")
placeholder_map["req_clarify"] = sections.get("req_clarify", "")
placeholder_map["req_analysis"] = sections.get("req_analysis", "")
placeholder_map["solution_overview"] = sections.get("solution_overview", "")
placeholder_map["new_tables"] = sections.get("new_tables", [])
placeholder_map["modify_tables"] = sections.get("modify_tables", [])
placeholder_map["data_compatibility"] = sections.get("data_compatibility", [])
# 中台三问接口列表 -> func_tables 占位符
zhongtai_interfaces = data.get("zhongtai_interfaces", [])
if zhongtai_interfaces:
func_tables_data = []
for iface in zhongtai_interfaces:
func_tables_data.append([
{"v": iface.get("method", ""), "r": [[iface.get("method", "")]]},
{"v": iface.get("interface_name", ""), "r": [[iface.get("interface_name", "")]]},
{"v": iface.get("description", ""), "r": [[iface.get("description", "")]]}
])
func_tables_blocks = [{
"t": "table",
"cols": 3,
"h": [
{"v": "接口ID", "r": [["接口ID"]]},
{"v": "接口名称", "r": [["接口名称"]]},
{"v": "功能说明", "r": [["功能说明"]]}
],
"d": func_tables_data,
"col_widths": [1500, 3500, 4000]
}]
else:
func_tables_blocks = []
placeholder_map["func_tables"] = func_tables_blocks
placeholder_map["zhongtai_scene"] = data.get("zhongtai_scene", "")
placeholder_map["happy_path"] = data.get("happy_path", "")
placeholder_map["sad_path"] = data.get("sad_path", "")
functions = data.get("functions", [])
for i in range(max_funcs):
if i < len(functions):
func = functions[i]
placeholder_map[f"func_{i+1}_name"] = func.get("name", "")
placeholder_map[f"func_{i+1}_scene"] = func.get("scene_desc", "")
placeholder_map[f"func_{i+1}_has_scene"] = bool(func.get("scene_desc", ""))
placeholder_map[f"func_{i+1}_has_subs"] = bool(func.get("sub_functions", []))
sub_functions = func.get("sub_functions", [])
for j in range(max_subs):
if j < len(sub_functions):
sub = sub_functions[j]
placeholder_map[f"func_{i+1}_sub_{j+1}_name"] = sub.get("name", "")
placeholder_map[f"func_{i+1}_sub_{j+1}_content"] = sub.get("content_blocks", [])
else:
placeholder_map[f"func_{i+1}_sub_{j+1}_name"] = ""
placeholder_map[f"func_{i+1}_sub_{j+1}_content"] = []
else:
placeholder_map[f"func_{i+1}_name"] = ""
placeholder_map[f"func_{i+1}_scene"] = ""
placeholder_map[f"func_{i+1}_has_scene"] = False
placeholder_map[f"func_{i+1}_has_subs"] = False
for j in range(max_subs):
placeholder_map[f"func_{i+1}_sub_{j+1}_name"] = ""
placeholder_map[f"func_{i+1}_sub_{j+1}_content"] = []
return placeholder_map
def find_placeholder(text: str) -> list:
"""查找文本中的所有占位符"""
return re.findall(r'\{\{(\w+)\}\}', text)
def get_run_style(para):
"""获取段落中第一个run的样式(用于保留模板样式)"""
style = {}
for run in para.runs:
if run.font:
if run.font.name:
style["font_name"] = run.font.name
if run.font.size:
style["font_size"] = run.font.size
if run.font.color and run.font.color.rgb:
style["font_color"] = str(run.font.color.rgb)
if run.font.bold is not None:
style["bold"] = run.font.bold
if run.font.italic is not None:
style["italic"] = run.font.italic
if run.font.underline is not None:
style["underline"] = run.font.underline
break
return style
def apply_run_style(run, style: dict):
"""将样式应用到run"""
if not style:
return
if style.get("font_name"):
run.font.name = style["font_name"]
run._element.rPr.rFonts.set(qn('w:eastAsia'), style["font_name"])
if style.get("font_size"):
run.font.size = style["font_size"]
if style.get("font_color"):
run.font.color.rgb = RGBColor.from_string(style["font_color"])
if style.get("bold"):
run.bold = style["bold"]
if style.get("italic"):
run.italic = style["italic"]
if style.get("underline"):
run.underline = style["underline"]
def apply_style_to_run(run, style: dict):
"""将样式应用到run(用于富文本渲染)"""
if not style:
return
if style.get("b"):
run.bold = True
if style.get("i"):
run.italic = True
if style.get("u"):
run.underline = True
if style.get("c"):
run.font.color.rgb = RGBColor.from_string(style["c"])
if style.get("sz"):
run.font.size = Pt(style["sz"] / 2)
def render_rich_text(para, block: dict, template_style: dict = None):
"""
渲染富文本到段落
紧凑格式:
{
"t": "text",
"runs": [["普通文本"], ["加粗蓝色", {"b": true, "c": "0000FF"}]],
"list": "disc/420/h420",
"indent": {"left": 420, "hanging": 420}
}
list字符串格式:
- bullet: "disc" 或 "disc:1" (符号名:ilvl)
- numbered: "1." 或 "1.:1" (编号文本:ilvl)
- 带缩进: "disc/420/h420" (列表/左缩进/悬挂或首行缩进)
"""
if not block:
return
runs_data = block.get("runs", [])
list_info = block.get("list", None)
indent_info = block.get("indent", None)
if list_info:
_apply_list_to_paragraph(para, list_info)
if indent_info:
_apply_indent_to_paragraph(para, indent_info)
if not runs_data:
return
for run_item in runs_data:
if not run_item or not isinstance(run_item, list):
continue
text = run_item[0] if len(run_item) > 0 else ""
if not text:
continue
run = para.add_run(text)
apply_run_style(run, template_style)
if len(run_item) > 1 and run_item[1]:
apply_style_to_run(run, run_item[1])
def _parse_list_string(list_str: str) -> dict:
"""
解析紧凑列表字符串为结构化信息
格式:
- "disc" -> bullet, ilvl=0, bullet=disc
- "disc:1" -> bullet, ilvl=1, bullet=disc
- "1." -> numbered, ilvl=0, numText="1."
- "1.:1" -> numbered, ilvl=1, numText="1."
- "disc/420/h420" -> bullet + indent
- "1./0/f420" -> numbered + indent
"""
parts = list_str.split('/')
list_part = parts[0]
indent_left = None
indent_special = None
if len(parts) >= 3:
indent_left = int(parts[1]) if parts[1] != "0" else 0
sp = parts[2]
if sp.startswith('h'):
indent_special = ("hanging", int(sp[1:]))
elif sp.startswith('f'):
indent_special = ("firstLine", int(sp[1:]))
elif sp != "0":
indent_special = ("hanging", int(sp))
list_parts = list_part.split(':')
primary = list_parts[0]
ilvl = int(list_parts[1]) if len(list_parts) > 1 else 0
if primary in BULLET_CHARS:
result = {
"type": "bullet",
"ilvl": ilvl,
"bullet": primary,
}
else:
result = {
"type": "numbered",
"ilvl": ilvl,
"numText": primary,
}
if indent_left is not None or indent_special is not None:
result["indent"] = {}
if indent_left:
result["indent"]["left"] = indent_left
if indent_special:
result["indent"][indent_special[0]] = indent_special[1]
return result
def _apply_list_to_paragraph(para, list_str, has_indent: bool = False):
"""将列表信息应用到段落XML"""
info = _parse_list_string(list_str) if isinstance(list_str, str) else list_str
pPr = para._element.find(qn('w:pPr'))
if pPr is None:
pPr = OxmlElement('w:pPr')
para._element.insert(0, pPr)
list_type = info.get("type", "bullet")
ilvl = info.get("ilvl", 0)
indent_from_list = info.get("indent", None)
if list_type == "bullet":
bullet_name = info.get("bullet", "disc")
bullet_char = _get_bullet_char(bullet_name)
if not has_indent and indent_from_list:
indent = pPr.find(qn('w:ind'))
if indent is None:
indent = OxmlElement('w:ind')
pPr.append(indent)
if "left" in indent_from_list:
indent.set(qn('w:left'), str(indent_from_list["left"]))
if "hanging" in indent_from_list:
indent.set(qn('w:hanging'), str(indent_from_list["hanging"]))
if indent.get(qn('w:firstLine')) is not None:
del indent.attrib[qn('w:firstLine')]
elif "firstLine" in indent_from_list:
indent.set(qn('w:firstLine'), str(indent_from_list["firstLine"]))
if indent.get(qn('w:hanging')) is not None:
del indent.attrib[qn('w:hanging')]
elif not has_indent:
indent = pPr.find(qn('w:ind'))
if indent is None:
indent = OxmlElement('w:ind')
pPr.append(indent)
left = 420 + ilvl * 420
indent.set(qn('w:left'), str(left))
indent.set(qn('w:hanging'), str(420))
prefix_run = para.add_run(bullet_char + " ")
rPr = prefix_run._element.find(qn('w:rPr'))
if rPr is None:
rPr = OxmlElement('w:rPr')
prefix_run._element.insert(0, rPr)
rFonts = rPr.find(qn('w:rFonts'))
if rFonts is None:
rFonts = OxmlElement('w:rFonts')
rPr.insert(0, rFonts)
rFonts.set(qn('w:ascii'), 'Arial Unicode MS')
rFonts.set(qn('w:hAnsi'), 'Arial Unicode MS')
rFonts.set(qn('w:eastAsia'), 'SimSun')
rFonts.set(qn('w:cs'), 'Arial Unicode MS')
elif list_type == "numbered":
num_text = info.get("numText", "")
if not has_indent and indent_from_list:
indent = pPr.find(qn('w:ind'))
if indent is None:
indent = OxmlElement('w:ind')
pPr.append(indent)
if "left" in indent_from_list:
indent.set(qn('w:left'), str(indent_from_list["left"]))
if "hanging" in indent_from_list:
indent.set(qn('w:hanging'), str(indent_from_list["hanging"]))
if indent.get(qn('w:firstLine')) is not None:
del indent.attrib[qn('w:firstLine')]
elif "firstLine" in indent_from_list:
indent.set(qn('w:firstLine'), str(indent_from_list["firstLine"]))
if indent.get(qn('w:hanging')) is not None:
del indent.attrib[qn('w:hanging')]
elif not has_indent:
indent = pPr.find(qn('w:ind'))
if indent is None:
indent = OxmlElement('w:ind')
pPr.append(indent)
left = 420 + ilvl * 420
indent.set(qn('w:left'), str(left))
indent.set(qn('w:firstLine'), str(0))
if num_text:
prefix_run = para.add_run(num_text + " ")
def _apply_indent_to_paragraph(para, indent_info: dict):
"""将缩进信息应用到段落XML"""
pPr = para._element.find(qn('w:pPr'))
if pPr is None:
pPr = OxmlElement('w:pPr')
para._element.insert(0, pPr)
indent = pPr.find(qn('w:ind'))
if indent is None:
indent = OxmlElement('w:ind')
pPr.append(indent)
if "left" in indent_info:
indent.set(qn('w:left'), str(indent_info["left"]))
if "hanging" in indent_info:
indent.set(qn('w:hanging'), str(indent_info["hanging"]))
if indent.get(qn('w:firstLine')) is not None:
del indent.attrib[qn('w:firstLine')]
elif "firstLine" in indent_info:
indent.set(qn('w:firstLine'), str(indent_info["firstLine"]))
if indent.get(qn('w:hanging')) is not None:
del indent.attrib[qn('w:hanging')]
def analyze_template_structure(para_list: list, placeholder_map: dict) -> dict:
"""
分析模板结构,确定哪些段落需要删除
返回:
- paras_to_remove: 需要删除的段落索引集合
- func_ranges: 每个功能点的段落范围 {func_num: (start_idx, end_idx)}
- sub_ranges: 每个子功能点的段落范围 {(func_num, sub_num): (start_idx, end_idx)}
"""
paras_to_remove = set()
func_ranges = {}
sub_ranges = {}
# 记录功能点状态
func_has_scene = {}
func_has_subs = {}
for i in range(1, 11):
func_has_scene[i] = placeholder_map.get(f"func_{i}_has_scene", False)
func_has_subs[i] = placeholder_map.get(f"func_{i}_has_subs", False)
# 第一遍:记录所有占位符段落的位置
placeholder_positions = {} # {placeholder_name: paragraph_index}
for i, para in enumerate(para_list):
text = para.text
placeholders = find_placeholder(text)
for ph in placeholders:
placeholder_positions[ph] = i
# 第二遍:确定功能点和子功能点的范围
current_func = None
current_sub = None
func_start = None
sub_start = None
for i, para in enumerate(para_list):
text = para.text
placeholders = find_placeholder(text)
style_name = para.style.name if para.style else "Normal"
if "Heading 1" in style_name or "标题 1" in style_name:
if current_func and func_start is not None:
func_ranges[current_func] = (func_start, i - 1)
if current_func and current_sub and sub_start is not None:
sub_ranges[(current_func, current_sub)] = (sub_start, i - 1)
current_func = None
current_sub = None
func_start = None
sub_start = None
for ph in placeholders:
if ph in ["new_tables", "modify_tables", "data_compatibility", "func_tables"]:
if current_func and func_start is not None:
func_ranges[current_func] = (func_start, i - 1)
if current_func and current_sub and sub_start is not None:
sub_ranges[(current_func, current_sub)] = (sub_start, i - 1)
current_func = None
current_sub = None
func_start = None
sub_start = None
continue
match = re.match(r'func_(\d+)_name$', ph)
if match:
if current_func and func_start is not None:
func_ranges[current_func] = (func_start, i - 1)
current_func = int(match.group(1))
func_start = i
current_sub = None
sub_start = None
match = re.match(r'func_(\d+)_sub_(\d+)_name$', ph)
if match:
if current_func and current_sub and sub_start is not None:
sub_ranges[(current_func, current_sub)] = (sub_start, i - 1)
current_sub = int(match.group(2))
sub_start = i
if current_func and func_start is not None:
func_ranges[current_func] = (func_start, len(para_list) - 1)
if current_func and current_sub and sub_start is not None:
sub_ranges[(current_func, current_sub)] = (sub_start, len(para_list) - 1)
# 第三遍:标记需要删除的段落
for i, para in enumerate(para_list):
text = para.text
placeholders = find_placeholder(text)
style_name = para.style.name if para.style else "Normal"
# 检查是否是空占位符
for ph in placeholders:
value = placeholder_map.get(ph, "")
# 空的功能点名称
if re.match(r'func_(\d+)_name$', ph) and not value:
func_num = int(re.match(r'func_(\d+)_name$', ph).group(1))
if func_num in func_ranges:
start, end = func_ranges[func_num]
for idx in range(start, end + 1):
paras_to_remove.add(idx)
# 空的子功能点名称
elif re.match(r'func_(\d+)_sub_(\d+)_name$', ph) and not value:
match = re.match(r'func_(\d+)_sub_(\d+)_name$', ph)
func_num = int(match.group(1))
sub_num = int(match.group(2))
if (func_num, sub_num) in sub_ranges:
start, end = sub_ranges[(func_num, sub_num)]
for idx in range(start, end + 1):
paras_to_remove.add(idx)
# 空的其他占位符
elif not value and ph in ["new_tables", "modify_tables", "data_compatibility", "func_tables"]:
paras_to_remove.add(i)
# 第四遍:根据占位符状态删除相关标题
# 查找"场景描述"和"实现方案"标题,根据状态决定是否删除
for i, para in enumerate(para_list):
text = para.text.strip()
style_name = para.style.name if para.style else "Normal"
# 只处理Heading样式的段落
if "Heading" not in style_name:
continue
# 查找这个标题之后最近的占位符来确定所属功能点
for j in range(i + 1, min(i + 10, len(para_list))):
next_text = para_list[j].text
next_placeholders = find_placeholder(next_text)
for ph in next_placeholders:
# 场景描述相关
if ph.startswith("func_") and "_scene" in ph:
match = re.match(r'func_(\d+)_scene', ph)
if match:
func_num = int(match.group(1))
if not func_has_scene.get(func_num, False):
paras_to_remove.add(i)
break
# 实现方案相关
elif ph.startswith("func_") and "_sub_" in ph and "_name" in ph:
match = re.match(r'func_(\d+)_sub', ph)
if match:
func_num = int(match.group(1))
if not func_has_subs.get(func_num, False):
paras_to_remove.add(i)
break
if next_placeholders:
break
return paras_to_remove
def render_template(template_path: str, placeholder_map: dict, context_dir: str, output_path: str):
"""渲染模板"""
doc = Document(template_path)
table_count = 0
img_count = 0
para_list = list(doc.paragraphs)
insertions = {}
# 分析模板结构,确定需要删除的段落
paras_to_remove = analyze_template_structure(para_list, placeholder_map)
# 渲染占位符
for i, para in enumerate(para_list):
if i in paras_to_remove:
continue
text = para.text
placeholders = find_placeholder(text)
if not placeholders:
continue
for ph in placeholders:
if ph not in placeholder_map:
continue
value = placeholder_map[ph]
# 保存段落样式和run样式
saved_style = para.style.name if para.style else None
saved_run_style = get_run_style(para)
para.clear()
if saved_style:
try:
para.style = saved_style
except:
pass
if isinstance(value, str):
if value:
run = para.add_run(value)
apply_run_style(run, saved_run_style)
else:
paras_to_remove.add(i)
elif isinstance(value, list) and value:
elements_to_insert = []
first_text_done = False
for block in value:
block_type = block.get("t", "text")
if block_type == "text":
if not first_text_done:
render_rich_text(para, block, saved_run_style)
first_text_done = True
else:
new_p = OxmlElement('w:p')
pPr = OxmlElement('w:pPr')
new_p.append(pPr)
from docx.text.paragraph import Paragraph
new_para = Paragraph(new_p, doc)
render_rich_text(new_para, block, saved_run_style)
elements_to_insert.append(new_p)
elif block_type == "table":
tbl = create_table_element(block)
elements_to_insert.append(tbl)
table_count += 1
elif block_type == "img":
img_rel_path = block.get("p", "")
img_path = os.path.normpath(os.path.join(context_dir, img_rel_path))
width = block.get("w", 200)
if os.path.exists(img_path):
try:
new_p = OxmlElement('w:p')
pPr = OxmlElement('w:pPr')
new_p.append(pPr)
from docx.text.paragraph import Paragraph
temp_para = Paragraph(new_p, doc)
run = temp_para.add_run()
run.add_picture(img_path, width=Pt(width))
elements_to_insert.append(new_p)
img_count += 1
except Exception as e:
print(f"图片加载失败: {e}")
if not first_text_done:
para.add_run(" ")
if elements_to_insert:
insertions[i] = elements_to_insert
break
# 插入表格和图片
for para_idx in sorted(insertions.keys(), reverse=True):
para = para_list[para_idx]
para_element = para._element
parent = para_element.getparent()
idx = list(parent).index(para_element)
for element in reversed(insertions[para_idx]):
parent.insert(idx + 1, element)
# 删除标记的段落
for idx in sorted(paras_to_remove, reverse=True):
if idx < len(para_list):
para = para_list[idx]
para_element = para._element
parent = para_element.getparent()
if parent is not None:
parent.remove(para_element)
# 删除所有仍包含占位符的段落
for para in list(doc.paragraphs):
text = para.text
if find_placeholder(text):
para_element = para._element
parent = para_element.getparent()
if parent is not None:
parent.remove(para_element)
# 注意:不再删除空段落,保留文档中的空行
doc.save(output_path)
print(f"渲染完成: {output_path}")
print(f"删除了 {len(paras_to_remove)} 个段落")
print(f"渲染了 {table_count} 个表格")
print(f"渲染了 {img_count} 张图片")
def render(json_path: str, template_path: str, output_path: str, max_funcs: int = 10, max_subs: int = 10):
"""主渲染函数"""
context_dir = os.path.dirname(json_path)
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
placeholder_map = build_placeholder_map(data, max_funcs, max_subs)
render_template(template_path, placeholder_map, context_dir, output_path)
if __name__ == "__main__":
import argparse
skill_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
default_template = os.path.join(skill_dir, "templates", "custom_template.docx")
default_json = os.path.join(skill_dir, "parsed_output", "converted_document.json")
default_output = os.path.join(skill_dir, "parsed_output", "client_document.docx")
parser = argparse.ArgumentParser(description='将JSON数据渲染到Word模板')
parser.add_argument('json_path', nargs='?', default=default_json, help=f'JSON数据文件路径(默认: {default_json})')
parser.add_argument('-t', '--template', default=default_template, help=f'Word模板文件路径(默认: {default_template})')
parser.add_argument('-o', '--output', default=default_output, help=f'输出Word文件路径(默认: {default_output})')
parser.add_argument('--max-funcs', type=int, default=10, help='最大功能点数量(默认: 10)')
parser.add_argument('--max-subs', type=int, default=10, help='每个功能点最大子功能数量(默认: 10)')
args = parser.parse_args()
render(args.json_path, args.template, args.output, args.max_funcs, args.max_subs)
scripts/requirements.txt
# Python版本要求: >=3.11
python-docx>=0.8.12
Pillow>=9.0.0
scripts/supplier_parser.py
# -*- coding: utf-8 -*-
"""
供应商文档解析器 v3
将供应商docx文件解析为JSON格式
提取规则(使用标题级别+文本精确匹配):
- 1.1 原始需求 (Heading 2) -> sections.original_req (普通文本)
- 1.2 需求澄清 (Heading 2) -> sections.req_clarify (普通文本)
- 2.1 需求分析 (Heading 2) -> sections.req_analysis (普通文本)
- 3.2.3 方案概述 (Heading 3) -> sections.solution_overview (普通文本)
- 6.1 新增表 (Heading 2) -> sections.new_tables (富文本内容块)
- 6.2 修改表 (Heading 2) -> sections.modify_tables (富文本内容块)
- 6.3 数据兼容性要求 (Heading 2) -> sections.data_compatibility (富文本内容块)
- 4.x 功能点 (Heading 2) -> functions[] (嵌套结构)
列表支持:
- 项目符号:圆点、三角、方框、菱形、对号、星形等
- 编号列表:阿拉伯数字、中文括号、带圈数字、多级编号等
- 文本缩进:悬挂缩进(hanging)和首行缩进(firstLine)
"""
import os
import re
import json
import hashlib
from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import Table
from docx.text.paragraph import Paragraph
from docx.oxml.ns import qn
from PIL import Image
import io
BULLET_MAP = {
"\u2022": "disc",
"\u2024": "disc",
"\u2219": "disc",
"\u00b7": "disc",
"\u2981": "disc",
"\u25e6": "disc",
"\u26ac": "disc",
"\u25cf": "disc",
"\u26ab": "disc",
"\u2b24": "disc",
"\u25aa": "disc",
"\u25ab": "disc",
"\uf06c": "disc",
"\uf0b7": "disc",
"\uf0a2": "disc",
"\u25b2": "triangle",
"\u25b3": "triangle",
"\u25b4": "triangle",
"\u25b5": "triangle",
"\u25b6": "triangle",
"\u25b7": "triangle",
"\u25b8": "triangle",
"\u25b9": "triangle",
"\u25ba": "triangle",
"\u25bb": "triangle",
"\u25bc": "triangle",
"\u25bd": "triangle",
"\u25be": "triangle",
"\u25bf": "triangle",
"\u25c0": "triangle",
"\u25c1": "triangle",
"\u25c2": "triangle",
"\u25c3": "triangle",
"\u25c4": "triangle",
"\u25c5": "triangle",
"\uf0d8": "triangle",
"\uf0a7": "triangle",
"\u25a0": "square",
"\u25a1": "square",
"\u25a2": "square",
"\u25a3": "square",
"\u25a4": "square",
"\u25a5": "square",
"\u25a6": "square",
"\u25a7": "square",
"\u25a8": "square",
"\u25a9": "square",
"\u25aa": "square",
"\u25ab": "square",
"\u25ac": "square",
"\u25ad": "square",
"\u25ae": "square",
"\u25af": "square",
"\u25fb": "square",
"\u25fc": "square",
"\u2b1c": "square",
"\u2b1b": "square",
"\uf06e": "square",
"\uf0a8": "square",
"\uf0a3": "square",
"\u25c6": "diamond",
"\u25c7": "diamond",
"\u25c8": "diamond",
"\u25c9": "diamond",
"\u25ca": "diamond",
"\u25cb": "diamond",
"\u25cc": "diamond",
"\u25cd": "diamond",
"\u25ce": "diamond",
"\u25cf": "diamond",
"\u2726": "diamond",
"\u2727": "diamond",
"\u2756": "diamond",
"\u2757": "diamond",
"\u2b29": "diamond",
"\u2b2a": "diamond",
"\u2b2b": "diamond",
"\u2b2c": "diamond",
"\u2b2d": "diamond",
"\u2b2e": "diamond",
"\u2b2f": "diamond",
"\uf075": "diamond",
"\uf0a4": "diamond",
"\uf0b5": "diamond",
"\u2713": "check",
"\u2714": "check",
"\u2611": "check",
"\u2610": "check",
"\u2705": "check",
"\u2706": "check",
"\u2718": "check",
"\u2716": "check",
"\u2717": "check",
"\u2612": "check",
"\u274e": "check",
"\u2752": "check",
"\uf0fc": "check",
"\uf050": "check",
"\u2605": "star",
"\u2606": "star",
"\u2728": "star",
"\u2729": "star",
"\u272a": "star",
"\u272b": "star",
"\u272c": "star",
"\u272d": "star",
"\u272e": "star",
"\u272f": "star",
"\u2730": "star",
"\u2b50": "star",
"\u2b51": "star",
"\u2b52": "star",
"\u2731": "star",
"\u2732": "star",
"\uf0b2": "star",
"\uf0a5": "star",
"\u2190": "arrow",
"\u2191": "arrow",
"\u2192": "arrow",
"\u2193": "arrow",
"\u2194": "arrow",
"\u2195": "arrow",
"\u2196": "arrow",
"\u2197": "arrow",
"\u2198": "arrow",
"\u2199": "arrow",
"\u219a": "arrow",
"\u219b": "arrow",
"\u219c": "arrow",
"\u219d": "arrow",
"\u219e": "arrow",
"\u219f": "arrow",
"\u21a0": "arrow",
"\u21a1": "arrow",
"\u21a2": "arrow",
"\u21a3": "arrow",
"\u21a4": "arrow",
"\u21a5": "arrow",
"\u21a6": "arrow",
"\u21a7": "arrow",
"\u21a8": "arrow",
"\u21a9": "arrow",
"\u21aa": "arrow",
"\u21ab": "arrow",
"\u21ac": "arrow",
"\u21ad": "arrow",
"\u21ae": "arrow",
"\u21af": "arrow",
"\u21b0": "arrow",
"\u21b1": "arrow",
"\u21b2": "arrow",
"\u21b3": "arrow",
"\u21b4": "arrow",
"\u21b5": "arrow",
"\u21b6": "arrow",
"\u21b7": "arrow",
"\u21b8": "arrow",
"\u21b9": "arrow",
"\u21ba": "arrow",
"\u21bb": "arrow",
"\u21bc": "arrow",
"\u21bd": "arrow",
"\u21be": "arrow",
"\u21bf": "arrow",
"\u21c0": "arrow",
"\u21c1": "arrow",
"\u21c2": "arrow",
"\u21c3": "arrow",
"\u21c4": "arrow",
"\u21c5": "arrow",
"\u21c6": "arrow",
"\u21c7": "arrow",
"\u21c8": "arrow",
"\u21c9": "arrow",
"\u21ca": "arrow",
"\u21cb": "arrow",
"\u21cc": "arrow",
"\u21cd": "arrow",
"\u21ce": "arrow",
"\u21cf": "arrow",
"\u21d0": "arrow",
"\u21d1": "arrow",
"\u21d2": "arrow",
"\u21d3": "arrow",
"\u21d4": "arrow",
"\u21d5": "arrow",
"\u21d6": "arrow",
"\u21d7": "arrow",
"\u21d8": "arrow",
"\u21d9": "arrow",
"\u21da": "arrow",
"\u21db": "arrow",
"\u21dc": "arrow",
"\u21dd": "arrow",
"\u21de": "arrow",
"\u21df": "arrow",
"\u21e0": "arrow",
"\u21e1": "arrow",
"\u21e2": "arrow",
"\u21e3": "arrow",
"\u21e4": "arrow",
"\u21e5": "arrow",
"\u21e6": "arrow",
"\u21e7": "arrow",
"\u21e8": "arrow",
"\u21e9": "arrow",
"\u21ea": "arrow",
"\u21eb": "arrow",
"\u21ec": "arrow",
"\u21ed": "arrow",
"\u21ee": "arrow",
"\u21ef": "arrow",
"\u21f0": "arrow",
"\u21f1": "arrow",
"\u21f2": "arrow",
"\u21f3": "arrow",
"\u21f4": "arrow",
"\u21f5": "arrow",
"\u21f6": "arrow",
"\u21f7": "arrow",
"\u21f8": "arrow",
"\u21f9": "arrow",
"\u21fa": "arrow",
"\u21fb": "arrow",
"\u21fc": "arrow",
"\u21fd": "arrow",
"\u21fe": "arrow",
"\u21ff": "arrow",
"\u27a1": "arrow",
"\u2b05": "arrow",
"\u2b06": "arrow",
"\u2b07": "arrow",
"\u2b08": "arrow",
"\u2b09": "arrow",
"\u2b0a": "arrow",
"\u2b0b": "arrow",
"\u2b0c": "arrow",
"\u2b0d": "arrow",
"\u2b0e": "arrow",
"\u2b0f": "arrow",
"\u2b10": "arrow",
"\u2b11": "arrow",
"\u2b95": "arrow",
"\u2b96": "arrow",
"\u2b97": "arrow",
"\u2b98": "arrow",
"\u2b99": "arrow",
"\u2b9a": "arrow",
"\u2b9b": "arrow",
"\u2b9c": "arrow",
"\u2b9d": "arrow",
"\u2b9e": "arrow",
"\u2b9f": "arrow",
}
NUM_FMT_MAP = {
"decimal": "decimal",
"chineseCounting": "chineseCounting",
"decimalEnclosedCircleChinese": "enclosedCircle",
"decimalEnclosedCircle": "enclosedCircle",
"lowerLetter": "lowerLetter",
"upperLetter": "upperLetter",
"lowerRoman": "lowerRoman",
"upperRoman": "upperRoman",
"ideographTraditional": "ideographTraditional",
"ideographEnclosedCircle": "ideographEnclosedCircle",
"bullet": "bullet",
}
def _resolve_numbering(doc, num_id, ilvl):
"""
从文档的numbering.xml中解析编号定义
Returns:
dict with keys: type, numFmt, lvlText, bullet, ilvl
or None if not found
"""
if not num_id or num_id == "0":
return None
try:
numbering_part = doc.part.numbering_part
numbering_xml = numbering_part._element
num_elem = None
for n in numbering_xml.findall(qn('w:num')):
if n.get(qn('w:numId')) == str(num_id):
num_elem = n
break
if num_elem is None:
return None
abstract_num_id = None
abs_ref = num_elem.find(qn('w:abstractNumId'))
if abs_ref is not None:
abstract_num_id = abs_ref.get(qn('w:val'))
if abstract_num_id is None:
return None
abstract_num = None
for an in numbering_xml.findall(qn('w:abstractNum')):
if an.get(qn('w:abstractNumId')) == str(abstract_num_id):
abstract_num = an
break
if abstract_num is None:
return None
for lvl in abstract_num.findall(qn('w:lvl')):
if lvl.get(qn('w:ilvl')) == str(ilvl):
numFmt_elem = lvl.find(qn('w:numFmt'))
lvlText_elem = lvl.find(qn('w:lvlText'))
num_fmt = ""
if numFmt_elem is not None:
raw_fmt = numFmt_elem.get(qn('w:val'), "")
num_fmt = NUM_FMT_MAP.get(raw_fmt, raw_fmt)
lvl_text = ""
if lvlText_elem is not None:
lvl_text = lvlText_elem.get(qn('w:val'), "")
result = {
"type": "bullet" if num_fmt == "bullet" else "numbered",
"ilvl": ilvl,
}
pPr = lvl.find(qn('w:pPr'))
if pPr is not None:
ind = pPr.find(qn('w:ind'))
if ind is not None:
left_v = ind.get(qn('w:left'))
hanging_v = ind.get(qn('w:hanging'))
firstLine_v = ind.get(qn('w:firstLine'))
if left_v is not None:
result["indent_left"] = int(left_v)
if hanging_v is not None:
result["indent_hanging"] = int(hanging_v)
elif firstLine_v is not None:
result["indent_firstLine"] = int(firstLine_v)
if num_fmt == "bullet":
bullet_name = BULLET_MAP.get(lvl_text, None)
if bullet_name:
result["bullet"] = bullet_name
else:
result["bullet"] = "disc"
else:
result["numFmt"] = num_fmt
result["lvlText"] = lvl_text
return result
except Exception:
pass
return None
def _compute_num_text(num_fmt, lvl_text, counters):
"""
根据编号格式和级别文本计算实际编号文本
Args:
num_fmt: 编号格式 (decimal, chineseCounting, enclosedCircle等)
lvl_text: 级别文本模板 (如 "%1.", "(%1)", "%1、")
counters: 各级别的计数值 [c0, c1, c2, ...]
Returns:
实际编号文本字符串 (如 "1.", "(一)", "一、")
"""
def format_counter(fmt, val):
if fmt == "decimal":
return str(val)
elif fmt == "chineseCounting":
cn = ["零", "一", "二", "三", "四", "五", "六", "七", "八", "九", "十",
"十一", "十二", "十三", "十四", "十五", "十六", "十七", "十八", "十九", "二十"]
if 0 < val <= 20:
return cn[val]
return str(val)
elif fmt == "enclosedCircle":
circles = "①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳"
if 0 < val <= 20:
return circles[val - 1]
return str(val)
elif fmt == "lowerLetter":
if 0 < val <= 26:
return chr(ord('a') + val - 1)
return str(val)
elif fmt == "upperLetter":
if 0 < val <= 26:
return chr(ord('A') + val - 1)
return str(val)
elif fmt == "lowerRoman":
roman_map = [(1000,'m'),(900,'cm'),(500,'d'),(400,'cd'),
(100,'c'),(90,'xc'),(50,'l'),(40,'xl'),
(10,'x'),(9,'ix'),(5,'v'),(4,'iv'),(1,'i')]
result = ''
for arabic, roman in roman_map:
while val >= arabic:
result += roman
val -= arabic
return result or str(val)
elif fmt == "upperRoman":
roman_map = [(1000,'M'),(900,'CM'),(500,'D'),(400,'CD'),
(100,'C'),(90,'XC'),(50,'L'),(40,'XL'),
(10,'X'),(9,'IX'),(5,'V'),(4,'IV'),(1,'I')]
result = ''
for arabic, roman in roman_map:
while val >= arabic:
result += roman
val -= arabic
return result or str(val)
else:
return str(val)
if not lvl_text:
return ""
result = lvl_text
for i in range(len(counters) - 1, -1, -1):
placeholder = f"%{i + 1}"
if placeholder in result:
result = result.replace(placeholder, format_counter(num_fmt, counters[i]))
return result
def _load_parse_filter(config_path: str = None) -> dict:
"""
加载解析过滤配置(支持JSONC格式,即带 // 行注释的JSON)
配置文件格式 (parse_filter.json):
{
// 注释说明
"exclude_styles": ["toc 1", ...], // 按样式名过滤
"exclude_texts": ["请在此处填写", ...], // 按文本内容过滤(支持多行\n)
"exclude_text_patterns": ["^\\d+\\.", ...] // 按正则过滤
}
"""
if config_path is None:
config_path = os.path.join(os.path.dirname(__file__), "parse_filter.json")
default_filter = {
"exclude_styles": [],
"exclude_texts": [],
"exclude_text_patterns": []
}
if not os.path.exists(config_path):
return default_filter
try:
with open(config_path, 'r', encoding='utf-8') as f:
raw = f.read()
cleaned_lines = []
for line in raw.split('\n'):
stripped = line.lstrip()
if stripped.startswith('//'):
continue
in_string = False
escape = False
result = []
for ch in line:
if escape:
result.append(ch)
escape = False
continue
if ch == '\\':
result.append(ch)
escape = True
continue
if ch == '"':
in_string = not in_string
result.append(ch)
continue
if not in_string and ch == '/' and result and result[-1] == '/':
result.pop()
break
result.append(ch)
cleaned_lines.append(''.join(result))
cleaned = '\n'.join(cleaned_lines)
config = json.loads(cleaned)
for key in default_filter:
if key not in config:
config[key] = default_filter[key]
config.pop("exclude_text_prefixes", None)
config["_compiled_patterns"] = []
for pattern in config.get("exclude_text_patterns", []):
try:
config["_compiled_patterns"].append(re.compile(pattern, re.DOTALL))
except re.error:
pass
return config
except Exception:
return default_filter
def _should_filter_paragraph(para: Paragraph, parse_filter: dict) -> bool:
"""
判断段落是否应被过滤(不进行JSON解析)
过滤规则(按优先级):
1. 样式匹配:段落样式在排除列表中(不区分大小写)
2. 文本包含匹配:段落文本包含排除文本(支持多行文本,\n表示换行)
3. 正则匹配:段落文本匹配排除正则(re.DOTALL模式,.匹配换行)
注意:蓝色斜体等富文本样式的内容不会被过滤,只有配置中明确指定的才会被过滤
"""
if not parse_filter:
return False
style_name = para.style.name if para.style else ""
for excluded_style in parse_filter.get("exclude_styles", []):
if style_name.lower() == excluded_style.lower():
return True
text = para.text.strip()
if not text:
return False
for excluded_text in parse_filter.get("exclude_texts", []):
normalized = excluded_text.replace('\\n', '\n')
if normalized in text:
return True
for pattern in parse_filter.get("_compiled_patterns", []):
if pattern.search(text):
return True
return False
def iter_block_items(parent):
"""按文档顺序遍历段落和表格"""
if isinstance(parent, _Document):
parent_elm = parent.element.body
else:
raise ValueError("parent must be Document")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
def get_heading_level(para: Paragraph) -> int:
"""获取标题级别 (1-4), 非标题返回0"""
style_name = para.style.name if para.style else ""
if "Heading 1" in style_name or "标题 1" in style_name:
return 1
elif "Heading 2" in style_name or "标题 2" in style_name:
return 2
elif "Heading 3" in style_name or "标题 3" in style_name:
return 3
elif "Heading 4" in style_name or "标题 4" in style_name:
return 4
return 0
def extract_run_style(run) -> dict:
"""提取run的样式"""
style = {}
if run.bold:
style["b"] = True
if run.italic:
style["i"] = True
if run.underline:
style["u"] = True
if run.font.color and run.font.color.rgb:
color = str(run.font.color.rgb)
if color != "000000":
style["c"] = color
if run.font.size:
sz = int(run.font.size.pt * 2)
if sz != 24:
style["sz"] = sz
return style
def _styles_equal(s1: dict, s2: dict) -> bool:
"""比较两个样式字典是否相等"""
if not s1 and not s2:
return True
if not s1 or not s2:
return False
return s1 == s2
def extract_text_with_format(para: Paragraph, doc=None, num_counters=None) -> dict:
"""
提取段落的富文本(紧凑数组格式,自动合并相同样式的连续runs)
格式说明:
- runs: 文本片段列表,每个片段为数组 [text, style?]
- 数组格式:[文本内容, 样式对象(可选)]
- 相同样式连续文本会自动合并,减少存储空间
- list: 列表信息(仅列表段落存在),紧凑字符串格式
- 项目符号: "disc" 或 "disc:1" 或 "disc/420/h420"
- 编号列表: "1." 或 "1.:1" 或 "1./0/f420"
- indent: 缩进信息(仅非默认缩进时存在)
- {"left":420, "hanging":420} 或 {"left":0, "firstLine":420}
示例:
{
"t": "text",
"runs": [["普通文本"], ["加粗蓝色", {"b": true, "c": "0000FF"}]],
"list": "disc/420/h420",
"indent": {"left": 420, "hanging": 420}
}
"""
if not para.text.strip():
return None
raw_items = []
for run in para.runs:
run_text = run.text
if not run_text:
continue
style = extract_run_style(run)
raw_items.append((run_text, style))
if not raw_items:
return None
merged = []
for text, style in raw_items:
if merged and _styles_equal(merged[-1][1], style):
merged[-1] = (merged[-1][0] + text, merged[-1][1])
else:
merged.append((text, style))
runs = []
for text, style in merged:
if style:
runs.append([text, style])
else:
runs.append([text])
result = {"t": "text", "runs": runs}
list_info = _extract_list_info(para, doc, num_counters)
if list_info:
result["list"] = list_info
else:
indent_info = _extract_indent_info(para)
if indent_info:
result["indent"] = indent_info
return result
def _extract_list_info(para: Paragraph, doc, num_counters) -> str:
"""
提取段落的列表信息,返回紧凑字符串格式
格式:
- bullet: "disc" 或 "disc:1" (符号名:ilvl,ilvl=0时省略)
- numbered: "1." 或 "1.:1" (编号文本:ilvl,ilvl=0时省略)
- 带缩进: "disc/420/h420" (符号名/left/hanging) 或 "1./0/f420" (编号文本/left/firstLine)
"""
pPr = para._element.find(qn('w:pPr'))
if pPr is None:
return None
numPr = pPr.find(qn('w:numPr'))
if numPr is None:
return None
numId_elem = numPr.find(qn('w:numId'))
ilvl_elem = numPr.find(qn('w:ilvl'))
if numId_elem is None:
return None
num_id = numId_elem.get(qn('w:val'))
ilvl = int(ilvl_elem.get(qn('w:val'), '0')) if ilvl_elem is not None else 0
if num_id == "0":
return None
style_name = para.style.name if para.style else ""
if "Heading" in style_name or "标题" in style_name:
return None
if doc is None:
return None
num_def = _resolve_numbering(doc, num_id, ilvl)
if num_def is None:
return None
ind = pPr.find(qn('w:ind'))
indent_parts = []
left_val = 0
has_para_indent = False
if ind is not None:
left = ind.get(qn('w:left'))
hanging = ind.get(qn('w:hanging'))
first_line = ind.get(qn('w:firstLine'))
if left is not None:
left_val = int(left)
has_para_indent = True
if hanging is not None:
indent_parts.append("h" + str(int(hanging)))
has_para_indent = True
elif first_line is not None:
indent_parts.append("f" + str(int(first_line)))
has_para_indent = True
if not has_para_indent and "indent_left" in num_def:
left_val = num_def["indent_left"]
if "indent_hanging" in num_def:
indent_parts.append("h" + str(num_def["indent_hanging"]))
elif "indent_firstLine" in num_def:
indent_parts.append("f" + str(num_def["indent_firstLine"]))
else:
indent_parts.append("0")
if indent_parts:
indent_parts.insert(0, str(left_val))
elif left_val != 0:
indent_parts = [str(left_val), "0"]
else:
indent_parts = ["0", "0"]
if num_def["type"] == "bullet":
bullet_name = num_def.get("bullet", "disc")
parts = [bullet_name]
if ilvl > 0:
parts.append(str(ilvl))
prefix = ":".join(parts)
else:
if num_counters is not None:
key = (num_id, ilvl)
if key not in num_counters:
for k in list(num_counters.keys()):
if k[0] == num_id and k[1] > ilvl:
del num_counters[k]
num_counters[key] = num_counters.get(key, 0) + 1
counters = []
for lvl in range(ilvl + 1):
counters.append(num_counters.get((num_id, lvl), 1))
num_text = _compute_num_text(
num_def.get("numFmt", "decimal"),
num_def.get("lvlText", ""),
counters
)
else:
num_text = ""
parts = [num_text] if num_text else ["?"]
if ilvl > 0:
parts.append(str(ilvl))
prefix = ":".join(parts)
if indent_parts:
return prefix + "/" + "/".join(indent_parts)
else:
return prefix
def _extract_indent_info(para: Paragraph) -> dict:
"""提取段落的缩进信息"""
pPr = para._element.find(qn('w:pPr'))
if pPr is None:
return None
ind = pPr.find(qn('w:ind'))
if ind is None:
return None
left = ind.get(qn('w:left'))
first_line = ind.get(qn('w:firstLine'))
hanging = ind.get(qn('w:hanging'))
indent = {}
if left is not None and int(left) != 0:
indent["left"] = int(left)
if hanging is not None and int(hanging) != 0:
indent["hanging"] = int(hanging)
elif first_line is not None and int(first_line) != 0:
indent["firstLine"] = int(first_line)
if not indent:
return None
return indent
def extract_cell_rich_text(cell) -> dict:
"""
提取单元格的富文本内容
返回格式:
{"v": "纯文本", "r": [["文本片段", {样式}]]}
"""
runs = []
plain_text = ""
for para in cell.paragraphs:
para_text = ""
para_runs = []
for run in para.runs:
run_text = run.text
if not run_text:
continue
style = extract_run_style(run)
if style:
para_runs.append([run_text, style])
else:
para_runs.append([run_text])
para_text += run_text
if para_runs:
if runs and para_text:
runs.append(["\n"])
runs.extend(para_runs)
plain_text += para_text
merged = []
for text, style in [(r[0], r[1] if len(r) > 1 else None) for r in runs]:
if merged and _styles_equal(merged[-1][1], style):
merged[-1] = (merged[-1][0] + text, merged[-1][1])
else:
merged.append((text, style))
final_runs = []
for text, style in merged:
if style:
final_runs.append([text, style])
else:
final_runs.append([text])
return {"v": plain_text, "r": final_runs}
def extract_table_compact(table: Table) -> dict:
"""
提取表格(支持合并单元格和富文本)
格式说明:
- cols: 列数
- h: 表头行,每个单元格为 {"v": "文本", "r": [[run]], "gs": 横向合并数, "vm": 纵向合并状态}
- d: 数据行,格式同上
合并单元格说明:
- gs (gridSpan): 横向合并列数,如 gs=2 表示合并2列
- vm (vMerge): 纵向合并状态,"restart"表示开始合并,"continue"表示继续合并
"""
cols = len(table.columns)
headers = []
data = []
for row_idx, row in enumerate(table.rows):
cells = []
tc_list = row._tr.findall(qn('w:tc'))
for tc_idx, tc in enumerate(tc_list):
cell_data = {"v": "", "r": []}
p_list = tc.findall(qn('w:p'))
for p in p_list:
for r in p.findall('.//' + qn('w:r')):
text_elem = r.find(qn('w:t'))
if text_elem is not None and text_elem.text:
text = text_elem.text
cell_data["v"] += text
style = {}
rPr = r.find(qn('w:rPr'))
if rPr is not None:
if rPr.find(qn('w:b')) is not None:
style["b"] = True
if rPr.find(qn('w:i')) is not None:
style["i"] = True
if rPr.find(qn('w:u')) is not None:
style["u"] = True
color_elem = rPr.find(qn('w:color'))
if color_elem is not None:
c = color_elem.get(qn('w:val'))
if c and c != "000000":
style["c"] = c
sz_elem = rPr.find(qn('w:sz'))
if sz_elem is not None:
sz = sz_elem.get(qn('w:val'))
if sz and int(sz) != 24:
style["sz"] = int(sz)
if style:
cell_data["r"].append([text, style])
else:
cell_data["r"].append([text])
tcPr = tc.find(qn('w:tcPr'))
if tcPr is not None:
gs_elem = tcPr.find(qn('w:gridSpan'))
if gs_elem is not None:
gs_val = gs_elem.get(qn('w:val'))
if gs_val:
cell_data["gs"] = int(gs_val)
vm_elem = tcPr.find(qn('w:vMerge'))
if vm_elem is not None:
vm_val = vm_elem.get(qn('w:val'))
if vm_val == "restart":
cell_data["vm"] = "restart"
else:
cell_data["vm"] = "continue"
cells.append(cell_data)
if row_idx == 0:
headers = cells
else:
data.append(cells)
return {"t": "table", "cols": cols, "h": headers, "d": data}
def extract_image_from_paragraph(para: Paragraph, images_dir: str, img_counter: list) -> dict:
"""从段落中提取图片"""
for run in para.runs:
drawings = run._element.findall('.//' + qn('a:blip'))
for drawing in drawings:
embed = drawing.get(qn('r:embed'))
if not embed:
embed = drawing.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed')
if embed:
try:
image_part = para.part.related_parts[embed]
image_bytes = image_part.blob
ext = image_part.content_type.split('/')[-1]
if ext == 'jpeg':
ext = 'jpg'
img_hash = hashlib.md5(image_bytes).hexdigest()[:8]
img_name = f"img_{img_counter[0]:03d}_{img_hash}.{ext}"
img_counter[0] += 1
img_path = os.path.join(images_dir, img_name)
with open(img_path, 'wb') as f:
f.write(image_bytes)
pil_img = Image.open(io.BytesIO(image_bytes))
w, h = pil_img.size
if w > 400:
ratio = 400 / w
w = 400
h = int(h * ratio)
return {"t": "img", "p": f"images/{img_name}", "w": w, "h": h}
except Exception as e:
print(f"图片提取失败: {e}")
return None
def parse_supplier_docx(docx_path: str, output_dir: str = None, filter_config_path: str = None) -> dict:
"""
解析供应商文档
Args:
docx_path: docx文件路径
output_dir: 输出目录(用于存储图片和JSON)
filter_config_path: 过滤配置文件路径(可选,默认为scripts/parse_filter.json)
Returns:
解析后的JSON数据
"""
if output_dir is None:
output_dir = os.path.dirname(docx_path)
images_dir = os.path.join(output_dir, "images")
os.makedirs(images_dir, exist_ok=True)
doc = Document(docx_path)
result = {
"doc_title": "",
"doc_date": "",
"sections": {
"original_req": "",
"req_clarify": "",
"req_analysis": "",
"solution_overview": "",
"new_tables": [],
"modify_tables": [],
"data_compatibility": []
},
"functions": [],
"zhongtai_interfaces": [],
"images_dir": "images"
}
img_counter = [1]
blocks = list(iter_block_items(doc))
num_counters = {}
parse_filter = _load_parse_filter(filter_config_path)
# 章节匹配规则:(标题级别, 标题包含文本) -> 目标字段
# 使用级别+文本组合来精确匹配,避免父子章节重名问题
SECTION_RULES = {
(2, "原始需求"): ("original_req", "text"),
(2, "需求澄清"): ("req_clarify", "text"),
(2, "需求分析"): ("req_analysis", "text"),
(3, "方案概述"): ("solution_overview", "text"),
(2, "新增表"): ("new_tables", "blocks"),
(2, "修改表"): ("modify_tables", "blocks"),
(2, "数据兼容性要求"): ("data_compatibility", "blocks"),
}
current_section = None
current_section_type = None
current_function = None
current_sub_function = None
in_function_section = False
in_zhongtai_section = False
zhongtai_current_interface = None
def _finalize_function(func):
"""功能点结束时检查:如果功能点含XXX且所有子功能点也含XXX,则移除"""
if func is None:
return
name = func.get("name", "")
subs = func.get("sub_functions", [])
if "xxx" in name.lower() and subs:
all_xxx = all("xxx" in s.get("name", "").lower() for s in subs)
if all_xxx:
if func in result["functions"]:
result["functions"].remove(func)
for idx, block in enumerate(blocks):
if isinstance(block, Paragraph):
if _should_filter_paragraph(block, parse_filter):
continue
level = get_heading_level(block)
text = block.text.strip()
# 检查是否是一级标题(章节开始)
if level == 1:
_finalize_function(current_function)
if "功能分解" in text:
in_function_section = True
current_section = None
elif "数据模型" in text:
in_function_section = False
current_section = None
else:
in_function_section = False
current_section = None
current_function = None
in_zhongtai_section = False
zhongtai_current_interface = None
continue
# 检查是否匹配章节规则(必须在level检查之后)
matched = False
for (rule_level, rule_text), (target_field, content_type) in SECTION_RULES.items():
if level == rule_level and rule_text in text:
_finalize_function(current_function)
current_function = None
current_section = target_field
current_section_type = content_type
matched = True
break
if matched:
continue
# 检查是否是功能点标题 (Heading 2, 在功能分解章节内)
if level == 2 and in_function_section:
_finalize_function(current_function)
# 中台三问能开接口同步:进入特殊解析模式
if "中台三问" in text:
in_zhongtai_section = True
current_function = None
current_section = None
current_section_type = None
continue
current_function = {
"name": text,
"scene_desc": "",
"sub_functions": []
}
result["functions"].append(current_function)
current_section = "function"
current_section_type = None
in_zhongtai_section = False
zhongtai_current_interface = None
continue
# 中台三问模式:Heading 4 = 接口名称
# 在该接口范围内(直到下一个Heading 4或更高级别)查找第一个表格
if in_zhongtai_section and level == 4:
interface_info = {
"interface_name": text,
"method": "",
"description": ""
}
# 向后查找该接口范围内的第一个表格
for j in range(idx + 1, len(blocks)):
next_block = blocks[j]
if isinstance(next_block, Paragraph):
next_level = get_heading_level(next_block)
if next_level >= 4:
# 遇到下一个Heading 4或更高级别,停止查找
if next_level == 4:
break
continue
elif isinstance(next_block, Table):
table_block = extract_table_compact(next_block)
# 纵向表格:每行第一列是属性名,第二列是值
for row in table_block.get("d", []):
if len(row) >= 2:
attr_name = row[0].get("v", "") if isinstance(row[0], dict) else str(row[0])
attr_value = row[1].get("v", "") if isinstance(row[1], dict) else str(row[1])
if "接口名称" in attr_name:
interface_info["interface_name"] = attr_value
elif "方法名" in attr_name:
interface_info["method"] = attr_value
elif "接口说明" in attr_name:
interface_info["description"] = attr_value
result["zhongtai_interfaces"].append(interface_info)
break
continue
# 检查是否是场景描述 (Heading 3)
if level == 3 and in_function_section and current_function:
if "场景描述" in text:
current_section = "scene_desc"
current_section_type = "text"
elif "实现方案" in text:
current_section = "implementation"
current_section_type = None
else:
current_section = None
continue
# 检查是否是子功能点 (Heading 4)
if level == 4 and in_function_section and current_function:
current_sub_function = {
"name": text,
"content_blocks": []
}
current_function["sub_functions"].append(current_sub_function)
current_section = "sub_function"
current_section_type = "blocks"
continue
# 其他标题级别,重置section
if level > 0:
current_section = None
continue
# 处理普通段落内容
else:
# 处理固定章节的文本内容
if current_section in ["original_req", "req_clarify", "req_analysis", "solution_overview"]:
if text:
if result["sections"][current_section]:
result["sections"][current_section] += "\n" + text
else:
result["sections"][current_section] = text
# 处理富文本章节(新增表、修改表、数据兼容性要求)
elif current_section in ["new_tables", "modify_tables", "data_compatibility"]:
if text:
text_block = extract_text_with_format(block, doc, num_counters)
if text_block:
result["sections"][current_section].append(text_block)
# 也检查图片
img_block = extract_image_from_paragraph(block, images_dir, img_counter)
if img_block:
result["sections"][current_section].append(img_block)
# 处理功能点的场景描述
elif current_section == "scene_desc" and current_function:
if text:
if current_function["scene_desc"]:
current_function["scene_desc"] += "\n" + text
else:
current_function["scene_desc"] = text
# 处理子功能点的内容块
elif current_section == "sub_function" and current_sub_function:
if text:
text_block = extract_text_with_format(block, doc, num_counters)
if text_block:
current_sub_function["content_blocks"].append(text_block)
# 也检查图片
img_block = extract_image_from_paragraph(block, images_dir, img_counter)
if img_block:
current_sub_function["content_blocks"].append(img_block)
elif isinstance(block, Table):
# 处理表格
if current_section in ["new_tables", "modify_tables", "data_compatibility"]:
table_block = extract_table_compact(block)
result["sections"][current_section].append(table_block)
elif current_section == "sub_function" and current_sub_function:
table_block = extract_table_compact(block)
current_sub_function["content_blocks"].append(table_block)
# 文档结束,finalize最后一个功能点
_finalize_function(current_function)
for func in result["functions"]:
for sub in func.get("sub_functions", []):
for block in sub.get("content_blocks", []):
if block.get("t") == "table":
headers = block.get("h", [])
workload_idx = None
for idx, h in enumerate(headers):
if isinstance(h, dict) and "工作量" in h.get("v", ""):
workload_idx = idx
break
if workload_idx is not None:
del headers[workload_idx]
for row in block.get("d", []):
if workload_idx < len(row):
del row[workload_idx]
block["cols"] = block.get("cols", len(headers)) - 1
for para in doc.paragraphs[:10]:
text = para.text.strip()
if text and not any(c in text for c in ['目录', '版本', '修改']):
if len(text) > 5 and "需求" in text:
result["doc_title"] = text
break
if re.match(r'\d{4}年\d{1,2}月\d{1,2}日', text):
result["doc_date"] = text
return result
def _get_skill_dir() -> str:
"""获取Skill根目录(scripts的上级目录)"""
return os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
def main():
"""主函数 - 支持命令行参数"""
import argparse
skill_dir = _get_skill_dir()
default_output = os.path.join(skill_dir, "parsed_output")
parser = argparse.ArgumentParser(description='解析供应商Word文档为JSON格式')
parser.add_argument('docx_path', help='供应商Word文档路径')
parser.add_argument('-o', '--output', default=None, help=f'输出目录路径(默认: {default_output})')
parser.add_argument('-n', '--name', default='converted_document.json', help='输出JSON文件名(默认: converted_document.json)')
parser.add_argument('--filter', default=None, help='过滤配置文件路径(默认: scripts/parse_filter.json)')
args = parser.parse_args()
docx_path = args.docx_path
output_dir = args.output if args.output else default_output
os.makedirs(output_dir, exist_ok=True)
print(f"解析文档: {docx_path}")
result = parse_supplier_docx(docx_path, output_dir, args.filter)
json_path = os.path.join(output_dir, args.name)
with open(json_path, 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print(f"JSON已保存: {json_path}")
print(f"图片目录: {os.path.join(output_dir, 'images')}")
print("\n提示: 已跳过AI转换环节,JSON可直接用于渲染")
print("\n=== 解析结果摘要 ===")
print(f"文档标题: {result['doc_title']}")
print(f"文档日期: {result['doc_date']}")
print(f"\n固定章节:")
for key, value in result['sections'].items():
if isinstance(value, str):
preview = value[:50] + "..." if len(value) > 50 else value or "(空)"
print(f" {key}: {preview}")
elif isinstance(value, list):
print(f" {key}: {len(value)}个内容块")
print(f"\n功能点数量: {len(result['functions'])}")
for func in result['functions']:
print(f" - {func['name']}")
print(f" 场景描述: {func['scene_desc'][:30]}...")
print(f" 子功能点数量: {len(func['sub_functions'])}")
for sub in func['sub_functions']:
blocks_count = len(sub['content_blocks'])
print(f" - {sub['name']} ({blocks_count}个内容块)")
if __name__ == "__main__":
main()
scripts/supplier_schema.py
"""
供应商文档JSON存储模板(Schema定义)
文档结构:
- doc_title: 文档标题
- doc_date: 文档日期
- sections: 固定章节内容(普通文本)
- original_req: 原始需求
- req_clarify: 需求澄清
- req_analysis: 需求分析
- solution_overview: 方案概述
- functions: 功能点列表(嵌套结构)
- name: 功能点名称
- scene_desc: 场景描述(普通文本)
- sub_functions: 子功能点列表
- name: 子功能点名称
- content_blocks: 内容块列表(支持富文本、表格、图片)
内容块类型:
- text: 富文本(紧凑数组格式)
{"t": "text", "runs": [["文本"], ["带样式", {"b": true}]], "list": "disc/420/h420", "indent": {"left": 420, "hanging": 420}}
- table: 表格(支持合并单元格和富文本)
{"t": "table", "cols": n, "h": [{"v": "文本", "r": [[run]], "gs": n, "vm": "状态"}], "d": [[...]]}
- image: 图片 {"t": "img", "p": "path", "w": n, "h": n}
"""
from typing import List, Optional, Literal
from pydantic import BaseModel, Field
import json
class TextStyle(BaseModel):
b: Optional[bool] = Field(None, description="粗体")
i: Optional[bool] = Field(None, description="斜体")
u: Optional[bool] = Field(None, description="下划线")
c: Optional[str] = Field(None, description="颜色(HEX)")
sz: Optional[int] = Field(None, description="字号(半磅)")
class TextBlock(BaseModel):
t: Literal["text"] = "text"
runs: List[List] = Field(default_factory=list, description="紧凑数组格式: [[text], [text, style]]")
list: Optional[str] = Field(None, description="列表信息,紧凑字符串格式: disc/420/h420 或 1./0/f420")
indent: Optional[dict] = Field(None, description="缩进信息: {left: 420, hanging: 420} 或 {left: 0, firstLine: 420}")
class TableCell(BaseModel):
v: str = Field(default="", description="单元格纯文本")
r: List[List] = Field(default_factory=list, description="富文本runs: [[text, style?]]")
gs: Optional[int] = Field(None, description="横向合并列数(gridSpan)")
vm: Optional[str] = Field(None, description="纵向合并状态: restart/continue")
class TableBlock(BaseModel):
t: Literal["table"] = "table"
cols: int = Field(..., description="列数")
h: List[dict] = Field(default_factory=list, description="表头行(TableCell列表)")
d: List[List[dict]] = Field(default_factory=list, description="数据行(TableCell二维数组)")
col_widths: Optional[List[int]] = Field(None, description="列宽列表,单位: twentieths of a point")
class ImageBlock(BaseModel):
t: Literal["img"] = "img"
p: str = Field(..., description="图片路径")
w: int = Field(..., description="宽度")
h: int = Field(..., description="高度")
class SubFunction(BaseModel):
name: str = Field(..., description="子功能点名称")
content_blocks: List[dict] = Field(default_factory=list, description="内容块列表")
class Function(BaseModel):
name: str = Field(..., description="功能点名称")
scene_desc: str = Field(default="", description="场景描述")
sub_functions: List[SubFunction] = Field(default_factory=list, description="子功能点列表")
class Sections(BaseModel):
original_req: str = Field(default="", description="原始需求")
req_clarify: str = Field(default="", description="需求澄清")
req_analysis: str = Field(default="", description="需求分析")
solution_overview: str = Field(default="", description="方案概述")
new_tables: list = Field(default_factory=list, description="新增表(富文本内容块)")
modify_tables: list = Field(default_factory=list, description="修改表(富文本内容块)")
data_compatibility: list = Field(default_factory=list, description="数据兼容性要求(富文本内容块)")
class ZhongtaiInterface(BaseModel):
interface_name: str = Field(..., description="接口名称")
method: str = Field(default="", description="方法名")
description: str = Field(default="", description="接口说明")
class SupplierDocument(BaseModel):
doc_title: str = Field(..., description="文档标题")
doc_date: Optional[str] = Field(None, description="文档日期")
sections: Sections = Field(default_factory=Sections, description="固定章节")
functions: List[Function] = Field(default_factory=list, description="功能点列表")
zhongtai_interfaces: List[ZhongtaiInterface] = Field(default_factory=list, description="中台三问接口列表")
images_dir: str = Field(default="images", description="图片存储目录")
def create_example_json() -> dict:
"""创建示例JSON(使用紧凑数组格式)"""
return {
"doc_title": "需求XXXXX_需求分析说明书",
"doc_date": "2024年6月30日",
"sections": {
"original_req": "这是需要提取的原始需求内容...",
"req_clarify": "这是需要提取的需求澄清内容...",
"req_analysis": "这是需要提取的需求分析内容...",
"solution_overview": "这是方案概述..."
},
"functions": [
{
"name": "功能点1:我是一个功能点名称",
"scene_desc": "这是功能点1的场景描述...",
"sub_functions": [
{
"name": "我是一个子功能点",
"content_blocks": [
{
"t": "text",
"runs": [
["富文本信息,"],
["蓝色字体", {"c": "0000FF"}],
[","],
["红色字体", {"c": "FF0000"}]
]
},
{
"t": "text",
"runs": [
["小字体", {"c": "FF0000", "sz": 21}]
]
},
{
"t": "text",
"runs": [
["这是一个列表项"]
],
"list": "disc/420/h420"
},
{
"t": "text",
"runs": [
["这是一个编号项"]
],
"list": "1./0/f420"
},
{
"t": "text",
"runs": [
["这是一个缩进段落"]
],
"indent": {"left": 420, "firstLine": 420}
},
{
"t": "text",
"runs": [
["下面是一张表格:"]
]
},
{
"t": "table",
"cols": 4,
"h": [
{"v": "序号", "r": [["序号"]]},
{"v": "字段名称", "r": [["字段名称"]]},
{"v": "字段说明", "r": [["字段说明"]]},
{"v": "备注", "r": [["备注"]]}
],
"d": [
[{"v": "1", "r": [["1"]]}, {"v": "地市", "r": [["地市"]]}, {"v": "", "r": []}, {"v": "", "r": []}],
[{"v": "2", "r": [["2"]]}, {"v": "区县", "r": [["区县"]]}, {"v": "", "r": []}, {"v": "", "r": []}]
]
},
{
"t": "text",
"runs": [
["下面是一张图片:"]
]
},
{
"t": "img",
"p": "images/img_001.png",
"w": 200,
"h": 150
}
]
}
]
}
],
"images_dir": "images"
}
if __name__ == "__main__":
print("=== 紧凑数组格式 ===")
example = create_example_json()
print(json.dumps(example, ensure_ascii=False, indent=2))
print("\n=== 格式说明 ===")
print("runs数组格式: [[文本], [文本, 样式对象]]")
print("示例: [['普通文本'], ['加粗蓝色', {'b': true, 'c': '0000FF'}]]")
重建步骤
1. 安装依赖
pip install -r scripts/requirements.txt
2. 创建模板
py -3.11 scripts/create_custom_template.py -o templates/custom_template.docx
3. 运行
# 解析供应商文档
py -3.11 scripts/supplier_parser.py <供应商文档.docx>
# 渲染甲方文档
py -3.11 scripts/renderer.py
说明
- Python 版本 >= 3.11
supplier_schema.py为参考文件,非运行必需(需安装 pydantic)- 输出目录
parsed_output/运行时自动创建 - 模板
custom_template.docx不存在时需先运行create_custom_template.py
浙公网安备 33010602011771号