OpenXML与Word概述

OpenXML与Word概述

一、Word

1.1 Word 格式演进

版本 标准 文件格式
Word 97-2003 OLE复合文档技术 .doc格式
Word2007及其之后 Office OpenXML协议 .docx格式

OLE复合文档技术:

word内部组成
word内部组成

Office OpenXML协议:

word内部组成

1.2 .docx 文档构成

Word文档(.docx)实质上是一个包含多个XML文件和资源的ZIP压缩包,主要由以下部分组成:

|- [Content_Types].xml 允许用户确定数据包中每个部件的内容类型
|- _rels 关系部件,定义ZIP包中各个Part之间的关系
|- docProps 存放文档的属性信息
    |- app.xml 记录应用程序特定的文档属性
    |- core.xml 存储核心属性
|- word 文档的主要内容存放于此目录
    |- _rels
        |- document.xml.rels
    |- theme
    |- document.xml 文档中所有可见文字的内容和属性及不可见部分的内容和属性
    |- fontTable.xml
    |- settings.xml 存储文档的设置
    |- styles.xml
    |- webSettings.xml

Word(.docx)文档内部组成:

word内部组成

提示:Word文档本质上是一个ZIP包,换句话说Word文档以包的形式存储。我们可以新建一个Word文档,更改其后缀名.docx.zip,解压缩后查看其内部结构。

二、OpenXML

2.1 概述

2.1.1 Open XML定义

Office Open XML (Open XML、OpenXML、OOXML)

一种能涵盖现有文档主题中所表示功能的标准;一项针对字处理文档、演示文稿和电子表格的建议开放标准,可由多个应用程序在多个平台上自由地实施。

2.1.2 OpenXML SDK

SDK(Software Development Kit):软件开发工具包。

Open XML SDK基于System.IO.Packaging API构建而成,并提供强类型类来处理符合Open XML文件格式规范的文档。

类似于OpenXML SDK的有NPOIXceed Words for .NETDocX

2.2 标准变迁

Open XML(简称为OOXML)是可由不同平台上的多个应用程序自由实现的字处理文档、演示文稿和电子表格的开放式标准。

序号 标准名 实施规范
1 ECMA-376 MS-OE376
2 ISO/IEC 29500 MS-OI29500

2.3 分类

Open XML定义了字处理、演示文稿和电子表格文档的格式。

  1. Word processing document(Word):使用WordProcessingML标记进行描述。WordprocessingML文档由文章集合组成,其中每个文章是以下类型之一:

    • 主文档
    • 词汇表文档
    • 子文档
    • 页眉
    • 页脚
    • 注释
    • 框架
    • 文本框
    • 脚注或尾注
  2. Presentations (演示文稿):使用PresentationML标记进行描述。

  3. Spreadsheet workbooks(电子表格工作薄):使用SpreadsheetML 标记进行描述。

2.5 最小WordprocessingML文档组成

该文档包含三个部件:内容类型部件、数据包关系部件、主文档部件

  1. 内容类型部件“./[Content_Types].xml”描述其他两个必须部件的内容类型
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Default Extension="rels" ContentType="application/vnd. openxml formats -package . relationships+xml"/>
    <Default Extension="xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
</Types>
  1. 数据包关系部件“./_rels/ .rels”描述数据包与主文档部件之间的关系
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="document.xml"/>
</Relationships>
  1. 主文档部件“./word/document.xml”包含文档内容
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
    <w:body>
        <w:p>
            <w: r>
                <w:t>Hello, world.</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:document>

三、DocumentFormat.OpenXML包在Word文档自动化中的应用

3.1 安装

Install-Package DocumentFormat.OpenXML

3.2 基本操作

创建 WordprocessingDocument对象(WordprocessingDocument类表示Word文档包):

// docPath = @"C:\Desktop\Test.docx";

using (var wordDoc = WordprocessingDocument.Create(docPath, WordprocessingDocumentType.Document))
{
    // Insert other code here. 
}

获取 WordprocessingDocument对象(true-读/写模式;false-读模式):

// docPath = @"C:\Desktop\Test.docx";

using (var wordDoc = WordprocessingDocument.Open(docPath, true))
{
    // Insert other code here.
}

3.3 使用XML内容创建docx文档

// docPath = @"C:\Desktop\Test.docx";

// To create a new package as a Word document.
public static void CreateNewWordDocument(string docPath)
{
    using (var wordDoc = WordprocessingDocument.Create(docPath, WordprocessingDocumentType.Document))
    {
        // Set the content of the document so that Word can open it.
        var mainPart = wordDoc.AddMainDocumentPart();

        string docXml = @"<?xml version=""1.0"" encoding=""utf-8""?>
                            <w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
                              <w:body>
                                <w:p>
                                  <w:r>
                                    <w:t>Hello World</w:t>
                                  </w:r>
                                </w:p>
                              </w:body>
                            </w:document>";

        using (var stream = mainPart.GetStream())
        {
            byte[] buf = (new UTF8Encoding()).GetBytes(docXml);
            stream.Write(buf, 0, buf.Length);
        }
    }
}

3.4 从docx文档中获取文档批注的内容

本质上是读取 word/comments.xml 文件内容:

word内部组成
// docPath = @"C:\Desktop\Test.docx";

// To get the contents of a document part.
public static string GetCommentsFromDoc(string docPath)
{
    string? comments = null;

    using (var wordDoc = WordprocessingDocument.Open(docPath, false))
    {
        var mainPart = wordDoc.MainDocumentPart ?? throw new InvalidOperationException("文档缺少主部件");
        var commentsPart = mainPart.WordprocessingCommentsPart;

        // if 批注部件不存在,直接返回 null
        if (commentsPart is null) return;

        using (var streamReader = new StreamReader(commentsPart.GetStream()))
        {
            comments = streamReader.ReadToEnd();
        }
    }

    return comments;
}

3.5 从docx文档中移除文档部件

// docPath = @"C:\Desktop\Test.docx";

// To remove a document part from a package.
public static void RemovePart(string docPath)
{
    using (var wordDoc = WordprocessingDocument.Open(docPath, true))
    {
        var mainPart = wordDoc.MainDocumentPart;

        if (mainPart is not null && mainPart.DocumentSettingsPart is not null)
        {
            mainPart.DeletePart(mainPart.DocumentSettingsPart);
        }
    }
}

3.6 从docx文档中替换文档的主题

// docPath = @"C:\Desktop\Test.docx";
// themeFilePath = @"C:\Desktop\ThemeReplace.xml"

// This method can be used to replace the theme part in a package.
public static void ReplaceTheme(string docPath, string themeFilePath)
{
    using (var wordDoc = WordprocessingDocument.Open(docPath, true))
    {
        if (wordDoc.MainDocumentPart is null || wordDoc.MainDocumentPart.Document.Body is null || wordDoc.MainDocumentPart.ThemePart is null)
        {
            throw new ArgumentNullException("MainDocumentPart and/or Body and/or ThemePart is null.");
        }

        var mainPart = wordDoc.MainDocumentPart;

        // Delete the old document part.
        mainPart.DeletePart(mainPart.ThemePart);

        // Add a new document part and then add content.
        ThemePart themePart = mainPart.AddNewPart<ThemePart>();

        using (var streamReader = new StreamReader(themeFilePath))
        using (var streamWriter = new StreamWriter(themePart.GetStream(FileMode.Create)))
        {
            streamWriter.Write(streamReader.ReadToEnd());
        }
    }
}

3.7 从docx文档中替换文档部件的内容

// docPath = @"C:\Desktop\Test.docx";

// To search and replace content in a document part.
static void SearchAndReplace(string docPath)
{
    using (var wordDoc = WordprocessingDocument.Open(docPath, true))
    {
        string? docText = null;

        if (wordDoc.MainDocumentPart is null)
        {
            throw new ArgumentNullException("MainDocumentPart and/or Body is null.");
        }

        using (var sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
        {
            docText = sr.ReadToEnd();
        }

        Regex regexText = new Regex("Hello world!");
        docText = regexText.Replace(docText, "Hi Everyone!");

        using (var sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
        {
            sw.Write(docText);
        }
    }
}

3.8 向docx文档中添加文字

public static void OpenAndAddTextToWordDocument(string docPath, string txt)
{
    // Open a WordprocessingDocument for editing using the filepath.
    var wordprocessingDocument = WordprocessingDocument.Open(docPath, true);

    if (wordprocessingDocument is null)
    {
        throw new ArgumentNullException(nameof(wordprocessingDocument));
    }

    // Assign a reference to the existing document body.
    var mainDocumentPart = wordprocessingDocument.MainDocumentPart ?? wordprocessingDocument.AddMainDocumentPart();
    mainDocumentPart.Document ??= new Document();
    mainDocumentPart.Document.Body ??= mainDocumentPart.Document.AppendChild(new Body());
    Body body = wordprocessingDocument.MainDocumentPart!.Document!.Body!;

    // Add new text.
    Paragraph para = body.AppendChild(new Paragraph());
    Run run = para.AppendChild(new Run());
    run.AppendChild(new Text(txt));

    // Dispose the handle explicitly.
    wordprocessingDocument.Dispose();
}

3.9 向docx文档中添加表格

///<summary>
///string[,] data = new string[,]
///{
///    {"名称", "描述"}
///}
///<summary>
public static void AddFixedTable(string docPath, string[,] data)
{
    using (var document = WordprocessingDocument.Open(docPath, true))
    {
        var doc = document.MainDocumentPart.Document;

        Table table = doc.Body.AppendChild(new Table());

        TableProperties props = table.PrependChild(new TableProperties());

        TableBorders tableBorders = props.AppendChild(new TableBorders(
            new TopBorder() { Val = new EnumValue<BorderValues>(BorderValues.Single), Size = 12 },
            new BottomBorder() { Val = new EnumValue<BorderValues>(BorderValues.Single), Size = 12 },
            new LeftBorder() { Val = new EnumValue<BorderValues>(BorderValues.Single), Size = 12 },
            new RightBorder() { Val = new EnumValue<BorderValues>(BorderValues.Single), Size = 12 },
            new InsideHorizontalBorder() { Val = new EnumValue<BorderValues>(BorderValues.Single), Size = 6 },
            new InsideVerticalBorder() { Val = new EnumValue<BorderValues>(BorderValues.Single), Size = 6 }
        ));

        for (var i = 0; i < data.GetLength(0); i++)
        {
            var tr = table.AppendChild(new TableRow());

            for (var j = 0; j < data.GetLength(1); j++)
            {
                var tc = tr.AppendChild(new TableCell());
                tc.Append(new Paragraph(new Run(new Text(data[i, j].ToString()))));

                Run run = tc.Elements<Paragraph>().First().Elements<Run>().First();
                RunProperties runProperties = run.PrependChild(new RunProperties());
                runProperties.AppendChild(new FontSize()
                {
                    Val = "20"
                });
                runProperties.AppendChild(new RunFonts()
                {
                    EastAsia = "宋体"
                });

                // Bold the title
                if (i == 0)
                {
                    Bold bold = runProperties.AppendChild(new Bold());
                }

                double widthInDxa = 7.23 * 567; // 1cm ≈ 567 Dxa
                tc.AppendChild(new TableCellProperties(
                new TableCellWidth { Type = TableWidthUnitValues.Dxa, Width = widthInDxa.ToString() }));
            }
        }
    }
}

四、Nuget包

DocumentFormat.OpenXML:包含部件和元素的所有强类型类。

DocumentFormat.OpenXML.Framework:包含启用 SDK 的基础框架。 这是从 v3.0 开始的新包,包含以前包含在 中的 DocumentFormat.OpenXml许多类型。

五、引用文章

  1. Word解析之Word内部结构:
    https://blog.csdn.net/pdfcxc/article/details/113260490

  2. 复合文档格式文件格式研究
    https://club.excelhome.net/thread-227502-1-1.html

  3. Office Open XML白皮书
    https://www.ecma-international.org/wp-content/uploads/OpenXML_White_Paper_Chinese.pdf

  4. Open XML SDK for Office
    https://learn.microsoft.com/zh-cn/office/open-xml/open-xml-sdk

  5. docx格式文档详解
    https://juejin.cn/post/7166821284087595038

posted on 2024-12-17 14:53  wubing7755  阅读(1648)  评论(1)    收藏  举报