RTF Dom Parser

Download Source/Files/xdesigner/RtfDomParser_1.0_source.zip .

Introduce

       RTF DOM Parser( short name RDP ) is a open source C# library which can parse RTF document and generate rtf DOM Tree. Use DOM tree , .NET programmer can read rtf document very easy, It is use GNU2 license.

Background

RTF is a document format use widely. May software support it to exchange text data , also some time C#er need to read and write rtf document.

Although RTF is not a very complex format, but it is not easy to parse, So I create RDP , It can parser rtf document and generate a DOM Tree , DOM Tree is easy to use , So that , .NET application developer need not much knowledge about RTF syntax .

RTF format

RTF is not a complex format , a rtf file’s content is plait text , those text can parse some key word . for example , you open windows WordPad , input characters “abcdef” without specify format , and save it to a rtf file. You open the rtf file with windows notepad, you can find the content as the following:

{\rtf1\ansi\ansicpg936\deff0\deflang1033\deflangfe2052{\fonttbl{\f0\fmodern\fprq6\fcharset134 \'cb\'ce\'cc\'e5;}}{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\lang2052\f0\fs20 abcdefg\par}

       This is a verty simple rtf’s content , For friendly to analyze, I can indent these code as following.

{\rtf1 \ansi \ansicpg936 \deff0 \deflang1033 \deflangfe2052

       {\fonttbl

              {\f0\fmodern\fprq6\fcharset134 \'cb\'ce\'cc\'e5;}

       }

       {\*\generator Msftedit 5.41.15.1515;}

       \viewkind4\uc1\pard\lang2052\f0\fs20 abcdefg\par

}

it can parse group and nested group, group is starts with "{" , and finish by "}". A rtf keyword start with "\" , and following by a keyword name , maybe some keyword has a integer parameter.

For example , "\ansicpg936" is a rtf keyword , it starts with "\" , and name is "ansicpg" , and has a parameter value "936" ; keyword "\ansi" , it’s name is "ansi" , and do no have parameter.

Notice, white space, include blank, tab, enter may effect rtf document’s render, so do not indent rtf code.

RDP

Some .NET programmer may have to parse or write rtf document, so I provide this RDP, I order by RTF standard V1.7, and parse rtf document and generate DOM tree , So .NET programmer can use this RTF DOM tree to read rtf document content. RTF code is not easy to read , but DOM tree is very easy to use ,I hope RDP can save up .NET programmer’s time.

In RDP, there are 3 part :RTF DOM, RTF Reader and RTF Writer. RTF DOM is the mainly party. The following shape descript RDP’s structure.

In RTF DOM, RTFDomElement is the root element type , it derive other document element type, such as bookmark , document , image and so on.

RTFDomDocument is the root element to access RTF DOM, It derived from RTFDomElement and deputy the whole rtf document.

In rtf standard, there are not exist table, table column type , only table row and cell. even table row is a special paragraph . But in RTF DOM , I defined table, table row, table column, table cell to describe the full table DOM, So programmer can read table information easy. But it is hard to parse table structure from RTF code , I spend a lot of C# code to realize this function.

Part of RTF Reader is a base module to read native rtf source code. It is in read only mode, It can read rtf code and generate rtf node. This can reduce RTF DOM’s workload. Because RTFReader use a read only stream mode , so RDP can parse big rtf file which size more than 100MB fast.

RTF Writer use to generate rtf source code. Current version of RDP is 1.0 , it is the first version, so RTF Writer is not powerful, This part include only RTFWriter type , use this type , .NET programmer can generate rtf source without any obvious syntax error.

 

RDP is easy to use, you can use the following C# code to parse rtf file.

 

XDesigner.RTF.RTFDomDocument doc = new XDesigner.RTF.RTFDomDocument();

doc.Load( rtfFileName);

 

After these two line C# code , you can get data from rtf document by use doc variable.

For example , you can use code "doc.Info.Title" to get this rtf document’s title, you can enumerate elements of "doc.Elements" to get some part you need.

posted on 2010-10-13 20:04  袁永福 电子病历,医疗信息化  阅读(2356)  评论(1编辑  收藏  举报

导航