Word2CHM released

Introduce

Word2CHM snapshotWord2CHM is a open source C# program which can convert MS Word document (in 2000/2003 format) to a CHM document. Learn more , visit http://www.sinoreport.net/Word2CHM_Details.aspx .

This is a screen snapshot.

Background

Many people write customer help document with MS Word, because MS Word is very fit to write document include text, images and tables.

But many customers did not want read help document in MS Word format, but they like CHM format. So it is useful than convert ms word document to CHM document. This is why I build Word2CHM.

Word2CHM

In Word2CHM , there are three steps in converting ms word document to CHM document . First is convert ms word document to a single html file, second is split a single html file to multi html files, and thirst is compile multi html files to a single CHM file.

First, Convert ms word document to a single html file

MS Word application support OLE automatic technology, a C# program can host a ms word application, open ms word binary document and save as a html file.

 There are some sample C# code that hosts a ms word application.
private bool SaveWordToHtml(string docFileName, string htmlFileName)
{
    // check doc file name
    if (System.IO.File.Exists(docFileName) == false )
    {
        this.Alert("File '" + docFileName + "' not exist!");
        return false;
    }
    // check output directory
    string dir = System.IO.Path.GetDirectoryName(htmlFileName);
    if (System.IO.Directory.Exists(dir) == false )
    {
        this.Alert("Directory '" + dir + "' not exist!");
        return false;
    }
    object trueValue = true;
    object falseValue = false;
    object missValue = System.Reflection.Missing.Value;
    object fileNameValue = docFileName;
    // create word application instance
    Microsoft.Office.Interop.Word.Application app =
        new Microsoft.Office.Interop.Word.ApplicationClass();
    // set word application visible
    // if something is error and quit , user can close word application by self.
    app.Visible = true;
    // open document
    Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(
        ref fileNameValue,
        ref missValue,
        ref trueValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue);
    // save a html file
    object htmlFileNameValue = htmlFileName;
    object format = Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatFilteredHTML;
    doc.SaveAs(
        ref htmlFileNameValue ,
        ref format,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue);
    // close document and release resource
    doc.Close(ref falseValue, ref missValue, ref missValue);
    app.Quit(ref falseValue, ref missValue, ref missValue);
    System.Runtime.InteropServices.Marshal.ReleaseComObject(doc);
    System.Runtime.InteropServices.Marshal.ReleaseComObject(app);
    return true;
}

In this C# source code, it is important than call function ReleaseComObject. Use ReleaseComObject function, program can release all resource use by word application.

In many program which hosts ms word application( also Excel application ), When program does not need word application, program can call Quit function of word application. But sometimes, The word process still alive, this is lead very serious resource leak. Use ReleaseComObject can reduce this risk.

Second, Split a single html file to multi html file

The html file generate word application include all content of word document. For example, A word document contains the following content.

 

I Save this document as filtered html file, the html file source code as the following.

<html>

       <head>

              <meta http-equiv=Content-Type content="text/html; charset=gb2312">

              <meta name=Generator content="Microsoft Word 11 (filtered)">

              <title>Header1</title>

              <style>

               some style code

              </style>

       </head>

       <body lang=ZH-CN style='text-justify-trim:punctuation'>

              <div class=Section1 style='layout-grid">
                     <h1><span lang=EN-US>Header1</span></h1>
                     <p class=MsoNormal><span lang=EN-US>Content1</span></p>
                     <h2><span lang=EN-US>Header2</span></h2>
                     <p class=MsoNormal><span lang=EN-US>Content2</span></p>
              </div>
       </body>
</html>

In this html source code, a div tag include all content, Word2CHM need split this html file to two files.

File0.html

<html>
       <head>
              <meta http-equiv=Content-Type content="text/html; charset=gb2312">
              <meta name=Generator content="Microsoft Word 11 (filtered)">
              <title>Header1</title>
       <style>
        --------------
       </style>
       </head>
       <body>
              <h1>Header</h1><hr />
              <p class=MsoNormal><span lang=EN-US>Content1</span></p>
              <hr /><h1>Footer</h1>
       </body>
</html>

File1.html

<html>
       <head>
              <meta http-equiv=Content-Type content="text/html; charset=gb2312">
              <meta name=Generator content="Microsoft Word 11 (filtered)">
              <title>Header1</title>
       <style>
        --------------
       </style>
       </head>
       <body>
              <h1>Header</h1><hr />
              <p class=MsoNormal><span lang=EN-US>Content2</span></p>
              <hr /><h1>Footer</h1>
       </body>
</html>

Here , program add html souce “<h1>Header</h1><hr />” in the front of html content source code , and add “<hr /><h1>Footer</h1>” after html content. Those additional html source uses as header and footer.

In Word2CHMI use the following C# code to split html file.
string strDir = System.IO.Path.GetDirectoryName(fileName);
string strHtml = null;
System.Text.Encoding encoding = System.Text.Encoding.Default ;
using (StreamReader reader = new StreamReader(fileName, encoding, true))
{
    //set content encoding
    encoding = reader.CurrentEncoding;
    //read HTML source code
    strHtml = reader.ReadToEnd();
}
int index = strHtml.IndexOf("<body");
string strHeader = strHtml.Substring(0, index);
string strHeader1 = strHeader;
string strHeader2 = null;
index = strHeader.IndexOf("<title>");
if (index > 0)
{
    strHeader1 = strHeader.Substring(0, index);
    int indexEndTitle = strHeader.IndexOf("</title>");
    strHeader2 = strHeader.Substring(indexEndTitle + 8);
    // read title
    this.strTitle = strHeader.Substring(index + 7, indexEndTitle - index - 6 - 1);
}
else
{
    strTitle = System.IO.Path.GetFileNameWithoutExtension(fileName);
}
index = strHtml.IndexOf(">", index);
string strBody = strHtml.Substring(index + 1);
index = strBody.LastIndexOf("</body>");
strBody = strBody.Substring(0, index);
index = strBody.IndexOf("<div");
if (index >= 0)
{
    index = strBody.IndexOf(">", index+1);
    strBody = strBody.Substring(index + 1 );
    index = strBody.LastIndexOf("</div>");
    strBody = strBody.Substring(0, index);
}
//Split html document by tag <h>
index = strBody.IndexOf("<h");
if (index >= 0)
{
    strBody = strBody.Substring(index);
}
else
{
    strBody = "";
}
strBody = strBody.Trim();
int lastLevel = 1;
int lastNativeLevel = 1;
while (strBody.Length > 0)
{
    int Nativelevel = Convert.ToInt32(strBody.Substring(2, 1));
    int level = Nativelevel;
    if (lastNativeLevel == Nativelevel)
    {
        level = lastLevel;
    }
    else
    {
        if (level > lastLevel + 1)
        {
            level = lastLevel + 1;
        }
    }
    lastNativeLevel = Nativelevel;
    lastLevel = level;
    int index2 = strBody.IndexOf(">");
    int index3 = strBody.IndexOf("</h" + Nativelevel + ">");
    //read text in <h</h> as topic title
    string strTitle = strBody.Substring(index2 + 1, index3 - index2 - 1);
    while (strTitle.IndexOf("<") >= 0)
    {
        int index4 = strTitle.IndexOf("<");
        int index5 = strTitle.IndexOf(">", index4);
        strTitle = strTitle.Remove(index4, index5 - index4 + 1);
    }
    strBody = strBody.Substring(index3 + 5);
    index = strBody.IndexOf("<h");
    if (index == -1)
    {
        index = strBody.Length;
    }
    //read topic content
    string strContent = strBody.Substring(0, index);
    // add node to chm document DOM tree
    CHMNode currentNode = null;
    if (this.Nodes.Count == 0 || level == 1)
    {
        //create node
        currentNode = new CHMNode();
        this.Nodes.Add(currentNode);
    }
    else
    {
        CHMNode parentNode = this.Nodes.LastNode;
        while (true)
        {
            if (parentNode.Nodes.Count == 0)
                break;
            if (parentNode.Level == level - 1)
            {
                break;
            }
            parentNode = parentNode.Nodes.LastNode;
        }
        currentNode = new CHMNode();
        //add child node
        parentNode.Nodes.Add(currentNode);
    }
    //set node's name
    currentNode.Name = strTitle;
    strContent = strContent.Trim();
    if (strContent.Length > 0)
    {
        string strHtmlFileName = "";
        CHMNode node = currentNode;
        while (node != null)
        {
            int NodeIndex = node.Index;
            if (node.Parent == null)
                NodeIndex = this.Nodes.IndexOf(node);
            if (strHtmlFileName.Length > 0)
                strHtmlFileName = NodeIndex + "-" + strHtmlFileName;
            else
                strHtmlFileName = NodeIndex.ToString();
            node = node.Parent;
        }
        strHtmlFileName = "File" + strHtmlFileName + ".html";
        currentNode.Local = strHtmlFileName;
        myFiles.Add(strHtmlFileName);
        strHtmlFileName = System.IO.Path.Combine(strDir, strHtmlFileName);
        //Generate topic html file
        using (StreamWriter writer = new StreamWriter(strHtmlFileName, false, encoding))
        {
            if (strHeader2 != null)
            {
                //write header html source
                writer.Write(strHeader1);
                writer.Write("<title>" + strTitle + "</title>");
                writer.Write(strHeader2);
            }
            else
            {
                writer.Write(strHeader);
            }
            writer.WriteLine("<body style=' margin: 0px 0px 0px 0px; padding: 0px 0px 0px 0px;font-family: Verdana, Arial, Helvetica, sans-serif;' >");
            string header = this.HelpHeaderHtml;
            if (header != null)
            {
                //write header html source code
                header = header.Replace("@Title", strTitle);
                writer.WriteLine(header);
            }
            //write html content
            writer.WriteLine(strContent);
            //write footer html source
            writer.WriteLine(this.HelpFooterHtml);
            writer.WriteLine("</body>");
            writer.WriteLine("</html>");
        }
    }
    if (index == strBody.Length)
    {
        break;
    }
    else
    {
        strBody = strBody.Substring(index);
    }
}//while
//write html file
string strFilesDir = System.IO.Path.ChangeExtension(fileName, "files");
if (System.IO.Directory.Exists(strFilesDir))
{
    string dirName = System.IO.Path.GetFileName(strFilesDir);
    foreach (string name in System.IO.Directory.GetFiles(strFilesDir))
    {
        string name2 = System.IO.Path.GetFileName(name);
        name2 = System.IO.Path.Combine(dirName, name2);
        myFiles.Add(name2);
    }

}

Use this C# code, I split html file by use html tag H1,H2,H3 and Hn.And set each html document’s title as content between html tag Hn.

 
 

Third. Compile multi html files to a single CHM file

Word2CHM can not compile multi html file to a single CHM file by it self,  It call “HTML Help workshop” to generate CHM file.

HTML Help workshop is a product of Microsoft, It can compile multi html file to a CHM file, It save settings in a help project file which extend name is hhp.

In Word2CHM , program generate HHP file , It use the following C# source code.
strOutputText = "";
if (System.IO.File.Exists(compilerExeFileName) == false)
{
    throw new System.IO.FileNotFoundException(compilerExeFileName);
}
string strHHP = System.IO.Path.Combine(this.WorkDirectory, strName + ".hhp");
string strHHC = System.IO.Path.Combine(this.WorkDirectory, strName + ".hhc");
string strCHM = System.IO.Path.Combine(this.WorkDirectory, strName + ".chm");
if (System.IO.File.Exists(strCHM))
{
    System.IO.File.Delete(strCHM);
}
string DefaultTopic = null;
CHMNodeList nodes = this.GetAllNodes();
foreach (CHMNode node in nodes)
{
    if (HasContent(node.Local))
    {
        DefaultTopic = node.Local;
        break;
    }
}
// Generate hhp file
using (System.IO.StreamWriter myWriter = new System.IO.StreamWriter(
           strHHP,
           false,
           System.Text.Encoding.GetEncoding(936)))
{
    myWriter.WriteLine("[OPTIONS]");
    myWriter.WriteLine("Compiled file=" + System.IO.Path.GetFileName(strCHM));
    myWriter.WriteLine("Contents file=" + System.IO.Path.GetFileName(strHHC));
    myWriter.WriteLine("Default topic=" + this.DefaultTopic);
    myWriter.WriteLine("Default Window=main");
    myWriter.WriteLine("Display compile progress=yes");
    myWriter.WriteLine("Full-text search=" + (this.FullTextSearch ? "Yes" : "No"));
    myWriter.WriteLine("Binary TOC=" + (this.BinaryToc ? "Yes" : "No"));
    myWriter.WriteLine("Auto Index=" + (this.AutoIndex ? "Yes" : "No"));
    myWriter.WriteLine("Binary Index=" + (this.BinaryIndex ? "Yes" : "No"));
    //myWriter.WriteLine("Index file=" + System.IO.Path.GetFileName( strIndexFile ));
    myWriter.WriteLine("Title=" + this.Title);
    myWriter.WriteLine("[FILES]");
    foreach (CHMNode node in nodes)
    {
        if (HasContent(node.Local))
        {
            if (myFiles.Contains(node.Local) == false)
            {
                myFiles.Add(node.Local);
            }
        }
    }
    foreach (string fileName in myFiles)
    {
        myWriter.WriteLine(fileName);
    }
}
// Generate hhc file
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.AppendChild(doc.CreateElement("hhc"));
ToHHCXMLElement(this.myNodes, doc.DocumentElement);
using (System.IO.StreamWriter myWriter = new System.IO.StreamWriter(
           strHHC,
           false,
           System.Text.Encoding.GetEncoding(936)))
{
    myWriter.Write(doc.DocumentElement.InnerXml);
}
// Compile project , generate chm file
ProcessStartInfo start = new ProcessStartInfo(compilerExeFileName, "\"" + strHHP + "\"");
start.UseShellExecute = false;
start.CreateNoWindow = true;
start.RedirectStandardOutput = true;
start.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
System.Diagnostics.Process proc = System.Diagnostics.Process.Start(start);
proc.PriorityClass = System.Diagnostics.ProcessPriorityClass.BelowNormal;
this.strOutputText = proc.StandardOutput.ReadToEnd();
// Delete template file
if (deleteTempFile)
{
    System.IO.File.Delete(strHHP);
    System.IO.File.Delete(strHHC);
}
if (System.IO.File.Exists(strCHM))
    return strCHM;
else
return null;

After generate HHP file , Word2CHM use the following C# code to generate CHM file.

string hhcPath = Word2CHM.Properties.Settings.Default.HHCExePath;
if( System.IO.File.Exists( hhcPath ) == false )
{
    MessageBox.Show("Can not find execute file '"

        + hhcPath + "' of 'HTML Help Workshop'!");
    return;
}
try
{
    string name = System.IO.Path.ChangeExtension(
        this.myDocument.FileName , "hhp");
    this.Cursor = System.Windows.Forms.Cursors.WaitCursor;
    name = myDocument.CompileProject(
        hhcPath ,
        Word2CHM.Properties.Settings.Default.DeleteTempFile );
    this.Cursor = System.Windows.Forms.Cursors.Default;
    System.Diagnostics.Debug.WriteLine( myDocument.OutputText);
    if (name == null)
        Alert( "Compile error!");
    else
        Alert( "Genereate file " + name);
}
catch (Exception ext)
{
    Alert("App error:" + ext.Message);
}

After complete this three steps , Word2CHM can convert a Word document to a CHM file.

posted on 2010-11-15 16:38  袁永福 电子病历,医疗信息化  阅读(819)  评论(1编辑  收藏  举报

导航