程序中的阿呆

经常做做梦,写写工作无关代码
  首页  :: 新随笔  :: 订阅 订阅  :: 管理

C#代码测试如何反编译CHM帮助文件

Posted on 2005-10-16 08:22  MicroDream  阅读(1926)  评论(1)    收藏  举报
这几天被借去为企划部门做一个CHM转PDF的工具,由于有一些特殊的要求,所以没有采用现有工具。首先要取得CHM的结构及文件内容,也就是需要反编译CHM。MS自已提供的工具Microsoft HTML Help Workshop就可以做到,更别说网上很多现在的电子书工具。但是考虑自主性的问题,决定自已写代码完成反编译的过程。

CHM格式

CHM(发音为“chum”)的原意是Compiled HTML help file,是微软作为HLP格式的替代格式提出的,因此微软自己不仅随4.01以上版本的IE一起提供免费的浏览器,而且免费提供制作工具Microsoft HTML Help Workshop。CHM文件内部使用ITS格式,这是一种非常优秀的压缩格式。

由于ITS格式的开放性,国外早就有人做出了CHM格式的独立编译、反编译工具,并且公开了全部源代码。

下面是从网上找到的国外的人对CHM研究的成果。

  • hhm (GPL2): hhm (HTML Help Maker) is a program that makes ITS files and in the future it will also make Compiled HTML Help (CHM) files. Both types of files are a kind of compressed archive format used on Win98, Win2K and other Microsoft operating systems to store documentation.
  • chmdeco (GPL2): chmdeco (CHM decompiler) is a program that converts the internal files of CHM files back into the hhp, hhc, hhk etc used to compile the documentation.
  • chmspec (GPL2): chmspec (CHM specification) is an effort to document Microsoft's Compiled HTML Help files (CHMs), mainly the internal files, since the archive format is documented already.
  • istorage (BSD): This is just a simple Windows proggie to extract files from those pesky MS compound file objects accessible via OLE's StgOpenStorage fuction and the IStorage interface exposed by that function. These compound file objects are created by word, excel, & probably other MS progs. Also Macromedia Flash source files (*.fla - there are some of these available from levitated.net, which is an interesting site) are these compound files. These compound file objects can be thought of as the equivalent of tar files, but of course MS went & invented some new format without even considering .zip, .lha, .tar.gz, .cab, blah blah blah. One weird thing about them is that the IStreams inside can & for word & excel do have freaky chars in their names, often as the first char such as in word 2000 docs there are streams named "SummaryInformation" (that is an 0x05 - there are also ones with 0x01). MSDEV .opt files are also these compound files. I updated this in April 2002 to extract Compiled HTML Help (chm) files (also known as InfoTech Storage (ITS) files) too. This feature uses the same IStorage interface, but uses an ITStorage (CLSID=5D02926A-212E-11D0-9DF9-00A0C922E6EC) object (from itss.dll) got from CoCreateInstance to open the file. I found out how to do this from these two code samples: www.keyworks.net/code.htm & helpware.net/delphi/index.html.
  • istorage-make (unicode version) (BSD): This is just a simple proggie to create those pesky MS compound file and InfoTech Storage (ITS) files too. It uses the same interfaces as istorage (an extractor for these files).

    我找到最有用的是http://bonedaddy.net/pabs3/files/istorage-makeu.zip。开放的源码是C++,而我工作的语言是Delphi,准备把C++代码迁移成Delphi(后来发现Delphi7的ActiveX单元内已经做了)。

    最重要的工作无非是申明出(引用)Ole32.dll的接口定义,实现结构化存储。C#的已经有人做得很好,看了一下明了易懂,如果需要用C#来实现,可以查看 http://www.codeproject.com/csharp/DecompilingCHM.asp 有完整的C#代码测试如何反编译CHM帮助文件。

  • 参考内容: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/stg/stg/istorage.asp

  •