C#读取BIFF4格式的Excel文件

NPOI读取BIFF4

使用NPOI读取这种格式的Excel文件时会报错:The supplied data appears to be in BIFF4 format. HSSF only supports the BIFF8 format, try OldExcelExtractor

如果使用OldExcelExtractor类的话,只能提取成文本,无法结构化读取。

 

BIFF格式

刚看到BIFF4格式时,可能会觉得有点奇怪,这跟.xls文件(Excel)有什么关系?

BIFF格式是Microsoft Excel使用的一种二进制文件格式,‌用于存储电子表格数据。‌这种格式被称为二进制交换文件格式(‌Binary Interchange File Format)‌,‌简称BIFF。‌BIFF格式允许Excel文件在不同版本的Excel之间保持兼容性,‌使得用户可以在不同版本的Excel中打开和编辑文件而不会丢失数据或格式。‌BIFF格式是Excel文件格式的一部分,‌它包含了单元格的数据、‌格式、‌公式以及整个工作表的布局信息

 

BIFF版本历史

这里引用loc.gov的英文原文。

大概理解为xls格式经历了不同的版本,从BIFF2到BIFF12。

以下为原文:

The early XLS formats, used for Excel 2.0 (1987) through Excel 4.0 (1992), allowed only a single worksheet. The corresponding file formats were single BIFF streams.

  • BIFF2 for Excel 2.0 (1987)
  • BIFF3 for Excel 3.0 (1990)
  • BIFF4 for Excel 4.0 (1992)

Note that the extension .xlw was used to support multi-sheet "workspaces" starting with Excel 3.0. However, the .xlw file did not contain user data; it was used to configure Excel's user interface presentation of the component sheets. See .xlw File Extension | ReviverSoft for more detail.

With BIFF5, a new structure was introduced; an XLS file now represented a single Workbook with one or many individual Worksheets. A number of streams, including BIFF streams for individual worksheets, are stored in a Microsoft Compound File Binary File container. The compilers of this resource have experimented with saving spreadsheets as XLS files in current versions of Excel. In all cases, the resulting file was in version 3 of CFB (MS-CFB3). Comments welcome. In BIFF7 and earlier, a record in a BIFF stream has a length limit of 2,084 bytes, including the record type and record length fields.

  • BIFF5 for Excel 5.0 (1993) and Excel 95 (1995)
  • BIFF7 was for an option available in Excel 97. Microsoft's 2007 Specification for Microsoft Office Excel 97-2007 Binary File Format (*.xls) states, "For improved backward compatibility, Excel 97 has a save file type option: Microsoft Excel 97 & 5.0/95 Workbook. When a workbook is saved using this file type, Excel writes two complete book streams. The first stream in the file is the Microsoft Excel 5.0/95 format (BIFF5/BIFF7), and the second one is the Microsoft Excel 97 format (BIFF8). The DSF record, which only appears in the BIFF8 stream, indicates the file is a double stream file. To distinguish the two streams, the BIFF5/BIFF7 stream is called Book, and the BIFF8 stream is called Workbook."
  • BIFF8 for Excel 98 (1998) through Excel 2003. In BIFF8, a BIFF record has a length limit of 8,228 bytes, including the record type and record length fields. As noted above, Excel 97 introduced the BIFF8 format in its double-stream file option. This format description is primarily for the BIFF8 variant of the XLS format.

Note that BIFF12 is used in a different binary file format, using a different container file and the file extension .xlsb. It has been available as an alternative to the XML-based XLSX since Excel 2007. See MS-XLSB.

 

 

读取BIFF4格式的Excel文件

对于这种古老的格式,可以使用ExcelDataRead库来进行读取

项目地址:https://github.com/ExcelDataReader/ExcelDataReader

 

ExcelDataRead使用方法

1、Nuget安装ExcelDataReader和ExcelDataReader.DataSet包

2、示例代码

 1 using (var stream = File.Open(filePath, FileMode.Open, FileAccess.Read))
 2 {
 3     //  自动识别格式,支持:
 4     //  - Excel文档 (2.0-2003 format; *.xls)
 5     //  - OpenXml Excel 文档 (2007 format; *.xlsx, *.xlsb)
 6     using (var reader = ExcelReaderFactory.CreateReader(stream))
 7     {
 8         // 可选以下两种方式:
 9 
10         // 1. 使用 reader 方法
11         do
12         {
13             while (reader.Read())
14             {
15                 // reader.GetDouble(0);
16             }
17         } while (reader.NextResult());
18 
19         // 2. 使用 AsDataSet 扩展方法转换为DataSet
20         var result = reader.AsDataSet();
21 
22         // 从result.Tables中取对应的Sheet
23     }
24 }

 

注意:对于.NET5及以上版本的项目,需要Nuget安装System.Text.Encoding.CodePages包,并在读取文档前调用以下代码:

1 System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);

需要调用这句代码的原因是某些编码在.Net Framework时是默认注册的,但是到.NET Core 5.0以后的版本需要手动注册。

 

因为我手里的这个BIFF4格式的文档涉及一些隐私信息,所以没办法上传,但是使用上面的代码是可以正常读取的。

 

中文读取乱码问题

文件中含有中文时,读取时会乱码

 

关于中文乱码的问题,在项目的issue中,有人已经提交了相关的issue,但是目前正式的版本并未解决。

根据项目作者的描述,在当前的开发版本中,已经加入了在创建Reader时指定编码的功能,如下所示:

1 IExcelDataReader excelReader = ExcelReaderFactory.CreateBinaryReader(stream, encoding);

 

但是根据我今天的测试,发现使用了github上的最新代码,还没有这种重载。

 

我手动调试了一下代码,找到了手动设置编码的地方,设置编码为GB2312就可以解决中文乱码的问题。

找到src\ExcelDataReader\Core\BinaryFormat\XlsWorksheet.cs文件,找到317行的GetLabelString函数,将编码强制设置为GB2312(原来的编码是MBCS)

 

1 private string GetLabelString(XlsBiffLabelCell cell, ExtendedFormat effectiveStyle)
2 {
3     // 1. Use encoding from font's character set (BIFF5-8)
4     // 2. If not specified, use encoding from CODEPAGE BIFF record
5     // 3. If not specified, use configured fallback encoding
6     // Encoding is only used on BIFF2-5 byte strings. BIFF8 uses XlsUnicodeString which ignores the encoding.
7     var labelEncoding = Encoding.GetEncoding("GB2312");
8     return cell.GetValue(labelEncoding);
9 }

 

此时再读取,中文就显示正常了。

 

 

参考资料:

https://www.loc.gov/preservation/digital/formats/fdd/fdd000510.shtml

posted @ 2024-08-07 16:22  zhaotianff  阅读(194)  评论(0)    收藏  举报