测试文件编码工具类juniversalchardet，java读取csv文件中文乱码

前言：

java读取csv文件中文乱码，文件编码不对。究其原因，

文件被转换过，重新写入指定编码不对，
还有就是读取csv时候指定的编码不对。
导出csv文件，excel和wps打开显示不对，要导出指定编码GBK或者GB2312

我们如何获取编码呢？下面就是如何获取原文件的编码。

读取csv时候我们用windows创建的csv文件，解析出是 WINDOWS-1252，我们解析传入的就是GBK或者GB2312,这样才对。

目前发现的是这样，有其他情况继续补充

1. 这是什么？

juniversalchardet是"universalchardet"的Java端口，它是Mozilla的编码检测器库。可以测试出文件的编码，读取的时候可以指定编码，防止文件读取乱码

maven镜像

<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

2. 支持的编码

Chinese
- ISO-2022-CN
- BIG5
- EUC-TW
- GB18030
- HZ-GB-2312¹
Cyrillic
- ISO-8859-5
- KOI8-R
- WINDOWS-1251
- MACCYRILLIC
- IBM866
- IBM855
Greek
- ISO-8859-7
- WINDOWS-1253
Hebrew
- ISO-8859-8
- WINDOWS-1255
Japanese
- ISO-2022-JP
- SHIFT_JIS
- EUC-JP
Korean
- ISO-2022-KR
- EUC-KR
Unicode
- UTF-8
- UTF-16BE / UTF-16LE
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-3412¹ / X-ISO-10646-UCS-4-2143¹
Others
- WINDOWS-1252

1 Currently not supported by Java

3. 使用分5步

Construct an instance of org.mozilla.universalchardet.UniversalDetector.
Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
Notify the detector of the end of data by calling UniversalDetector.dataEnd().
Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
Don't forget to call UniversalDetector.reset() before you reuse the detector instance.

如何使用

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
    public static void main(String[] args) throws java.io.IOException {
        byte[] buf = new byte[4096];
        String fileName = args[0];
        java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

// (1)
        UniversalDetector detector = new UniversalDetector(null);

// (2)
        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
// (3)
        detector.dataEnd();

// (4)
        String encoding = detector.getDetectedCharset();
        if (encoding != null) {
            System.out.println("Detected encoding = " + encoding);
        } else {
            System.out.println("No encoding detected.");
        }

// (5)
        detector.reset();
    }
}

注意，用工具类读取csv文件时候指定的编码不是 WINDOWS-1252，是GB2312

//hutool工具类 https://www.hutool.cn/docs/#/core/%E6%96%87%E6%9C%AC%E6%93%8D%E4%BD%9C/CSV%E6%96%87%E4%BB%B6%E5%A4%84%E7%90%86%E5%B7%A5%E5%85%B7-CsvUtil
CsvReader reader = CsvUtil.getReader();
//从文件中读取CSV数据
CsvData data = reader.read(FileUtil.file("test.csv"),Charset.forName("GB2312"));
List<CsvRow> rows = data.getRows();
//遍历行
for (CsvRow csvRow : rows) {
    //getRawList返回一个List列表，列表的每一项为CSV中的一个单元格（既逗号分隔部分）
    Console.log(csvRow.getRawList());
}

posted @ 2021-11-26 10:26 _Phoenix 阅读(475) 评论(0) 收藏举报

刷新页面返回顶部

_Phoenix

测试文件编码工具类juniversalchardet，java读取csv文件中文乱码

前言：

1. 这是什么？

2. 支持的编码

3. 使用分5步

如何使用

公告