识别 UTF-8 编码

思路：如果GBK中文字节流我们使用 UTF-8 编码，会出现未知字符�，字节代表数字为 -65, -67。
　　　　所以先尝试用 utf-8 编码，然后获取字节流，判断其中是否存在未知字符。

流程：

当直接使用 java 中的 String(byte[], offset, len) 时会采用的编码格式为

Charset.defaultCharset();


假设现在是 utf-8 编码
1，首先将字节 buf 转化为 String

String str = new String(buf, 0, buf.length)


2, 再次获取字节流

byte[] buf1 = str.getBytes()


3，看字节中是否存在连续的 -65,-67。有说明不是 utf-8 编码

public static boolean isUtf8(byte[] buf, int len) {
    // -65 -67
    for (int i = 0; i < len; i++) {
        if (buf[i] == -65 && i + 1 < len && buf[i + 1] == -67) {
            return false;
        }new String(buf, 0, buf.length);
    }
    return true;
}




整个流程代码：

{
    byte[] src; // 源字节流
    int offset; // 源字节流的起始位置
    int len;    // 源字节流的长度
    String str= new String(src, offset, len); // 假设默认编码为utf-8
    byte[] buf1 = scrStr.getBytes();
    if (! isUtf8(buf1, len)) {
        byte[] buf2 = new byte[len];
        System.arraycopy(src, offset, buf2, 0, len);
        try {
            str = new String(bytes, "GBK");
        } catch (UnsupportedEncodingException e){}
    }
   /*-- str是最终结果，即是采用合适编码的中文字符串 --*/
}

posted @ 2021-01-21 11:26 dnghong 阅读(293) 评论(0) 收藏举报

刷新页面返回顶部

识别 UTF-8 编码

公告