html编码的自动识别：

一个html编码的自动识别：

也许大家曾经尝试过很多方法，我也是，包括去取http包头的charset、分别stram的byte的特征等等，但你会发现，作为一个通用的平台，这些方案都行不通的。

通过多日的尝试，百度/google等等，得到的答案是，其实目的没有一个方法能够保证不出错，但有一个解决方案可以基本解决问题。那就是mozilla采用的编码识别模块，我找到了他的.net版本：NUniversalCharDet

using Mozilla.NUniversalCharDet;

public static string DetectEncoding_Bytes(byte[] DetectBuff)
        {
            int nDetLen = 0;
            UniversalDetector Det = new UniversalDetector(null);
            //while (!Det.IsDone())
            {
                Det.HandleData(DetectBuff, 0, DetectBuff.Length);
            }
            Det.DataEnd();
            if (Det.GetDetectedCharset() != null)
            {
                return Det.GetDetectedCharset();
            }

return "utf-8";
}

posted @ 2012-11-05 22:29 代码示例阅读(2977) 评论(0) 收藏举报

刷新页面返回顶部

html编码的自动识别：

公告