python 中文乱码解决方案

python 处理文字内容时,常常遇到编码的问题。

汉字常用的两种编码方式为 utf8 和 gbk,解析一个 txt 文件或者一个字符串时经常会遇到编码问题。

对于一行文本,我们分别尝试用 utf8 或者 gbk 去解码,哪一个解码内容多选择哪一个

def force_decode(string:bytes) ->str:
    """
    sometimes neither gbk nor gbk can decode succseefully from string
    select longger decode result from utf8 or gbk
    """
    if not isinstance(string, bytes):
        raise ValueError('expected bytes array')
    decode_chars_count = []
    for i in ['utf8', 'gbk']:
        try:
            return string.decode(i)
        except UnicodeDecodeError as ex:
            decode_chars_count.append(ex.start)
    # neither utf8 or gbk decode successfully
    # select the longer decode one
    utf8_len, gbk_len = decode_chars_count
    selected_encoding = 'utf8' if utf8_len > gbk_len else 'gbk'
    return string.decode(selected_encoding, errors='ignore')

 

代码链接:https://gist.github.com/albertofwb/b53bf32adca5c245c6dee6642ca5463d

posted @ 2020-06-24 16:46  SurfUniverse  阅读(307)  评论(0编辑  收藏  举报