python中bytes和str的编解码问题

1.在用read_csv读取文件时：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 2: invalid start byte

说明utf-8不行，https://docs.python.org/2.4/lib/standard-encodings.html

https://stackoverflow.com/questions/54133455/importing-csv-using-pd-read-csv-invalid-start-byte-error 根据这个里提供的代码

import chardet    
rawdata = open('D:\\path\\file.csv', 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print(charenc)

改成GB18030和GB2312。https://blog.csdn.net/Junkichan/article/details/51913845，换成GB18030还是不行的啊。。。

为啥文件显示了用 GB2312来编码，但是用它来指定编码方式读取文件就不行呢

utf-16 也不行。。。。

https://github.com/rkern/line_profiler/issues/37

https://stackoverflow.com/questions/42339876/error-unicodedecodeerror-utf-8-codec-cant-decode-byte-0xff-in-position-0-in/42340744

反正就很多文件里都是说这样读：

with open(path, 'rb') as f:
  contents = f.read()

bytes类型。

https://www.jb51.net/article/144439.htm像这个给出来的读取中文数据集的例子，都能够正常读进来。

import chardet    #查看这个文件原始的编码方式
rawdata = open("./a.txt", 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print(charenc)

然后尝试了这个：

utf-8
>>> rawdata
b'\xe7\xbc\x96\xe5\x8f\xb7,\xe8\x89\xb2\xe6\xb3\xbd,\xe6\xa0\xb9\xe8\x92\x82,\xe6\x95\xb2\xe5\xa3\xb0,\xe7\xba\xb9....

>>> rawdata[1]
188
>>> rawdata[2]
150

默认的编码方式就是utf-8很正常啊，也能够正常读取。。

既然它不是utf-8编码，那我就把数据集改为utf-8不行吗？

我直接在notepad++里将它的编码方式转化为了utf-8.。。没有乱码。

2.字节串和字符串

http://c.biancheng.net/view/2175.html 这个讲的非常好！

b="C语言中文网8岁了"
be=b.encode('UTF-8')
bd=be.decode() #默认用UTF-8解码

#输出：
>>> be
b'C\xe8\xaf\xad\xe8\xa8\x80\xe4\xb8\xad\xe6\x96\x87\xe7\xbd\x918\xe5\xb2\x81\xe4\xba\x86'
>>> bd
'C语言中文网8岁了'

b="C语言中文网8岁了"
be=b.encode('GB2312')
bd=be.decode()

#输出：
 bd=be.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 1: invalid continuation byte
#熟悉的bug出现了！

对比上面的使用UTF-8编码，可以发现这两种编码结果是不同的：

b="C语言中文网8岁了"
be=b.encode('GB2312')
bd=be.decode('GB2312')

#结果：
>>> be
b'C\xd3\xef\xd1\xd4\xd6\xd0\xce\xc4\xcd\xf88\xcb\xea\xc1\xcb'
>>> bd
'C语言中文网8岁了'

不搞了，反正以后都转换为utf-8就ok。

https://blog.csdn.net/lyb3b3b/article/details/74993327 这个讲的非常透彻！

posted @ 2020-03-17 16:46 lypbendlf 阅读(241) 评论(0) 收藏举报

刷新页面返回顶部

python中bytes和str的编解码问题

1.在用read_csv读取文件时：

2.字节串和字符串

公告