编码中的BOM
BOM的用途
这篇文章中谈到utf-16,utf-32,两种编码方式,都是指定了字节数。
讲字符对应的code point保存为字节,以utf-16为例,将字符对应的unicode,转换为2个字节。
这两个字节存储上就有两种方案,把大的字节放在前面还是后面,例如'中'
utf-16 大端编码为 FEFF4E2D
utf-16 小端编码为 FFFE2D4E
2D4E排在前面,还是后面,这就是两种方案
前面的FEFF/FFFE是BOM,指明了是大端编码还是小端编码,一个文本编辑器读到UTF-16编码的文件时,由BOM指定了大小端,那么在读字节流的时候,就按规定解析
UTF-16、UTF-32,带BOM是必要的
UTF-8,可以不带BOM,因为不带BOM的UTF-8文件,照样可以按照UTF-8规定正确解析出对应的字符
Python中的UTF BOM
1 2 s = '中' 3 4 s.encode('utf-16') 5 # windows 命令窗口下返回 b'\xff\xfe-N' 6 # 因为命令窗口自动做了转换 7 8 hex(ord('-')) # 0x2d 9 hex(ord('N')) # 0x4e 10 11 # 综上 b'\xff\xfe\x2d\4e 12 # 默认采用了小端编码,这个和CPU有关,应该不只是Python的原因
Name | UTF-8 | UTF-16 | UTF-16BE | UTF-16LE | UTF-32 | UTF-32BE | UTF-32LE |
---|---|---|---|---|---|---|---|
Smallest code point | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 |
Largest code point | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF |
Code unit size | 8 bits | 16 bits | 16 bits | 16 bits | 32 bits | 32 bits | 32 bits |
Byte order | N/A | <BOM> | big-endian | little-endian | <BOM> | big-endian | little-endian |
Fewest bytes per character | 1 | 2 | 2 | 2 | 4 | 4 | 4 |
Most bytes per character | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
# 如果指定了大小端,那么字节里就没有了BOM信息 s.encode('utf-16be') # 返回b'N-' # b'\x4e\x2d' # 没有带BOM
在保存文件时,可以指定这样的方式
with open('res.txt', 'wb') as f: f.write(s.encode('utf-16be')) # 不带bom写入,但是不建议,utf-16 utf-32尽量带BOM with open('res.txt', 'w', encoding='utf-16be') as f: f.write(s) # 因为s默认时unicode,所以以encoding编码成字节,写入文件中,当然如上所述,尽量带BOM
UTF-8 BOM
UTF-8也有BOM,但是一般不建议带上,因为UTF-8没有大小端,BOM是一个固定的值。
Windows系的软件保存的文本文件,默认都带有BOM,处理的时候务必注意,Python处理带BOM的UTF-8,会有异常,不是Python处理不了,而是默认按照不带BOM的UTF-8处理了
Bytes | Encoding Form |
---|---|
00 00 FE FF | UTF-32, big-endian |
FF FE 00 00 | UTF-32, little-endian |
FE FF | UTF-16, big-endian |
FF FE | UTF-16, little-endian |
EF BB BF | UTF-8 |
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?
A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is usedtransparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. [AF]
Python中utf-8默认是不带BOM的,utf-8-sig是带有BOM的,当打开带BOM的utf-8的,指定utf-8-sig编码
写出生活