编码中的BOM

 

BOM的用途

这篇文章中谈到utf-16,utf-32,两种编码方式,都是指定了字节数。
讲字符对应的code point保存为字节,以utf-16为例,将字符对应的unicode,转换为2个字节。
这两个字节存储上就有两种方案,把大的字节放在前面还是后面,例如'中'
utf-16 大端编码为 FEFF4E2D
utf-16 小端编码为 FFFE2D4E
2D4E排在前面,还是后面,这就是两种方案
 
前面的FEFF/FFFE是BOM,指明了是大端编码还是小端编码,一个文本编辑器读到UTF-16编码的文件时,由BOM指定了大小端,那么在读字节流的时候,就按规定解析
 
UTF-16、UTF-32,带BOM是必要的
UTF-8,可以不带BOM,因为不带BOM的UTF-8文件,照样可以按照UTF-8规定正确解析出对应的字符
 

Python中的UTF BOM

 
 1  
 2 s = ''
 3  
 4 s.encode('utf-16')
 5 # windows 命令窗口下返回  b'\xff\xfe-N'
 6 # 因为命令窗口自动做了转换
 7  
 8 hex(ord('-'))  # 0x2d
 9 hex(ord('N'))  # 0x4e
10  
11 # 综上 b'\xff\xfe\x2d\4e
12 # 默认采用了小端编码,这个和CPU有关,应该不只是Python的原因

 

 

NameUTF-8UTF-16UTF-16BEUTF-16LEUTF-32UTF-32BEUTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Fewest bytes per character 1 2 2 2 4 4 4
Most bytes per character 4 4 4 4 4 4 4
 
 
 
# 如果指定了大小端,那么字节里就没有了BOM信息
s.encode('utf-16be')
 
# 返回b'N-'
# b'\x4e\x2d'
# 没有带BOM
 
在保存文件时,可以指定这样的方式
 
with open('res.txt', 'wb') as f:
    f.write(s.encode('utf-16be'))  # 不带bom写入,但是不建议,utf-16 utf-32尽量带BOM
 
with open('res.txt', 'w', encoding='utf-16be') as f:
    f.write(s)  # 因为s默认时unicode,所以以encoding编码成字节,写入文件中,当然如上所述,尽量带BOM
 
 

UTF-8 BOM

UTF-8也有BOM,但是一般不建议带上,因为UTF-8没有大小端,BOM是一个固定的值。
Windows系的软件保存的文本文件,默认都带有BOM,处理的时候务必注意,Python处理带BOM的UTF-8,会有异常,不是Python处理不了,而是默认按照不带BOM的UTF-8处理了
 
BytesEncoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
 

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is usedtransparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. [AF]

 
 
Python中utf-8默认是不带BOM的,utf-8-sig是带有BOM的,当打开带BOM的utf-8的,指定utf-8-sig编码

with open('with_bom_utf_8.txt', 'r', encoding='utf-8-sig') as f:
    s = f.read()  # 按正确的编码,将字节流读取成unicode字符,每个字符在内存中保存为整数,code point
 
with open('utf_8.txt', 'w', encoding='utf-8') as f:
    f.write(s)  # 将unicode字符,保存为encoding编码后的字节流

 

 
 
 
 
 
 
 
 
 
posted @ 2020-08-22 08:11  duohappy  阅读(475)  评论(0编辑  收藏  举报