编码中的BOM

BOM的用途

这篇文章中谈到utf-16,utf-32，两种编码方式，都是指定了字节数。

讲字符对应的code point保存为字节，以utf-16为例，将字符对应的unicode，转换为2个字节。

这两个字节存储上就有两种方案，把大的字节放在前面还是后面，例如'中'

utf-16 大端编码为 FEFF4E2D

utf-16 小端编码为 FFFE2D4E

2D4E排在前面，还是后面，这就是两种方案

前面的FEFF/FFFE是BOM，指明了是大端编码还是小端编码，一个文本编辑器读到UTF-16编码的文件时，由BOM指定了大小端，那么在读字节流的时候，就按规定解析

UTF-16、UTF-32，带BOM是必要的

UTF-8，可以不带BOM，因为不带BOM的UTF-8文件，照样可以按照UTF-8规定正确解析出对应的字符

Python中的UTF BOM

 1  
 2 s = '中'
 3  
 4 s.encode('utf-16')
 5 # windows 命令窗口下返回  b'\xff\xfe-N'
 6 # 因为命令窗口自动做了转换
 7  
 8 hex(ord('-'))  # 0x2d
 9 hex(ord('N'))  # 0x4e
10  
11 # 综上 b'\xff\xfe\x2d\4e
12 # 默认采用了小端编码，这个和CPU有关，应该不只是Python的原因

来源： http://www.unicode.org/faq/utf_bom.html

Name	UTF-8	UTF-16	UTF-16BE	UTF-16LE	UTF-32	UTF-32BE	UTF-32LE
Smallest code point	0000	0000	0000	0000	0000	0000	0000
Largest code point	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF	10FFFF
Code unit size	8 bits	16 bits	16 bits	16 bits	32 bits	32 bits	32 bits
Byte order	N/A	<BOM>	big-endian	little-endian	<BOM>	big-endian	little-endian
Fewest bytes per character	1	2	2	2	4	4	4
Most bytes per character	4	4	4	4	4	4	4

 
# 如果指定了大小端，那么字节里就没有了BOM信息
s.encode('utf-16be')
 
# 返回b'N-'
# b'\x4e\x2d'
# 没有带BOM

在保存文件时，可以指定这样的方式

with open('res.txt', 'wb') as f:
    f.write(s.encode('utf-16be'))  # 不带bom写入，但是不建议，utf-16 utf-32尽量带BOM
 
with open('res.txt', 'w', encoding='utf-16be') as f:
    f.write(s)  # 因为s默认时unicode，所以以encoding编码成字节，写入文件中，当然如上所述，尽量带BOM
 

UTF-8 BOM

UTF-8也有BOM，但是一般不建议带上，因为UTF-8没有大小端，BOM是一个固定的值。

Windows系的软件保存的文本文件，默认都带有BOM，处理的时候务必注意，Python处理带BOM的UTF-8，会有异常，不是Python处理不了，而是默认按照不带BOM的UTF-8处理了

Bytes	Encoding Form
00 00 FE FF	UTF-32, big-endian
FF FE 00 00	UTF-32, little-endian
FE FF	UTF-16, big-endian
FF FE	UTF-16, little-endian
EF BB BF	UTF-8

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is usedtransparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. [AF]

来源： http://www.unicode.org/faq/utf_bom.html#bom1

Python中utf-8默认是不带BOM的，utf-8-sig是带有BOM的，当打开带BOM的utf-8的，指定utf-8-sig编码

with open('with_bom_utf_8.txt', 'r', encoding='utf-8-sig') as f:
    s = f.read()  # 按正确的编码，将字节流读取成unicode字符，每个字符在内存中保存为整数，code point
 
with open('utf_8.txt', 'w', encoding='utf-8') as f:
    f.write(s)  # 将unicode字符，保存为encoding编码后的字节流

posted @ 2020-08-22 08:11 duohappy 阅读(475) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

zhouww

duohappy

编码中的BOM

BOM的用途

Python中的UTF BOM

UTF-8 BOM