# utf-8编码的 byte order mark 问题

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.[1]

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.

The Unicode Standard does permit the BOM in UTF-8,[2] but does not require or recommend its use.[3] Byte order has no meaning in UTF-8[4] so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.

Many Windows programs (including Windows Notepad) add BOMs to UTF-8 files by default[citation needed].

UTF-8编码不推荐使用无意义的BOM，但许多Windows程序却在保存UTF-8编码的文件时将其存为带BOM的格式（即在文件开头加上0xEFBBBF三个字节），这么干的就包括Windows记事本。

posted @ 2016-04-11 16:01  WilliamHu  阅读(...)  评论(...编辑  收藏