What's the difference between UTF-8 and UTF-8 without BOM?

What's the difference between UTF-8 and UTF-8 without BOM?

评论:

UTF-8 can be auto-detected better by contents than by BOM. The method is simple: try to read the file (or a string) as UTF-8 and if that succeeds, assume that the data is UTF-8. Otherwise assume that it is CP1252 (or some other 8 bit encoding). Any non-UTF-8 eight bit encoding will almost certainly contain sequences that are not permitted by UTF-8. Pure ASCII (7 bit) gets interpreted as UTF-8, but the result is correct that way too.

 

Scanning large files for UTF-8 content takes time. A BOM makes this process much faster. In practice you often need to do both. The culprit犯人 nowadays is that still a lot of text content isn't Unicode, and I still bump into tools that say they do Unicode (for instance UTF-8) but emit their content a different codepage.

 

UTF-8 does not have a BOM. When you put a U+FEFF code point at the start of a UTF-8 file, special care must be made to deal with it. This is just one of those Microsoft naming lies, like calling an encoding "Unicode" when there is no such thing.

 

 

回答1

The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (EF BB BF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:   但是应该还是需要使用BOM,否则文件识别可能有问题

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

 

评论:

It might not be recommended but from my experience in Hebrew conversions the BOM is sometimes crucial for UTF-8 recognition in Excel, and may make the difference between Jibrish and Hebrew

 

It might not be recommended but it did wonders to my powershell script when trying to output "æøå"

 

 

回答2

The other excellent answers already answered that:

  • There is no official difference between UTF-8 and BOM-ed UTF-8
  • A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
  • Those bytes, if present, must be ignored when extracting the string from the file/stream.

But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...

For example, the data [EF BB BF 41 42 43] could either be:

  • The legitimate ISO-8859-1 string "ABC"
  • The legitimate UTF-8 string "ABC"

So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above

Encodings should be known, not divined.

评论

You understood correctly. The string [EF BB BF 41 42 43] is just a bunch of bytes. You need external information to choose how to interpret it. If you believe those bytes were encoded using ISO-8859-1, then the string is "ABC". If you believe those bytes were encoded using UTF-8, then it is "ABC". If you don't know, then you must try to find out. The BOM could be a clue. The absence of invalid character when decoded as UTF-8 could be another... In the end, unless you can memorize/find the encoding somehow, an array of bytes is just an array of bytes.

 

Convert a file to utf-8, but still shown as ascii

https://stackoverflow.com/questions/11303405/force-encode-from-us-ascii-to-utf-8-iconv

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.

It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.

 


回答3

This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.

As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.

If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on its source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.

Why is a BOM not recommended?

  1. Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
  2. It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)

When should you encode with a BOM?

If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.

 

 

 不建议使用BOM

https://github.com/PowerShell/PowerShell/issues/15230

Yea it’s not recommended because a lot of tools on Linux don’t support the BOM for UTF-8.

It’s really a mixed bag and you can probably safely use just utf-8 without a BOM but it’s highly dependent on your environment and what is consuming the files.

 

 https://en.wikipedia.org/wiki/Byte_order_mark

UTF-8

The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF.

The Unicode Standard permits the BOM in UTF-8,[3] but does not require or recommend its use.[4] Byte order has no meaning in UTF-8,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[6][7] The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[8]

Not using a BOM allows text to be backwards-compatible with some software that is not Unicode-aware. Examples include programming languages that permit non-ASCII bytes in string literals but not at the start of the file.

UTF-8 is a sparse encoding in the sense that a large fraction of possible byte combinations do not result in valid UTF-8 text. Binary data and text in any other encoding are likely to contain byte sequences that are invalid as UTF-8. Practically the only exceptions to that are when the text consists purely of ASCII-range bytes. Because all modern encodings use ASCII-range bytes to represent ASCII characters, ASCII-only text can be safely interpreted as UTF-8 regardless of what encoding was intended by the system that emitted the bytes. Because of these considerations, heuristic analysis can detect with high confidence whether UTF-8 is in use, without requiring a BOM.

Microsoft compilers[9] and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Windows PowerShell (up to 5.1) will add a BOM when it saves UTF-8 XML documents. However, PowerShell Core 6 has added a -Encoding switch on some cmdlets called utf8NoBOM so that document can be saved without BOM. Google Docs also adds a BOM when converting a document to a plain text file for download.

 

 

 

posted @ 2019-01-21 13:32  ChuckLu  阅读(820)  评论(0编辑  收藏  举报