# 再谈字符编码

"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.

Unicode本质上是类似于是一种逻辑上的码点（code point）的集合，每一个码点对应一个语言中的最小基本单元，例如汉字中的字，字母表中的字母。这是逻辑上的映射。码点本身就是一系列的数字1,2,3....

UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.

UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)

UTF系列才是真正的“编码”，即把unicode中所代表的码点，转化成（编码）一个新的二进制码，用于存储，偏向于实际的物理表示。举个例子好了，韩国字“한”，对应的unicode是U+D55C，就是一个数字，用十进制表示就是152534，用二进制表示1101 0101 0101 1100，这都是逻辑上的。那么按照utf-8来编码的话，二进制物理表示就成为了11101101 10010101 10011100，用十进制的角度来看这个二进制就是355 225 234，16进制为ED 95 9C，这些都是存储在计算机中的值，是真正的编码值。

ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.

ASCII最简单的编码方式，不多说。

ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

ANSI这个说法经常看到，更多是指本地系统的编码方式。例如中文就是GB2312。

stackoverflow上也有很好的解答（https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring

utf8和utf16的互转，可利用c++11里面的codecvt方法。可参阅https://stackoverflow.com/questions/4804298/how-to-convert-wstring-into-string

GB2312 character set is sub set of Unicode character set. This means that every character defined in GB2312 is also defined in Unicode.

However, GB2312 codes and Unicode codes are totally un-related. For example, GB2312 character with code value of 0xB0A1 has a Unicode code value of 0x554A. There is no mathematical formula to convert a GB2312 code to a Unicode code of the same character.

The byte stream should be written to a binary file; it can be corrupted if written to a text file.

NL->CRLF conversion in ASCII mode isn't going to do pretty things to UTF-16 files, since it will insert one byte 0x0D instead of two bytes 0x00 0x0D.

1 std::locale loc (std::locale(), new std::codecvt_utf8<wchar_t>);
2 std::wofstream ofs ("test.txt");
3 ofs.imbue(loc);
4
5 std::cout << "Writing to file (UTF-8)... ";
6 ofs << str;
7 std::cout << "done!\n";

windows平台下utf8的输出，如果是到文本没有任何问题，常规使用ofstream直接将std::string写入就可以。

posted @ 2020-03-03 00:48  IT屁民  阅读(237)  评论(0编辑  收藏  举报