MySQL 学习笔记(六)---character 与 collation

1.character 与 collation

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

collation 命名:A collation name starts with the name of the character set with which it is associated, generally followed by one or more suffixes indicating other collation characteristics. But the binary character set has a single collation, also named binary, with no suffixes.A language-specific collation includes a language name.Collation suffixes indicate whether a collation is case-sensitive, accent-sensitive(声调敏感), or kana-sensitive (or some combination thereof), or binary.For nonbinary collation names that do not specify accent sensitivity, it is determined by case sensitivity.

可能用到的SQL语句

-- 查看可用的Character
SHOW CHARACTER SET;
-- 支持模糊查询
SHOW CHARACTER SET LIKE 'latin%'; ##例如查看latin开头的

-- 查看Collations
SHOW COLLATION;
-- 支持模糊查询
SHOW COLLATION WHERE Charset = 'latin1';

2.UTF-8 for Metadata

MySQL stores metadata in a Unicode character set, namely UTF-8. This does not cause any disruption if you never use accented or non-Latin characters. But if you do, you should be aware that metadata is in UTF-8.

The server sets the character_set_system system variable to the name of the metadata character set:

 SHOW VARIABLES LIKE 'character_set_system';

为:utf8.

3.字符集的设置也会影响到客户端

Character set issues affect not only data storage, but also communication between client programs and the MySQL server. If you want the client program to communicate with the server using a character set different from the default, you'll need to indicate which one.

4.four levels

There are default settings for character sets and collations at four levels: server, database, table, and column. 

Server

MySQL Server has a server character set and a server collation.They can be set explicitly at server startup on the command line or in an option file and changed at runtime.

Database

Every database has a database character set and a database collation. The CREATE DATABASE and ALTER DATABASE statements have optional clauses for specifying the database character set and collation.(All database options are stored in a text file named db.opt that can be found in the database directory.).

Table

Every table has a table character set and a table collation. The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the table character set and collation.The table character set and collation are used as default values for column definitions if the column character set and collation are not specified in individual column definitions. 

column

Every “character” column (that is, a column of type CHAR, VARCHAR, a TEXT type, or any synonym) has a column character set and a column collation. Column definition syntax for CREATE TABLE and ALTER TABLE has optional clauses for specifying the column character set and collation.If you use ALTER TABLE to convert a column from one character set to another, MySQL attempts to map the data values, but if the character sets are incompatible, there may be data loss.

5.Connection Character Sets and Collations

A “connection” is what a client program makes when it connects to the server, to begin a session within which it interacts with the server. The client sends SQL statements, such as queries, over the session connection. The server sends responses, such as result sets or error messages, over the connection back to the client.

Additional character set and collation system variables are involved in handling traffic for the connection between a client and the server. Every client has session-specific connection-related character set and collation system variables. These session system variable values are initialized at connect time, but can be changed within the session.

The server takes the character_set_client system variable to be the character set in which statements are sent by the client.

6.Client Program Connection Character Set Configuration

When a client connects to the server, it indicates which character set it wants to use for communication with the server. (Actually, the client indicates the default collation for that character set, from which the server can determine the character set.) The server uses this information to set the character_set_client, character_set_results, character_set_connection system variables to the character set, and collation_connection to the character set default collation. In effect, the server performs the equivalent of a SET NAMES operation.

If the server does not support the requested character set or collation, it falls back to using the server character set and collation to configure the connection.

7.Character Set for Error Message Disposition

If the server writes the error message to the error log, it writes it in UTF-8, as constructed, without conversion to another character set.If the server sends the error message to a client program, the server converts it from UTF-8 to the character set specified by the character_set_results system variable.

8.The binary Collation Compared to _bin Collations

binary Collation

Binary strings (as stored using the BINARY, VARBINARY, and BLOB data types) have a character set and collation named binary. Binary strings are sequences of bytes and the numeric values of those bytes determine comparison and sort order.

_bin Collations

Nonbinary strings (as stored using the CHAR, VARCHAR, and TEXT data types) have a character set and collation other than binary. A given nonbinary character set can have several collations, each of which defines a particular comparison and sort order for the characters in the set. One of these is the binary collation, indicated by a _bin suffix in the collation name. For example, the binary collation for utf8 and latin1 is named utf8_bin and latin1_bin, respectively.

两者的区别

(1)The Unit for Comparison and Sorting

Binary strings are sequences of bytes.Nonbinary strings are sequences of characters, which might be multibyte.

(2)Character Set Conversion

For binary string columns, no conversion occurs.A nonbinary string has a character set and is automatically converted to another character set in many cases, even when the string has a _bin collation.

(3)Lettercase Conversion

Collations for nonbinary character sets provide information about lettercase of characters, so characters in a nonbinary string can be converted from one lettercase to another, even for _bin collations that ignore lettercase for ordering.The concept of lettercase does not apply to bytes in a binary string.例如LOWER(); UPPER();函数。

(4)Trailing Space Handling in Comparisons

Nonbinary strings have PAD SPACE behavior for all collations, including _bin collations. Trailing spaces are insignificant in comparisons(对结尾的空格不敏感).For binary strings, all bytes are significant in comparisons, including trailing spaces(对结尾的空格敏感)。

(5)Trailing Space Handling for Inserts and Retrievals

CHAR(N) columns store nonbinary strings N characters long. For inserts, values shorter than N characters are extended with spaces. For retrievals, trailing spaces are removed. BINARY(N) columns store binary strings N bytes long. For inserts, values shorter than N bytes are extended with 0x00 bytes. For retrievals, nothing is removed; a value of the declared length is always returned.

9.Unicode Support

The UTF-8 (Unicode Transformation Format with 8-bit units) method for encoding Unicode data is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:

• Basic Latin letters, digits, and punctuation signs use one byte.

• Most European and Middle East script letters fit into a 2-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.

• Korean, Chinese, and Japanese ideographs use 3-byte or 4-byte sequences.

10.MySQL supports these Unicode character sets

• utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.

• utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.

• utf8: An alias for utf8mb3.

• ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character.

• utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like ucs2 but with an extension for supplementary characters.

• utf16le: The UTF-16LE encoding for the Unicode character set. Like utf16 but little-endian rather than big-endian.

• utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.

Character Set Supported Characters Required Storage Per Character
utf8mb3, utf8 BMP only 1, 2, or 3 bytes
ucs2 BMP only 2 bytes
utf8mb4 BMP and supplementary 1, 2, 3, or 4 bytes
utf16 BMP and supplementary 2 or 4 bytes
utf16le BMP and supplementary 2 or 4 bytes
utf16le BMP and supplementary 4 bytes

注意:Characters outside the BMP compare as REPLACEMENT CHARACTER and convert to '?' when converted to a Unicode character set that supports only BMP characters (utf8mb3 or ucs2).

11.The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)

utfmb4的特征:

(1)Supports BMP and supplementary characters.

(2)Requires a maximum of four bytes per multibyte character.

utf8mb4 contrasts with the utf8mb3 character set, which supports only BMP characters and uses a maximum of three bytes per character: 

(3)For a BMP character, utf8mb4 and utf8mb3 have identical storage characteristics: same code values, same encoding, same length.

(4)For a supplementary character, utf8mb4 requires four bytes to store it, whereas utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there will be none.

utf8mb4 is a superset of utf8mb3.

12.The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding)

utf8mb3特征

(1)Supports BMP characters only (no support for supplementary characters).

(2)Requires a maximum of three bytes per multibyte character.

13.Converting Between 3-Byte and 4-Byte Unicode Character Sets

Conversion from utf8mb3 to utf8mb4 presents no problems,but these are the primary potential incompatibilities:

• For the variable-length character data types (VARCHAR and the TEXT types), the maximum permitted length in characters is less for utf8mb4 columns than for utf8mb3 columns.

• For all character data types (CHAR, VARCHAR, and the TEXT types), the maximum number of characters that can be indexed is less for utf8mb4 columns than for utf8mb3 columns.

Consequently, to convert tables from utf8mb3 to utf8mb4, it may be necessary to change some column or index definitions.

举例说明

A TINYTEXT column can hold up to 255 bytes, so it can hold up to 85 3-byte or 63 4-byte characters. Suppose that you have a TINYTEXT column that uses utf8mb3 but must be able to contain more than 63 characters. You cannot convert it to utf8mb4 unless you also change the data type to a longer type such as TEXT.

Similarly, a very long VARCHAR column may need to be changed to one of the longer TEXT types if you want to convert it from utf8mb3 to utf8mb4.

InnoDB has a maximum index length of 767 bytes for tables that use COMPACT or REDUNDANT row format, so for utf8mb3 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8mb3 columns with indexes longer than 191 characters, you must index a smaller number of characters.

14 Restrictions on ucs2, utf16, utf16le, and utf32 Character Sets

• None of them can be used as the client character set.

• It is currently not possible to use LOAD DATA to load data files that use these character sets.

• FULLTEXT indexes cannot be created on a column that uses any of these character sets.

• The use of ENCRYPT() with these character sets is not recommended because the underlying system call expects a string terminated by a zero byte.

 

--------笔记

posted @ 2021-11-09 22:13  东山絮柳仔  阅读(475)  评论(0编辑  收藏  举报