什么时候一个字符等于两个字符？

一、由 ß 与 ss 引发的疑问

研究这个问题的起因是我在查找关于 MySQL 中 _general_ci 和 unicode_ci 两个 collation 之间差别的相关资料时，在 stackoverflow 中发现一个我觉得很神奇的答案。unicode_ci 校对集在比较和排序时，会认为 ß 和 ss 是完全相等的，而且如图：

图中说道，ß 和 ss 相等这个结论是 many people normally want 的。在我看来应该是完全不可能相等。且不说长得不一样， ß 是一个字符，而 ss 是两个字符又怎么会相等呢？但是，好像在一些国家或地区看来， ß == ss 才是符合期望的结果。

同时，如果你使用 chrome 在页面内搜索字符 ß，发现 ss 也会被搜索出来。说明连 chrome 都认为 ß 和 ss 是相等的，如图：

再做个试验：分别基于 utf8 general_ci 和 utf8 unicode_ci 这两种校对集建立两张结构一致的表，而且两张表都分别插入一条 name 为 ß, 一条 name 为 ss 的数据。然后执行查询，查询结果如图：

使用 utf8 unicode_ci 校对集的这张表，查询 name = ss ，会将 name = ß 的记录一并查询出来。这就说明了 ß == ss 这个 comparison 为 true 的原因和 unicode 脱不了关系。

二、ß 是什么

一开始我的反应是，ß 是一个连体字，由 s + zwj + s 组成，所以 ß 等于 ss。后来查了一下这个字符，发现它并不包含 zwj，所以它实际上还是一个单独的字符。后来找到了相关的资料，如图：

图中说道，在 en-US 和 de-DE cultures 中，ss 等于 ß，而 ß 是 German Essetz。现在我们知道了，ß 是一个德文字符。

三、为什么 ß 等于 ss

既然知道了 ß 是一个德文字符，那么为什么 ß 等于 ss 呢？

上面说了，在 en-US 和 de-DE cultures 中，ss 等于 ß，并且这个 comparison 为 true 和 unicode 脱不了干系。所以我查阅了 unicode 文档中的相关资料，找到了我认为能回答这个问题的答案。

3.1 case

首先看一下三个 case：

case	example
upper case	THIS IS UPPER CASE
lowe case	this is lower case
title case	This Is Title Case

大小写这种说法，只有在拉丁语系(拉丁语等)、日耳曼语系(英语等) 一些语言的字母中才有。但是汉字、韩文和日文等这些都是没有大小写这一说法的。

3.2 case mapping

unicode 给所有的字符设计了一个 code point -> upper case、 code point -> lower case、code point -> title case 的映射，这个把字符转换为指定 case 的过程叫做 case mapping。这些 map 存储在 UCD (unicode character database) 里，比如：

3.3 case folding

case folding 译作 “大小写折叠”，在网上找到了很多对它的定义：

wikipedia : case folding is the conversion of letter case in a string
igi-global : The process of converting all the characters in a document into the same case, either all upper caseor lower case, in order to speed up comparisons during the indexing process
w3.org : case folding is the process of making two texts which differ only in case identical for comparison purposes, that is, it is meant for the purpose of string matching

大概的意思是：case folding 是出于字符串比较目的，而将字符串中的所有字母转换为某个 case 的过程。

3.4 case fold mapping

case mapping 为字符设计了大小写映射，而 case fold mapping 为字符设计了大小写折叠映射。大部分字符是没有大小写之说的，所以它们也不需要 case fold mapping。而那些有大小写的字符，大多数都是一一对应的。

这里说的一一对应，指的是一个字符对应一个字符。这种 case fold mapping 叫做 common case fold mapping，比如：

而还有一些比较特殊的字符，它的小写对应了 2 个字符，比如 ß 这个德语字符对应的小写字符是 ss。ß -> ss 这种特殊的映射叫做 full case fold mapping。

而为什么 unicode 设计了一个 ß : ss 的 map 呢，这个我没有详细研究，再研究下去真的就是 de-DE culture 的范畴了。

最后，我还是选择了 general_ci ，因为对我(zh-CN culture)来说，ß == ss 真的是非常奇怪。

posted on 2018-11-04 00:29 delta_lt_0 阅读(273) 评论(0) 收藏举报

刷新页面返回顶部

delta_lt_0