多维数据查询效率分析（2） - taowen

公告

上次我们分析了在附加属性表这样表结构设计下的PostgreSQL查询效率。由于PostgreSQL众所周知的所谓“性能”问题，所以有必要再用使用MyISAM引擎的MySQL再来实验一遍。在我们详细分析了两种常见的开源数据库之后，话题将会进一步引申到按行存储的数据库结构以及索引对于查询效率的影响。以下实验中的MySQL为MariaDB发行版本。还是从建表开始：

MariaDB [veil]> show create table cc2;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                         |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| cc2   | CREATE TABLE `cc2` (
  `contact_id` int(11) NOT NULL,
  `cad_id` int(11) NOT NULL,
  `value` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [veil]> select count(*) from cc2;
+----------+
| count(*) |
+----------+
|  5904385 |
+----------+
1 row in set (0.00 sec)

MySQL在表上缓存了count(*)的结果，所以查询是不需要花费时间的。再来创建必要的索引：

MariaDB [veil]> create index contact_id_idx on cc2(contact_id);
Query OK, 5904385 rows affected (3.11 sec)             
Records: 5904385  Duplicates: 0  Warnings: 0

MariaDB [veil]> create index cad_id_value_idx on cc2(cad_id, value);
Query OK, 5904385 rows affected (8.17 sec)             
Records: 5904385  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from cc2 where cad_id = 101 and value = 5;
+----------+
| count(*) |
+----------+
|   998839 |
+----------+
1 row in set (0.18 sec)

我们可以看到，这速度那是刚刚的。比较PostgreSQL中的相同的查询，速度要快上4倍。但是：

MariaDB [veil]> select count(contact_id) from cc2 where cad_id = 101 and value = 5;
+-------------------+
| count(contact_id) |
+-------------------+
|            998839 |
+-------------------+
1 row in set (0.41 sec)

这个查询比前一个慢，是因为count(*)是数返回行的rowid，而count(contact_id)是数真正的contact_id列，而这个列的值是不包含在cad_id_value_idx中的，如果创建更多的索引的话，速度就要更快一些了：

MariaDB [veil]> create index cad_id_value_contact_id_idx on cc2(cad_id, value, contact_id);
Query OK, 5904385 rows affected (13.37 sec)            
Records: 5904385  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(contact_id) from cc2 where cad_id = 101 and value = 5;
+-------------------+
| count(contact_id) |
+-------------------+
|            998839 |
+-------------------+
1 row in set (0.21 sec)

这个在MySQL中被称作covering index。大概PostgreSQL 9.2还没发布的index only query也是这个意思吧，我猜的。
现在，让我们来看看MySQL是否能够在两个条件的情况下表现得比PostgreSQL强。首先尝试INTERSECT吧：

MariaDB [veil]> select count(*) from (select contact_id from cc as a1 where a1.cad_id = 101 and a1.value = 5 intersect select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7) as temp;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'intersect select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7' at line 1

我靠，MySQL居然不支持INTERSECT。由于我们知道IN + SUBQUERY肯定是更慢的，所以就只剩INNER JOIN这一种写法了。

MariaDB [veil]> select count(*) from cc as a1, cc as a2 where a1.contact_id = a2.contact_id and a1.cad_id = 101 and a1.value = 5 and a2.cad_id = 102 and a2.value = 7;
+----------+
| count(*) |
+----------+
|   164788 |
+----------+
1 row in set (6.56 sec)

MariaDB [veil]> select count(a1.contact_id) from cc as a1, cc as a2 where a1.contact_id = a2.contact_id and a1.cad_id = 101 and a1.value = 5 and a2.cad_id = 102 and a2.value = 7;
+----------------------+
| count(a1.contact_id) |
+----------------------+
|               164788 |
+----------------------+
1 row in set (6.67 sec)

你没看错，这结果就是这么惨不忍睹。它甚至比PostgreSQL用IN + SUBQUERY实现得还要慢。看看到底是啥状况吧：

MariaDB [veil]> explain extended select count(a1.contact_id) from cc as a1, cc as a2 where a1.contact_id = a2.contact_id and a1.cad_id = 101 and a1.value = 5 and a2.cad_id = 102 and a2.value = 7;
+----+-------------+-------+------+------------------------------------+-------------------------+---------+--------------------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys                      | key                     | key_len | ref                            | rows   | filtered | Extra                    |
+----+-------------+-------+------+------------------------------------+-------------------------+---------+--------------------------------+--------+----------+--------------------------+
|  1 | SIMPLE      | a1    | ref  | contact_id,cad_id_value_contact_id | cad_id_value_contact_id | 10      | const,const                    | 808542 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | a2    | ref  | contact_id,cad_id_value_contact_id | cad_id_value_contact_id | 15      | const,const,veil.a1.contact_id |      1 |   100.00 | Using index              |
+----+-------------+-------+------+------------------------------------+-------------------------+---------+--------------------------------+--------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)

虽然不像PostgreSQL的分析结果那般详细。但是至少还是可以看出，索引确实是被利用上了。所以我感觉，MySQL真的不擅长复杂的join。这就是问题了，join处理不好，多条件的情况就没法支持好了。

对于我们的第一次尝试，可以总结出以下几点：

基于磁盘的方案行不通，必须放在内存中
无条件：PostgreSQL慢，MySQL慢
一个条件：PostgreSQL快，MySQL更快（拜covering index所赐）
两个条件：PostgreSQL慢，MySQL更慢。INNER JOIN稍慢于INTERSECT。
三个条件：PostgreSQL快于MySQL，而且INNER JOIN要快于INTERSECT。

总体来说，要把请求响应时间控制在一秒之内那难度是相当的大啊。所以建模思路必须变化。那我们再来看看把行变列会不会好一些吧：

MariaDB [veil]> show create table spike;
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                         |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| spike | CREATE TABLE `spike` (
  `contact_id` int(11) NOT NULL,
  `a1` int(11) DEFAULT NULL,
  `a2` int(11) DEFAULT NULL,
  `a3` int(11) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [veil]> select * from spike limit 10;
+------------+------+------+------+
| contact_id | a1   | a2   | a3   |
+------------+------+------+------+
|     800001 |    4 | NULL | NULL |
|     800003 |    5 |    7 |    1 |
|     800004 |    3 | NULL | NULL |
|     800005 |    3 | NULL | NULL |
|     800006 |    5 | NULL | NULL |
|     800007 | NULL |    6 | NULL |
|     800008 |    5 | NULL | NULL |
|     800009 |    4 |    8 | NULL |
|     800011 | NULL |    7 | NULL |
|     800012 |    3 |    6 |    1 |
+------------+------+------+------+
10 rows in set (0.01 sec)

列a1来自于cad_id=101的值，a2来自于cad_id=102的值，a3来自于cad_id=6的值。

首先试试一个条件的性能吧：

MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
|   998839 |
+----------+
1 row in set (0.26 sec)

速度不错，即便我们还没给a1加索引呢。这是因为PostgreSQL和MySQL这样的按行存储的数据库，在需要做sequential scan的时候，性能很大程度上决定于行的大小。如果我们有100行，这性能就肯定不会这么好了。不信的话可以试试：

MariaDB [veil]> alter table spike add column dummy1 TEXT;
Query OK, 4074980 rows affected (1.28 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
|   998839 |
+----------+
1 row in set (0.64 sec)

不出意料，仅仅只是添加了一个TEXT字段，查询就慢了这么多。为了克服按行存储的查询速度随行的列数增加和越来越慢的问题，索引变得特别必要：

MariaDB [veil]> create index a1_idx on spike(a1);        
Query OK, 4074980 rows affected (4.97 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
|   998839 |
+----------+
1 row in set (0.22 sec)

我们可以看到，查询时间又变得正常了。因为a1_idx是一个covering index，所以count无需真的读取行，只需要查询索引就可以知道有多少行了。而且MYISAM又不像PostgreSQL那样，因为MVCC还需要对原始行更新hint bit，所以整个查询就和行的大小一点关系都没有了。

MariaDB [veil]> alter table spike add column dummy2 TEXT;
Query OK, 4074980 rows affected (5.00 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> alter table spike add column dummy3 TEXT;
Query OK, 4074980 rows affected (5.30 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
|   998839 |
+----------+
1 row in set (0.23 sec)

对吧，没骗你吧。
那么两个条件如何？先把表结构回到正常的状态，把索引都去掉：

MariaDB [veil]> alter table spike drop column dummy1;
Query OK, 4074980 rows affected (5.19 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> alter table spike drop column dummy2;
Query OK, 4074980 rows affected (4.98 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> alter table spike drop column dummy3;
Query OK, 4074980 rows affected (3.41 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> alter table spike drop index a1_idx;
Query OK, 4074980 rows affected (0.46 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> show create table spike;             
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                         |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| spike | CREATE TABLE `spike` (
  `contact_id` int(11) NOT NULL,
  `a1` int(11) DEFAULT NULL,
  `a2` int(11) DEFAULT NULL,
  `a3` int(11) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

接下来用两个条件做查询：

MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
|   164788 |
+----------+
1 row in set (0.30 sec)

速度不错嘛！给a1创建一个索引，应该速度会更快吧：

MariaDB [veil]> create index a1_idx on spike(a1);                  
Query OK, 4074980 rows affected (3.06 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
|   164788 |
+----------+
1 row in set (0.45 sec)

尼玛居然更慢了……原因不是索引没有被使用，而是因为用了索引所以更慢了

MariaDB [veil]> explain select count(*) from spike where a1 = 5 and a2 = 7;
+----+-------------+-------+------+---------------+--------+---------+-------+---------+-------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows    | Extra       |
+----+-------------+-------+------+---------------+--------+---------+-------+---------+-------------+
|  1 | SIMPLE      | spike | ref  | a1_idx        | a1_idx | 5       | const | 1420338 | Using where |
+----+-------------+-------+------+---------------+--------+---------+-------+---------+-------------+
1 row in set (0.01 sec)

a1索引确实被使用了，但是a2并没有在同一个索引之中。所以这不是一个covering index，查询就不能仅仅在索引中查询了，还要回到原始的行存储中去检查a2=7。

MariaDB [veil]> create index a2_idx on spike(a2);
Query OK, 4074980 rows affected (5.71 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
|   164788 |
+----------+
1 row in set (0.40 sec)

给a2_idx创建了一个单独的索引也于事无补，因为在扫描a1_idx的时候，还是没有a2。查询仍然需要回到原始的行存储中去获得a2。

MariaDB [veil]> create index a1_a2_idx on spike(a1, a2);
Query OK, 4074980 rows affected (9.47 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
|   164788 |
+----------+
1 row in set (0.06 sec)

现在，我们才看到什么叫做速度！它是完全基于索引的。

MariaDB [veil]> alter table spike add column dummy1 TEXT;
Query OK, 4074980 rows affected (12.33 sec)            
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> alter table spike add column dummy2 TEXT;
Query OK, 4074980 rows affected (12.87 sec)            
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> alter table spike add column dummy3 TEXT;
Query OK, 4074980 rows affected (13.15 sec)            
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
|   164788 |
+----------+
1 row in set (0.06 sec)

把行的大小增大也没有关系，因为压根就不会去读真正的行。但是一旦把covering index移除，情况就回到原来那个德行了：

MariaDB [veil]> alter table spike drop index a1_a2_idx;
Query OK, 4074980 rows affected (8.74 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
|   164788 |
+----------+
1 row in set (0.62 sec)

MariaDB [veil]> alter table spike drop index a1_idx;
Query OK, 4074980 rows affected (5.31 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> alter table spike drop index a2_idx;
Query OK, 4074980 rows affected (1.93 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
|   164788 |
+----------+
1 row in set (0.72 sec)

所以嘛，wide table的表设计确实会是一个问题。一旦索引罩不住了，情况就糟糕了。

如果covering index这么棒的话，那为何不给所有的查询都建立covering index呢？这样问题可不就解决了么。

真是牛逼的想法，试试吧：

MariaDB [veil]> create index a1_a2_a3_idx on spike(a1, a2, a3);
Query OK, 4074980 rows affected (7.03 sec)             
Records: 4074980  Duplicates: 0  Warnings: 0

MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7 and a3 = 1;
+----------+
| count(*) |
+----------+
|     6748 |
+----------+
1 row in set (0.01 sec)

真的耶！covering index又立功了。似乎一旦建立了这样的一个索引，所有的查询问题都解决了。

MariaDB [veil]> select count(*) from spike where a1 = 5;                      
+----------+
| count(*) |
+----------+
|   998839 |
+----------+
1 row in set (0.28 sec)

看见没，它确实管用！

MariaDB [veil]> select count(*) from spike where a3 = 1;
+----------+
| count(*) |
+----------+
|   207228 |
+----------+
1 row in set (1.06 sec)

我靠，高兴太早了。栽了。原因是索引的顺序是a1，a2，a3，对于a1的查询因为和索引建立的顺序相符，所以用的上。类似的a1，a2的查询也能用得上。但是a3和建立顺序不符，就无法利用上。如果要真正的达到覆盖所有的情况，那么将是一个排列组合的结果。对于十几列的表来说，这个数字会大得让系统无法承受。

总结起来就是，无论是磁盘还是内存，无论是附加的属性表，还是把属性建模成列，无论是PostgreSQL还是MySQL都没有办法提供Responsive Query。传统的按行存储的数据库无法满足多维数据的高速查询需求。

posted on 2012-02-26 16:43 taowen 阅读(1896) 评论(1) 收藏举报

刷新页面返回顶部