【转】关于distinct和group by的效率问题
【原文地址】http://www.itpub.net/thread-1392256-1-1.html
一直以来都有这么个疑问,关于distinct和group by的效率问题
比如下面这个2个语句:
select distinct type from BBTEST;
和
select type from BBTEST group by type
两个语句的结果是一样的,就是取出表中的type值,去处重复记录。
但是这两者的执行效率如何呢,是否有差异?
在10gR2以前,distinct走的是SORT UNIQUE,group by走的是SORT GROUP BY,两者都需要进行排序;
10gR2以后,默认情况下distinct改走HASH UNIQUE,group by走HASH GROUP BY(至少在上面的简单例子中肯定是这样的)
测试例子:
BBTEST表有100万行,type字段上有索引。
多次执行后,统计信息固定输出如下:
SQL> select distinct type from bbtest;
T
-
1
3
6
8
7
5
9
4
2
已选择9行。
已用时间: 00: 00: 00.23
执行计划
----------------------------------------------------------
Plan hash value: 3426129975
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 958K| 2808K| | 3790 (6)| 00:00:46 |
| 1 | HASH UNIQUE | | 958K| 2808K| 22M| 3790 (6)| 00:00:46 |
| 2 | TABLE ACCESS FULL| BBTEST | 958K| 2808K| | 1098 (4)| 00:00:14 |
--------------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
统计信息
----------------------------------------------------------
0 recursive calls
0 db block gets
4850 consistent gets
0 physical reads
0 redo size
490 bytes sent via SQL*Net to client
385 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
9 rows processed
SQL> select type from bbtest group by type;
T
-
1
8
5
2
3
6
7
9
4
已选择9行。
已用时间: 00: 00: 00.25
执行计划
----------------------------------------------------------
Plan hash value: 19366041
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 958K| 2808K| 1247 (15)| 00:00:15 |
| 1 | HASH GROUP BY | | 958K| 2808K| 1247 (15)| 00:00:15 |
| 2 | TABLE ACCESS FULL| BBTEST | 958K| 2808K| 1098 (4)| 00:00:14 |
-----------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
统计信息
----------------------------------------------------------
0 recursive calls
0 db block gets
4850 consistent gets
0 physical reads
0 redo size
490 bytes sent via SQL*Net to client
385 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
9 rows processed
从这个数据级别上来看,HASH UNIQUE和HASH GROUP BY并没有多少差异
但是HASH UNIQUE的Cost明显比HASH GROUP BY高
都是一个全表扫描,逻辑读肯定是一样的。不一样的是排序之类的
确实,在COST上HASH UNIQUE要高,而且HASH UNIQUE还用到了22M的TempSpc,看来在内部处理上还是有差异的。
HASH GROUP BY 为何不使用临时表空间?是否也可以理解为HASH GROUP BY的算法都在内存中完成,而HASH UNIQUE则需要借助临时表空间?
不知道哪有这方面的详细算法介绍,好多网上资料都语焉不详的
1000万行表的测试:
SQL> alter system flush buffer_cache;
系统已更改。
已用时间: 00: 00: 02.37
SQL> select distinct type from bbtest;
T
-
1
3
6
8
7
5
9
4
2
已选择9行。
已用时间: 00: 00: 10.20
执行计划
----------------------------------------------------------
Plan hash value: 3426129975
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9203K| 26M| | 37198 (6)| 00:0:27 |
| 1 | HASH UNIQUE | | 9203K| 26M| 211M| 37198 (6)| 00:0:27 |
| 2 | TABLE ACCESS FULL| BBTEST | 9203K| 26M| | 11161 (4)| 00:0:14 |
-------------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
统计信息
----------------------------------------------------------
0 recursive calls
0 db block gets
49379 consistent gets
49366 physical reads
0 redo size
490 bytes sent via SQL*Net to client
385 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
9 rows processed
SQL> alter system flush buffer_cache;
系统已更改。
已用时间: 00: 00: 00.01
SQL> select type from bbtest group by type;
T
-
5
2
3
6
9
1
8
7
4
已选择9行。
已用时间: 00: 00: 11.12
执行计划
----------------------------------------------------------
Plan hash value: 19366041
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9203K| 26M| 12810 (16)| 00:02:34 |
| 1 | HASH GROUP BY | | 9203K| 26M| 12810 (16)| 00:02:34 |
| 2 | TABLE ACCESS FULL| BBTEST | 9203K| 26M| 11161 (4)| 00:02:14 |
-----------------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement
统计信息
----------------------------------------------------------
4 recursive calls
0 db block gets
49452 consistent gets
49366 physical reads
0 redo size
490 bytes sent via SQL*Net to client
385 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
9 rows processed
在COST上HASH UNIQUE是HASH GROUP BY的3倍,但实际执行速度上反而快一点。二者都没有sorts (memory)和sorts (disk)
我想应该是算法处理上的不同导致的。但是单纯的distinct效率比group by高却是无从说起,只能说distinct花费更多的CPU资源,执行速度并不比group by慢
浙公网安备 33010602011771号