当索引遇上null

原文来自newkid的大作《受楼下老大的NULL贴启发，讲两个NULL与索引的小技巧》。

众所周知，Oracle的B-Tree索引不包含null的条目，为此引申出了诸多“is null/is not null不走索引”的伪言，在SQL Server方面我写过一篇《关于数据库是否使用索引的讨论，我想说的》，这里就不再熬述了。

newkid大侠展示了一个方法让B-Tree索引包含null条目，让“is null” 的条件判断也能走索引。该方法是创建一个“伪复合索引”：

首先，创建一些测试数据：

create table t02(id number,names varchar(100)) ; 
insert into t02 select rownum ,object_name from all_objects;  
update t02 set id=NULL where id=1234;
update t02 set id=NULL where id=2234;
update t02 set id=NULL where id=3234;
update t02 set id=NULL where id=4234;

经过一轮折腾数据的分布如下：

select count(*),count(id) from t02; 
  COUNT(*)  COUNT(ID)
---------- ----------
     40686      40682

总共40686条记录，其中id is null的有4条。往常在id列上建一个普通的索引，可以预见对于非null的条件查询非常有帮助，而对于id is null 就无能为力了：

create index idx__t02_id on t02(id);    

select * from t02 where id=1234 ; 
-------------------------------------------------------------------------------------------
| Id  | Operation                   | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |             |     1 |    65 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T02         |     1 |    65 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IDX__T02_ID |     1 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

select * from t02 where id is null ;
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    15 |   975 |    57   (4)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T02  |    15 |   975 |    57   (4)| 00:00:01 |
--------------------------------------------------------------------------

id is null 这个条件判断有将近 1/10000 的选择率，不走索引有时这让人非常郁闷。此时，再建一个“伪符合索引”。

CREATE INDEX cidx__t02_id ON t02(id,0);

select * from t02 where id is null ;
--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |    15 |   975 |    18   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T02          |    15 |   975 |    18   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | CIDX__T02_ID |  2035 |       |     7   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

id is null 也能走索引了。

通常来说，数据量越大问题越多，春运就是一个典型的例子，索引也是一样，大量的DML语句需要维护索引，大数据量会导致索引的重建和维护产生风险。某些场景中，我们仅仅关心某一小部分的数据而不是所有数据，而索引通常是针对所有的数据进行索引的，为了解决这个问题Tom在《Oracle 9i&10g变成艺术》这一书中描述了一种基于函数索引排除大部分无关记录的方式，其原理利用了Oracle的索引不保存null条目的特点。

例如，某些时候我们仅关心某个状态字段为1的记录，而这部分记录很少，其他状态的记录从来不会去查询。

我们创建一些示例数据展示这一点：

create table t03(id number,name varchar(200) ); 
insert into t03 select rownum,object_name from all_objects; 
update t03 set id=1 where id<50; 
update t03 set id=0 where id<>1;

一轮折腾之后，数据分布如下：

select id,count(*) from t03 group by id;

        ID   COUNT(*)
---------- ----------
         1         49
         0      40641

平时我们感兴趣的数据是 id=1 那部分数据，如果建立一个普通B-Tree索引就要保存40690个条目，但我们实际关心其中的49条。

此时，我们建立两个索引，顺便收集统计信息：

create index fidx__t03_id_eq_1 on t03( decode(id,1,1) ) ;
exec dbms_stats.gather_table_stats( user , 't03' , cascade=>true );

索引fidx__t03_id_eq_1仅包含我们关心的49条记录：

select index_name,index_type,blevel,leaf_blocks,num_rows,status,distinct_keys
from user_indexes 
where table_name='T03'; 
INDEX_NAME         INDEX_TYPE              BLEVEL LEAF_BLOCKS   NUM_ROWS STATUS   DISTINCT_KEYS
------------------ ---------------------- ------- ----------- ---------- -------- -------------
IDX__T03_ID        NORMAL                       1          74      40641 VALID                2
FIDX__T03_ID_EQ_1  FUNCTION-BASED NORMAL        0           1         49 VALID                1

我们再来看看实际的查询：

select count(*) from t03 where decode(id,1,1)=1;

---------------------------------------------------------------------------------------
| Id  | Operation         | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                   |     1 |     3 |     1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |                   |     1 |     3 |            |          |
|*  2 |   INDEX RANGE SCAN| FIDX__T03_ID_EQ_1 |    49 |   147 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          1  consistent gets
          


select count(*) from t03 where id=1;
---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |     1 |     3 |     1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |             |     1 |     3 |            |          |
|*  2 |   INDEX RANGE SCAN| IDX__T03_ID |    40 |   120 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          2  consistent gets

可以看到，我们节省了consistent gets，由于数据量不大，优势不是太明显，但是可以看到索引的LEAF_BLOCKS数量大大减少，对于id<>1的DML语句不会影响索引FIDX__T03_ID_EQ_1，而且索引体积很少，维护成本非常低廉。

posted @ 2010-06-30 16:06 killkill 阅读(4822) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

当索引遇上null

公告