PostgreSQL SQL优化之NOT IN问题

在我们平时写SQL时，如果遇到需要排除某些数据时，往往使用id <> xxx and id <> xxx，进而改进为id not in (xxx, xxx);

这样写没有问题，而且简化了SQL，但是往往有些极端情况，使用not in就会造成极大的性能损耗，例如：

select * from test where id not in (select id from test_back) and info like '%test%';

这样的话select id from test_back将成为一个子查询，而且不会走索引，每次走一遍全表扫描。

每一条满足info like '%test%'的记录都会去调用这个方法去判断id是否不在子查询中，具体的执行计划见下面的例子。

改进方法：

1）使用test和test_back进行联合查询，id <> id明显是不行的，这样只会判断同一关联条件下的一行中的id是否相同，无法做到排除某些id。

2）正确的方式应该使用not exists，将条件下推到里面，就不会出现子查询了：

select * from test t1 where info like '%test%' and not exits (select 1 from test_back t2 where t2.id = t1.id);

apple=# \d test
                Table "public.test"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 id     | integer |           | not null |
 info   | text    |           |          |
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

apple=# truncate test;
TRUNCATE TABLE
apple=# insert into test select generate_series(1, 100), 'test'||round(random()*10000)::text;
INSERT 0 100
apple=# select * from test limit 1;
 id |   info
----+----------
  1 | test9526
(1 row)

apple=# insert into test select generate_series(101, 200), 'tes'||round(random()*10000)::text;
INSERT 0 100                            
apple=# create table test_back as  select * from test where id between 50 and 70;
SELECT 21
apple=# explain select * from test where id not in (select id from test_back) and info like '%test%';
                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on test  (cost=25.88..30.88 rows=49 width=12)
   Filter: ((NOT (hashed SubPlan 1)) AND (info ~~ '%test%'::text))
   SubPlan 1
     ->  Seq Scan on test_back  (cost=0.00..22.70 rows=1270 width=4)
(4 rows)

apple=# explain select * from test t1 where info like '%test%' and not exists (select 1 from test_back t2 where t2.id = t1.id);
                               QUERY PLAN
-------------------------------------------------------------------------
 Hash Anti Join  (cost=1.47..7.13 rows=89 width=12)
   Hash Cond: (t1.id = t2.id)
   ->  Seq Scan on test t1  (cost=0.00..4.50 rows=99 width=12)
         Filter: (info ~~ '%test%'::text)
   ->  Hash  (cost=1.21..1.21 rows=21 width=4)
         ->  Seq Scan on test_back t2  (cost=0.00..1.21 rows=21 width=4)
(6 rows)

例子里面没有建索引，建索引后，这种优化方式效果更好。

那么进一步扩展来说：

1）!= 不是标准的SQL，<>才是，这两个在PostgreSQL中是等效的。

2）exits和not exits的意思是逐条将条件下放到判断条件，而jion方式是先对表进行笛卡尔积，然后判断同行之间的各列值是否满足关系。

2020-11-25补充：

-----------------------------

除了使用exists、in之外，PG也还支持使用=any的方式，有时候可能更优，参考博客：

https://blog.csdn.net/weixin_39540651/article/details/104990653

1、= any这种写法，不会走subplan，因此不涉及hash table的问题。和work_mem设置大小无关。因此在pg中十分推荐使用这种写法。--相当于把多行结果，放到了一个有数组中进行判断，如果把数组弄成有序的，会更好。

2、exists，由于优化器会默认它只需要搜索到1条命中目标就不搜了，所以优化器评估是否使用hash table时，需要的内存相对较少，即使较小的work_mem也可能使用hashtable。

3、in ，当出现在subquery中时，优化器评估这个subquery是否要构建哈希TABLE，直接和subquery的大小相关，所以需要较大的work_mem才会选择使用hashtable。

posted @ 2019-07-26 14:46 狂神314 阅读(9486) 评论(0) 收藏举报

刷新页面返回顶部

狂神314

PostgreSQL SQL优化之NOT IN问题

公告