1 pg_trgm原理

1.1 pg_trgm 原理

pg_trgm是PostgreSQL的一个内置模块,用于处理模糊文本搜索。我们将了解pg_trgm的基本原理,并示范如何应用它进行地址搜索。

pg_trgm(PostgreSQL Trigram)是一个用于模糊文本搜索的扩展模块。它基于 trigram(三元组)的概念,即将字符串分割成三个字符的集合。pg_trgm提供了一组函数和操作符,可以计算两个字符串之间的相似度,并快速找到与目标字符串相似的字符串。

pg_trgm模块包含了几个关键函数,其中最常用的是similarity和 % 操作符。similarity函数可以计算两个字符串之间的相似度,返回一个0到1之间的值,值越接近1表示相似度越高。%操作符用于在一个字符串中搜索与另一个字符串相似的部分。

性能和资源消耗

虽然pg_trgm插件可以显著提高文本查询的性能,但索引的创建和维护也会消耗一定的系统资源。因此,在使用时需要根据实际情况进行权衡‌。

替代方案

虽然pg_trgm在处理文本相似度方面表现出色,但也有其他插件如pg_bigm提供了类似的功能。pg_bigm将对象每2个字节为一组进行切割,而pg_trgm则是每3个字符为一组。因此,pg_bigm可以支持任意字数的模糊查询‌。

1.2 pg_trgm 安装

请参考:https://www.cnblogs.com/xibuhaohao/articles/18783858

SELECT name, default_version,installed_version FROM pg_available_extensions;

CREATE EXTENSION IF NOT EXISTS pg_trgm;   

2 pg_trgm测试

2.1 新建测试表

新建一张测试表,包含一个主键列,四个varchar 字符串列

tpccdb-# \d mytest
Table "public.mytest"
Column | Type | Collation | Nullable | Default
-------------+-----------------------------+-----------+----------+---------
id | bigint | | |
sku1 | character varying(100) | | |
sku2 | character varying(100) | | |
sku3 | character varying(100) | | |
sku4 | character varying(100) | | |
insert_time | timestamp without time zone | | | now()

2.2 对前三个字符串列创建GIN倒排索引

CREATE INDEX CONCURRENTLY idx_gin_sku1 ON mytest USING gin (sku1 gin_trgm_ops);

CREATE INDEX CONCURRENTLY idx_gin_sku2 ON mytest USING gin (sku2 gin_trgm_ops);

CREATE INDEX CONCURRENTLY idx_gin_sku3 ON mytest USING gin (sku3 gin_trgm_ops);

2.3 造测试数据

造5000万测试数据:

tpccdb=# insert into mytest1(id,sku1,sku2,sku3,sku4) SELECT
generate_series(1, 50000000) AS id,
substring(md5(random()::text), 1, 10) AS sku1,substring(md5(random()::text), 1, 15) AS sku2,substring(md5(random()::text), 1, 20) AS sku3,substring(md5(random()::text), 1, 25) AS sku4;

2.4 查询测试

1)单字段查询,走了索引
tpccdb=# select * from mytest where sku1 like '%e7eefd7ffc%';
id | sku1 | sku2 | sku3 | sku4 | insert_time
----+------------+------------+------------+------------+----------------------------
9 | e7eefd7ffc | d2e05ee34f | 101f44dad9 | daae8fb159 | 2025-03-20 19:19:28.162067
(1 row)

Time: 3.596 ms
tpccdb=# explain select * from mytest where sku1 like '%e7eefd7ffc%';
QUERY PLAN
-------------------------------------------------------------------------------
Bitmap Heap Scan on mytest (cost=211.95..2107.65 rows=510 width=60)
Recheck Cond: ((sku1)::text ~~ '%e7eefd7ffc%'::text)
-> Bitmap Index Scan on idx_gin_sku1 (cost=0.00..211.83 rows=510 width=0)
Index Cond: ((sku1)::text ~~ '%e7eefd7ffc%'::text)
(4 rows)

Time: 0.403 ms


2)多字段带 or 的查询,查询字段上都有倒排索引,也都走了索引
tpccdb=# select * from mytest where sku1 like '%57121eabfd%' or sku2 like '3f5174a238' or sku3 like '79d397e350' ;
id | sku1 | sku2 | sku3 | sku4 | insert_time
---------+------------+------------+------------+------------+----------------------------
5000000 | 57121eabfd | 3f5174a238 | 79d397e350 | fe8e00634a | 2025-03-20 19:20:07.566315
(1 row)

Time: 17.573 ms
tpccdb=#
tpccdb=#
tpccdb=# explain select * from mytest where sku1 like '%57121eabfd%' or sku2 like '3f5174a238' or sku3 like '79d397e350' ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on mytest (cost=780.22..2685.66 rows=512 width=60)
Recheck Cond: (((sku1)::text ~~ '%57121eabfd%'::text) OR ((sku2)::text ~~ '3f5174a238'::text) OR ((sku3)::text ~~ '79d397e350'::text))
-> BitmapOr (cost=780.22..780.22 rows=512 width=0)
-> Bitmap Index Scan on idx_gin_sku1 (cost=0.00..211.83 rows=510 width=0)
Index Cond: ((sku1)::text ~~ '%57121eabfd%'::text)
-> Bitmap Index Scan on idx_gin_sku2 (cost=0.00..284.01 rows=1 width=0)
Index Cond: ((sku2)::text ~~ '3f5174a238'::text)
-> Bitmap Index Scan on idx_gin_sku3 (cost=0.00..284.01 rows=1 width=0)
Index Cond: ((sku3)::text ~~ '79d397e350'::text)
(9 rows)

3)多字段带 or 的查询,只有第四个查询字段上没有倒排索引,则都不走索引,直接全表扫描且并行执行
tpccdb=# select * from mytest where sku1 like '%57121eabfd%' or sku2 like '3f5174a238' or sku3 like '79d397e350' or sku4 like 'fe8e00634a';
id | sku1 | sku2 | sku3 | sku4 | insert_time
---------+------------+------------+------------+------------+----------------------------
5000000 | 57121eabfd | 3f5174a238 | 79d397e350 | fe8e00634a | 2025-03-20 19:20:07.566315
(1 row)

Time: 537.911 ms
tpccdb=# explain select * from mytest where sku1 like '%57121eabfd%' or sku2 like '3f5174a238' or sku3 like '79d397e350' or sku4 like 'fe8e00634a';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..101508.13 rows=513 width=60)
Workers Planned: 2
-> Parallel Seq Scan on mytest (cost=0.00..100456.83 rows=214 width=60)
Filter: (((sku1)::text ~~ '%57121eabfd%'::text) OR ((sku2)::text ~~ '3f5174a238'::text) OR ((sku3)::text ~~ '79d397e350'::text) OR ((sku4)::text ~~ 'fe8e00634a'::text))
(4 rows)

Time: 0.464 ms
4)中文查询测试

造中文测试数据一条:

tpccdb=# update mytest1 set sku4='蓝臻奶粉-110g' where sku4='a7a3e5801527e5b6c806601fe';     

查看数据:  

tpccdb=# select * from mytest1 order by id limit 10;
id | sku1 | sku2 | sku3 | sku4 | insert_time
----+------------+-----------------+----------------------+---------------------------+----------------------------
1 | f479220795 | 6d5411ec1130208 | 13bc0b9a0baffbbefba6 | 蓝臻奶粉-110g | 2025-03-21 08:54:41.055103
2 | 83f13461ab | d2c2adc1f220694 | 9b0bdfd205835dba1f41 | b0eb4b061f16799a930ee8071 | 2025-03-21 08:54:41.055103
3 | 6b3808aaa2 | 15b42ce930576b6 | 3d58dfa9b72998a27434 | fa75049da30bb731c00709c21 | 2025-03-21 08:54:41.055103
4 | eb5b1cafe5 | b985b51d8c3376c | 4548d327bb3be2846de5 | c4c18c7e1bbf8713ecfe4dd67 | 2025-03-21 08:54:41.055103
5 | d4cacfd24e | fe0707990f85307 | 4fc346f3a18990fc68f8 | e68f2db5547464542e211d8bc | 2025-03-21 08:54:41.055103
6 | 6599faf104 | 8747837e06b8ced | 57c0ecc54cca09e2ca89 | 906beeac1720dd98503f3be73 | 2025-03-21 08:54:41.055103
7 | 4fe23d1375 | fddb091d4ab5227 | 629db13b578f03b527ed | 7e99bdbf0c9066b4a071f0f96 | 2025-03-21 08:54:41.055103
8 | b0cfa2652b | 8a44df721c0f3aa | abb27b6bacb485ffd369 | 8c630b67537c72a0c4f08db97 | 2025-03-21 08:54:41.055103
9 | 9317237fc7 | 393a71877568693 | cdb74b48e0fcdea99dd4 | 9e70b5ab6e6db7d67d1f8bf7b | 2025-03-21 08:54:41.055103
10 | c48d24b29b | 15b447c4ecb9d8e | 16a35ced264166244f47 | 846db4ace4141deec2f7de277 | 2025-03-21 08:54:41.055103
(10 rows)

测试查询:当中文字符大于等于三个时,才会走索引与英文一致

tpccdb=# explain select * from mytest1 where sku4 like '%蓝臻%';
QUERY PLAN
--------------------------------------------------------------------------------
Gather (cost=1000.00..1031147.67 rows=5000 width=90)
Workers Planned: 2
-> Parallel Seq Scan on mytest1 (cost=0.00..1029647.67 rows=2083 width=90)
Filter: ((sku4)::text ~~ '%蓝臻%'::text)
(4 rows)

Time: 0.407 ms
tpccdb=# explain select * from mytest1 where sku4 like '%臻奶粉%';
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on mytest1 (cost=330.75..19125.71 rows=5000 width=90)
Recheck Cond: ((sku4)::text ~~ '%臻奶粉%'::text)
-> Bitmap Index Scan on idx_gin_sku4 (cost=0.00..329.50 rows=5000 width=0)
Index Cond: ((sku4)::text ~~ '%臻奶粉%'::text)
(4 rows)

 posted on 2025-03-20 20:30  xibuhaohao  阅读(135)  评论(0)    收藏  举报