【案例】PostgreSQL一个序列引发的惨案

问题背景

用户反馈删除序列失败,报锁超时的错误,排查发现因为该序列被使用,所以没法删除。

尝试kill持锁的会话,但是kill不掉,尝试重启数据库也失败,最终重启主机,启动数据库后才成功删除序列。

问题分析

检查发现序列定义有问题,用户使用navicat、dbeaver等图像界面修改序列当前值,由于设置错误把序列的cache设置过大。

一旦使用该序列,相关进程就一直在生成cache,不接收中断信号,即使使用pg_terminate_backend杀会话,甚至停止数据库,相关进程都不接收中断信息,只能重启主机。

问题影响

创建测试对象

-- 序列
create sequence test_s cache 111111111111111;
-- 创建测试表
create table test_t1 (id int,name varchar(100));
create table test_t2 (id int,name varchar(100));
insert into test_t2 values (1,'zhangsan');

1、数据库不能正常重启

使用序列,如下查询一直执行不过去

postgres(9510)=#select nextval('test_s');

查询会话信息

postgres(9561)=#select * from pg_stat_activity where pid = '9510';
-[ RECORD 1 ]----+------------------------------
datid            | 13593
datname          | postgres
pid              | 9510
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2025-06-11 03:23:43.150979+08
xact_start       | 2025-06-11 03:25:07.312909+08
query_start      | 2025-06-11 03:25:07.312909+08
state_change     | 2025-06-11 03:25:07.312911+08
wait_event_type  | 
wait_event       | 
state            | active
backend_xid      | 
backend_xmin     | 490
query            | select nextval('test_s');
backend_type     | client backend

-- 尝试kill 会话
postgres(9561)=#select pg_terminate_backend(9510);
-[ RECORD 1 ]--------+--
pg_terminate_backend | t

-- 如下9510会话并没有被杀掉
postgres(9561)=#select * from pg_stat_activity where pid = '9510';
-[ RECORD 1 ]----+------------------------------
datid            | 13593
datname          | postgres
pid              | 9510
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2025-06-11 03:23:43.150979+08
xact_start       | 2025-06-11 03:25:07.312909+08
query_start      | 2025-06-11 03:25:07.312909+08
state_change     | 2025-06-11 03:25:07.312911+08
wait_event_type  | 
wait_event       | 
state            | active
backend_xid      | 
backend_xmin     | 490
query            | select nextval('test_s');
backend_type     | client backend

尝试关闭数据库,如下关闭失败

[postgres@centos7 ~]$ pg_ctl stop
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
[postgres@centos7 ~]$ psql
psql: error: FATAL:  the database system is shutting down

#如下进程都在
[postgres@centos7 ~]$ ps ux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres   9465  0.0  0.0 115564  2132 pts/0    S    03:23   0:00 -bash
postgres   9500  0.0  0.3 272644 13528 ?        Ss   03:23   0:00 /postgresql/pg12/bin/postgres
postgres   9501  0.0  0.0 124936   920 ?        Ss   03:23   0:00 postgres: logger   
postgres   9503  0.0  0.0 272644  1096 ?        Ss   03:23   0:00 postgres: checkpointer   
postgres   9507  0.0  0.0 127056  1088 ?        Ss   03:23   0:00 postgres: stats collector   
postgres   9509  0.0  0.0 124960  2076 pts/0    S+   03:23   0:00 psql
postgres   9510 79.1  0.1 274616  7600 ?        Rs   03:23   5:21 postgres: postgres postgres [local] SELECT
postgres   9536  0.0  0.0 115564  2096 pts/1    S    03:25   0:00 -bash
postgres   9560  0.0  0.0 124960  2056 pts/1    S+   03:25   0:00 psql
postgres   9591  0.0  0.0 115564  2124 pts/2    S    03:28   0:00 -bash
postgres   9619  0.0  0.0 155360  1868 pts/2    R+   03:30   0:00 ps u

重启主机

[root@centos7 ~]# reboot

重启pg服务

[postgres@centos7 ~]$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2025-06-11 03:35:28.866 CST [9470] LOG:  starting PostgreSQL 12.17 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit
2025-06-11 03:35:28.867 CST [9470] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2025-06-11 03:35:28.867 CST [9470] LOG:  listening on IPv6 address "::", port 5432
2025-06-11 03:35:28.870 CST [9470] LOG:  listening on Unix socket "/postgresql/pgdata/.s.PGSQL.5432"
2025-06-11 03:35:28.885 CST [9470] LOG:  redirecting log output to logging collector process
2025-06-11 03:35:28.885 CST [9470] HINT:  Future log output will appear in directory "pg_log".
 done
server started
[postgres@centos7 ~]$

如上测试,如果涉及到一些需要重启库的操作(例如修改postmaster级别的参数),此时会失败,最终只能重启主机之后才能重启数据库。

2、锁表影响

锁表场景有两种情况

场景1:只影响使用序列的单张表

会话1:会话9484会一直卡主,执行不过去

postgres(9484)=#insert into test_t1 values (nextval('test_s'),'zhangsan');

会话2:会话9540会被阻塞

postgres(9540)=#alter table test_t1 add column sex varchar(10);

会话3:查看持锁情况,insert(会话9484)阻塞了alter(会话9540)

postgres(9593)=#select pid,pg_blocking_pids(pid),wait_event,wait_event_type,state,query from pg_stat_activity where pid in (9484,9540);
 pid  | pg_blocking_pids | wait_event | wait_event_type | state  |                           query                            
------+------------------+------------+-----------------+--------+------------------------------------------------------------
 9484 | {}               |            |                 | active | insert into test_t1 values (nextval('test_s'),'zhangsan');
 9540 | {9484}           | relation   | Lock            | active | alter table test_t1 add column sex varchar(10);
(2 rows)

此时会话9540可以被kill掉,但是会话9484不能被kill掉

postgres(9593)=#select pg_terminate_backend(9484);
 pg_terminate_backend 
----------------------
 t
(1 row)

postgres(9593)=#select pg_terminate_backend(9540);
 pg_terminate_backend 
----------------------
 t
(1 row)

postgres(9593)=#select pid,pg_blocking_pids(pid),wait_event,wait_event_type,state,query from pg_stat_activity where pid in (9484,9540);
 pid  | pg_blocking_pids | wait_event | wait_event_type | state  |                           query                         
------+------------------+------------+-----------------+--------+-----------------------------------------------------------
 9484 | {}               |            |                 | active | insert into test_t1 values (nextval('test_s'),'zhangsan');
(1 row)

锁信息:

/*

 * These are the valid values of type LOCKMODE for all the standard lock
 * methods (both DEFAULT and USER).
   */

/* NoLock is not a lock mode, but a flag value meaning "don't get a lock" */
#define NoLock					0

#define AccessShareLock			1		/* SELECT */
#define RowShareLock			2		/* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock		3		/* INSERT, UPDATE, DELETE */
#define ShareUpdateExclusiveLock 4		/* VACUUM (non-FULL),ANALYZE, CREATE INDEX CONCURRENTLY */
#define ShareLock				5		/* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock	6		/* like EXCLUSIVE MODE, but allows ROW SHARE */
#define ExclusiveLock			7		/* blocks ROW SHARE/SELECT...FOR UPDATE */
#define AccessExclusiveLock		8		/* ALTER TABLE, DROP TABLE, VACUUM FULL, and unqualified LOCK TABLE */

如上根据锁信息,insert为3级锁,和4-8级锁冲突,也就是说后续针对test_t1表的VACUUM、ANALYZE、ALTER TABLE、DROP TABLE、VACUUM FULL等操作都没法进行,只能重启主机,然后重启数据库,修改序列的cache后才能正常。

场景2:事务中会影响其他表

该场景影响就比较大了,如下示例

会话1:insert语句卡主执行不过去

postgres(9508)=#begin;
BEGIN
postgres(9508)=#update test_t2 set name='lisi' where id=1;
UPDATE 1
postgres(9508)=#insert into test_t1 values (nextval('test_s'),'zhangsan');

会话2:update语句卡主执行不过去

postgres(9542)=#update test_t2 set name='wangwu' where id=1;

会话3:查询持锁信息

postgres(9569)=#select pid,pg_blocking_pids(pid),wait_event,wait_event_type,state,query from pg_stat_activity where pid in (9508,9542);
-[ RECORD 1 ]----+-----------------------------------------------------------
pid              | 9508
pg_blocking_pids | {}
wait_event       | 
wait_event_type  | 
state            | active
query            | insert into test_t1 values (nextval('test_s'),'zhangsan');
-[ RECORD 2 ]----+-----------------------------------------------------------
pid              | 9542
pg_blocking_pids | {9508}
wait_event       | transactionid
wait_event_type  | Lock
state            | active
query            | update test_t2 set name='wangwu' where id=1;

如上此时受影响的就是test_t1和test_t2两张表了,test_t1表在insert,test_t2表在update,都是3级锁,也就是说后续针对test_t1和test_t2表的VACUUM、ANALYZE、ALTER TABLE、DROP TABLE、VACUUM FULL等操作都没法进行,只能重启主机,然后重启数据库,修改序列的cache后才能正常。

设想一下如果当前场景下会话1比较复杂,涉及更多的表,则影响范围更大。

posted @ 2025-06-11 04:13  kahnyao  阅读(44)  评论(0)    收藏  举报