【案例】PostgreSQL一个序列引发的惨案
问题背景
用户反馈删除序列失败,报锁超时的错误,排查发现因为该序列被使用,所以没法删除。
尝试kill持锁的会话,但是kill不掉,尝试重启数据库也失败,最终重启主机,启动数据库后才成功删除序列。
问题分析
检查发现序列定义有问题,用户使用navicat、dbeaver等图像界面修改序列当前值,由于设置错误把序列的cache设置过大。
一旦使用该序列,相关进程就一直在生成cache,不接收中断信号,即使使用pg_terminate_backend杀会话,甚至停止数据库,相关进程都不接收中断信息,只能重启主机。
问题影响
创建测试对象
-- 序列
create sequence test_s cache 111111111111111;
-- 创建测试表
create table test_t1 (id int,name varchar(100));
create table test_t2 (id int,name varchar(100));
insert into test_t2 values (1,'zhangsan');
1、数据库不能正常重启
使用序列,如下查询一直执行不过去
postgres(9510)=#select nextval('test_s');
查询会话信息
postgres(9561)=#select * from pg_stat_activity where pid = '9510';
-[ RECORD 1 ]----+------------------------------
datid | 13593
datname | postgres
pid | 9510
usesysid | 10
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2025-06-11 03:23:43.150979+08
xact_start | 2025-06-11 03:25:07.312909+08
query_start | 2025-06-11 03:25:07.312909+08
state_change | 2025-06-11 03:25:07.312911+08
wait_event_type |
wait_event |
state | active
backend_xid |
backend_xmin | 490
query | select nextval('test_s');
backend_type | client backend
-- 尝试kill 会话
postgres(9561)=#select pg_terminate_backend(9510);
-[ RECORD 1 ]--------+--
pg_terminate_backend | t
-- 如下9510会话并没有被杀掉
postgres(9561)=#select * from pg_stat_activity where pid = '9510';
-[ RECORD 1 ]----+------------------------------
datid | 13593
datname | postgres
pid | 9510
usesysid | 10
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2025-06-11 03:23:43.150979+08
xact_start | 2025-06-11 03:25:07.312909+08
query_start | 2025-06-11 03:25:07.312909+08
state_change | 2025-06-11 03:25:07.312911+08
wait_event_type |
wait_event |
state | active
backend_xid |
backend_xmin | 490
query | select nextval('test_s');
backend_type | client backend
尝试关闭数据库,如下关闭失败
[postgres@centos7 ~]$ pg_ctl stop
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
[postgres@centos7 ~]$ psql
psql: error: FATAL: the database system is shutting down
#如下进程都在
[postgres@centos7 ~]$ ps ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 9465 0.0 0.0 115564 2132 pts/0 S 03:23 0:00 -bash
postgres 9500 0.0 0.3 272644 13528 ? Ss 03:23 0:00 /postgresql/pg12/bin/postgres
postgres 9501 0.0 0.0 124936 920 ? Ss 03:23 0:00 postgres: logger
postgres 9503 0.0 0.0 272644 1096 ? Ss 03:23 0:00 postgres: checkpointer
postgres 9507 0.0 0.0 127056 1088 ? Ss 03:23 0:00 postgres: stats collector
postgres 9509 0.0 0.0 124960 2076 pts/0 S+ 03:23 0:00 psql
postgres 9510 79.1 0.1 274616 7600 ? Rs 03:23 5:21 postgres: postgres postgres [local] SELECT
postgres 9536 0.0 0.0 115564 2096 pts/1 S 03:25 0:00 -bash
postgres 9560 0.0 0.0 124960 2056 pts/1 S+ 03:25 0:00 psql
postgres 9591 0.0 0.0 115564 2124 pts/2 S 03:28 0:00 -bash
postgres 9619 0.0 0.0 155360 1868 pts/2 R+ 03:30 0:00 ps u
重启主机
[root@centos7 ~]# reboot
重启pg服务
[postgres@centos7 ~]$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2025-06-11 03:35:28.866 CST [9470] LOG: starting PostgreSQL 12.17 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit
2025-06-11 03:35:28.867 CST [9470] LOG: listening on IPv4 address "0.0.0.0", port 5432
2025-06-11 03:35:28.867 CST [9470] LOG: listening on IPv6 address "::", port 5432
2025-06-11 03:35:28.870 CST [9470] LOG: listening on Unix socket "/postgresql/pgdata/.s.PGSQL.5432"
2025-06-11 03:35:28.885 CST [9470] LOG: redirecting log output to logging collector process
2025-06-11 03:35:28.885 CST [9470] HINT: Future log output will appear in directory "pg_log".
done
server started
[postgres@centos7 ~]$
如上测试,如果涉及到一些需要重启库的操作(例如修改postmaster级别的参数),此时会失败,最终只能重启主机之后才能重启数据库。
2、锁表影响
锁表场景有两种情况
场景1:只影响使用序列的单张表
会话1:会话9484会一直卡主,执行不过去
postgres(9484)=#insert into test_t1 values (nextval('test_s'),'zhangsan');
会话2:会话9540会被阻塞
postgres(9540)=#alter table test_t1 add column sex varchar(10);
会话3:查看持锁情况,insert(会话9484)阻塞了alter(会话9540)
postgres(9593)=#select pid,pg_blocking_pids(pid),wait_event,wait_event_type,state,query from pg_stat_activity where pid in (9484,9540);
pid | pg_blocking_pids | wait_event | wait_event_type | state | query
------+------------------+------------+-----------------+--------+------------------------------------------------------------
9484 | {} | | | active | insert into test_t1 values (nextval('test_s'),'zhangsan');
9540 | {9484} | relation | Lock | active | alter table test_t1 add column sex varchar(10);
(2 rows)
此时会话9540可以被kill掉,但是会话9484不能被kill掉
postgres(9593)=#select pg_terminate_backend(9484);
pg_terminate_backend
----------------------
t
(1 row)
postgres(9593)=#select pg_terminate_backend(9540);
pg_terminate_backend
----------------------
t
(1 row)
postgres(9593)=#select pid,pg_blocking_pids(pid),wait_event,wait_event_type,state,query from pg_stat_activity where pid in (9484,9540);
pid | pg_blocking_pids | wait_event | wait_event_type | state | query
------+------------------+------------+-----------------+--------+-----------------------------------------------------------
9484 | {} | | | active | insert into test_t1 values (nextval('test_s'),'zhangsan');
(1 row)
锁信息:
/*
* These are the valid values of type LOCKMODE for all the standard lock
* methods (both DEFAULT and USER).
*/
/* NoLock is not a lock mode, but a flag value meaning "don't get a lock" */
#define NoLock 0
#define AccessShareLock 1 /* SELECT */
#define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */
#define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL),ANALYZE, CREATE INDEX CONCURRENTLY */
#define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW SHARE */
#define ExclusiveLock 7 /* blocks ROW SHARE/SELECT...FOR UPDATE */
#define AccessExclusiveLock 8 /* ALTER TABLE, DROP TABLE, VACUUM FULL, and unqualified LOCK TABLE */
如上根据锁信息,insert为3级锁,和4-8级锁冲突,也就是说后续针对test_t1表的VACUUM、ANALYZE、ALTER TABLE、DROP TABLE、VACUUM FULL等操作都没法进行,只能重启主机,然后重启数据库,修改序列的cache后才能正常。
场景2:事务中会影响其他表
该场景影响就比较大了,如下示例
会话1:insert语句卡主执行不过去
postgres(9508)=#begin;
BEGIN
postgres(9508)=#update test_t2 set name='lisi' where id=1;
UPDATE 1
postgres(9508)=#insert into test_t1 values (nextval('test_s'),'zhangsan');
会话2:update语句卡主执行不过去
postgres(9542)=#update test_t2 set name='wangwu' where id=1;
会话3:查询持锁信息
postgres(9569)=#select pid,pg_blocking_pids(pid),wait_event,wait_event_type,state,query from pg_stat_activity where pid in (9508,9542);
-[ RECORD 1 ]----+-----------------------------------------------------------
pid | 9508
pg_blocking_pids | {}
wait_event |
wait_event_type |
state | active
query | insert into test_t1 values (nextval('test_s'),'zhangsan');
-[ RECORD 2 ]----+-----------------------------------------------------------
pid | 9542
pg_blocking_pids | {9508}
wait_event | transactionid
wait_event_type | Lock
state | active
query | update test_t2 set name='wangwu' where id=1;
如上此时受影响的就是test_t1和test_t2两张表了,test_t1表在insert,test_t2表在update,都是3级锁,也就是说后续针对test_t1和test_t2表的VACUUM、ANALYZE、ALTER TABLE、DROP TABLE、VACUUM FULL等操作都没法进行,只能重启主机,然后重启数据库,修改序列的cache后才能正常。
设想一下如果当前场景下会话1比较复杂,涉及更多的表,则影响范围更大。