逻辑复制槽失效导致checkpoint卡住
问题现象
先是备份执行pg_basebackup时卡住超时,查看pg_stat_activity中备份进程被checkpoint进程阻塞,然后再检查checkpoint进程发现进程等待事件是IPC:CheckpointStart
查看checkpointer进程的pstack:
# pstack checkpointer_pid
# 可以看到#3行,卡在了删除无效复制槽这一步
#0 0x00007ffff71e20c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x0000000000753956 in WaitEventSetWait ()
#2 0x000000000075d92c in ConditionVariableTimedSleep ()
#3 0x000000000072ba7b in InvalidateObsoleteReplicationSlots ()
#4 0x000000000051390d in CreateCheckPoint ()
#5 0x00000000006f5bc2 in CheckpointerMain ()
#6 0x0000000000523725 in AuxiliaryProcessMain ()
#7 0x00000000006ffac9 in StartChildProcess ()
#8 0x0000000000700cb2 in reaper ()
#9 <signal handler called>
#10 0x00007ffff71d8b23 in __select_nocancel () from /lib64/libc.so.6
#11 0x0000000000482cd4 in ServerLoop ()
#12 0x00000000007022c3 in PostmasterMain ()
#13 0x000000000048421e in main ()
进一步检查数据库中的复制槽:
# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size
--------------------------------------------+----------+-----------+--------+------------------+-----------+--------+------------+------+--------------+--------------+---------------------+------------+---------------
*****_publication | pgoutput | logical | 16394 | ****** | f | t | 94753 | | 43147922 | 101/A53C5660 | 101/A53E4A98 | unreserved | -1297197432
发现有一个复制槽wal_status处于unreserved状态,那应该就是这个复制槽导致checkpoint卡住的罪魁祸首。
checkpoint要干的事情:
- 刷脏数据
- 将clog刷盘
- 删除truncate、drop等命令留下的空数据文件(drop等命令会立即回收表数据文件所占空间,空数据文件直到checkpoint才删除),如果有大量的unlog table、temp table,checkpoint执行时间可能会较长。
- flush slot(将ReplicationSlotPersistentData数据刷盘),所以复制槽异常可能会阻塞checkpoint。
问题的源头找到了,把这个逻辑复制槽删除应该就能解决。但逻辑复制槽的walsender进程,pg_terminate_backend() 结束不了,也就无法删除复制槽。
如何kill walsender进程?
逻辑复制槽的walsender进程,pg_terminate_backend() 结束不了,但看到pg_stat_activity中会话的等待事件是clientwrite,所以想到可以利用网络超时来结束会话。
不完美的解决:
# 查到该进程的客户端的IP和端口
select client_addr, client_port from pg_stat_activity where pid=<复制槽的walsender进程pid>;
# 使用iptables禁止掉该客户端IP、port
iptables -I INPUT -p tcp --dport 51758 -s 192.168.0.1 -j DROP
iptables -I OUTPUT -p tcp --dport 51758 -d 192.168.0.1 -j DROP
# 查看 iptables拦截记录
iptables -vnL
# 等待tcp keepalive超时,然后再次pg_terminate_backend就可以kill掉了
复制槽的walsender进程kill后,就可以删除逻辑复制槽了(select pg_drop_replication_slot()),然后checkpoint就没问题了。
浙公网安备 33010602011771号