逻辑复制槽失效导致checkpoint卡住

问题现象

先是备份执行pg_basebackup时卡住超时,查看pg_stat_activity中备份进程被checkpoint进程阻塞,然后再检查checkpoint进程发现进程等待事件是IPC:CheckpointStart

查看checkpointer进程的pstack:

# pstack checkpointer_pid
# 可以看到#3行,卡在了删除无效复制槽这一步
#0  0x00007ffff71e20c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x0000000000753956 in WaitEventSetWait ()
#2  0x000000000075d92c in ConditionVariableTimedSleep ()
#3  0x000000000072ba7b in InvalidateObsoleteReplicationSlots ()
#4  0x000000000051390d in CreateCheckPoint ()
#5  0x00000000006f5bc2 in CheckpointerMain ()
#6  0x0000000000523725 in AuxiliaryProcessMain ()
#7  0x00000000006ffac9 in StartChildProcess ()
#8  0x0000000000700cb2 in reaper ()
#9  <signal handler called>
#10 0x00007ffff71d8b23 in __select_nocancel () from /lib64/libc.so.6
#11 0x0000000000482cd4 in ServerLoop ()
#12 0x00000000007022c3 in PostmasterMain ()
#13 0x000000000048421e in main ()

进一步检查数据库中的复制槽:

# select * from pg_replication_slots ;
slot_name                  |  plugin  | slot_type | datoid |     database     | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn | wal_status | safe_wal_size 
--------------------------------------------+----------+-----------+--------+------------------+-----------+--------+------------+------+--------------+--------------+---------------------+------------+---------------
 *****_publication         | pgoutput | logical   |  16394 | ******           | f         | t      |      94753 |      |     43147922 | 101/A53C5660 | 101/A53E4A98        | unreserved |   -1297197432

发现有一个复制槽wal_status处于unreserved状态,那应该就是这个复制槽导致checkpoint卡住的罪魁祸首。
checkpoint要干的事情:

  1. 刷脏数据
  2. 将clog刷盘
  3. 删除truncate、drop等命令留下的空数据文件(drop等命令会立即回收表数据文件所占空间,空数据文件直到checkpoint才删除),如果有大量的unlog table、temp table,checkpoint执行时间可能会较长。
  4. flush slot(将ReplicationSlotPersistentData数据刷盘),所以复制槽异常可能会阻塞checkpoint。

问题的源头找到了,把这个逻辑复制槽删除应该就能解决。但逻辑复制槽的walsender进程,pg_terminate_backend() 结束不了,也就无法删除复制槽。

如何kill walsender进程?

逻辑复制槽的walsender进程,pg_terminate_backend() 结束不了,但看到pg_stat_activity中会话的等待事件是clientwrite,所以想到可以利用网络超时来结束会话。
不完美的解决:

# 查到该进程的客户端的IP和端口
select client_addr, client_port from pg_stat_activity where pid=<复制槽的walsender进程pid>;

# 使用iptables禁止掉该客户端IP、port
iptables -I INPUT -p tcp --dport 51758 -s 192.168.0.1 -j DROP
iptables -I OUTPUT -p tcp --dport 51758 -d 192.168.0.1 -j DROP
# 查看 iptables拦截记录
iptables -vnL

# 等待tcp keepalive超时,然后再次pg_terminate_backend就可以kill掉了

复制槽的walsender进程kill后,就可以删除逻辑复制槽了(select pg_drop_replication_slot()),然后checkpoint就没问题了。

posted @ 2024-04-11 11:07  清风生  阅读(301)  评论(0)    收藏  举报