KingbaseES复制冲突中谁阻塞walreplay

背景

回顾一下流复制冲突相关参数:

hot_standby_feedback: 从库反馈给主库快照, 主库vacuum时不回收最老快照之后产生的垃圾,注:备库长查询将导致主库表膨胀。
vacuum_defer_cleanup_age: 当触发vacuum时,延迟指定事务后触发。
recovery_min_apply_delay: 如果将此参数设置为5分钟,则只有在备库时间至少比主库提交时间多五分钟时,备库才会重放每个wal。
max_standby_streaming_delay: 当startup replay stream wal record时, 如果遇到冲突, startup最多等多久。
max_standby_archive_delay: 当startup replay archive(restore command) wal record时, 如果遇到冲突, startup最多等多久。

根据有关视图查询复制冲突次数,冲突是否被阻塞,以及阻塞者

冲突次数
test=# select * from sys_stat_database_conflicts ;
datid | datname | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
-------+-----------+------------------+------------+----------------+-----------------+----------------
14187 | test | 0 | 0 | 0 | 0 | 0
16053 | security | 0 | 0 | 0 | 0 | 0
1 | template1 | 0 | 0 | 0 | 0 | 0
14186 | template0 | 0 | 0 | 0 | 0 | 0
16387 | test | 0 | 0 | 4 | 0 | 0
17527 | test23 | 0 | 0 | 0 | 0 | 0
(6 rows)

查看当前是否存在冲突,当startup进程的等待事件为空, 表示它被堵塞了,可以理解为此时startup进程什么工作也没做。

test=# select * from sys_stat_activity where backend_type ='startup' and wait_event is null;
-[ RECORD 1 ]----+---------------------------------
datid |
datname |
pid | 21060
usesysid |
usename |
application_name |
client_addr |
client_hostname |
client_port |
backend_start | 2020-02-29 00:26:28.478013+08
xact_start |
query_start |
state_change |
wait_event_type |
wait_event |
state |
backend_xid |
backend_xmin |
query |
backend_type | startup

当前startup等待事件
当startup在回放wal时, 等待中通常有io等操作. 这种情况不是conflict堵塞。

test=# select * from sys_stat_activity where backend_type ='startup';
-[ RECORD 1 ]----+------------------------------
datid |
datname |
pid | 21060
usesysid |
usename |
application_name |
client_addr |
client_hostname |
client_port |
backend_start | 2020-02-29 00:26:28.478013+08
xact_start |
query_start |
state_change |
wait_event_type | IO
wait_event | DataFileExtend
state |
backend_xid |
backend_xmin |
query |
backend_type | startup

RecoveryWalAll 通常表示startup进程正在等待wal, 通常此时standby处于未delay状态.对应wait_event_type 是Activity。

test=# select * from sys_stat_activity where backend_type ='startup';
-[ RECORD 1 ]----+------------------------------
datid |
datname |
pid | 21060
usesysid |
usename |
application_name |
client_addr |
client_hostname |
client_port |
backend_start | 2020-02-29 00:26:28.478013+08
xact_start |
query_start |
state_change |
wait_event_type | Activity
wait_event | RecoveryWalAll
state |
backend_xid |
backend_xmin |
query |
backend_type | startup

startup 可能被哪个查询堵塞了呢
通常是时间越早, 越可能是堵塞startup的query。
或者xmin, xid越早, 越可能是堵塞startup的query。

test=# select *,xact_start,query_start,state,user,query from sys_stat_activity where datname=current_database() and state<>'idle' order by xact_start limit 5;
-[ RECORD 1 ]----+---------------------------------------------------------------------------------------------------------------------------
datid | 16387
datname | test
pid | 29015
usesysid | 10
usename | test
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2020-03-10 19:01:22.577305+08
xact_start | 2020-03-10 19:01:42.257888+08
query_start | 2020-03-10 19:01:43.750416+08
state_change | 2020-03-10 19:01:43.750577+08
wait_event_type | Client
wait_event | ClientRead
state | idle in transaction
backend_xid |
backend_xmin | 4556
query | select * from abc limit 1;
backend_type | client backend
xact_start | 2020-03-10 19:01:42.257888+08
query_start | 2020-03-10 19:01:43.750416+08
state | idle in transaction
user | test
query | select * from abc limit 1;

通常可以使用以下sql进行查询

1、时间最老

select a.* from
(
select *,row_number() over (partition by state order by xact_start) as rn
from sys_stat_activity
where datname=current_database()
and pid<>sys_backend_pid()
and state<>'idle'
) a,
(
select * from sys_stat_activity where backend_type ='startup' and wait_event is null
)b
where a.rn <= 1
order by a.xact_start;

2、或 (事务号最老)

select a.* from
(
select *,row_number() over (partition by state order by least(backend_xid::text::int8,backend_xmin::text::int8)) as rn
from sys_stat_activity
where datname=current_database()
and pid<>sys_backend_pid()
and state<>'idle'
) a,
(
select * from sys_stat_activity where backend_type ='startup' and wait_event is null
)b
where a.rn <= 1
order by least(a.backend_xid::text::int8,a.backend_xmin::text::int8);

可能堵塞了wal apply 的 query 如下

-[ RECORD 1 ]----+------------------------------
datid | 16387
datname | test
pid | 30448
usesysid | 10
usename | test
application_name | psql
client_addr |
client_hostname |
client_port | -1
backend_start | 2020-03-10 19:13:36.670184+08
xact_start | 2020-03-10 19:13:38.696822+08
query_start | 2020-03-10 19:13:40.856399+08
state_change | 2020-03-10 19:13:40.85716+08
wait_event_type | Client
wait_event | ClientRead
state | idle in transaction
backend_xid |
backend_xmin | 4565
query | select * from abc limit 1;
backend_type | client backend
rn | 1

注意, 这样找到的疑似堵塞startup replay的查询, 但是结果不精确。

期待未来有相关sql能关联查询出准确的阻塞者, 可以找到精确的堵塞wal replay的query。

注意目前提供的blocking的系统函数只能用于查询重量级锁的冲突。

备库查看堵塞了多少wal还没有被replay

select sys_is_wal_replay_paused(),
sys_last_wal_receive_lsn(),sys_last_wal_replay_lsn(),
sys_size_pretty(sys_wal_lsn_diff(sys_last_wal_receive_lsn(),sys_last_wal_replay_lsn()));

pg_is_wal_replay_paused | pg_last_wal_receive_lsn | pg_last_wal_replay_lsn | pg_size_pretty
-------------------------+-------------------------+------------------------+----------------
f | 4/A5C524A0 | 4/A5C524A0 | 0 bytes

(1 row)

posted @ 2024-04-03 17:03  KINGBASE研究院  阅读(8)  评论(0编辑  收藏  举报