KingbaseES V8R6集群运维案例之---同城双中心failover切换
案例说明:
对于同城双中心的集群架构,当生产中心的主库数据库服务down时,默认是本中心的同步备库被选举为新的主库,只有当生产中心内所有节点都down时,灾备中心的和生产中心主库连接的同步备库节点提升为新主库。
适用版本:
KingbaseES V8R6
集群架构:
      ID | Name  | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
 1  | node1 | primary | * running |          | running | 2690 | no      | n/a
 2  | node2 | standby |   running | node1    | running | 2874 | no      | 1 second(s) ago
 3  | node3 | standby |   running | node1    | running | 4118 | no      | 0 second(s) ago
 注:
 1)生产中心:node1、node2
 2)灾备中心:node3
案例一:中心内failover切换
一、集群状态信息
[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location       | Priority | Timeline | LSN_Lag | Connection string                                                                                                               
----+-------+---------+-----------+----------+----------------+----------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | production     | 100      | 5        |         | host=192.168.1.101 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | production     | 100      | 5        | 0 bytes | host=192.168.1.102 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running | node1    | local_disaster | 100      | 5        | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 
 流复制状态信息:
test=# select application_name ,client_addr , state ,sent_lsn,write_lsn,flush_lsn,replay_lsn,sync_state
test-# from sys_stat_replication;
 application_name |  client_addr  |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | sync_state
------------------+---------------+-----------+------------+------------+------------+------------+------------
 node2            | 192.168.1.102 | streaming | 0/13002570 | 0/13002570 | 0/13002570 | 0/13002570 | sync
 node3            | 192.168.1.103 | streaming | 0/13002570 | 0/13002570 | 0/13002570 | 0/13002570 | potential
二、repmgr.conf相关配置
[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep failover
failover='automatic'
failover_need_server_alive='none'
[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep sync
sync_in_same_location='0'
synchronous='sync'
三、模拟主库数据库服务down
[kingbase@node101 bin]$ ./sys_ctl stop -D /data/kingbase/tptc/rh6/data/
四、查看备库hangr.log日志
1、生产中心备库
1)生产中心备库尝试连接主库数据库服务,超过阈值10次失败后,触发failover切换。
[2023-07-04 11:21:27] [INFO] sleeping up to 6 seconds until next reconnection attempt
[2023-07-04 11:21:33] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-07-04 11:21:33] [WARNING] unable to reconnect to node "node1" (ID: 1) after 10 attempts
2)断开所有备库的流复制
[2023-07-04 11:21:33] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2023-07-04 11:21:33] [WARNING] wal receiver not running
[2023-07-04 11:21:33] [NOTICE] WAL receiver disconnected on all sibling nodes
[2023-07-04 11:21:33] [INFO] WAL receiver disconnected on all 1 sibling nodes
3)生产中心备库在候选备库选举中获胜,将提升为新的主库
[2023-07-04 11:21:33] [INFO] 1 active sibling nodes registered
[2023-07-04 11:21:33] [INFO] 3 total nodes registered
[2023-07-04 11:21:33] [INFO] primary node  "node1" (ID: 1) and this node have the same location ("production")
[2023-07-04 11:21:33] [INFO] local node's last receive lsn: 0/100000A0
[2023-07-04 11:21:33] [INFO] checking state of sibling node "node3" (ID: 3)
[2023-07-04 11:21:33] [INFO] node "node3" (ID: 3) reports its upstream is node 1, last seen 55 second(s) ago
[2023-07-04 11:21:33] [INFO] standby node "node3" (ID: 3) last saw primary node 55 second(s) ago
[2023-07-04 11:21:33] [NOTICE] in quorum/sync/all mode, node "node3" (ID: 3) replication is not in streaming+quorum/sync stat, skipping
[2023-07-04 11:21:33] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2023-07-04 11:21:33] [NOTICE] promotion candidate is "node2" (ID: 2)
[2023-07-04 11:21:33] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2023-07-04 11:21:33] [NOTICE] this node is the winner, will now promote itself and inform other nodes
4)测试网络连通性(ping网关),执行promote提升备库为新主库。
[2023-07-04 11:21:33] [INFO] try to ping the trusted_servers "192.168.1.1" before execute promote_command
[2023-07-04 11:21:35] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
--- 192.168.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.198/0.201/0.204/0.003 ms
[2023-07-04 11:21:35] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2023-07-04 11:21:35] [INFO] promote_command is:
  "/home/kingbase/cluster/tptc/rh6/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/tptc/rh6/kingbase/etc/repmgr.conf"
[WARNING] 1 sibling nodes found, but option "--siblings-follow" not specified
[DETAIL] these nodes will remain attached to the current primary:
  node3 (node ID: 3)
[NOTICE] promoting standby to primary
[DETAIL] promoting server "node2" (ID: 2) using sys_promote()
[NOTICE] waiting for promotion to complete, replay lsn: 0/100000A0
[2023-07-04 11:21:35] [NOTICE] try to stop old primary db (host: "192.168.1.101")
[INFO] SET synchronous TO "async" on primary host
[NOTICE] STANDBY PROMOTE successful
[DETAIL] server "node2" (ID: 2) was successfully promoted to primary
2、灾备中心备库
[2023-07-04 11:21:42] [NOTICE] [thread pid:4567] node (ID: 3; host: "192.168.1.103") will follow myself, ready to auto-recovery
[2023-07-04 11:21:42] [NOTICE] [thread pid:4567] Now, the primary host ip: 192.168.1.102
[2023-07-04 11:21:43] [INFO] [thread pid:4567] ES connection to host "192.168.1.103" succeeded, ready to do auto-recovery
[2023-07-04 11:21:43] [INFO] node "node3" (ID: 3, HOST: 192.168.1.103) auto-recovery: STANDBY FOLLOW
[2023-07-04 11:21:45] [NOTICE] executing repmgr command "/home/kingbase/cluster/tptc/rh6/kingbase/bin/repmgr standby follow -f /home/kingbase/cluster/tptc/rh6/kingbase/etc/repmgr.conf -W --upstream-node-id=2"
---如上所示,对同城灾备中心备库执行repmgr standby follow。
3、灾备中心sys_log日志
如下所示,备库在failover过程中,连接新的主库(upstream)节点时有个被recovery,数据库服务短暂重启的过程。
2023-07-04 11:21:50.579 CST,,,10041,,64a3904e.2739,1,,2023-07-04 11:21:50 CST,,0,LOG,00000,"database system was shut down in recovery at 2023-07-04 11:21:50 CST",,,,,,,,,""
2023-07-04 11:21:50.579 CST,,,10041,,64a3904e.2739,2,,2023-07-04 11:21:50 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2023-07-04 11:21:50.586 CST,,,10041,,64a3904e.2739,3,,2023-07-04 11:21:50 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/100000A0",,,,,,,,,""
2023-07-04 11:21:50.586 CST,,,10041,,64a3904e.2739,4,,2023-07-04 11:21:50 CST,1/0,0,LOG,00000,"invalid record length at 0/100000A0: wanted 24, got 0",,,,,,,,,""
2023-07-04 11:21:50.586 CST,,,10028,,64a3904e.272c,2,,2023-07-04 11:21:50 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""
五、切换后集群状态
[kingbase@node102 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location       | Priority | Timeline | LSN_Lag | Connection string                                                                                                               
----+-------+---------+-----------+----------+----------------+----------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | production     | 100      |          |         | host=192.168.1.101 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | primary | * running |          | production     | 100      | 6        |         | host=192.168.1.102 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running | node2    | local_disaster | 100      | 5        | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)
[HINT] execute with --verbose option to see connection error messages
案例二:中心之间failover切换
1、生产中心集群状态
如下图所示,生产中心node1节点数据服务down,现在node2为集群主节点。

2、repmgr.conf相关配置
[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep failover
failover='automatic'
failover_need_server_alive='none'
[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep sync
sync_in_same_location='0'
synchronous='sync'
3、模拟node2节点数据库服务down
[kingbase@node102 bin]$ ./sys_ctl stop -D /data/kingbase/tptc/rh6/data/
4、查看灾备备库hamgr.log
2023-06-27 14:48:03] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-06-27 14:48:03] [WARNING] unable to reconnect to node "node1" (ID: 1) after 10 attempts
[2023-06-27 14:48:03] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2023-06-27 14:48:03] [WARNING] wal receiver not running
[2023-06-27 14:48:03] [NOTICE] WAL receiver disconnected on all sibling nodes
[2023-06-27 14:48:03] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2023-06-27 14:48:03] [INFO] 0 active sibling nodes registered
[2023-06-27 14:48:03] [INFO] 3 total nodes registered
[2023-06-27 14:48:03] [INFO] primary node "node1" (ID: 1) has location "production", this node's location is "local_disaster"
[2023-06-27 14:48:03] [INFO] no other sibling nodes and servers alive - we win by default
[2023-06-27 14:48:03] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2023-06-27 14:48:03] [NOTICE] this node is the only available candidate and will now promote itself
[2023-06-27 14:48:03] [INFO] try to ping the trusted_servers "192.168.1.1" before execute promote_command
[2023-06-27 14:48:05] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
--- 192.168.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.207/0.229/0.249/0.021 ms
[2023-06-27 14:48:05] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2023-06-27 14:48:05] [INFO] promote_command is:
  "/home/kingbase/cluster/tptc/rh6/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/tptc/rh6/kingbase/etc/repmgr.conf"
[NOTICE] promoting standby to primary
[DETAIL] promoting server "node3" (ID: 3) using sys_promote()
[NOTICE] waiting for promotion to complete, replay lsn: 0/9001118
[INFO] SET synchronous TO "async" on primary host
[NOTICE] STANDBY PROMOTE successful
[DETAIL] server "node3" (ID: 3) was successfully promoted to primary
---如上所示,此灾备中心备库节点是集群唯一节点,生产中心所有节点down后,提升为新的主库。
 
                    
                     
                    
                 
                    
                
 
 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号