KingbaseES V8R6集群运维案例之---备库网卡down集群状态分析
案例说明:
KingbaseES V8R6集群,在备库主机网卡down后,集群状态测试分析。
适用版本:
KingbaseES V8R6
主机节点信息:
[kingbase@node101 bin]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.101   node101  ,  #主库
192.168.1.102   node102     #备库
集群节点信息:
ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node101 | primary | * running |          | running | 11180 | no      | n/a
 2  | node102 | standby |   running | node101  | running | 9242  | no      | 0 second(s) ago
一、查看集群状态及配置信息
1、集群节点状态
[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                         
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node101 | primary | * running |          | default  | 100      | 1        | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node102 | standby |   running | node101  | default  | 100      | 1        | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2、集群配置信息

二、将备库网卡down测试
1、备库网卡down
[root@node102 ~]# ifconfig enp0s3 down

2、查看备库messages日志

3、备库hamgr.log
=日志信息显示备库与主库及自己连接被close,无法提供正常的连接。=

4、主库查看集群节点状态
[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                 
----+---------+---------+---------------+----------+----------+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node101 | primary | * running     |          | default  | 100      | 1        | host=192.168.1.101 user=system dbname=esrep port=5 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node102 | standby | ? unreachable | node101  | default  | 100      | ?        | host=192.168.1.102 user=system dbname=esrep port=5 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
WARNING: following issues were detected
  - unable to connect to node "node102" (ID: 2)
  - node "node102" (ID: 2) is registered as an active standby but is unreachable
=== 从以上信息所示,集群没有触发主备库的切换操作。===
三、备库网卡恢复正常(up)
1、查看集群状态信息
[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                     
----+---------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node101 | primary | * running |          | default  | 100      | 1        | host=192.168.1.101 user=system dbname=esrep port=54321nect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node102 | standby |   running | node101  | default  | 100      | 1        | host=192.168.1.102 user=system dbname=esrep port=54321nect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2、查看备库hamgr.log
=如下日志所示,备库网卡恢复正常后,备库通过接收wal日志流执行recovery,和主库同步。=
[2022-03-29 16:11:45] [INFO] node "node102" (ID: 2) monitoring upstream node "node101" (ID: 1) in normal state
[2022-03-29 16:11:45] [ERROR] unable to determine if server is in recovery
[2022-03-29 16:11:45] [DETAIL]
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
[2022-03-29 16:11:45] [DETAIL] query text is:
SELECT pg_catalog.pg_is_in_recovery()
[2022-03-29 16:11:47] [NOTICE] upstream is available but upstream connection has gone away, resetting
[2022-03-29 16:12:24] [ERROR] is_rep_sync_streaming(): get 2 tuples
[2022-03-29 16:12:45] [ERROR] is_wal_all_recevied(): get 0 tuples
[2022-03-29 16:12:45] [ERROR] is_rep_sync_streaming(): get 0 tuples
[2022-03-29 16:12:47] [ERROR] is_wal_all_recevied(): get 0 tuples
[2022-03-29 16:12:47] [ERROR] is_rep_sync_streaming(): get 0 tuples
[2022-03-29 16:12:49] [ERROR] is_wal_all_recevied(): get 0 tuples
[2022-03-29 16:12:49] [ERROR] is_rep_sync_streaming(): get 0 tuples
[2022-03-29 16:16:47] [INFO] node "node102" (ID: 2) monitoring upstream node "node101" (ID: 1) in normal state
四、总结
 1、对于备库,如果网卡down引起的网络故障,并不会触发集群的主备切换。当网卡正常后,集群恢复正常。
 2、如果备库的数据库服务down,在recovery=‘automatic | standby’配置时,会自动恢复备库的数据库服务。
 3、本案例是在一主一备的架构下的测试,如果是一主多备的架构,对于同步状态是‘sync’的备库网卡down,会导致其他的备库进行竞选,将同步状态提升为‘sync’。
 
                     
                    
                 
                    
                
 
 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号