postgresql 高可用 repmgr 的使用之六 1 Primary + 1 Standby 的 auto failover

os:ubunbu 16.04
postgresql:9.6.8
repmgr:4.1.1

192.168.56.101 node1
192.168.56.102 node2

生产环境确保网络稳定的前提下,数据库能够 auto failover 是最好的状态。如果是同步复制的话,建议配置两个slave,如果是异步复制,也建议配置两个slave。

需要实现 auto failover,就需要启用 repmgrd,而启用 repmgrd 就必须设置 postgresql.conf 的 shared_preload_libraries = ‘repmgr’

$ vi /etc/postgresql/9.6/main/postgresql.conf
shared_preload_libraries = 'repmgr'

/etc/repmgr.conf 的内容

node1 节点上查看,node2 节点上基本相同。

# more /etc/repmgr.conf

##############################################
# node information
node_id=1
node_name=node1
conninfo='host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/9.6/main'
use_replication_slots=true
pg_bindir='/usr/lib/postgresql/9.6/bin'
pg_ctl_options='-l /var/log/postgres/startup.log'
rsync_options=--exclude=postgresql.local.conf --archive
#use_primary_conninfo_password=true
location='location01'
##############################################
# service_start_command
service_start_command   = 'sudo pg_ctlcluster 9.6 main start'
service_stop_command    = 'sudo pg_ctlcluster 9.6 main stop'
service_restart_command = 'sudo pg_ctlcluster 9.6 main restart'
service_reload_command  = 'sudo pg_ctlcluster 9.6 main reload' 
service_promote_command = 'sudo pg_ctlcluster 9.6 main promote'
##############################################
# log info
log_level=INFO
log_status_interval=10
log_file=/var/log/postgresql/repmgrd.log
##############################################
# repmgrd options
monitoring_history=yes
monitor_interval_secs=5
reconnect_attempts=10
reconnect_interval=5
##############################################
# automatic failover
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'

启动 repmgrd

每个节点都是需要启动的
手动启动

# repmgrd -f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid

系统启动

# vi /etc/default/repmgrd 

# default settings for repmgrd. This file is source by /bin/sh from
# /etc/init.d/repmgrd

# disable repmgrd by default so it won't get started upon installation
# valid values: yes/no
REPMGRD_ENABLED=yes

# configuration file (required)
REPMGRD_CONF="/etc/repmgr.conf"

# additional options
#REPMGRD_OPTS=""

# user to run repmgrd as
REPMGRD_USER=postgres

# repmgrd binary
REPMGRD_BIN=/usr/bin/repmgrd

# pid file
REPMGRD_PIDFILE=/var/run/repmgrd.pid

REPMGRD_CONF 指向了配置文件
非常有必要配置成系统启用。

# systemctl status repmgrd.service
# systemctl start repmgrd.service
# systemctl enable repmgrd.service

node1 上的 repmgrd 进程

# ps -ef|grep -i repmgrd |grep -v grep
postgres  4091     1  0 14:30 ?        00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf

# cat /var/log/postgresql/repmgrd.log 
[2018-09-25 14:30:28] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 14:30:28] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 14:30:28] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2018-09-25 14:30:28] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
[2018-09-25 14:30:38] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 14:30:48] [INFO] monitoring primary node "node1" (node ID: 1) in normal state

node2 上的 repmgrd 进程

# ps -ef|grep -i repmgrd |grep -v grep
postgres  4349     1  0 14:18 ?        00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf

# cat /var/log/postgresql/repmgrd.log 
[2018-09-25 14:33:44] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 14:33:44] [INFO] connecting to database "host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 14:33:44] [NOTICE] starting monitoring of node "node2" (ID: 2)
[2018-09-25 14:33:44] [INFO] monitoring connection to upstream node "node1" (node ID: 1)
[2018-09-25 14:33:55] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 14:33:55] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 14:34:05] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 14:34:05] [DETAIL] last monitoring statistics update was 5 seconds ago

手动关闭主库模拟异常

node1 节点上操作

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
 
$ pg_ctl -D /var/lib/postgresql/9.6/main -m fast stop
或者
$ sudo pg_ctlcluster 9.6 main stop

$ tail -f /var/log/postgresql/repmgrd.log
[2018-09-25 15:08:23] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 15:08:23] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 15:08:23] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2018-09-25 15:08:23] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
[2018-09-25 15:08:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:53] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:03] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:13] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:23] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:53] [WARNING] unable to connect to local node
[2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts
[2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts
[2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts
[2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts
[2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts
[2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts
[2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts
[2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts
[2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts
[2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] unable to connect to local node, falling back to degraded monitoring
[2018-09-25 15:10:38] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state
[2018-09-25 15:10:38] [DETAIL] waiting for the node to become available
[2018-09-25 15:10:48] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state
[2018-09-25 15:10:48] [DETAIL] waiting for the node to become available

node2 上查看

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Connection string                                            
----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | ? unreachable |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running     | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
	Is the server running on host "192.168.56.101" and accepting
	TCP/IP connections on port 5432?"
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
$ tail -f /var/log/postgresql/repmgrd.log 

[2018-09-25 15:07:58] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 15:07:58] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 15:07:58] [NOTICE] starting monitoring of node "node2" (ID: 2)
[2018-09-25 15:07:58] [INFO] monitoring connection to upstream node "node1" (node ID: 1)
[2018-09-25 15:08:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:08] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:18] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:28] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:38] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:48] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:58] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:58] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:08] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:18] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:28] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:38] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:48] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:53] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
[2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts
[2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts
[2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts
[2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts
[2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts
[2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts
[2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts
[2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts
[2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts
[2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself
[2018-09-25 15:10:38] [INFO] promote_command is:
  "/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
[2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"

[2018-09-25 15:10:38] [NOTICE] promoting standby to primary
[2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
[2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful
[2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2018-09-25 15:10:39] [INFO] switching to primary monitoring mode
[2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2)
[2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:10:59] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:11:09] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:11:19] [INFO] monitoring primary node "node2" (node ID: 2) in normal state

在经过 10*5s的时常后,发生了 auto failover: promoting standby to primary,如下

[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself
[2018-09-25 15:10:38] [INFO] promote_command is:
  "/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
[2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"

[2018-09-25 15:10:38] [NOTICE] promoting standby to primary
[2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
[2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful
[2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2018-09-25 15:10:39] [INFO] switching to primary monitoring mode
[2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2)
[2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state

node2 上查看 cluster show

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
	Is the server running on host "192.168.56.101" and accepting
	TCP/IP connections on port 5432?"
	

虚拟机掉电模拟异常

经过前面的测试,node2 现在是 master,node1 通过 node rejoin后变为slave,细节可以参考上一篇blog。
node2 节点掉电前状态查看

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

node2 开始掉电

node1 上查看

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Connection string                                            
----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
 1  | node1 | standby |   running     | node2    | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | ? unreachable |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - when attempting to connect to node "node2" (ID: 2), following error encountered :
"could not connect to server: No route to host
	Is the server running on host "192.168.56.102" and accepting
	TCP/IP connections on port 5432?"
  - node "node2" (ID: 2) is registered as an active primary but is unreachable
  

此时 node1 节点的也提升为 master,提升为master的日志输出基本类似,不再重复粘贴。

参考:
https://www.2ndquadrant.com/en/resources/repmgr/
https://github.com/2ndQuadrant/repmgr
https://repmgr.org/docs/4.1/using-repmgrd.html

posted @ 2018-09-25 17:10  peiybpeiyb  阅读(968)  评论(0编辑  收藏  举报