day10-06-MHA故障模拟

MHA故障模拟及故障处理

1、宕掉 db01 数据库

/etc/init.d/mysqld stop
systemctl stop mysqld

查看日志,可以看到故障转移过程

Mon Oct 19 15:24:38 2020 - [info] Starting ping health check on 10.0.50.61(10.0.50.61:3306)..
Mon Oct 19 15:24:38 2020 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Mon Oct 19 15:28:42 2020 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away
)
Mon Oct 19 15:28:42 2020 - [info] Executing SSH check script: exit 0
Mon Oct 19 15:28:42 2020 - [info] HealthCheck: SSH to 10.0.50.61 is reachable.
Mon Oct 19 15:28:44 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server 
on '10.0.50.61' (111))
Mon Oct 19 15:28:44 2020 - [warning] Connection failed 2 time(s)..
Mon Oct 19 15:28:46 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server 
on '10.0.50.61' (111))
Mon Oct 19 15:28:46 2020 - [warning] Connection failed 3 time(s)..
Mon Oct 19 15:28:48 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server 
on '10.0.50.61' (111))
Mon Oct 19 15:28:48 2020 - [warning] Connection failed 4 time(s)..
Mon Oct 19 15:28:48 2020 - [warning] Master is not reachable from health checker!
Mon Oct 19 15:28:48 2020 - [warning] Master 10.0.50.61(10.0.50.61:3306) is not reachable!
Mon Oct 19 15:28:48 2020 - [warning] SSH is reachable.
Mon Oct 19 15:28:48 2020 - [info] Connecting to a master server failed. Reading configuration file /e
tc/masterha_default.cnf and /etc/mha/app1.cnf again, and trying to connect to all servers to check se
rver status..
Mon Oct 19 15:28:48 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. S
kipping.
Mon Oct 19 15:28:48 2020 - [info] Reading application default configuration from /etc/mha/app1.cnf..
Mon Oct 19 15:28:48 2020 - [info] Reading server configuration from /etc/mha/app1.cnf..
Mon Oct 19 15:28:49 2020 - [info] GTID failover mode = 1
Mon Oct 19 15:28:49 2020 - [info] Dead Servers:
Mon Oct 19 15:28:49 2020 - [info]   10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:49 2020 - [info] Alive Servers:
Mon Oct 19 15:28:49 2020 - [info]   10.0.50.62(10.0.50.62:3306)
Mon Oct 19 15:28:49 2020 - [info]   10.0.50.63(10.0.50.63:3306)
Mon Oct 19 15:28:49 2020 - [info] Alive Slaves:
Mon Oct 19 15:28:49 2020 - [info]   10.0.50.62(10.0.50.62:3306)  Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:49 2020 - [info]     GTID ON
Mon Oct 19 15:28:49 2020 - [info]     Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:49 2020 - [info]   10.0.50.63(10.0.50.63:3306)  Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:49 2020 - [info]     GTID ON
Mon Oct 19 15:28:49 2020 - [info]     Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:49 2020 - [info] Checking slave configurations..
Mon Oct 19 15:28:49 2020 - [info]  read_only=1 is not set on slave 10.0.50.62(10.0.50.62:3306).
Mon Oct 19 15:28:49 2020 - [info]  read_only=1 is not set on slave 10.0.50.63(10.0.50.63:3306).
Mon Oct 19 15:28:49 2020 - [info] Checking replication filtering settings..
Mon Oct 19 15:28:49 2020 - [info]  Replication filtering check ok.
Mon Oct 19 15:28:49 2020 - [info] Master is down!
Mon Oct 19 15:28:49 2020 - [info] Terminating monitoring script.
Mon Oct 19 15:28:49 2020 - [info] Got exit code 20 (Master dead).
Mon Oct 19 15:28:49 2020 - [info] MHA::MasterFailover version 0.58.
Mon Oct 19 15:28:49 2020 - [info] Starting master failover.
Mon Oct 19 15:28:49 2020 - [info] 
Mon Oct 19 15:28:49 2020 - [info] * Phase 1: Configuration Check Phase..
Mon Oct 19 15:28:49 2020 - [info] 
Mon Oct 19 15:28:49 2020 - [info] HealthCheck: SSH to 10.0.50.63 is reachable.
Mon Oct 19 15:28:50 2020 - [info] Binlog server 10.0.50.63 is reachable.
Mon Oct 19 15:28:51 2020 - [info] GTID failover mode = 1
Mon Oct 19 15:28:51 2020 - [info] Dead Servers:
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] Checking master reachability via MySQL(double check)...
Mon Oct 19 15:28:51 2020 - [info]  ok.
Mon Oct 19 15:28:51 2020 - [info] Alive Servers:
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.62(10.0.50.62:3306)
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.63(10.0.50.63:3306)
Mon Oct 19 15:28:51 2020 - [info] Alive Slaves:
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.62(10.0.50.62:3306)  Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info]     GTID ON
Mon Oct 19 15:28:51 2020 - [info]     Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.63(10.0.50.63:3306)  Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info]     GTID ON
Mon Oct 19 15:28:51 2020 - [info]     Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] Starting GTID based failover.
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] Forcing shutdown so that applications never connect to the current 
master..
Mon Oct 19 15:28:51 2020 - [info] Executing master IP deactivation script:
Mon Oct 19 15:28:51 2020 - [info]   /usr/local/bin/master_ip_failover --orig_master_host=10.0.50.61 -
-orig_master_ip=10.0.50.61 --orig_master_port=3306 --command=stopssh --ssh_user=root  
Mon Oct 19 15:28:51 2020 - [info]  done.
Mon Oct 19 15:28:51 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of t
he dead master.
Mon Oct 19 15:28:51 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] * Phase 3: Master Recovery Phase..
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.0000
19:1188
Mon Oct 19 15:28:51 2020 - [info] Retrieved Gtid Set: b4dfda8a-0f89-11eb-a610-525400c58226:1-5
Mon Oct 19 15:28:51 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.62(10.0.50.62:3306)  Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info]     GTID ON
Mon Oct 19 15:28:51 2020 - [info]     Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.63(10.0.50.63:3306)  Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info]     GTID ON
Mon Oct 19 15:28:51 2020 - [info]     Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.0000
19:1188
Mon Oct 19 15:28:51 2020 - [info] Retrieved Gtid Set: b4dfda8a-0f89-11eb-a610-525400c58226:1-5
Mon Oct 19 15:28:51 2020 - [info] Oldest slaves:
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.62(10.0.50.62:3306)  Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info]     GTID ON
Mon Oct 19 15:28:51 2020 - [info]     Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info]   10.0.50.63(10.0.50.63:3306)  Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info]     GTID ON
Mon Oct 19 15:28:51 2020 - [info]     Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] * Phase 3.3: Determining New Master Phase..
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] Searching new master from slaves..
Mon Oct 19 15:28:51 2020 - [info]  Candidate masters from the configuration file:
Mon Oct 19 15:28:51 2020 - [info]  Non-candidate masters:
Mon Oct 19 15:28:51 2020 - [info] New master is 10.0.50.62(10.0.50.62:3306)
Mon Oct 19 15:28:51 2020 - [info] Starting master failover..
Mon Oct 19 15:28:51 2020 - [info] 
From:
10.0.50.61(10.0.50.61:3306) (current master)
 +--10.0.50.62(10.0.50.62:3306)
 +--10.0.50.63(10.0.50.63:3306)

To:
10.0.50.62(10.0.50.62:3306) (new master)
 +--10.0.50.63(10.0.50.63:3306)
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Mon Oct 19 15:28:51 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info]  Waiting all logs to be applied.. 
Mon Oct 19 15:28:51 2020 - [info]   done.
Mon Oct 19 15:28:51 2020 - [info] -- Saving binlog from host 10.0.50.63 started, pid: 19160
Mon Oct 19 15:28:52 2020 - [info] 
Mon Oct 19 15:28:52 2020 - [info] Log messages from 10.0.50.63 ...
Mon Oct 19 15:28:52 2020 - [info] 
Mon Oct 19 15:28:51 2020 - [info] Fetching binary logs from binlog server 10.0.50.63..
Mon Oct 19 15:28:51 2020 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=mysql-bin.000019  --start_pos=1188 --output_file=/var/tmp/saved_binlog_binlog1_20201019152849.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.58 --oldest_version=5.7.26-log  --binlog_dir=/data/mysql/mha/binlog_server 
  Creating /var/tmp if not exists..    ok.
 Concat binary/relay logs from mysql-bin.000019 pos 1188 to mysql-bin.000019 EOF into /var/tmp/saved_
binlog_binlog1_20201019152849.binlog ..
No additional binlog events found.
Event not exists.
Mon Oct 19 15:28:51 2020 - [info] Additional events were not found from the binlog server. No need to save.
Mon Oct 19 15:28:52 2020 - [info] End of log messages from 10.0.50.63.
Mon Oct 19 15:28:52 2020 - [info] No binlog events found from 10.0.50.63. Skipping
Mon Oct 19 15:28:52 2020 - [info] Getting new master's binlog name and position..
Mon Oct 19 15:28:52 2020 - [info]  mysql-bin.000004:1188
Mon Oct 19 15:28:52 2020 - [info]  All other slaves should start replication from here. Statement sho
uld be: CHANGE MASTER TO MASTER_HOST='10.0.50.62', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Oct 19 15:28:52 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000004
, 1188, b4dfda8a-0f89-11eb-a610-525400c58226:1-5
Mon Oct 19 15:28:52 2020 - [info] Executing master IP activate script:
Mon Oct 19 15:28:52 2020 - [info]   /usr/local/bin/master_ip_failover --command=start --ssh_user=root
 --orig_master_host=10.0.50.61 --orig_master_ip=10.0.50.61 --orig_master_port=3306 --new_master_host=10.0.50.62 --new_master_ip=10.0.50.62 --new_master_port=3306 --new_master_user='mha'   --new_master_p
assword=xxx
Enabling the VIP - 10.0.50.69/24 on the new master - 10.0.50.62 
Mon Oct 19 15:28:52 2020 - [info]  OK.
Mon Oct 19 15:28:52 2020 - [info] ** Finished master recovery successfully.
Mon Oct 19 15:28:52 2020 - [info] * Phase 3: Master Recovery Phase completed.
Mon Oct 19 15:28:52 2020 - [info] 
Mon Oct 19 15:28:52 2020 - [info] * Phase 4: Slaves Recovery Phase..
Mon Oct 19 15:28:52 2020 - [info] 
Mon Oct 19 15:28:52 2020 - [info] 
Mon Oct 19 15:28:52 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Mon Oct 19 15:28:52 2020 - [info] 
Mon Oct 19 15:28:52 2020 - [info] -- Slave recovery on host 10.0.50.63(10.0.50.63:3306) started, pid:
 19176. Check tmp log /var/log/mha/app1/10.0.50.63_3306_20201019152849.log if it takes time..
Mon Oct 19 15:28:54 2020 - [info] 
Mon Oct 19 15:28:54 2020 - [info] Log messages from 10.0.50.63 ...
Mon Oct 19 15:28:54 2020 - [info] 
Mon Oct 19 15:28:52 2020 - [info]  Resetting slave 10.0.50.63(10.0.50.63:3306) and starting replicati
on from the new master 10.0.50.62(10.0.50.62:3306)..
Mon Oct 19 15:28:52 2020 - [info]  Executed CHANGE MASTER.
Mon Oct 19 15:28:53 2020 - [info]  Slave started.
Mon Oct 19 15:28:53 2020 - [info]  gtid_wait(b4dfda8a-0f89-11eb-a610-525400c58226:1-5) completed on 1
0.0.50.63(10.0.50.63:3306). Executed 0 events.
Mon Oct 19 15:28:54 2020 - [info] End of log messages from 10.0.50.63.
Mon Oct 19 15:28:54 2020 - [info] -- Slave on host 10.0.50.63(10.0.50.63:3306) started.
Mon Oct 19 15:28:54 2020 - [info] All new slave servers recovered successfully.
Mon Oct 19 15:28:54 2020 - [info] 
Mon Oct 19 15:28:54 2020 - [info] * Phase 5: New master cleanup phase..
Mon Oct 19 15:28:54 2020 - [info] 
Mon Oct 19 15:28:54 2020 - [info] Resetting slave info on the new master..
Mon Oct 19 15:28:54 2020 - [info]  10.0.50.62: Resetting slave info succeeded.
Mon Oct 19 15:28:54 2020 - [info] Master failover to 10.0.50.62(10.0.50.62:3306) completed successful
ly.
Mon Oct 19 15:28:54 2020 - [info] Deleted server1 entry from /etc/mha/app1.cnf .
Mon Oct 19 15:28:54 2020 - [info] 

----- Failover Report -----

app1: MySQL Master failover 10.0.50.61(10.0.50.61:3306) to 10.0.50.62(10.0.50.62:3306) succeeded

Master 10.0.50.61(10.0.50.61:3306) is down!

Check MHA Manager logs at mysql-node03:/var/log/mha/app1/manager for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 10.0.50.61(10.0.50.61:3306)
Selected 10.0.50.62(10.0.50.62:3306) as a new master.
10.0.50.62(10.0.50.62:3306): OK: Applying all logs succeeded.
10.0.50.62(10.0.50.62:3306): OK: Activated master IP address.
10.0.50.63(10.0.50.63:3306): OK: Slave started, replicating from 10.0.50.62(10.0.50.62:3306)
10.0.50.62(10.0.50.62:3306): Resetting slave info succeeded.
Master failover to 10.0.50.62(10.0.50.62:3306) completed successfully.

2、恢复故障

1)启动故障节点

db01:
[root@mysql-node01 binlog]# systemctl start mysqld
[root@mysql-node01 binlog]# ss -lntp|grep 3306
LISTEN     0      80        [::]:3306                  [::]:*                   users:(("mysqld",pid=4521,fd=33))
[root@mysql-node01 binlog]#

在db01节点上执行 change master to 语句,重建主从关系

# 可以在MHA-manager节点,从日志中找到提示

[root@mysql-node03 binlog_server]# grep 'CHANGE' /var/log/mha/app1/manager
Mon Oct 19 15:28:52 2020 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.0.50.62', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Oct 19 15:28:52 2020 - [info]  Executed CHANGE MASTER.
[root@mysql-node03 binlog_server]# 

在db01 节点执行

CHANGE MASTER TO MASTER_HOST='10.0.50.62', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='repl'


mysql> CHANGE MASTER TO MASTER_HOST='10.0.50.62', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='repl'
    -> ;
Query OK, 0 rows affected, 2 warnings (0.03 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G

  1. 配置文件恢复主节点信息
vi /etc/mha/app1.cnf

[server1]
hostname=10.0.50.61
port=3306

4) 恢复binlog_server

[root@mysql-node03 binlog_server]# pwd
/data/mysql/mha/binlog_server
[root@mysql-node03 binlog_server]#
[root@mysql-node03 binlog_server]# rm -rf ./mysql-bin.00001*
[root@mysql-node03 binlog_server]# mysqlbinlog -R --host=10.0.50.62 --user=mha --password=mha --raw --stop-never mysql-bin.000004 &
[1] 19260
[root@mysql-node03 binlog_server]# mysqlbinlog: [Warning] Using a password on the command line interface can be insecure.

[root@mysql-node03 binlog_server]#

5) 启动MHA-manager

[root@mysql-node03 binlog_server]# nohup masterha_manager --conf=/etc/mha/app1.cnf --remove_dead_master_conf --ignore_last_failover  < /dev/null> /var/log/mha/app1/manager.log 2>&1 &
[2] 19262
[root@mysql-node03 binlog_server]#
posted @ 2022-11-24 20:30  oldSimon  阅读(19)  评论(0)    收藏  举报