day10-06-MHA故障模拟
MHA故障模拟及故障处理
1、宕掉 db01 数据库
/etc/init.d/mysqld stop
systemctl stop mysqld
查看日志,可以看到故障转移过程
Mon Oct 19 15:24:38 2020 - [info] Starting ping health check on 10.0.50.61(10.0.50.61:3306)..
Mon Oct 19 15:24:38 2020 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Mon Oct 19 15:28:42 2020 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away
)
Mon Oct 19 15:28:42 2020 - [info] Executing SSH check script: exit 0
Mon Oct 19 15:28:42 2020 - [info] HealthCheck: SSH to 10.0.50.61 is reachable.
Mon Oct 19 15:28:44 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server
on '10.0.50.61' (111))
Mon Oct 19 15:28:44 2020 - [warning] Connection failed 2 time(s)..
Mon Oct 19 15:28:46 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server
on '10.0.50.61' (111))
Mon Oct 19 15:28:46 2020 - [warning] Connection failed 3 time(s)..
Mon Oct 19 15:28:48 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server
on '10.0.50.61' (111))
Mon Oct 19 15:28:48 2020 - [warning] Connection failed 4 time(s)..
Mon Oct 19 15:28:48 2020 - [warning] Master is not reachable from health checker!
Mon Oct 19 15:28:48 2020 - [warning] Master 10.0.50.61(10.0.50.61:3306) is not reachable!
Mon Oct 19 15:28:48 2020 - [warning] SSH is reachable.
Mon Oct 19 15:28:48 2020 - [info] Connecting to a master server failed. Reading configuration file /e
tc/masterha_default.cnf and /etc/mha/app1.cnf again, and trying to connect to all servers to check se
rver status..
Mon Oct 19 15:28:48 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. S
kipping.
Mon Oct 19 15:28:48 2020 - [info] Reading application default configuration from /etc/mha/app1.cnf..
Mon Oct 19 15:28:48 2020 - [info] Reading server configuration from /etc/mha/app1.cnf..
Mon Oct 19 15:28:49 2020 - [info] GTID failover mode = 1
Mon Oct 19 15:28:49 2020 - [info] Dead Servers:
Mon Oct 19 15:28:49 2020 - [info] 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:49 2020 - [info] Alive Servers:
Mon Oct 19 15:28:49 2020 - [info] 10.0.50.62(10.0.50.62:3306)
Mon Oct 19 15:28:49 2020 - [info] 10.0.50.63(10.0.50.63:3306)
Mon Oct 19 15:28:49 2020 - [info] Alive Slaves:
Mon Oct 19 15:28:49 2020 - [info] 10.0.50.62(10.0.50.62:3306) Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:49 2020 - [info] GTID ON
Mon Oct 19 15:28:49 2020 - [info] Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:49 2020 - [info] 10.0.50.63(10.0.50.63:3306) Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:49 2020 - [info] GTID ON
Mon Oct 19 15:28:49 2020 - [info] Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:49 2020 - [info] Checking slave configurations..
Mon Oct 19 15:28:49 2020 - [info] read_only=1 is not set on slave 10.0.50.62(10.0.50.62:3306).
Mon Oct 19 15:28:49 2020 - [info] read_only=1 is not set on slave 10.0.50.63(10.0.50.63:3306).
Mon Oct 19 15:28:49 2020 - [info] Checking replication filtering settings..
Mon Oct 19 15:28:49 2020 - [info] Replication filtering check ok.
Mon Oct 19 15:28:49 2020 - [info] Master is down!
Mon Oct 19 15:28:49 2020 - [info] Terminating monitoring script.
Mon Oct 19 15:28:49 2020 - [info] Got exit code 20 (Master dead).
Mon Oct 19 15:28:49 2020 - [info] MHA::MasterFailover version 0.58.
Mon Oct 19 15:28:49 2020 - [info] Starting master failover.
Mon Oct 19 15:28:49 2020 - [info]
Mon Oct 19 15:28:49 2020 - [info] * Phase 1: Configuration Check Phase..
Mon Oct 19 15:28:49 2020 - [info]
Mon Oct 19 15:28:49 2020 - [info] HealthCheck: SSH to 10.0.50.63 is reachable.
Mon Oct 19 15:28:50 2020 - [info] Binlog server 10.0.50.63 is reachable.
Mon Oct 19 15:28:51 2020 - [info] GTID failover mode = 1
Mon Oct 19 15:28:51 2020 - [info] Dead Servers:
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] Checking master reachability via MySQL(double check)...
Mon Oct 19 15:28:51 2020 - [info] ok.
Mon Oct 19 15:28:51 2020 - [info] Alive Servers:
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.62(10.0.50.62:3306)
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.63(10.0.50.63:3306)
Mon Oct 19 15:28:51 2020 - [info] Alive Slaves:
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.62(10.0.50.62:3306) Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info] GTID ON
Mon Oct 19 15:28:51 2020 - [info] Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.63(10.0.50.63:3306) Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info] GTID ON
Mon Oct 19 15:28:51 2020 - [info] Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] Starting GTID based failover.
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] Forcing shutdown so that applications never connect to the current
master..
Mon Oct 19 15:28:51 2020 - [info] Executing master IP deactivation script:
Mon Oct 19 15:28:51 2020 - [info] /usr/local/bin/master_ip_failover --orig_master_host=10.0.50.61 -
-orig_master_ip=10.0.50.61 --orig_master_port=3306 --command=stopssh --ssh_user=root
Mon Oct 19 15:28:51 2020 - [info] done.
Mon Oct 19 15:28:51 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of t
he dead master.
Mon Oct 19 15:28:51 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] * Phase 3: Master Recovery Phase..
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.0000
19:1188
Mon Oct 19 15:28:51 2020 - [info] Retrieved Gtid Set: b4dfda8a-0f89-11eb-a610-525400c58226:1-5
Mon Oct 19 15:28:51 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.62(10.0.50.62:3306) Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info] GTID ON
Mon Oct 19 15:28:51 2020 - [info] Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.63(10.0.50.63:3306) Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info] GTID ON
Mon Oct 19 15:28:51 2020 - [info] Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.0000
19:1188
Mon Oct 19 15:28:51 2020 - [info] Retrieved Gtid Set: b4dfda8a-0f89-11eb-a610-525400c58226:1-5
Mon Oct 19 15:28:51 2020 - [info] Oldest slaves:
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.62(10.0.50.62:3306) Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info] GTID ON
Mon Oct 19 15:28:51 2020 - [info] Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info] 10.0.50.63(10.0.50.63:3306) Version=5.7.26-log (oldest major ver
sion between slaves) log-bin:enabled
Mon Oct 19 15:28:51 2020 - [info] GTID ON
Mon Oct 19 15:28:51 2020 - [info] Replicating from 10.0.50.61(10.0.50.61:3306)
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] * Phase 3.3: Determining New Master Phase..
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] Searching new master from slaves..
Mon Oct 19 15:28:51 2020 - [info] Candidate masters from the configuration file:
Mon Oct 19 15:28:51 2020 - [info] Non-candidate masters:
Mon Oct 19 15:28:51 2020 - [info] New master is 10.0.50.62(10.0.50.62:3306)
Mon Oct 19 15:28:51 2020 - [info] Starting master failover..
Mon Oct 19 15:28:51 2020 - [info]
From:
10.0.50.61(10.0.50.61:3306) (current master)
+--10.0.50.62(10.0.50.62:3306)
+--10.0.50.63(10.0.50.63:3306)
To:
10.0.50.62(10.0.50.62:3306) (new master)
+--10.0.50.63(10.0.50.63:3306)
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Mon Oct 19 15:28:51 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] Waiting all logs to be applied..
Mon Oct 19 15:28:51 2020 - [info] done.
Mon Oct 19 15:28:51 2020 - [info] -- Saving binlog from host 10.0.50.63 started, pid: 19160
Mon Oct 19 15:28:52 2020 - [info]
Mon Oct 19 15:28:52 2020 - [info] Log messages from 10.0.50.63 ...
Mon Oct 19 15:28:52 2020 - [info]
Mon Oct 19 15:28:51 2020 - [info] Fetching binary logs from binlog server 10.0.50.63..
Mon Oct 19 15:28:51 2020 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=mysql-bin.000019 --start_pos=1188 --output_file=/var/tmp/saved_binlog_binlog1_20201019152849.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.58 --oldest_version=5.7.26-log --binlog_dir=/data/mysql/mha/binlog_server
Creating /var/tmp if not exists.. ok.
Concat binary/relay logs from mysql-bin.000019 pos 1188 to mysql-bin.000019 EOF into /var/tmp/saved_
binlog_binlog1_20201019152849.binlog ..
No additional binlog events found.
Event not exists.
Mon Oct 19 15:28:51 2020 - [info] Additional events were not found from the binlog server. No need to save.
Mon Oct 19 15:28:52 2020 - [info] End of log messages from 10.0.50.63.
Mon Oct 19 15:28:52 2020 - [info] No binlog events found from 10.0.50.63. Skipping
Mon Oct 19 15:28:52 2020 - [info] Getting new master's binlog name and position..
Mon Oct 19 15:28:52 2020 - [info] mysql-bin.000004:1188
Mon Oct 19 15:28:52 2020 - [info] All other slaves should start replication from here. Statement sho
uld be: CHANGE MASTER TO MASTER_HOST='10.0.50.62', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Oct 19 15:28:52 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000004
, 1188, b4dfda8a-0f89-11eb-a610-525400c58226:1-5
Mon Oct 19 15:28:52 2020 - [info] Executing master IP activate script:
Mon Oct 19 15:28:52 2020 - [info] /usr/local/bin/master_ip_failover --command=start --ssh_user=root
--orig_master_host=10.0.50.61 --orig_master_ip=10.0.50.61 --orig_master_port=3306 --new_master_host=10.0.50.62 --new_master_ip=10.0.50.62 --new_master_port=3306 --new_master_user='mha' --new_master_p
assword=xxx
Enabling the VIP - 10.0.50.69/24 on the new master - 10.0.50.62
Mon Oct 19 15:28:52 2020 - [info] OK.
Mon Oct 19 15:28:52 2020 - [info] ** Finished master recovery successfully.
Mon Oct 19 15:28:52 2020 - [info] * Phase 3: Master Recovery Phase completed.
Mon Oct 19 15:28:52 2020 - [info]
Mon Oct 19 15:28:52 2020 - [info] * Phase 4: Slaves Recovery Phase..
Mon Oct 19 15:28:52 2020 - [info]
Mon Oct 19 15:28:52 2020 - [info]
Mon Oct 19 15:28:52 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Mon Oct 19 15:28:52 2020 - [info]
Mon Oct 19 15:28:52 2020 - [info] -- Slave recovery on host 10.0.50.63(10.0.50.63:3306) started, pid:
19176. Check tmp log /var/log/mha/app1/10.0.50.63_3306_20201019152849.log if it takes time..
Mon Oct 19 15:28:54 2020 - [info]
Mon Oct 19 15:28:54 2020 - [info] Log messages from 10.0.50.63 ...
Mon Oct 19 15:28:54 2020 - [info]
Mon Oct 19 15:28:52 2020 - [info] Resetting slave 10.0.50.63(10.0.50.63:3306) and starting replicati
on from the new master 10.0.50.62(10.0.50.62:3306)..
Mon Oct 19 15:28:52 2020 - [info] Executed CHANGE MASTER.
Mon Oct 19 15:28:53 2020 - [info] Slave started.
Mon Oct 19 15:28:53 2020 - [info] gtid_wait(b4dfda8a-0f89-11eb-a610-525400c58226:1-5) completed on 1
0.0.50.63(10.0.50.63:3306). Executed 0 events.
Mon Oct 19 15:28:54 2020 - [info] End of log messages from 10.0.50.63.
Mon Oct 19 15:28:54 2020 - [info] -- Slave on host 10.0.50.63(10.0.50.63:3306) started.
Mon Oct 19 15:28:54 2020 - [info] All new slave servers recovered successfully.
Mon Oct 19 15:28:54 2020 - [info]
Mon Oct 19 15:28:54 2020 - [info] * Phase 5: New master cleanup phase..
Mon Oct 19 15:28:54 2020 - [info]
Mon Oct 19 15:28:54 2020 - [info] Resetting slave info on the new master..
Mon Oct 19 15:28:54 2020 - [info] 10.0.50.62: Resetting slave info succeeded.
Mon Oct 19 15:28:54 2020 - [info] Master failover to 10.0.50.62(10.0.50.62:3306) completed successful
ly.
Mon Oct 19 15:28:54 2020 - [info] Deleted server1 entry from /etc/mha/app1.cnf .
Mon Oct 19 15:28:54 2020 - [info]
----- Failover Report -----
app1: MySQL Master failover 10.0.50.61(10.0.50.61:3306) to 10.0.50.62(10.0.50.62:3306) succeeded
Master 10.0.50.61(10.0.50.61:3306) is down!
Check MHA Manager logs at mysql-node03:/var/log/mha/app1/manager for details.
Started automated(non-interactive) failover.
Invalidated master IP address on 10.0.50.61(10.0.50.61:3306)
Selected 10.0.50.62(10.0.50.62:3306) as a new master.
10.0.50.62(10.0.50.62:3306): OK: Applying all logs succeeded.
10.0.50.62(10.0.50.62:3306): OK: Activated master IP address.
10.0.50.63(10.0.50.63:3306): OK: Slave started, replicating from 10.0.50.62(10.0.50.62:3306)
10.0.50.62(10.0.50.62:3306): Resetting slave info succeeded.
Master failover to 10.0.50.62(10.0.50.62:3306) completed successfully.
2、恢复故障
1)启动故障节点
db01:
[root@mysql-node01 binlog]# systemctl start mysqld
[root@mysql-node01 binlog]# ss -lntp|grep 3306
LISTEN 0 80 [::]:3306 [::]:* users:(("mysqld",pid=4521,fd=33))
[root@mysql-node01 binlog]#
在db01节点上执行 change master to 语句,重建主从关系
# 可以在MHA-manager节点,从日志中找到提示
[root@mysql-node03 binlog_server]# grep 'CHANGE' /var/log/mha/app1/manager
Mon Oct 19 15:28:52 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.0.50.62', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Oct 19 15:28:52 2020 - [info] Executed CHANGE MASTER.
[root@mysql-node03 binlog_server]#
在db01 节点执行
CHANGE MASTER TO MASTER_HOST='10.0.50.62', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='repl'
mysql> CHANGE MASTER TO MASTER_HOST='10.0.50.62', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='repl'
-> ;
Query OK, 0 rows affected, 2 warnings (0.03 sec)
mysql> start slave;
Query OK, 0 rows affected (0.00 sec)
mysql> show slave status\G
- 配置文件恢复主节点信息
vi /etc/mha/app1.cnf
[server1]
hostname=10.0.50.61
port=3306
4) 恢复binlog_server
[root@mysql-node03 binlog_server]# pwd
/data/mysql/mha/binlog_server
[root@mysql-node03 binlog_server]#
[root@mysql-node03 binlog_server]# rm -rf ./mysql-bin.00001*
[root@mysql-node03 binlog_server]# mysqlbinlog -R --host=10.0.50.62 --user=mha --password=mha --raw --stop-never mysql-bin.000004 &
[1] 19260
[root@mysql-node03 binlog_server]# mysqlbinlog: [Warning] Using a password on the command line interface can be insecure.
[root@mysql-node03 binlog_server]#
5) 启动MHA-manager
[root@mysql-node03 binlog_server]# nohup masterha_manager --conf=/etc/mha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null> /var/log/mha/app1/manager.log 2>&1 &
[2] 19262
[root@mysql-node03 binlog_server]#

浙公网安备 33010602011771号