MySQL 5.7复制配置不规范修改导致的坑
2、重新复现、详细分析与验证
- 为了方便问题排查,我们需要按照复现过程重新来一次(再把master_auto_position从1切换为0),这一次,我们每一步昨晚都先检查一些信息,以便找出问题的起源以及找到从根本上杜绝这个问题发生的方法!!
-
需要检查的信息有:
- stop slave;之前先show slave status查看一下复制的大概位置信息
- 查看并收集stop slave;时、change master to master_auto_position=0时、 start slave;时三个步骤的如下信息:
* 从库查看复制信息:show slave status;
* 从库查看slave_master_info表:select * from mysql.slave_master_info;
* 从库查看slave_relay_log_info表:select * from mysql.slave_relay_log_info;
* 从库查看slave_worker_info表:select * from mysql.slave_worker_info;
* 从库查看replication_applier_status_by_worker表:select * from performance_schema.replication_applier_status_by_worker;
* 从库查看replication_connection_status表:select * from performance_schema.replication_connection_status;
* 从库查看relay log是否被清理,解析relay log,查看所有relay log中当前的起始GTID
* 根据show slave status输出信息中的Master_Log_File、Read_Master_Log_Pos、Relay_Master_Log_File、Exec_Master_Log_Pos、Retrieved_Gtid_Set、Executed_Gtid_Set值解析主库的binlog,找到对应这些值的事务相关event信息
* 从库解析binlog file,查看show slave status输出中Executed_Gtid_Set值、Executed_Gtid_Set值相比Retrieved_Gtid_Set值缺失的GTID在从库的binlog中是否能找到
-
为了方便大家理解下文中将要提到的几个步骤中的冗长信息,这里将这些信息整理为三个脑图(这里紧列出关键信息,为了简洁,GTID去掉了UUID部分,只留下事务ID),有兴趣的可以跟着下文向下看详细信息,兴趣不大的可以从这里直接跳转到"解决方法验证"小节
-
执行stop slave语句时主从的相关详细信息
- 执行change master to master_auto_position=0语句时主从的相关详细信息
-
执行stop slave语句时主从的相关详细信息
- 执行start slave语句时主从的相关详细信息
2.1. 重新复现与详细分析
- sysbench压力根据步骤需要,可能暂停或者重新执行
- sysbench执行对主库加压
- 从库上执行show slave status\G语句查看复制状态
-
admin@localhost : (none) 12:19:20> show slave status\G;
-
*************************** 1. row ***************************
-
Slave_IO_State: Waiting for master to send event
-
......
-
Master_Log_File: mysql-bin.000038
-
Read_Master_Log_Pos: 19058431
-
Relay_Log_File: mysql-relay-bin.000006
-
Relay_Log_Pos: 11609958
-
Relay_Master_Log_File: mysql-bin.000038
-
Slave_IO_Running: Yes
-
Slave_SQL_Running: Yes
-
......
-
Exec_Master_Log_Pos: 11609745
-
......
-
Seconds_Behind_Master: 61
-
Master_SSL_Verify_Server_Cert: No
-
Last_IO_Errno: 0
-
Last_IO_Error:
-
Last_SQL_Errno: 0
-
Last_SQL_Error:
-
......
-
Slave_SQL_Running_State: Waiting for slave workers to process their queues
-
......
-
Last_IO_Error_Timestamp:
-
Last_SQL_Error_Timestamp:
-
......
-
Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-764951
-
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756013
-
Auto_Position: 1
-
......
- 1 row in set (0.00 sec)
- 待到从库复制有一定延迟时,执行stop slave语句
-
admin@localhost : (none) 12:21:36> stop slave;
- Query OK, 0 rows affected (0.11 sec)
- 从库执行show slave status语句查看复制状态
-
admin@localhost : (none) 12:21:43> show slave status\G;
-
*************************** 1. row ***************************
-
Slave_IO_State:
-
......
-
Master_Log_File: mysql-bin.000038
-
Read_Master_Log_Pos: 19989790
-
Relay_Log_File: mysql-relay-bin.000006
-
Relay_Log_Pos: 12148076
-
Relay_Master_Log_File: mysql-bin.000038
-
Slave_IO_Running: No
-
Slave_SQL_Running: No
-
......
-
Exec_Master_Log_Pos: 12147863
-
......
-
Seconds_Behind_Master: NULL
-
Master_SSL_Verify_Server_Cert: No
-
Last_IO_Errno: 0
-
Last_IO_Error:
-
Last_SQL_Errno: 0
-
Last_SQL_Error:
-
......
-
Slave_SQL_Running_State:
-
......
-
Last_IO_Error_Timestamp:
-
Last_SQL_Error_Timestamp:
-
......
-
Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
-
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655
-
Auto_Position: 1
-
......
- 1 row in set (0.00 sec)
- 现在,把对主库的sysbench进程终止
- 从库查询slave_master_info表
-
admin@localhost : (none) 12:21:46> select * from mysql.slave_master_info\G;
-
*************************** 1. row ***************************
-
Number_of_lines: 25
-
Master_log_name: mysql-bin.000038
-
Master_log_pos: 19989790
-
Host: 10.10.20.14
-
User_name: qfsys
-
User_password: letsg0
-
Port: 3306
-
Connect_retry: 60
-
Enabled_ssl: 0
-
Ssl_ca:
-
Ssl_capath:
-
Ssl_cert:
-
Ssl_cipher:
-
Ssl_key:
-
Ssl_verify_server_cert: 0
-
Heartbeat: 5
-
Bind:
-
Ignored_server_ids: 0
-
Uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
-
Retry_count: 86400
-
Ssl_crl:
-
Ssl_crlpath:
-
Enabled_auto_position: 1
-
Channel_name:
-
Tls_version:
- 1 row in set (0.00 sec)
- 从库查询slave_relay_log_info表
-
admin@localhost : (none) 12:25:57> select * from mysql.slave_relay_log_info\G;
-
*************************** 1. row ***************************
-
Number_of_lines: 7
-
Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
-
'# 与show slave status输出中的Relay_log_pos值相等'
-
Relay_log_pos: 12148076
-
Master_log_name: mysql-bin.000038
-
Master_log_pos: 12147863
-
Sql_delay: 0
-
Number_of_workers: 4
-
Id: 1
-
Channel_name:
- 1 row in set (0.00 sec)
- 从库查询slave_worker_info表
-
admin@localhost : (none) 12:26:23> select * from mysql.slave_worker_info\G;
-
*************************** 1. row ***************************
-
Id: 1
-
Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
-
Relay_log_pos: 12146410
-
Master_log_name: mysql-bin.000038
-
Master_log_pos: 12146197
-
Checkpoint_relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
-
Checkpoint_relay_log_pos: 12130583
-
Checkpoint_master_log_name: mysql-bin.000038
-
Checkpoint_master_log_pos: 12130370
-
Checkpoint_seqno: 18
-
Checkpoint_group_size: 64
-
Checkpoint_group_bitmap: HD
-
Channel_name:
-
*************************** 2. row ***************************
-
Id: 2
-
Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
-
Relay_log_pos: 12144744
-
Master_log_name: mysql-bin.000038
-
Master_log_pos: 12144531
-
Checkpoint_relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
-
Checkpoint_relay_log_pos: 12130583
-
Checkpoint_master_log_name: mysql-bin.000038
-
Checkpoint_master_log_pos: 12130370
-
Checkpoint_seqno: 16
-
Checkpoint_group_size: 64
- Checkpoint_group_bitmap:
- 从库查询replication_applier_status_by_worker表
-
admin@localhost : (none) 12:27:02> select * from performance_schema.replication_applier_status_by_worker\G;
-
*************************** 1. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 1
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756653
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
-
*************************** 2. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 2
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756651
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
-
*************************** 3. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 3
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756652
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
-
*************************** 4. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 4
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756654
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 4 rows in set (0.00 sec)
- 从库查询replication_connection_status表
-
admin@localhost : (none) 12:28:29> select * from performance_schema.replication_connection_status\G;
-
*************************** 1. row ***************************
-
CHANNEL_NAME:
-
GROUP_NAME:
-
SOURCE_UUID: b57c75c2-6205-11e7-8d9f-525400a4b2e1
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
COUNT_RECEIVED_HEARTBEATS: 10030
-
LAST_HEARTBEAT_TIMESTAMP: 2017-07-09 12:19:36
-
RECEIVED_TRANSACTION_SET: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 1 row in set (0.00 sec)
- 查看gtid_executed表与show master status信息,从下面的结果中可以看到,gtid_executed表中记录的GTID事务号是741923,远小于show master status的最大GTID事务号756655,说明这个表中记录GTID信息并不是实时的
-
admin@localhost : (none) 02:01:49> select * from mysql.gtid_executed\G;
-
*************************** 1. row ***************************
-
source_uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
-
interval_start: 1
-
interval_end: 741923
-
1 row in set (0.01 sec)
-
admin@localhost : (none) 02:06:54> show master status;
-
+------------------+----------+--------------+------------------+-----------------------------------------------+
-
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
-
+------------------+----------+--------------+------------------+-----------------------------------------------+
-
| mysql-bin.000002 | 12074497 | | | b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655 |
-
+------------------+----------+--------------+------------------+-----------------------------------------------+
- 1 row in set (0.00 sec)
- 查看并解析从库的relay log,找到relay log中记录的GTID起始事务号
-
[root@luoxiaobo-02 ~]# cd /data/mysqldata1/relaylog/
-
[root@luoxiaobo-02 relaylog]# ll
-
总用量 19532
-
-rw-r-----. 1 mysql mysql 460 7月 8 22:15 mysql-relay-bin.000005
-
-rw-r-----. 1 mysql mysql 19990003 7月 9 12:21 mysql-relay-bin.000006
-
-rw-r-----. 1 mysql mysql 120 7月 8 22:15 mysql-relay-bin.index
-
'#mysql-relay-bin.000005中没有数据相关的GTID'
-
[root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000005 |grep -i GTID_NEXT
-
mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
-
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
-
'#解析mysql-relay-bin.000006结果先放到一个文件暂存'
-
[root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000006 > a.sql
-
[root@luoxiaobo-02 relaylog]# grep -i GTID_NEXT a.sql|head -3
-
mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
-
'#mysql-relay-bin.000006中的第一个GTID'
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:741926'/*!*/;
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:741927'/*!*/;
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:741928'/*!*/;
-
[root@luoxiaobo-02 relaylog]# grep -i GTID_NEXT a.sql|tail -3
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766069'/*!*/;
-
'#mysql-relay-bin.000006中的最后一个GTID'
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766070'/*!*/;
-
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
-
'# 对于最后一个GTID,我们需要在解析结果中把完整的event组内容贴出来,方便后续排查(这里也可以看到,relay log中最优一个事务只有GTID EVENT,没有其他的EVENT,'\
-
'# 说明这个事务记录的event组并不完整)'
-
......
-
COMMIT/*!*/;
-
# at 19989938
-
#170709 12:21:43 server id 330614 end_log_pos 19989790 CRC32 0xef47a7bb GTID last_committed=24138 sequence_number=24145
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766070'/*!*/;
-
BEGIN /*added by mysqlbinlog */ /*!*/;
-
ROLLBACK /* added by mysqlbinlog */ /*!*/;
-
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
-
DELIMITER ;
-
# End of log file
- ......
- 现在,到主库上解析stop slave之后,IO和SQL协调器线程停止的位置(mysql-bin.000038文件的pos:19989790和12147863)
-
'# 解析结果先暂存到一个文件'
-
[root@luoxiaobo-01 binlog]# mysqlbinlog -vv mysql-bin.000038 > a.sql
-
mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
-
[root@luoxiaobo-01 binlog]#
-
'# 打开这个文件,搜索19989790和12147863'
-
## pos:19989790
-
'#上一个事务的COMMIT语句'
-
COMMIT/*!*/;
-
# at 19989725
-
'# 从库的IO线程最后停止的pos对应的event在这里,这是一个事务event组中的第一个event'
-
#170709 12:21:43 server id 330614 end_log_pos 19989790 CRC32 0xef47a7bb GTID last_committed=24138 sequence_number=24145
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766070'/*!*/;
-
# at 19989790
-
'#这个事务缺失的后续event组'
-
#170709 12:21:43 server id 330614 end_log_pos 19989869 CRC32 0x15f97bf9 Query thread_id=8 exec_time=0 error_code=0
-
SET TIMESTAMP=1499574103/*!*/;
-
BEGIN
-
/*!*/;
-
# at 19989869
-
#170709 12:21:43 server id 330614 end_log_pos 19990052 CRC32 0x5dde7aa8 Rows_query
-
# UPDATE sbtest1 SET c='61636976773-16107654940-51623712437-75581944876-25176120731-37361704412-40193407500-44604516796-09703856563-85070507959' WHERE id=2507560
-
# at 19990052
-
#170709 12:21:43 server id 330614 end_log_pos 19990111 CRC32 0x119bffe4 Table_map: `sbtest`.`sbtest1` mapped to number 120
-
# at 19990111
-
#170709 12:21:43 server id 330614 end_log_pos 19990527 CRC32 0xcc9c086d Update_rows: table id 120 flags: STMT_END_F
-
BINLOG '
-
......
-
'/*!*/;
-
......
-
# at 19990527
-
#170709 12:21:43 server id 330614 end_log_pos 19990558 CRC32 0x542c9d8c Xid = 42809
-
'#标记这个事务结束'
-
COMMIT/*!*/;
-
......
-
## pos:12147863
-
'#上一个事务的COMMIT语句'
-
COMMIT/*!*/;
-
# at 12147030
-
#170709 12:20:38 server id 330614 end_log_pos 12147095 CRC32 0x741990e3 GTID last_committed=14722 sequence_number=14730
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:756655'/*!*/;
-
# at 12147095
-
#170709 12:20:38 server id 330614 end_log_pos 12147174 CRC32 0x15ecc69f Query thread_id=8 exec_time=0 error_code=0
-
SET TIMESTAMP=1499574038/*!*/;
-
BEGIN
-
/*!*/;
-
# at 12147174
-
#170709 12:20:38 server id 330614 end_log_pos 12147357 CRC32 0x54ed17d2 Rows_query
-
# UPDATE sbtest4 SET c='59725621137-97256297494-12005645169-65422620013-80006649645-18254436888-53122616738-78749990655-60632029435-76689256326' WHERE id=2204328
-
# at 12147357
-
#170709 12:20:38 server id 330614 end_log_pos 12147416 CRC32 0xc0fd34bd Table_map: `sbtest`.`sbtest4` mapped to number 123
-
# at 12147416
-
#170709 12:20:38 server id 330614 end_log_pos 12147832 CRC32 0x5c45d713 Update_rows: table id 123 flags: STMT_END_F
-
BINLOG
-
......
-
# at 12147832
-
'#从库的SQL协调器线程最后停止的pos对应的event在这里,这是一个事务的event组中的最后一个event'
-
#170709 12:20:38 server id 330614 end_log_pos 12147863 CRC32 0xd650cd60 Xid = 26692
-
'# 标记这个事务结束'
-
COMMIT/*!*/;
- ......
-
从上文中描述我们可以得知,复制突然停止时,IO线程处在一个事务开始的GTID EVENT位置,而SQL协调器线程处于一个事务结束的xid event位置,这也证实了Read_Master_Log_Pos是以event为单位进行更新的。
- 解析从库的最后一个binlog,查找最后一个事务的GTID是否与Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655 相对应,两者相等
-
[root@luoxiaobo-02 relaylog]# cd ../binlog/
-
[root@luoxiaobo-02 binlog]# ll
-
总用量 46196
-
-rw-r-----. 1 mysql mysql 35219969 7月 8 22:11 mysql-bin.000001
-
-rw-r-----. 1 mysql mysql 12074497 7月 9 12:21 mysql-bin.000002
-
-rw-r-----. 1 mysql mysql 104 7月 8 22:11 mysql-bin.index
-
[root@luoxiaobo-02 binlog]# mysqlbinlog -vv mysql-bin.000002 |grep -i GTID_NEXT |tail -3
-
mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:756654'/*!*/;
-
'# 从库的binlog中最后一个GTID 事务号是756655,与stop slave时show slave status输出的Executed_Gtid_Set中最大事务号相等'
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:756655'/*!*/;
- SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/
- 留意观察每一个步骤相对于stop slave步骤有什么变化?以下会在每一个发生变化的位置进行注解
- 从库执行change master to master_auto_position=0语句
-
admin@localhost : (none) 04:00:37> change master to master_auto_position=0;
- Query OK, 0 rows affected (0.05 sec)
- 从库执行show slave status语句查看复制状态
-
admin@localhost : (none) 04:02:54> show slave status\G;
-
*************************** 1. row ***************************
-
Slave_IO_State:
-
......
-
Master_Log_File: mysql-bin.000038
-
Read_Master_Log_Pos: 19989790
-
'# relay log文件名从之前的mysql-relay-bin.000006变为了mysql-relay-bin.000001'
-
Relay_Log_File: mysql-relay-bin.000001
-
'#relay log pos从之前的12148076变为了4'
-
Relay_Log_Pos: 4
-
Relay_Master_Log_File: mysql-bin.000038
-
Slave_IO_Running: No
-
Slave_SQL_Running: No
-
......
-
'# 从这里可以看到,SQL协调器线程的停止位置发生了变化,与之前stop slave;时IO线程停止位置相同了(从12147863变为了19989790)'
-
Exec_Master_Log_Pos: 19989790
-
......
-
Seconds_Behind_Master: NULL
-
Master_SSL_Verify_Server_Cert: No
-
Last_IO_Errno: 0
-
Last_IO_Error:
-
Last_SQL_Errno: 0
-
Last_SQL_Error:
-
......
-
Slave_SQL_Running_State:
-
......
-
Last_IO_Error_Timestamp:
-
Last_SQL_Error_Timestamp:
-
......
-
'#从这里可以看到,接收的GTID SET值被清空了!'
-
Retrieved_Gtid_Set:
-
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655
-
Auto_Position: 0
-
......
- 1 row in set (0.00 sec)
- 从库查询slave_master_info表
-
admin@localhost : (none) 04:03:04> select * from mysql.slave_master_info\G;
-
*************************** 1. row ***************************
-
Number_of_lines: 25
-
Master_log_name: mysql-bin.000038
-
Master_log_pos: 19989790
-
Host: 10.10.20.14
-
User_name: qfsys
-
User_password: letsg0
-
Port: 3306
-
Connect_retry: 60
-
Enabled_ssl: 0
-
Ssl_ca:
-
Ssl_capath:
-
Ssl_cert:
-
Ssl_cipher:
-
Ssl_key:
-
Ssl_verify_server_cert: 0
-
Heartbeat: 5
-
Bind:
-
Ignored_server_ids: 0
-
Uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
-
Retry_count: 86400
-
Ssl_crl:
-
Ssl_crlpath:
-
'#从这里可以看到,从之前的1变为了0'
-
Enabled_auto_position: 0
-
Channel_name:
-
Tls_version:
- 1 row in set (0.00 sec)
- 从库查询slave_relay_log_info表
-
admin@localhost : (none) 04:07:18> select * from mysql.slave_relay_log_info\G;
-
*************************** 1. row ***************************
-
Number_of_lines: 7
-
'# relay log文件名从之前的mysql-relay-bin.000006变为了mysql-relay-bin.000001'
-
Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000001
-
'#relay log pos从之前的12148076变为了4'
-
Relay_log_pos: 4
-
Master_log_name: mysql-bin.000038
-
'# 从这里可以看到,SQL协调器线程的停止位置发生了变化,与之前stop slave;时IO线程停止位置相同了(从12147863变为了19989790)'
-
Master_log_pos: 19989790
-
Sql_delay: 0
-
'# 记录worker线程数量的值变为了0'
-
Number_of_workers: 0
-
Id: 1
-
Channel_name:
- 1 row in set (0.00 sec)
- 从库查询slave_worker_info表,从下面的结果中可以看到,worker线程信息被清空了
-
admin@localhost : (none) 04:11:07> select * from mysql.slave_worker_info\G;
- Empty set (0.00 sec)
- 从库查询replication_applier_status_by_worker表,从下面的信息中可以看到,这个表记录的信息没有任何变化
-
admin@localhost : (none) 04:20:17> select * from performance_schema.replication_applier_status_by_worker\G;
-
*************************** 1. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 1
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756653
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
-
*************************** 2. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 2
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756651
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
-
*************************** 3. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 3
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756652
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
-
*************************** 4. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 4
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756654
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 4 rows in set (0.00 sec)
- 从库查询replication_connection_status表
-
admin@localhost : (none) 04:21:04> select * from performance_schema.replication_connection_status\G;
-
*************************** 1. row ***************************
-
CHANNEL_NAME:
-
GROUP_NAME:
-
SOURCE_UUID: b57c75c2-6205-11e7-8d9f-525400a4b2e1
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
COUNT_RECEIVED_HEARTBEATS: 10030
-
LAST_HEARTBEAT_TIMESTAMP: 2017-07-09 12:19:36
-
' #从这里可以看到,之前记录的b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069 值,现在变为空了'
-
RECEIVED_TRANSACTION_SET:
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 1 row in set (0.01 sec)
- 查看gtid_executed表与show master status信息,从下面的结果中可以看到,与之前stop slave时查看到的信息相比,没有任何变化
-
admin@localhost : (none) 04:22:19> select * from mysql.gtid_executed\G;
-
*************************** 1. row ***************************
-
source_uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
-
interval_start: 1
-
interval_end: 741923
-
1 row in set (0.00 sec)
-
admin@localhost : (none) 04:23:56> show master status;
-
+------------------+----------+--------------+------------------+-----------------------------------------------+
-
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
-
+------------------+----------+--------------+------------------+-----------------------------------------------+
-
| mysql-bin.000002 | 12074497 | | | b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655 |
-
+------------------+----------+--------------+------------------+-----------------------------------------------+
- 1 row in set (0.00 sec)
- 查看并解析从库的relay log,找到relay log中记录的GTID起始事务号
-
[root@luoxiaobo-02 ~]# cd /data/mysqldata1/relaylog/
-
[root@luoxiaobo-02 relaylog]# ll
-
总用量 65424
-
-rw-r--r--. 1 root root 66984802 7月 9 13:01 a.sql
-
'#这里可以看到,relay log文件有变化,000002编号的文件不见了'
-
-rw-r-----. 1 mysql mysql 194 7月 9 16:02 mysql-relay-bin.000001
-
-rw-r-----. 1 mysql mysql 60 7月 9 16:02 mysql-relay-bin.index
-
[root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000001
-
mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
-
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
-
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
-
DELIMITER /*!*/;
-
# at 4
-
#170709 16:02:54 server id 330615 end_log_pos 123 CRC32 0xa0b99fea Start: binlog v 4, server v 5.7.18-log created 170709 16:02:54 at startup
-
# This Format_description_event appears in a relay log and was generated by the slave thread.
-
# at 123
-
#170709 16:02:54 server id 330615 end_log_pos 194 CRC32 0x2c2d5d7f Previous-GTIDs
-
'#从这里可以看到,这个Previous-GTIDs event中记录的GTID SET就是之前show slave status中的Retrieved_Gtid_Set值'
-
# b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
-
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
-
DELIMITER ;
-
# End of log file
-
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
- /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/
-
从上面的内容中可以看到,之前relay log中的日志全部被清理了,当前mysql-relay-bin.000001中记录的有效数据除了Previous-GTIDs event中记录的GTID SET之外,没有其他有用数据,而且这个GTID SET是使用之前show slave status输出中的Retrieved_Gtid_Set值填充的
-
主库中change master to master_auto_position=0之后的binlog这里就不需要去解析了,因为现在IO和SQL协调器线程停止的位置相同了(mysql-bin.000038文件的pos:19989790)
-
从库中change master to master_auto_position=0之后的binlog这里也不需要去解析了,因为现在还没有启动复制,从库中的数据理论上是不会发生变化的
-
通过以上发生变化的信息我们可以得出,change master to master_auto_position=0语句执行了relay log的清理,slave_worker_info表中worker线程的信息清理,SQL协调器线程位置的重置(设置为stop slave;时IO线程的位置,并更新了slave_relay_log_info表和内存中的SQL线程和relay log位置信息),更新了replication_connection_status表的RECEIVED_TRANSACTION_SET字段值为空、slave_master_info表的Enabled_auto_position字段值为0
- 留意观察每一个步骤相对于change master to master_auto_position=0步骤有什么变化?以下会在每一个发生变化的位置进行注解
- 从库执行start slave语句
-
admin@localhost : (none) 05:09:02> start slave;
- Query OK, 0 rows affected (0.01 sec)
- 从库执行show slave status语句查看复制状态
-
admin@localhost : (none) 05:09:06> show slave status\G;
-
*************************** 1. row ***************************
-
Slave_IO_State: Waiting for master to send event
-
......
-
Master_Log_File: mysql-bin.000038
-
'# IO线程正常运行,该位置持续正常变化'
-
Read_Master_Log_Pos: 40855542
-
'# relay log文件名从之前的mysql-relay-bin.000001变为了mysql-relay-bin.000002,说明新生成了一个relay log文件'
-
Relay_Log_File: mysql-relay-bin.000002
-
'# IO线程正常运行时,该位置持续正常变化'
-
Relay_Log_Pos: 320
-
Relay_Master_Log_File: mysql-bin.000038
-
'#IO线程运行状态正常'
-
Slave_IO_Running: Yes
-
'#SQL线程未运行'
-
Slave_SQL_Running: No
-
......
-
'# SQL线程位置未发生变化'
-
Exec_Master_Log_Pos: 19989790
-
......
-
Seconds_Behind_Master: NULL
-
Master_SSL_Verify_Server_Cert: No
-
Last_IO_Errno: 0
-
Last_IO_Error:
-
Last_SQL_Errno: 1782
-
'# SQL线程应用relay log时发生错误,错误binlog pos发生在master log mysql-bin.000038, end_log_pos 19989869'
-
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'NOT_YET_DETERMINED' at master \
-
log mysql-bin.000038, end_log_pos 19989869. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
-
......
-
Slave_SQL_Running_State:
-
......
-
Last_IO_Error_Timestamp:
-
'#SQL线程发生错误的时间'
-
Last_SQL_Error_Timestamp: 170709 17:09:06
-
......
-
'# 接收到的GTID SET重新生成了'
-
Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:766071-791118
-
'# 由于SQL线程报错终止,Executed_Gtid_Set位置未发生变化,说明SQL线程在复制启动时未应用任何事务就立即报错终止了'
-
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655
-
Auto_Position: 0
-
......
- 1 row in set (0.00 sec)
- 现在,把对主库的sysbench进程终止
- 由于IO线程运行正常,相对于主库的binlog pos会持续发生正常变化,不需要查看从库的slave_master_info表
- 从库查询slave_relay_log_info表,从下面的信息中可以可以看到,除了Number_of_workers字段发生变化之外,其他字段没有发生变化
-
admin@localhost : (none) 05:22:12> select * from mysql.slave_relay_log_info\G;
-
*************************** 1. row ***************************
-
Number_of_lines: 7
-
Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000001
-
Relay_log_pos: 4
-
Master_log_name: mysql-bin.000038
-
Master_log_pos: 19989790
-
Sql_delay: 0
-
'# worker线程数量值变为4'
-
Number_of_workers: 4
-
Id: 1
-
Channel_name:
- 1 row in set (0.00 sec)
- 从库查询slave_worker_info表,从下面的信息中可以看到,slave_worker_info中有线程信息了,但是缺没有任何有用的信息,show processlist可以看到SQL线程全部停止了
-
admin@localhost : (none) 05:24:31> select * from mysql.slave_worker_info\G;
-
*************************** 1. row ***************************
-
Id: 1
-
Relay_log_name:
-
Relay_log_pos: 0
-
Master_log_name:
-
Master_log_pos: 0
-
Checkpoint_relay_log_name:
-
Checkpoint_relay_log_pos: 0
-
Checkpoint_master_log_name:
-
Checkpoint_master_log_pos: 0
-
Checkpoint_seqno: 0
-
Checkpoint_group_size: 64
-
Checkpoint_group_bitmap:
-
Channel_name:
-
*************************** 2. row ***************************
-
Id: 2
-
Relay_log_name:
-
Relay_log_pos: 0
-
Master_log_name:
-
Master_log_pos: 0
-
Checkpoint_relay_log_name:
-
Checkpoint_relay_log_pos: 0
-
Checkpoint_master_log_name:
-
Checkpoint_master_log_pos: 0
-
Checkpoint_seqno: 0
-
Checkpoint_group_size: 64
-
Checkpoint_group_bitmap:
-
Channel_name:
-
*************************** 3. row ***************************
-
Id: 3
-
Relay_log_name:
-
Relay_log_pos: 0
-
Master_log_name:
-
Master_log_pos: 0
-
Checkpoint_relay_log_name:
-
Checkpoint_relay_log_pos: 0
-
Checkpoint_master_log_name:
-
Checkpoint_master_log_pos: 0
-
Checkpoint_seqno: 0
-
Checkpoint_group_size: 64
-
Checkpoint_group_bitmap:
-
Channel_name:
-
*************************** 4. row ***************************
-
Id: 4
-
Relay_log_name:
-
Relay_log_pos: 0
-
Master_log_name:
-
Master_log_pos: 0
-
Checkpoint_relay_log_name:
-
Checkpoint_relay_log_pos: 0
-
Checkpoint_master_log_name:
-
Checkpoint_master_log_pos: 0
-
Checkpoint_seqno: 0
-
Checkpoint_group_size: 64
-
Checkpoint_group_bitmap:
-
Channel_name:
-
4 rows in set (0.00 sec)
-
ERROR:
-
No query specified
-
admin@localhost : (none) 05:26:57> show processlist;
-
+----+-------------+-----------+------+---------+------+----------------------------------+------------------+
-
| Id | User | Host | db | Command | Time | State | Info |
-
+----+-------------+-----------+------+---------+------+----------------------------------+------------------+
-
| 13 | admin | localhost | NULL | Query | 0 | starting | show processlist |
-
| 14 | system user | | NULL | Connect | 1083 | Waiting for master to send event | NULL |
-
+----+-------------+-----------+------+---------+------+----------------------------------+------------------+
- 2 rows in set (0.01 sec)
- 从库查询replication_applier_status_by_worker表
-
admin@localhost : (none) 05:27:09> select * from performance_schema.replication_applier_status_by_worker\G;
-
*************************** 1. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 1
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
'#该字段变为空值'
-
LAST_SEEN_TRANSACTION:
-
'# 这里可以看到,WORKER_ID=1的线程的报错的错误代码和错误详细信息,通过前面show slave status输出Executed_Gtid_Set的值未发生变化,可以推理出在start slave;启动复制时,'\
-
'# 第一个worker线程分配到的就是stop slave时IO线程读取不完整的事务,所以,start slave;时SQL线程应用第一个事务就报错停止了'
-
LAST_ERROR_NUMBER: 1782
-
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'NOT_YET_DETERMINED' at master log mysql-bin.000038, end_log_pos 19989869; Error '@@SESSION.GTID_NEXT \
-
cannot be set to ANONYMOUS when @@GLOBAL.GTID_MODE = ON.' on query. Default database: 'sbtest'. Query: 'BEGIN'
-
LAST_ERROR_TIMESTAMP: 2017-07-09 17:09:06
-
*************************** 2. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 2
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
'# 该字段变为空值'
-
LAST_SEEN_TRANSACTION:
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
-
*************************** 3. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 3
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
'# 该字段变为空值'
-
LAST_SEEN_TRANSACTION:
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
-
*************************** 4. row ***************************
-
CHANNEL_NAME:
-
WORKER_ID: 4
-
THREAD_ID: NULL
-
SERVICE_STATE: OFF
-
'# 该字段变为空值'
-
LAST_SEEN_TRANSACTION:
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 4 rows in set (0.00 sec)
- 从库查询replication_connection_status表
-
admin@localhost : (none) 06:00:08> select * from performance_schema.replication_connection_status\G;
-
*************************** 1. row ***************************
-
CHANNEL_NAME:
-
GROUP_NAME:
-
SOURCE_UUID: b57c75c2-6205-11e7-8d9f-525400a4b2e1
-
THREAD_ID: 56
-
SERVICE_STATE: ON
-
COUNT_RECEIVED_HEARTBEATS: 10703
-
LAST_HEARTBEAT_TIMESTAMP: 2017-07-09 18:05:13
-
' #该字段值变为重新生成的Retrieved_Gtid_Set值'
-
RECEIVED_TRANSACTION_SET: b57c75c2-6205-11e7-8d9f-525400a4b2e1:766071-791118
-
LAST_ERROR_NUMBER: 0
-
LAST_ERROR_MESSAGE:
-
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 1 row in set (0.00 sec)
-
查看gtid_executed表与show master status信息,由于SQL线程一启动就报错,所以理论上不会发生任何变化,不需要查看
- 由于IO线程正常运行,relay log会持续正常发生变化,解析relay log时只需要查看起始位置的日志即可
-
[root@luoxiaobo-02 ~]# cd /data/mysqldata1/relaylog/
-
[root@luoxiaobo-02 relaylog]# ll
-
总用量 93968
-
-rw-r--r--. 1 root root 66984802 7月 9 13:01 a.sql
-
-rw-r-----. 1 mysql mysql 247 7月 9 17:09 mysql-relay-bin.000001
-
-rw-r-----. 1 mysql mysql 23258448 7月 9 18:10 mysql-relay-bin.000002
-
-rw-r-----. 1 mysql mysql 120 7月 9 17:09 mysql-relay-bin.index
-
'#mysql-relay-bin.000001中没有数据相关的GTID,只有一个Previous-GTIDs event记录GTID SET事务号为741924-766069'
-
[root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000001
-
mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
-
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
-
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
-
DELIMITER /*!*/;
-
# at 4
-
#170709 16:02:54 server id 330615 end_log_pos 123 CRC32 0xa0b99fea Start: binlog v 4, server v 5.7.18-log created 170709 16:02:54 at startup
-
# This Format_description_event appears in a relay log and was generated by the slave thread.
-
# at 123
-
#170709 16:02:54 server id 330615 end_log_pos 194 CRC32 0x2c2d5d7f Previous-GTIDs
-
# b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
-
# at 194
-
#170709 17:09:06 server id 330615 end_log_pos 247 CRC32 0x1584ebb7 Rotate to mysql-relay-bin.000002 pos: 4
-
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
-
DELIMITER ;
-
# End of log file
-
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
-
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
-
'# mysql-relay-bin.000002解析结果先放到a.sql文件中暂存'
-
[root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000002 > a.sql
-
mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
-
'# vim打开这个文件,查看起始位置的日志数据'
-
# at 123
-
'# 从这里可以看到,从第一次复现时一样,Previous-GTIDs EVENT记录了一个"[empty]"而并不是一个GTID SET'
-
#170709 17:09:06 server id 330615 end_log_pos 154 CRC32 0xa1000bf3 Previous-GTIDs
-
# [empty]
-
# at 154
-
'#这里就是SQL协调器线程报错终止的位置'
-
#700101 8:00:00 server id 330614 end_log_pos 0 CRC32 0xfff429de Rotate to mysql-bin.000038 pos: 19989790
-
# at 201
-
#170708 22:14:19 server id 330614 end_log_pos 0 CRC32 0x40b1240b Start: binlog v 4, server v 5.7.18-log created 170708 22:14:19
-
BINLOG '
-
......
-
'/*!*/;
-
# at 320
-
' # 这里就是worker线程报错终止的位置'
-
#170709 12:21:43 server id 330614 end_log_pos 19989869 CRC32 0x15f97bf9 Query thread_id=8 exec_time=0 error_code=0
-
SET TIMESTAMP=1499574103/*!*/;
-
......
-
BEGIN
-
/*!*/;
-
# at 399
-
#170709 12:21:43 server id 330614 end_log_pos 19990052 CRC32 0x5dde7aa8 Rows_query
-
# UPDATE sbtest1 SET c='61636976773-16107654940-51623712437-75581944876-25176120731-37361704412-40193407500-44604516796-09703856563-85070507959' WHERE id=2507560
-
# at 582
-
#170709 12:21:43 server id 330614 end_log_pos 19990111 CRC32 0x119bffe4 Table_map: `sbtest`.`sbtest1` mapped to number 120
-
# at 641
-
#170709 12:21:43 server id 330614 end_log_pos 19990527 CRC32 0xcc9c086d Update_rows: table id 120 flags: STMT_END_F
-
BINLOG
-
......
-
# at 1057
-
#170709 12:21:43 server id 330614 end_log_pos 19990558 CRC32 0x542c9d8c Xid = 42809
-
COMMIT/*!*/;
-
# at 1088
-
#170709 12:21:43 server id 330614 end_log_pos 19990623 CRC32 0x7fb6a8d9 GTID last_committed=24138 sequence_number=24146
-
'#从这里可以发现,relay
log中第一个可见的GTID是事务号是766071,与show slave
status输出的Retrieved_Gtid_Set值起始GTID事务号相同,所以可以推理出Retrieved_Gtid_Set值就是从这里开始计算起始值的'
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766071'/*!*/;
- ......
- 现在,到主库上解析start slave之后,SQL协调器线程和worker线程停止的位置(mysql-bin.000038文件的pos:19989790和19989869)
-
'# 由于主库的binlog不会受到从库的影响,所以这里直接把上一次解析主库binlog的相关内容重新贴出来'
-
COMMIT/*!*/; #上一个事务的COMMIT语句
-
# at 19989725
-
'# 从库的SQL协调器线程最后停止的pos对应的event在这里,这是一个事务event组中的第一个event'
-
#170709 12:21:43 server id 330614 end_log_pos 19989790 CRC32 0xef47a7bb GTID last_committed=24138 sequence_number=24145
-
SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766070'/*!*/;
-
# at 19989790
-
'#这里是worker线程停止的位置'
-
#170709 12:21:43 server id 330614 end_log_pos 19989869 CRC32 0x15f97bf9 Query thread_id=8 exec_time=0 error_code=0
-
SET TIMESTAMP=1499574103/*!*/;
-
BEGIN
-
/*!*/;
-
'# at 19989869 # 19989869 位置也是下一个event开始的位置,从上文中的信息可以得知,worker线程由于丢失前面的GTID EVENT,start slave;时对于这个事务读取到的第一个even't\
-
'# 就是这个Rows_query event,没有读取到GTID EVENT,所以被误认为是匿名事务,而在gtid_mode系统参数为ON时,不允许读取到匿名事务,所以报错终止了'
-
#170709 12:21:43 server id 330614 end_log_pos 19990052 CRC32 0x5dde7aa8 Rows_query
-
# UPDATE sbtest1 SET c='61636976773-16107654940-51623712437-75581944876-25176120731-37361704412-40193407500-44604516796-09703856563-85070507959' WHERE id=2507560
-
# at 19990052
-
#170709 12:21:43 server id 330614 end_log_pos 19990111 CRC32 0x119bffe4 Table_map: `sbtest`.`sbtest1` mapped to number 120
-
# at 19990111
-
#170709 12:21:43 server id 330614 end_log_pos 19990527 CRC32 0xcc9c086d Update_rows: table id 120 flags: STMT_END_F
-
BINLOG '
-
......
-
'/*!*/;
-
......
-
# at 19990527
-
#170709 12:21:43 server id 330614 end_log_pos 19990558 CRC32 0x542c9d8c Xid = 42809
-
'#标记这个事务结束'
-
COMMIT/*!*/;
- ......
-
从上文中描述我们可以得知,复制突然报错时,SQL协调器线程的停止位置并不是一个事务结束的xid event位置(与stop slave时正常停止不一样,正常停止时SQL协调器线程停止的位置是一个事务的xid event的位置,说明正常停止时,SQL线程会继续做正在执行的事务才停止),而是一个事务开始的GTID EVENT位置,这也说明了Exec_Master_Log_Pos是以event为单位进行更新的。
-
由于start slave启动复制时,SQL线程就立即报错终止,所以理论上从库数据不会发生变化,不需要解析从库的binlog
2.2. 解决方法验证
- 现在,通过上述分析过程中的详细信息,我们知道了引起报错的起因是由于主库持续写入的情况下,stop slave;导致IO线程只写入了一个事务的部分event到relay log中,然后在change master语句时mysql server清理了relay log,SQL线程被更新为IO线程停止的位置,导致start slave时SQL应用了一个event不完整的事务,所以报错了,那么,IO线程停止的位置我们不可控,SQL线程我们通过上文中的长篇解说已经知道,stop slave;的时候停止位置一定是最近一次提交事务的binlog pos,so..解决方法就是:change master语句的时候乖乖按照官方文档的要求指定master_log_file和master_log_pos选项(这样在change master语句时,IO线程就会以SQL线程的位置为准),如果你不想relay log被清理,那么就同时指定relay_log_file和relay_log_pos选项。
- 现在我们在从库重新执行stop slave;change master to master_auto_position=1;start slave; 恢复复制
-
admin@localhost : (none) 06:57:51> stop slave;change master to master_auto_position=1;start slave;
-
Query OK, 0 rows affected (0.00 sec)
-
Query OK, 0 rows affected (0.02 sec)
- Query OK, 0 rows affected (0.02 sec)
- show slave status看看复制是否恢复正常
-
admin@localhost : (none) 06:59:12> show slave status\G;
-
*************************** 1. row ***************************
-
Slave_IO_State: Waiting for master to send event
-
......
-
Master_Log_File: mysql-bin.000038
-
Read_Master_Log_Pos: 44331651
-
Relay_Log_File: mysql-relay-bin.000002
-
Relay_Log_Pos: 32184202
-
Relay_Master_Log_File: mysql-bin.000038
-
Slave_IO_Running: Yes
-
Slave_SQL_Running: Yes
-
......
-
Exec_Master_Log_Pos: 44331651
-
......
-
Seconds_Behind_Master: 0
-
Master_SSL_Verify_Server_Cert: No
-
Last_IO_Errno: 0
-
Last_IO_Error:
-
Last_SQL_Errno: 0
-
Last_SQL_Error:
-
......
-
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
-
......
-
Last_IO_Error_Timestamp:
-
Last_SQL_Error_Timestamp:
-
......
-
Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756656-795291
-
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-795291
-
Auto_Position: 1
-
......
- 1 row in set (0.00 sec)
- 从上面的结果中可以看到,复制已恢复正常,现在在主库上使用pt-table-checksum校验主从数据一致性
-
[root@luoxiaobo-01 ~]# pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=xiaoboluo.checksums --databases sbtest h=localhost,u=admin,p=letsg0,P=3306
-
.....
-
'# 如果DIFFS列出现有不为0值时,就表示主备数据不一致,可以执行如下语句执行同步,同步之后再次执行pt-table-checksum校验直到DIFFS列不会出现0值为止'
- pt-table-sync --replicate=xiaoboluo.checksums --databases sbtest h=localhost,u=admin,p=letsg0,P=3306 --execute
-
启动sysbench压测进程
- 从库等到复制有一些延迟时,执行top slave;
-
admin@localhost : (none) 07:15:14> stop slave;
- Query OK, 0 rows affected (0.04 sec)
- 从库执行一次show slave status并查看Relay_Master_Log_File、Exec_Master_Log_Pos、Relay_Log_File、Relay_Log_Pos选项值
-
admin@localhost : (none) 07:15:19> show slave status\G;
-
*************************** 1. row ***************************
-
Slave_IO_State:
-
......
-
Master_Log_File: mysql-bin.000038
-
Read_Master_Log_Pos: 52989383
-
Relay_Log_File: mysql-relay-bin.000002
-
Relay_Log_Pos: 36434532
-
Relay_Master_Log_File: mysql-bin.000038
-
Slave_IO_Running: No
-
Slave_SQL_Running: No
-
......
-
Exec_Master_Log_Pos: 48581981
-
......
-
Seconds_Behind_Master: NULL
-
Master_SSL_Verify_Server_Cert: No
-
Last_IO_Errno: 0
-
Last_IO_Error:
-
Last_SQL_Errno: 0
-
Last_SQL_Error:
-
......
-
Last_IO_Error_Timestamp:
-
Last_SQL_Error_Timestamp:
-
......
-
Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756656-805843
-
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-800552
-
Auto_Position: 1
-
......
- 1 row in set (0.00 sec)
- 从库执行change master语句切换master_auto_position=0,带上前面获取的四个选项值
-
admin@localhost : (none) 07:15:23> change master to master_auto_position=0,master_log_file='mysql-bin.000038',master_log_pos=48581981,relay_log_file='mysql-relay-bin.000002',\
-
relay_log_pos=36434532;
- Query OK, 0 rows affected (0.00 sec)
- 从库查看一下relay log,从下面的结果中可以看到relay log并没有被清理
-
[root@luoxiaobo-02 binlog]# cd ../relaylog/
-
[root@luoxiaobo-02 relaylog]# ll
-
总用量 144792
-
-rw-r--r--. 1 root root 82262702 7月 9 18:14 a.sql
-
-rw-r-----. 1 mysql mysql 247 7月 9 18:59 mysql-relay-bin.000001
-
-rw-r-----. 1 mysql mysql 40841934 7月 9 19:15 mysql-relay-bin.000002
- -rw-r-----. 1 mysql mysql 120 7月 9 18:59 mysql-relay-bin.index
- 启动复制
-
admin@localhost : (none) 07:17:44> start slave;
- Query OK, 0 rows affected (0.01 sec)
- 查看复制状态,从下面的结果中可以看到,复制一切正常
-
admin@localhost : (none) 07:18:43> show slave status\G;
-
*************************** 1. row ***************************
-
Slave_IO_State: Queueing master event to the relay log
-
......
-
Master_Log_File: mysql-bin.000038
-
Read_Master_Log_Pos: 104115141
-
Relay_Log_File: mysql-relay-bin.000002
-
Relay_Log_Pos: 36931000
-
Relay_Master_Log_File: mysql-bin.000038
-
Slave_IO_Running: Yes
-
Slave_SQL_Running: Yes
-
......
-
Exec_Master_Log_Pos: 49078449
-
......
-
Seconds_Behind_Master: 221
-
Master_SSL_Verify_Server_Cert: No
-
Last_IO_Errno: 0
-
Last_IO_Error:
-
Last_SQL_Errno: 0
-
Last_SQL_Error:
-
......
-
Slave_SQL_Running_State: Waiting for dependent transaction to commit
-
......
-
Last_IO_Error_Timestamp:
-
Last_SQL_Error_Timestamp:
-
......
-
Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756656-867218
-
Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-801145
-
Auto_Position: 0
-
......
- 1 row in set (0.14 sec)
-
现在,停止sysbench进程
- 等到复制无延迟时,再次使用pt-table-checksum再校验一把(校验结果省略,推理上这里不会出现数据不一致)
-
[root@luoxiaobo-01 ~]# pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=xiaoboluo.checksums --databases sbtest h=localhost,u=admin,p=letsg0,P=3306
- ......
3、总结
-
从上文中冗长的分析过程,我们可以得知,导致出问题的起因是主库持续写入的情况下,从库执行stop
slave;时,IO线程读取并只向relay log中写入了一个事务的部分event(event组不完整,只写了GTID
EVENT),后续change master to
master_auto_position=0语句既没有指定请求主库的binlog位置,也没有指定从库的relay log,导致mysql
server执行了清理relay log(清理了IO线程最后写入的GTID EVENT)以及 relay
log和SQL线程相关的表和内存中的位置信息,并把SQL线程的位置指向了IO线程的停止位置(IO线程读取的主库binlog file and
pos),后续start
slave;时从库就从IO线程停止的位置开始复制,由于IO线程停止时的位置处在一个事务的中间,SQL线程从这个位置开始应用时,立即发现了这个事务缺少GTID
event,被误认为是一个
匿名事务(gtid_mode系统参数设置为ON时,IO线程和SQL线程都不允许碰到匿名事务),所以SQL协调器线程和worker线程就都报错终止了。
- 这个问题的解决办法就是:执行change master to master_auto_position=0语句的时候,乖乖按照官方文档要求,指定master_log_file和master_log_pos选项,如果你不想relay log被清理,那么同时指定relay_log_file或relay_log_pos选项,如:change master to master_auto_position=0,master_log_file=Relay_Master_Log_File,master_log_pos=Exec_Master_Log_Pos,relay_log_file=Relay_Log_File,relay_log_pos=Relay_Log_Pos;
-
对于复制模式从master_auto_position=1切换为master_auto_position=0这种操作,个人有以下几点看法:
- change master to master_auto_position=0;不带master_log_file和master_log_pos是非常不规范的操作,官方文档中也明确说明了需要指定master_log_file和master_log_pos选项,因为master_auto_position=0时即停止使用基于GTID的复制协议,启用基于文件的复制协议,这个时候如果复制有延迟,就算mysql server没有自动清理relay log,也有可能被人为清理掉,而一旦被清理且start slave;之后,IO线程继续从上次stop slave;的位置开始请求主库的binlog,SQL线程也只能从relay log的起始位置开始重放,从而报错、丢失数据的问题就接踵而来。
- 你真的有不得不切换到master_auto_position=0的需求吗?还是一个伪需求呢?如果没有特殊的不得不切换为0的需求,建议在5.7版本中尽量使用master_auto_position=1
-
注意:
- 使用change master to master_auto_position=0不指定SQL线程位置和relay log文件时,start slave;不一定会报错,如果IO线程读取主库binlog时没有发生部分读,可能就不会发生报错这个问题(但此时可能会引发从库丢失数据的问题)。如果你想复现这个问题,你可以把master_auto_position恢复为1启动复制之后,再次切换为0(注意,切换期间,sysbench压力需要持续进行)
-
回顾与思考(卖个关子)
-
根据官方描述,change master
…语句不带relay_log_file和relay_log_pos选项都会导致relay
log被清理,那么,在GTID复制架构下,执行master_auto_position=1时,会不会有报错和数据丢失的情况发生呢?为什么?
* 不会,因为master_auto_position=1时,在执行start slave;语句时,IO线程向主库请求的GTID位置是根据公式"UNION(@@global.gtid_executed, Retrieved_gtid_set - last_received_GTID)"自动计算的,所以无论relay log是否清理,起始GTID位置使用包含了从库当前数据的gtid_executed位置。当然,要保证start slave;时SQL线程正常拉起,前提是上次SQL线程是正常停止的 - GTID复制架构下,如果stop slave时,IO线程正好停止在一个事务的xid event位置,然后执行change master to master_auto_position=0;start slave;语句会发生什么情况?(温馨提示:尤其需要留意这种情况,本文不做演示,请自行尝试)
* start slave;之后,IO线程会从上次停止的位置继续请求主库binlog(新生成的relay log中这个时候就不存在读取到一个事务的部分event),而SQL线程也可以继续正常执行IO线程新生成的relay log。注意:这个时候从库就发生数据丢失了,丢失的就是执行change master to master_auto_position=0;语句清理的那部分relay log
* 后续再切换为master_auto_position=1,虽然从库会把缺失的事务重新从向主库请求,也会在从库进行重放,但是。。这个时候,事务的提交顺序乱了,可能导致UPDATE、DELETE、INSERT顺序错乱,从而导致虽然数据行数主从库一致,但是行内数据可能并不一致,如果你运气好,颠倒的事务顺序之后有数据发生冲突,那么SQL线程就会报错终止,如果运气不好,顺序颠倒之后数据没有发生冲突,那么如果你不使用pt工具校验一下主备数据一致性,你可能就很难发现到底是哪里出了问题。当应用反馈说从库查询数据有异常时,你也很可能一脸懵逼! -
根据官方描述,change master
…语句不带relay_log_file和relay_log_pos选项都会导致relay
log被清理,那么,在GTID复制架构下,执行master_auto_position=1时,会不会有报错和数据丢失的情况发生呢?为什么?
-
全文参考资料
- https://dev.mysql.com/doc/refman/5.7/en/change-master-to.html
- https://dev.mysql.com/doc/refman/5.7/en/replication-features-transaction-inconsistencies.html
- https://dev.mysql.com/doc/refman/5.7/en/show-slave-status.html
- https://dev.mysql.com/doc/refman/5.7/en/start-slave.html
- https://dev.mysql.com/doc/refman/5.7/en/stop-slave.html
- https://dev.mysql.com/doc/refman/5.7/en/replication-mode-change-online-disable-gtids.html
- https://dev.mysql.com/doc/refman/5.7/en/replication-applier-status-by-worker-table.html
- https://dev.mysql.com/doc/refman/5.7/en/replication-connection-status-table.html
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/28218939/viewspace-2142235/

浙公网安备 33010602011771号