MySQL Server-id踩到的坑

最近踩到一个说大不大，说小不小的坑，在此分享出来给各位同学。事情是这样的，线上有2台服务器，1主1从。A -> B，B服务器从A服务器同步数据。每天使用xtrabackup在B服务器上面进行全备。某天A服务器挂了，后来由于某种原因无法进入系统了，只有重装了系统，那么此时要恢复A服务器的步骤就是在A服务器部署mysql实例，从B服务器上面拿备份恢复到A，再根据POS点change到B服务器，让A服务器从B服务器同步。此时是B -> A。相信熟悉MySQL的人都知道步骤是没有问题的。

但在这过程中还是出问题了，在A服务器从新从B服务器同步完成以后，确认没有延时以后，此时把A重新恢复成了原来的角色，也就是主库，架构又变回了A -> B。恢复完成以后询问开发说没有异常。到第二天的时候有玩家反馈数据不正确。此时进行数据差异的查找。最后发现A的数据比B的数据少。在经过几番查找以及回忆操作步骤以后，发现踩了大坑。那就是我们安装mysql实例的时候，server-id是根据服务器ip地址的后2位生成的，比如ip地址是：192.168.5.6，那么server-id就是56。

有同学会问了，和server-id有毛关系啊。大家仔细想想mysql的双主是怎样确定binlog是否需要应用的？没错，那就是server-id，如果server-id是自己的就不再应用binlog，那么我踩的坑就是当A再次重新向B同步的时候，A的server-id还是老的，没有修改，B服务器的binlog里面记录的server-id就是A服务器的server-id，最后导致有一部分binlog没有应用。原理已经说明了，那么接下来进行简单的实验就可以论证了。

环境：

自己搭建一个测试环境，简单的1主1从。

我主库的server-id是

mysql> show variables like '%server_id%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     | 25152 |
+---------------+-------+
1 row in set (0.00 sec)

从库的server-id是

mysql> show variables like '%server_id%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     | 25250 |
+---------------+-------+
1 row in set (0.00 sec)

主库建库，建表，插入数据：

mysql> create database yayun;
Query OK, 1 row affected (0.00 sec)

mysql> create table yayun.tb1 ( id int, age int, name char(20), primary key(id) );  
Query OK, 0 rows affected (0.07 sec)

mysql> use yayun
Database changed
mysql> insert into tb1 (id,age,name)values(1,18,'aa');
Query OK, 1 row affected (0.00 sec)

mysql> insert into tb1 (id,age,name)values(2,18,'bb');
Query OK, 1 row affected (0.00 sec)

mysql> select * from tb1;
+----+------+------+
| id | age  | name |
+----+------+------+
|  1 |   18 | aa   |
|  2 |   18 | bb   |
+----+------+------+
2 rows in set (0.00 sec)

mysql>

从库查询：

mysql> select * from tb1;
+----+------+------+
| id | age  | name |
+----+------+------+
|  1 |   18 | aa   |
|  2 |   18 | bb   |
+----+------+------+
2 rows in set (0.00 sec)

mysql>

此时数据是一致的。

接下来在从库备份数据，并且记录pos点。（这里模拟的是从库每天进行的备份）

mysqldump -uroot -p --master-data=2 yayun > /tmp/backup_yayun.sql

下面在主库继续进行insert，update操作。

mysql> insert into tb1 (id,age,name)values(3,19,'cc');
Query OK, 1 row affected (0.00 sec)

mysql> update tb1 set name='yayun' where id=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from tb1;
+----+------+-------+
| id | age  | name  |
+----+------+-------+
|  1 |   18 | yayun |
|  2 |   18 | bb    |
|  3 |   19 | cc    |
+----+------+-------+
3 rows in set (0.00 sec)

mysql>

查询从库记录：

mysql> select * from tb1;
+----+------+-------+
| id | age  | name  |
+----+------+-------+
|  1 |   18 | yayun |
|  2 |   18 | bb    |
|  3 |   19 | cc    |
+----+------+-------+
3 rows in set (0.00 sec)

mysql>

可以看到此时主从数据是一致的。接下来我们就当主库挂了。重新需要拉取备份，然后向从库同步数据。
1. 把备份文件backup_yayun.sql拉到主库。

2. 把从库的同步断掉，清掉同步信息。

从库操作：

mysql> stop slave;reset slave all;
Query OK, 0 rows affected (0.03 sec)

Query OK, 0 rows affected (0.05 sec)

mysql>

主库操作：

mysql -uroot -p yayun < backup_yayun.sql

查看pos点：

[root@mdw ~]# grep -i change backup_yayun.sql 
-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000004', MASTER_LOG_POS=4070;
[root@mdw ~]#

主库change到原来的从库

mysql> CHANGE MASTER TO  MASTER_HOST='10.36.25.250',MASTER_USER='repl',MASTER_PASSWORD='123',MASTER_LOG_FILE='mysql-bin.000004',MASTER_LOG_POS=4070;
Query OK, 0 rows affected (0.10 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

查询数据：

如果查询出来的数据是下面的数据，那么就是正确的：

+----+------+-------+
| id | age  | name  |
+----+------+-------+
|  1 |   18 | yayun |
|  2 |   18 | bb    |
|  3 |   19 | cc    |
+----+------+-------+

我们实际查询一下：

mysql> select * from tb1;
+----+------+------+
| id | age  | name |
+----+------+------+
|  1 |   18 | aa   |
|  2 |   18 | bb   |
+----+------+------+
2 rows in set (0.00 sec)

mysql>

卧槽，发生了什么，怎么数据少了，而且id等于1的name字段结果也不一样？

下面我们看看原来老的从库的binlog

#160908 13:45:57 server id 25152  end_log_pos 4239      Query   thread_id=16    exec_time=0     error_code=0
SET TIMESTAMP=1473313557/*!*/;
insert into tb1 (id,age,name)values(3,19,'cc')
/*!*/;
# at 4239
#160908 13:45:57 server id 25152  end_log_pos 4266      Xid = 160
COMMIT/*!*/;
# at 4266
#160908 13:46:20 server id 25152  end_log_pos 4325      Query   thread_id=16    exec_time=0     error_code=0
SET TIMESTAMP=1473313580/*!*/;
BEGIN
/*!*/;
# at 4325
#160908 13:46:20 server id 25152  end_log_pos 4427      Query   thread_id=16    exec_time=0     error_code=0
SET TIMESTAMP=1473313580/*!*/;
update tb1 set name='yayun' where id=1
/*!*/;
# at 4427
#160908 13:46:20 server id 25152  end_log_pos 4454      Xid = 162
COMMIT/*!*/;
DELIMITER ;

可以看见有insert，update，但是server id都是25152，也就是主库的。这也就是为什么少了数据的原因。开头也提到过了。

如果我们在新的主库上面进行update，如果这条记录在从库没有存在，而且主从的binlog是row模式，那么就会触发1032错误，复制将中断，由于我的是mixed模式，同步一直没有报错，没有早发现问题。我update语句加limit就会触发row模式，下面我们试试。

主库：

mysql> update tb1 set name='abcd' where id=3 limit 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql>

从库：

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.36.25.250
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 4653
               Relay_Log_File: relaylog.000002
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table yayun.tb1; Can't find record in 'tb1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 4626
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4454
              Relay_Log_Space: 601
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table yayun.tb1; Can't find record in 'tb1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 4626
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 25250
1 row in set (0.00 sec)

可以看见抛1032错误，主库有这条记录，从库没有，同时触发了row模式，就会导致复制中断。

结论：

1. 在重新搭建复制关系的时候一定注意server-id。

2. 线上对数据一致性要求比较高的一定要使用row模式。

posted @ 2016-09-08 14:10 yayun 阅读(33423) 评论(6) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Atlas

MySQL Server-id踩到的坑

公告