MySQL InnoDB Cluster节点重新加入集群踩坑案例

2026-03-05 16:21 潇湘隐者阅读(60) 评论(0) 收藏举报

案例,一个MySQL InnoDB Cluster集群中两个节点处于MISSING状态. 接手后通过沟通发现出现这么一个状态的大概原因:InnoDB Cluster集群中三个节点被人同时重启,
后续反复重启折腾,而且出现问题后没有解决.而是将第一个节点作为引导节点,当前的具体状态如下所示:(集群节点信息做了混淆)

 MySQL  mysql01:3306 ssl  JS > cluster.status();
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "mysql01:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.", 
        "topology": {
            "mysql01:3306": {
                "address": "mysql01:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.39"
            }, 
            "mysql02:3306": {
                "address": "mysql02:3306", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.39"
            }, 
            "mysql03:3306": {
                "address": "mysql03:3306", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.39"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "mysql01:3306"
}
 MySQL  mysql01:3306 ssl  JS >

正常来说,顺利的话,只需要将节点重新加入集群,恢复数据即可.

MySQL  mysql01:3306 ssl  JS > cluster.rejoinInstance('mysqladm@mysql02:3306')
Validating instance configuration at mysql02:3306...

This instance reports its own address as mysql02:3306

Instance configuration is suitable.

ERROR: A GTID set check of the MySQL instance at 'mysql02:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

mysql02:3306 has the following errant GTIDs that do not exist in the cluster:
0d3d8524-a642-11ef-849c-b496919a423e:36017566-36017580

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to rejoining the instance to the cluster.

Discarding these extra GTID events can either be done manually or by completely overwriting the state of mysql02:3306 with a physical snapshot from an existing cluster member. To achieve this remove the instance from the cluster and add it back using <Cluster>.addInstance() and setting the 'recoveryMethod' option to 'clone'.

ERROR: RuntimeError: The instance 'mysql02:3306' contains errant transactions that did not originate from the cluster.
Cluster.rejoinInstance: The instance 'mysql02:3306' contains errant transactions that did not originate from the cluster. (RuntimeError)

但是如上所示,当时InnoDB Cluster出现问题的时候,节点mysql01的事务不是最新的,但是已经将节点mysql01作为引导节点了, 现在节点mysql02重新加入集群,出现数据冲突了.
此时,应用程序运行在节点mysql01上(作为单节点)已经运行了一段时间了.也就是说节点mysql01上有大量的新数据. 只能将节点mysql02/mysql03从集群中移除,然后重新加入集群.

MySQL  mysql01:3306 ssl  JS > cluster.removeInstance('mysqladm@mysql02:3306')
ERROR: mysql02:3306 is reachable but has state OFFLINE
To safely remove it from the cluster, it must be brought back ONLINE. If not possible, use the 'force' option to remove it anyway.

Do you want to continue anyway (only the instance metadata will be removed)? [y/N]: yes

The instance will be removed from the InnoDB Cluster.

NOTE: Transaction sync was skipped
NOTE: The instance 'mysql02:3306' is OFFLINE, Group Replication stop skipped.

The instance 'mysql02:3306' was successfully removed from the cluster.

 MySQL  mysql01:3306 ssl  JS > cluster.status();
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "mysql01:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "mysql01:3306": {
                "address": "mysql01:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.39"
            }, 
            "mysql03:3306": {
                "address": "mysql03:3306", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.39"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "mysql01:3306"
}

将节点mysql02重新加入集群, 正常来说,这一步应该也很顺利. 结果踩了一个大坑. 如下所示

 MySQL  mysql01:3306 ssl  JS > cluster.addInstance('mysqladm@mysql02:3306');

WARNING: A GTID set check of the MySQL instance at 'mysql02:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

mysql02:3306 has the following errant GTIDs that do not exist in the cluster:
0d3d8524-a642-11ef-849c-b496919a423e:36017566-36017580

WARNING: Discarding these extra GTID events can either be done manually or by completely overwriting the state of mysql02:3306 with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.

Please select a recovery method [C]lone/[A]bort (default Abort): C
Validating instance configuration at mysql02:3306...

This instance reports its own address as mysql02:3306

Instance configuration is suitable.
NOTE: Group Replication will communicate with other members using 'mysql02:3306'. Use the localAddress option to override.

* Checking connectivity and SSL configuration...
A new instance will be added to the InnoDB Cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.

Adding instance to the cluster...

Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: mysql02:3306 is being cloned from mysql01:3306
ERROR: The clone process has failed: Error dropping database (can't rmdir './backup/', errno: 17 - File exists) (1010)
ERROR: Error dropping database (can't rmdir './backup/', errno: 17 - File exists)
Cluster.addInstance: Error dropping database (can't rmdir './backup/', errno: 17 - File exists) (RuntimeError)

在加入InnoDB Cluster集群的时候, 这个环境中MySQL的数据目录为/data/mysql, 但是备份目录/data/mysql/backup的挂载点backup放置在/data/mysql下, 而且这个目录的owner为root,不清楚当初部署的人为什么这样脑洞大开,结果导致节点加入Innodb Cluster集群的时候出现异常(发现踩了好大一个坑). 因为线程将backup当作数据库删除时遇到权限错误.无法删除目录/data/mysql/backup, 详情请见上面错误信息

# df -h
Filesystem                 Size  Used Avail Use% Mounted on
...........................................................
/dev/mapper/vg00-mysql     500G  7.3G  493G   2% /data/mysql
/dev/sdb                   3.7T   27G  3.7T   1% /data/mysql/backup
............................................................

异常后,节点重新加入InnoDB Cluster集群就遇到了元数据报错信息.无法再加入InnoDB Cluster集群了.

 MySQL  mysql01:3306 ssl  JS > cluster.addInstance('mysqladm@mysql02:3306');
ERROR: MySQL Error 1356: Failed to execute query on Metadata server mysql02:3306: View 'mysql_innodb_cluster_metadata.v2_this_instance' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them
Cluster.addInstance: Failed to execute query on Metadata server mysql02:3306: View 'mysql_innodb_cluster_metadata.v2_this_instance' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them (MySQL Error 1356)
 MySQL  mysql01:3306 ssl  JS >

此时,尝试了好几种方法后,最后采取将节点mysql02中的InnoDB Cluster元数据库mysql_innodb_cluster_metadata删除.如下所示:

mysql>  SET GLOBAL super_read_only = OFF;
Query OK, 0 rows affected (0.00 sec)

mysql>  drop database mysql_innodb_cluster_metadata;
Query OK, 12 rows affected (0.02 sec)

mysql> exit

然后将节点mysql02重新加入MySQL InnoDB Cluster集群.

MySQL  mysql01:3306 ssl  JS > cluster.addInstance('mysqladm@mysql02:3306')
NOTE: A GTID set check of the MySQL instance at 'mysql02:3306' determined that it is missing transactions that were purged from all cluster members.
Clone based recovery was selected because it seems to be safely usable.

Validating instance configuration at mysql02:3306...

This instance reports its own address as mysql02:3306

Instance configuration is suitable.
NOTE: Group Replication will communicate with other members using 'mysql02:3306'. Use the localAddress option to override.

* Checking connectivity and SSL configuration...
A new instance will be added to the InnoDB Cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.

Adding instance to the cluster...

Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: mysql02:3306 is being cloned from mysql01:3306
** Stage DROP DATA: Completed





NOTE: mysql02:3306 is shutting down...

* Waiting for server restart... ready 
* mysql02:3306 has restarted, waiting for clone to finish...
** Stage RESTART: Completed
* Clone process has finished: 730.94 MB transferred in 6 sec (121.82 MB/s)

Incremental state recovery is now in progress.

* Waiting for distributed recovery to finish...
NOTE: 'mysql02:3306' is being recovered from 'mysql01:3306'
* Distributed recovery has finished

The instance 'mysql02:3306' was successfully added to the cluster.

 MySQL  mysql01:3306 ssl  JS > cluster.status()
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "mysql01:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures.", 
        "topology": {
            "mysql01:3306": {
                "address": "mysql01:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.39"
            }, 
            "mysql02:3306": {
                "address": "mysql02:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.39"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "mysql01:3306"
}

节点mysql03加入集群就非常简单,这里就不做过多叙述了. 关于接手项目或帮人处理问题时, 如果环境不熟悉或没有规范配置,很容易踩坑. 如果你接手一个陌生的生产环境.做任何操作记得谨慎小心.

刷新页面返回顶部

潇湘隐者

MySQL InnoDB Cluster节点重新加入集群踩坑案例

About