# RocketMQ部署模式导航学习文本4noface

Posted on 2025-10-01 06:07  吾以观复  阅读(3)  评论(0)    收藏  举报

关联知识库:# RocketMQ部署模式导航学习文本4noface

跳到主要内容
My Site Logo
Apache RocketMQ
GitHub
简体中文
文档
下载
博客
社区

基本概念
快速开始
领域模型
功能特性
部署 & 运维
部署方式
Admin Tool
主备自动切换模式部署
RocketMQ Dashboard
RocketMQ Prometheus Exporter
可观测
客户端 SDK
最佳实践
RocketMQ EventBridge
RocketMQ MQTT
RocketMQ Connect
RocketMQ Streams
贡献指南
安全模型
部署 & 运维部署方式
版本:5.0
部署方式
Apache RocketMQ 5.0 版本完成基本消息收发,包括 NameServer、Broker、Proxy 组件。 在 5.0 版本中 Proxy 和 Broker 根据实际诉求可以分为 Local 模式和 Cluster 模式,一般情况下如果没有特殊需求,或者遵循从早期版本平滑升级的思路,可以选用Local模式。

在 Local 模式下,Broker 和 Proxy 是同进程部署,只是在原有 Broker 的配置基础上新增 Proxy 的简易配置就可以运行。
在 Cluster 模式下,Broker 和 Proxy 分别部署,即在原有的集群基础上,额外再部署 Proxy 即可。
Local模式部署
由于 Local 模式下 Proxy 和 Broker 是同进程部署,Proxy本身无状态,因此主要的集群配置仍然以 Broker 为基础进行即可。

启动 NameServer
NameServer需要先于Broker启动,且如果在生产环境使用,为了保证高可用,建议一般规模的集群启动3个NameServer,各节点的启动命令相同,如下:

首先启动Name Server

$ nohup sh mqnamesrv &

验证Name Server 是否启动成功

$ tail -f ~/logs/rocketmqlogs/namesrv.log
The Name Server boot success...

启动Broker+Proxy
单组节点单副本模式
警告
这种方式风险较大,因为 Broker 只有一个节点,一旦Broker重启或者宕机时,会导致整个服务不可用。不建议线上环境使用, 可以用于本地测试。

启动 Broker+Proxy

$ nohup sh bin/mqbroker -n localhost:9876 --enable-proxy &

验证Broker 是否启动成功,例如Broker的IP为:192.168.1.2,且名称为broker-a

$ tail -f ~/logs/rocketmqlogs/broker_default.log
The broker[xxx, 192.169.1.2:10911] boot success...

多组节点(集群)单副本模式
一个集群内全部部署 Master 角色,不部署Slave 副本,例如2个Master或者3个Master,这种模式的优缺点如下:

优点:配置简单,单个Master宕机或重启维护对应用无影响,在磁盘配置为RAID10时,即使机器宕机不可恢复情况下,由于RAID10磁盘非常可靠,消息也不会丢(异步刷盘丢失少量消息,同步刷盘一条不丢),性能最高;

缺点:单台机器宕机期间,这台机器上未被消费的消息在机器恢复之前不可订阅,消息实时性会受到影响。

启动Broker+Proxy集群

在机器A,启动第一个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-noslave/broker-a.properties --enable-proxy &

在机器B,启动第二个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-noslave/broker-b.properties --enable-proxy &

...

备注
如上启动命令是在单个NameServer情况下使用的。对于多个NameServer的集群,Broker启动命令中-n后面的地址列表用分号隔开即可,例如 192.168.1.1:9876;192.161.2:9876。

多节点(集群)多副本模式-异步复制
每个Master配置一个Slave,有多组 Master-Slave,HA采用异步复制方式,主备有短暂消息延迟(毫秒级),这种模式的优缺点如下:

优点:即使磁盘损坏,消息丢失的非常少,且消息实时性不会受影响,同时Master宕机后,消费者仍然可以从Slave消费,而且此过程对应用透明,不需要人工干预,性能同多Master模式几乎一样;

缺点:Master宕机,磁盘损坏情况下会丢失少量消息。

启动Broker+Proxy集群

在机器A,启动第一个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-async/broker-a.properties --enable-proxy &

在机器B,启动第二个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-async/broker-b.properties --enable-proxy &

在机器C,启动第一个Slave,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-async/broker-a-s.properties --enable-proxy &

在机器D,启动第二个Slave,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-async/broker-b-s.properties --enable-proxy &

多节点(集群)多副本模式-同步双写
每个Master配置一个Slave,有多对 Master-Slave,HA采用同步双写方式,即只有主备都写成功,才向应用返回成功,这种模式的优缺点如下:

优点:数据与服务都无单点故障,Master宕机情况下,消息无延迟,服务可用性与数据可用性都非常高;

缺点:性能比异步复制模式略低(大约低10%左右),发送单个消息的RT会略高,且目前版本在主节点宕机后,备机不能自动切换为主机。

启动 Broker+Proxy 集群

在机器A,启动第一个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-sync/broker-a.properties --enable-proxy &

在机器B,启动第二个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-sync/broker-b.properties --enable-proxy &

在机器C,启动第一个Slave,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-sync/broker-a-s.properties --enable-proxy &

在机器D,启动第二个Slave,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-sync/broker-b-s.properties --enable-proxy &

提示
以上 Broker 与 Slave 配对是通过指定相同的 BrokerName 参数来配对,Master 的 BrokerId 必须是 0,Slave 的 BrokerId 必须是大于 0 的数。另外一个 Master 下面可以挂载多个 Slave,同一 Master 下的多个 Slave 通过指定不同的 BrokerId 来区分。$ROCKETMQ_HOME指的RocketMQ安装目录,需要用户自己设置此环境变量。

5.0 HA新模式
提供更具灵活性的HA机制,让用户更好的平衡成本、服务可用性、数据可靠性,同时支持业务消息和流存储的场景。详见

Cluster模式部署
在 Cluster 模式下,Broker 与 Proxy分别部署,我可以在 NameServer和 Broker都启动完成之后再部署 Proxy。

在 Cluster模式下,一个 Proxy集群和 Broker集群为一一对应的关系,可以在 Proxy的配置文件 rmq-proxy.json 中使用 rocketMQClusterName 进行配置

启动 NameServer

首先启动Name Server

$ nohup sh mqnamesrv &

验证Name Server 是否启动成功

$ tail -f ~/logs/rocketmqlogs/namesrv.log
The Name Server boot success...

启动 Broker
单组节点单副本模式
警告
这种方式风险较大,因为 Broker 只有一个节点,一旦Broker重启或者宕机时,会导致整个服务不可用。不建议线上环境使用, 可以用于本地测试。

在机器A,启动第一个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 &

多组节点(集群)单副本模式
一个集群内全部部署 Master 角色,不部署Slave 副本,例如2个Master或者3个Master,这种模式的优缺点如下:

优点:配置简单,单个Master宕机或重启维护对应用无影响,在磁盘配置为RAID10时,即使机器宕机不可恢复情况下,由于RAID10磁盘非常可靠,消息也不会丢(异步刷盘丢失少量消息,同步刷盘一条不丢),性能最高;

缺点:单台机器宕机期间,这台机器上未被消费的消息在机器恢复之前不可订阅,消息实时性会受到影响。

在机器A,启动第一个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-noslave/broker-a.properties &

在机器B,启动第二个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-noslave/broker-b.properties &

...

备注
如上启动命令是在单个NameServer情况下使用的。对于多个NameServer的集群,Broker启动命令中-n后面的地址列表用分号隔开即可,例如 192.168.1.1:9876;192.161.2:9876。

多节点(集群)多副本模式-异步复制
每个Master配置一个Slave,有多组 Master-Slave,HA采用异步复制方式,主备有短暂消息延迟(毫秒级),这种模式的优缺点如下:

优点:即使磁盘损坏,消息丢失的非常少,且消息实时性不会受影响,同时Master宕机后,消费者仍然可以从Slave消费,而且此过程对应用透明,不需要人工干预,性能同多Master模式几乎一样;

缺点:Master宕机,磁盘损坏情况下会丢失少量消息。

在机器A,启动第一个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-async/broker-a.properties &

在机器B,启动第二个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-async/broker-b.properties &

在机器C,启动第一个Slave,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-async/broker-a-s.properties &

在机器D,启动第二个Slave,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-async/broker-b-s.properties &

多节点(集群)多副本模式-同步双写
每个Master配置一个Slave,有多对 Master-Slave,HA采用同步双写方式,即只有主备都写成功,才向应用返回成功,这种模式的优缺点如下:

优点:数据与服务都无单点故障,Master宕机情况下,消息无延迟,服务可用性与数据可用性都非常高;

缺点:性能比异步复制模式略低(大约低10%左右),发送单个消息的RT会略高,且目前版本在主节点宕机后,备机不能自动切换为主机。

在机器A,启动第一个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-sync/broker-a.properties &

在机器B,启动第二个Master,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-sync/broker-b.properties &

在机器C,启动第一个Slave,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-sync/broker-a-s.properties &

在机器D,启动第二个Slave,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqbroker -n 192.168.1.1:9876 -c $ROCKETMQ_HOME/conf/2m-2s-sync/broker-b-s.properties &

提示
以上 Broker 与 Slave 配对是通过指定相同的 BrokerName 参数来配对,Master 的 BrokerId 必须是 0,Slave 的 BrokerId 必须是大于 0 的数。另外一个 Master 下面可以挂载多个 Slave,同一 Master 下的多个 Slave 通过指定不同的 BrokerId 来区分。$ROCKETMQ_HOME指的RocketMQ安装目录,需要用户自己设置此环境变量。

5.0 HA新模式
提供更具灵活性的HA机制,让用户更好的平衡成本、服务可用性、数据可靠性,同时支持业务消息和流存储的场景。详见

启动 Proxy
可以在多台机器启动多个Proxy

在机器A,启动第一个Proxy,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqproxy -n 192.168.1.1:9876 &

在机器B,启动第二个Proxy,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqproxy -n 192.168.1.1:9876 &

在机器C,启动第三个Proxy,例如NameServer的IP为:192.168.1.1

$ nohup sh bin/mqproxy -n 192.168.1.1:9876 &

若需要指定配置文件,可以使用 -pc或者 --proxyConfigPath 进行指定

自定义配置文件

$ nohup sh bin/mqproxy -n 192.168.1.1:9876 -pc /path/to/proxyConfig.json &

编辑此页
上一页
消息存储和清理机制
下一页
Admin Tool
Local模式部署
启动 NameServer
启动Broker+Proxy
Cluster模式部署
启动 NameServer
启动 Broker
启动 Proxy
Learn
Introduction
Installation
Migration from 4.x to 5.0
Community
Twitter
Github
Help
More
Blog
Changelog
GitHub
Twitter
Legal
Licenses
Security
Thanks
Sponsorship
Meta Open Source Logo
Copyright © 2025 The Apache Software Foundation. Licensed under the Apache License, Version 2.0.

跳到主要内容
My Site Logo
Apache RocketMQ
GitHub
简体中文
文档
下载
博客
社区

基本概念
快速开始
领域模型
功能特性
部署 & 运维
部署方式
Admin Tool
主备自动切换模式部署
RocketMQ Dashboard
RocketMQ Prometheus Exporter
可观测
客户端 SDK
最佳实践
RocketMQ EventBridge
RocketMQ MQTT
RocketMQ Connect
RocketMQ Streams
贡献指南
安全模型
部署 & 运维主备自动切换模式部署
版本:5.0
主备自动切换模式部署
架构图

该文档主要介绍如何部署支持自动主从切换的 RocketMQ 集群,其架构如上图所示,主要增加支持自动主从切换的 Controller 组件,其可以独立部署也可以内嵌在 NameServer 中。

参考
详细可参考 设计思想 和 快速开始

Controller 部署
Controller 组件提供选主能力,若需要保证 Controller 具备容错能力,Controller 部署需要三副本及以上(遵循 Raft 的多数派协议)。

注意
Controller 若只部署单副本也能完成 Broker Failover,但若该单点 Controller 故障,会影响切换能力,但不会影响存量集群的正常收发。

Controller 部署有两种方式。一种是嵌入于 NameServer 进行部署,可以通过配置 enableControllerInNamesrv 打开(可以选择性打开,并不强制要求每一台 NameServer 都打开),在该模式下,NameServer 本身能力仍然是无状态的,也就是内嵌模式下若 NameServer 挂掉多数派,只影响切换能力,不影响原来路由获取等功能。另一种是独立部署,需要单独部署 Controller 组件。

Controller 嵌入 NameServer 部署
内嵌部署图

嵌入 NameServer 部署时只需要在 NameServer 的配置文件中设置 enableControllerInNamesrv=true,并填上 Controller 的配置即可。

enableControllerInNamesrv = true
controllerDLegerGroup = group1
controllerDLegerPeers = n0-127.0.0.1:9877;n1-127.0.0.1:9878;n2-127.0.0.1:9879
controllerDLegerSelfId = n0
controllerStorePath = /home/admin/DledgerController
enableElectUncleanMaster = false
notifyBrokerRoleChanged = true

参数解释:

enableControllerInNamesrv:Nameserver 中是否开启 controller,默认 false。
controllerDLegerGroup:DLedger Raft Group 的名字,同一个 DLedger Raft Group 保持一致即可。
controllerDLegerPeers:DLedger Group 内各节点的端口信息,同一个 Group 内的各个节点配置必须要保证一致。
controllerDLegerSelfId:节点 id,必须属于 controllerDLegerPeers 中的一个;同 Group 内各个节点要唯一。
controllerStorePath:controller 日志存储位置。controller 是有状态的,controller 重启或宕机需要依靠日志来恢复数据,该目录非常重要,不可以轻易删除。
enableElectUncleanMaster:是否可以从 SyncStateSet 以外选举 Master,若为 true,可能会选取数据落后的副本作为 Master 而丢失消息,默认为 false。
notifyBrokerRoleChanged:当 Broker 副本组上角色发生变化时是否主动通知,默认为 true。
参数设置完成后,指定配置文件启动 Nameserver 即可。

$ nohup sh bin/mqnamesrv -c namesrv.conf &

Controller 独立部署
架构图

独立部署执行以下脚本即可

$ nohup sh bin/mqcontroller -c controller.conf &

mqcontroller 脚本在源码包 distribution/bin/mqcontroller,配置参数与内嵌模式相同。

注意
独立部署Controller后,仍然需要单独部署NameServer提供路由发现能力

Broker 部署
Broker 启动方法与之前相同,增加以下参数

enableControllerMode:Broker controller 模式的总开关,只有该值为 true,自动主从切换模式才会打开。默认为 false。
controllerAddr:controller 的地址,多个 controller 中间用分号隔开。例如controllerAddr = 127.0.0.1:9877;127.0.0.1:9878;127.0.0.1:9879
syncBrokerMetadataPeriod:向 controller 同步 Broker 副本信息的时间间隔。默认 5000(5s)。
checkSyncStateSetPeriod:检查 SyncStateSet 的时间间隔,检查 SyncStateSet 可能会 shrink SyncState。默认5000(5s)。
syncControllerMetadataPeriod:同步 controller 元数据的时间间隔,主要是获取 active controller 的地址。默认10000(10s)。
haMaxTimeSlaveNotCatchup:表示 Slave 没有跟上 Master 的最大时间间隔,若在 SyncStateSet 中的 slave 超过该时间间隔会将其从 SyncStateSet 移除。默认为 15000(15s)。
storePathEpochFile:存储 epoch 文件的位置。epoch 文件非常重要,不可以随意删除。默认在 store 目录下。
allAckInSyncStateSet:若该值为 true,则一条消息需要复制到 SyncStateSet 中的每一个副本才会向客户端返回成功,可以保证消息不丢失。默认为 false。
syncFromLastFile:若 slave 是空盘启动,是否从最后一个文件进行复制。默认为 false。
asyncLearner:若该值为 true,则该副本不会进入 SyncStateSet,也就是不会被选举成 Master,而是一直作为一个 learner 副本进行异步复制。默认为false。
inSyncReplicas:需保持同步的副本组数量,默认为1,allAckInSyncStateSet=true 时该参数无效。
minInSyncReplicas:最小需保持同步的副本组数量,若 SyncStateSet 中副本个数小于 minInSyncReplicas 则 putMessage 直接返回 PutMessageStatus.IN_SYNC_REPLICAS_NOT_ENOUGH,默认为1。
在Controller模式下,Broker配置必须设置 enableControllerMode=true,并填写 controllerAddr,并以下面命令启动:

$ nohup sh bin/mqbroker -c broker.conf &

注意
自动主备切换模式下Broker无需指定brokerId和brokerRole,其由Controller组件进行分配

兼容性
该模式未对任何客户端层面 API 进行新增或修改,不存在客户端的兼容性问题。

Nameserver 本身能力未做任何修改,Nameserver 不存在兼容性问题。如开启 enableControllerInNamesrv 且 controller 参数配置正确,则开启 controller 功能。

Broker若设置 enableControllerMode=false,则仍然以之前方式运行。若设置 enableControllerMode=true,则需要部署 controller 且参数配置正确才能正常运行。

具体行为如下表所示:

旧版 Nameserver 旧版 Nameserver+独立部署 Controller 新版 Nameserver 开启 controller功能 新版 Nameserver 关闭 controller 功能
旧版 Broker 正常运行,无法切换 正常运行,无法切换 正常运行,无法切换 正常运行,无法切换
新版 Broker 开启 Controller 模式 无法正常上线 正常运行,可以切换 正常运行,可以切换 无法正常上线
新版 Broker 不开启 Controller 模式 正常运行,无法切换 正常运行,无法切换 正常运行,无法切换 正常运行,无法切换
升级注意事项
从上述兼容性表述可以看出,NameServer 正常升级即可,无兼容性问题。在不想升级 Nameserver 情况,可以独立部署 Controller 组件来获得切换能力。

针对 Broker 升级,分为两种情况:

(1)Master-Slave 部署升级成 Controller 切换架构

可以带数据进行原地升级,对于每组 Broker,停机主、备 Broker,保证主、备的 CommitLog 对齐(可以在升级前禁写该组 Broker 一段时间,或则通过拷贝方式保证一致),升级包后重新启动即可。

注意
若主备 CommitLog 不对齐,需要保证主上线以后再上线备,否则可能会因为数据截断而丢失消息。

(2)原 DLedger 模式升级到 Controller 切换架构

由于原 DLedger 模式消息数据格式与 Master-Slave 下数据格式存在区别,不提供带数据原地升级的路径。在部署多组 Broker 的情况下,可以禁写某一组 Broker 一段时间(只要确认存量消息被全部消费即可,比如根据消息的保存时间来决定),然后清空 store 目录下除 config/topics.json、subscriptionGroup.json 下(保留 topic 和订阅关系的元数据)的其他文件后,进行空盘升级。

编辑此页
上一页
Admin Tool
下一页
RocketMQ Dashboard
Controller 部署
Controller 嵌入 NameServer 部署
Controller 独立部署
Broker 部署
兼容性
升级注意事项
Learn
Introduction
Installation
Migration from 4.x to 5.0
Community
Twitter
Github
Help
More
Blog
Changelog
GitHub
Twitter
Legal
Licenses
Security
Thanks
Sponsorship
Meta Open Source Logo
Copyright © 2025 The Apache Software Foundation. Licensed under the Apache License, Version 2.0.

背景

当前 RocketMQ Raft 模式主要是利用 DLedger Commitlog 替换原来的 Commitlog,使 Commitlog 拥有选举复制能力,但这也造成了一些问题:

  • Raft 模式下,Broker组内副本数必须是三副本及以上,副本的ACK也必须遵循多数派协议。
  • RocketMQ 存在两套 HA 复制流程,且 Raft 模式下的复制无法利用 RocketMQ 原生的存储能力。

因此我们希望利用 DLedger 实现一个基于 Raft 的一致性模块(DLedger Controller),并当作一个可选的选主组件,支持独立部署,也可以嵌入在 Nameserver 中,Broker 通过与 Controller 的交互完成 Master 的选举,从而解决上述问题,我们将该新模式称为 Controller 模式。

架构

核心思想

架构图

如图是 Controller 模式的核心架构,介绍如下:

  • DledgerController:利⽤ DLedger ,构建⼀个保证元数据强⼀致性的 DLedger Controller 控制器,利⽤ Raft 选举会选出⼀个 Active DLedger Controller 作为主控制器,DLedger Controller 可以内嵌在 Nameserver中,也可以独立的部署。其主要作用是,用来存储和管理 Broker 的 SyncStateSet 列表,并在某个 Broker 的 Master Broker 下线或⽹络隔离时,主动发出调度指令来切换 Broker 的 Master。
  • SyncStateSet:主要表示⼀个 broker 副本组中跟上 Master 的 Slave 副本加上 Master 的集合。主要判断标准是 Master 和 Slave 之间的差距。当 Master 下线时,我们会从 SyncStateSet 列表中选出新的 Master。 SyncStateSet 列表的变更主要由 Master Broker 发起。Master通过定时任务判断和同步过程中完成 SyncStateSet 的Shrink 和 Expand,并向选举组件 Controller 发起 Alter SyncStateSet 请求。
  • AutoSwitchHAService:一个新的 HAService,在 DefaultHAService 的基础上,支持 BrokerRole 的切换,支持 Master 和 Slave 之间互相转换 (在 Controller 的控制下) 。此外,该 HAService 统一了日志复制流程,会在 HA HandShake 阶段进行日志的截断。
  • ReplicasManager:作为一个中间组件,起到承上启下的作用。对上,可以定期同步来自 Controller 的控制指令,对下,可以定期监控 HAService 的状态,并在合适的时间修改 SyncStateSet。ReplicasManager 会定期同步 Controller 中关于该 Broker 的元数据,当 Controller 选举出一个新的 Master 的时候,ReplicasManager 能够感知到元数据的变化,并进行 BrokerRole 的切换。

DLedgerController 核心设计

image-20220605213143645

如图是 DledgerController 的核心设计:

  • DLedgerController 可以内嵌在 Namesrv 中,也可以独立的部署。
  • Active DLedgerController 是 DLedger 选举出来的 Leader,其会接受来自客户端的事件请求,并通过 DLedger 发起共识,最后应用到内存元数据状态机中。
  • Not Active DLedgerController,也即 Follower 角色,其会通过 DLedger 复制来自 Active DLedgerController 的事件日志,然后直接运用到状态机中。

日志复制

基本概念与流程

为了统一日志复制流程,区分每一任 Master 的日志复制边界,方便日志截断,引入了 MasterEpoch 的概念,代表当前 Master 的任期号 (类似 Raft Term 的含义)。

对于每一任 Master,其都有 MasterEpoch 与 StartOffset,分别代表该 Master 的任期号与起始日志位移。

需要注意的是,MasterEpoch 是由 Controller 决定的,且其是单调递增的。

此外,我们还引入了 EpochFile,用于存放 <Epoch, StartOffset> 序列。

当⼀个 Broker 成为 Master,其会:

  • 将 Commitlog 截断到最后⼀条消息的边界。

  • 同时最新将 <MasterEpoch , startoffset> 持久化到 EpochFile,startOffset 也即当前 CommitLog 的 MaxPhyOffset 。

  • 然后 HAService 监听连接,创建 HAConnection,配合 Slave 完成流程交互。

当一个 Broker 成为 Slave,其会:

Ready 阶段:

  • 将Commitlog截断到最后⼀条消息的边界。

  • 与Master建⽴连接。

Handshake 阶段:

  • 进⾏⽇志截断,这⾥关键在于 Slave 利⽤本地的 epoch 与 startOffset 和 Master 对⽐,找到⽇志截断点,进⾏⽇志截断。

Transfer 阶段:

  • 从 Master 同步日志。

截断算法

具体的日志截断算法流程如下:

  • 在 HandShake 阶段, Slave 会从 Master 处获取 Master 的 EpochCache 。

  • Slave ⽐较获取到的 Master EpochCahce <Startoffset,Endoffset>,从后往前依次和本地进行比对,如果二者的 Epoch 与 StartOffset 相等, 则该 Epoch 有效,截断位点为两者中较⼩的 Endoffset,截断后修正⾃⼰的<Epoch , Startoffset> 信息,进⼊Transfer 阶 段;如果不相等,对比 Slave 前⼀个epoch,直到找到截断位点。

slave:TreeMap<Epoch, Pair<startOffset,endOffset>> epochMap;
Iterator iterator = epochMap.entrySet().iterator();
truncateOffset = -1;

//Epoch为从⼤到⼩排序
while (iterator.hasNext()) {
    Map.Entry<Epoch, Pair<startOffset,endOffset>> curEntry = iterator.next();
    Pair<startOffset,endOffset> masterOffset=
    findMasterOffsetByEpoch(curEntry.getKey());
    
    if(masterOffset != null && 
            curEntry.getKey().getObejct1() == masterOffset.getObejct1()) {
        truncateOffset = Math.min(curEntry.getKey().getObejct2(), masterOffset.getObejct2());
        break;
   }
}

复制流程

由于 Ha 是基于流进行日志复制的,我们无法分清日志的边界 (也即传输的一批日志可能横跨多个 MasterEpoch),Slave 无法感知到 MasterEpoch 的变化,也就无法及时修改 EpochFile。

因此,我们做了如下改进:

Master 传输⽇志时,保证⼀次发送的⼀个 batch 是同⼀个 epoch 中的,⽽不能横跨多个 epoch。可以在WriteSocketService 中新增两个变量:

  • currentTransferEpoch:代表当前 WriteSocketService.nextTransferFromWhere 对应在哪个 epoch 中

  • currentTransferEpochEndOffset: 对应 currentTransferEpoch 的 end offset.。如果 currentTransferEpoch == MaxEpoch,则 currentTransferEpochEndOffset= -1,表示没有界限。

WriteSocketService 传输下⼀批⽇志时 (假设这⼀批⽇志总⼤⼩为 size),如果发现

nextTransferFromWhere + size > currentTransferEpochEndOffset,则将 selectMappedBufferResult limit ⾄ currentTransferEpochEndOffset。 最后,修改 currentTransferEpoch 和 currentTransferEpochEndOffset ⾄下⼀个 epoch。

相应的, Slave 接受⽇志时,如果从 header 中发现 epoch 变化,则记录到本地 epoch⽂件中。

复制协议

根据上文我们可以知道,AutoSwitchHaService 对日志复制划分为多个阶段,下面介绍是该 HaService 的协议。

Handshake 阶段

1.AutoSwitchHaClient (Slave) 会向 Master 发送 HandShake 包,如下:

示意图

current state(4byte) + Two flags(4byte) + slaveBrokerId(8byte)

  • Current state 代表当前的 HAConnectionState,也即 HANDSHAKE。

  • Two flags 是两个状态标志位,其中,isSyncFromLastFile 代表是否要从 Master 的最后一个文件开始复制,isAsyncLearner 代表该 Slave 是否是异步复制,并以 Learner 的形式接入 Master。

  • slaveBrokerId 代表了该 Slave 的 brokerId,用于后续加入 SyncStateSet 。

2.AutoSwitchHaConnection (Master) 会向 Slave 回送 HandShake 包,如下:

示意图

current state(4byte) + body size(4byte) + offset(8byte) + epoch(4byte) + body

  • Current state 代表当前的 HAConnectionState,也即 HANDSHAKE。
  • Body size 代表了 body 的长度。
  • Offset 代表 Master 端日志的最大偏移量。
  • Epoch 代表了 Master 的 Epoch 。
  • Body 中传输的是 Master 端的 EpochEntryList 。

Slave 收到 Master 回送的包后,就会在本地进行上文阐述的日志截断流程。

Transfer 阶段

1.AutoSwitchHaConnection (Master) 会不断的往 Slave 发送日志包,如下:

示意图

current state(4byte) + body size(4byte) + offset(8byte) + epoch(4byte) + epochStartOffset(8byte) + additionalInfo(confirmOffset) (8byte)+ body

  • Current state:代表当前的 HAConnectionState,也即 Transfer 。
  • Body size:代表了 body 的长度。
  • Offset:当前这一批次的日志的起始偏移量。
  • Epoch:代表当前这一批次日志所属的 MasterEpoch。
  • epochStartOffset:代表当前这一批次日志的 MasterEpoch 对应的 StartOffset。
  • confirmOffset:代表在 SyncStateSet 中的副本的最小偏移量。
  • Body:日志。

2.AutoSwitchHaClient (Slave) 会向 Master 发送 ACK 包:

示意图

current state(4byte) + maxOffset(8byte)

  • Current state:代表当前的 HAConnectionState,也即 Transfer 。
  • MaxOffset:代表当前 Slave 的最大日志偏移量。

Master 选举

基本流程

ELectMaster 主要是在某 Broker 副本组的 Master 下线或不可访问时,重新从 SyncStateSet 列表⾥⾯选出⼀个新的 Master,该事件由 Controller ⾃身或者通过运维命令electMaster 发起Master选举。

无论 Controller 是独立部署,还是嵌入在 Namesrv 中,其都会监听每个 Broker 的连接通道,如果某个 Broker channel inActive 了,就会判断该 Broker 是否为 Master,如果是,则会触发选主的流程。

选举 Master 的⽅式⽐较简单,我们只需要在该组 Broker 所对应的 SyncStateSet 列表中,挑选⼀个出来成为新的 Master 即可,并通过 DLedger 共识后应⽤到内存元数据,最后将结果通知对应的Broker副本组。

SyncStateSet 变更

SyncStateSet 是选主的重要依据,SyncStateSet 列表的变更主要由 Master Broker 发起。Master通过定时任务判断和同步过程中完成 SyncStateSet 的Shrink 和 Expand,并向选举组件 Controller 发起 Alter SyncStateSet 请求。

Shrink

Shrink SyncStateSet ,指把 SyncStateSet 副本集合中那些与Master差距过⼤的副本移除,判断依据如下:

  • 增加 haMaxTimeSlaveNotCatchUp 参数 。

  • HaConnection 中记录 Slave 上⼀次跟上 Master 的时间戳 lastCaughtUpTimeMs,该时间戳含义是:每次Master 向 Slave 发送数据(transferData)时记录⾃⼰当前的 MaxOffset 为 lastMasterMaxOffset 以及当前时间戳 lastTransferTimeMs。

  • ReadSocketService 接收到 slaveAckOffset 时若 slaveAckOffset >= lastMasterMaxOffset 则将lastCaughtUpTimeMs 更新为 lastTransferTimeMs。

  • Master 端通过定时任务扫描每一个 HaConnection,如果 (cur_time - connection.lastCaughtUpTimeMs) > haMaxTimeSlaveNotCatchUp,则该 Slave 是 Out-of-sync 的。

  • 如果检测到 Slave out of sync ,master 会立刻向 Controller 上报SyncStateSet,从而 Shrink SyncStateSet。

Expand

如果⼀个 Slave 副本追赶上了 Master,Master 需要及时向Controller Alter SyncStateSet 。加⼊SyncStateSet 的条件是 slaveAckOffset >= ConfirmOffset(当前 SyncStateSet 中所有副本的 MaxOffset 的最⼩值)。

参考资料

RIP-44原文

Skip to content
Navigation Menu
apache
rocketmq

Type / to search
Code
Issues
206
Pull requests
119
Discussions
Actions
Wiki
Security
Insights
RIP 44 Support DLedger Controller
rongtong edited this page on Jul 1, 2022 · 2 revisions
Status
Current State: accept
Authors: RongtongJin, ZhangHeng Huang
Shepherds: duhengforever@apache.org, dongeforver@apache.org
Mailing List discussion: dev@rocketmq.apache.org
Pull Request: https://github.com/apache/rocketmq/pull/4484
Released:
Related Docs: English 中文版
Background & Motivation
What do we need to do
Will we add a new module?

Yes, a new controller module will be added.

Will we add new APIs?

No additions or modifications to any client-level APIs

Admin tools add related new commands

There will be a new API for the internal interaction between broker and controller

Will we add a new feature?

Yes, dledger controller is a new feature.

Why should we do that
Are there any problems of our current project?

After the release of RocketMQ 4.5.0, the DLedger mode (raft) was introduced. The raft commitlog under this architecture is used to replace the original commitlog so that it has the ability to failover. However, there are some disadvantages going with this architecture due to the raft capability on replication, including:

To have failover ability, the number of replicas in the broker group must be 3 or more

Acks from replicas need to strictly follow the majority rule of the Raft protocol, that is, 3-replica architecture requires acks from 2 replicas to return, and 5-replica architecture requires acks from 3 to return

Since the store repository relies on OpenMessaging DLedger in DLedger mode, Native storage and replication capabilities of RocketMQ (such as transientStorePool and zero-copy capabilities) cannot be reused, and maintenance becomes difficult as well.

To handle those mentioned problems, RIP-44 want to support dledger controller. With this improvement, DLedger (Raft) capability will be abstracted onto the upper layer, becoming an optional and loosely coupled coordination component named DLedger Controller.

After the deployment of DLedger Controller, the master-slave architecture will also equip with failover capability. The DLedger Controller can optionally be embedded into the NameServer (the NameServer itself remains stateless and cannot provide electoral capabilities when the majority is down), or it can be deployed independently.

DLedger controller is an optional component that does not change the previous operation and maintenance mode. Compared with other components, its downtime will not affect online services. In addition, RIP-44 unifies the storage and replication of RocketMQ, resulting in lower maintenance costs and faster development iterations. In terms of compatibility, the master-slave architecture can upgrade without compatibility problems.

What can we benefit from proposed changes?
RIP-44 enables RocketMQ to have the optional failover capability in the Master-Slave deployment and can utilize RocketMQ's native storage and replication capabilities, with consistent log data and no message loss.

Goals
What problem is this proposal designed to solve?
This enables RocketMQ to have the optional failover capability in the Master-Slave deployment and can utilize RocketMQ's native storage and replication capabilities, with consistent log data and no message loss.

Non-Goals.
What problem is this proposal NOT designed to solve?
The following are not designed to solve this RIP:

External coordination components such as ZooKeeper and ETCD are introduced to solve the failover problem

The master-slave election capability is mandatory

Changes
Architecture
Core idea

SyncStateSet: It mainly represents the replicas in a broker group that keep up with the master (including the master itself). The main criterion is the gap between the master and the slave. When the master goes offline we elect a new master from the SyncStateSet.

Active Controller: Use Raft capabilities to build DLedger controller that ensures the metadata consistency. Using Raft elections, an Active DLedger Controller will be elected as the active controller. The DLedger Controller can be embedded in the Nameserver and turned on by a switch ( And after the nameserver hangs up the majority, it only affects the failover ability, the original ability of the nameserver is still stateless). Additionally, DLedger Controller supports standalone deployment.

Alter SyncStateSet: The master broker will regularly detect whether the SyncStateSet needs to be expanded or reduced, and report the SyncStateSet information through the API. The DLedger Controller will maintain the SyncStateSet of the broker replica group with strong consistency.

Elect Master: Once the Master of the Broker's replica group is offline, the Active DLedger Controller is notified via heartbeat mechanisms (there are two of them, NameServer's lightweight heartbeat if embedded, or its own heartbeat module if deployed independently), Then calling the Elect Master API will re-select a master from the SyncStateSet, and send instructions to the Broker to complete the master-slave switch.

Replication: Broker's internal replication adds epoch and start offset, epoch represents the version, and start offset is the start physical offset from the version number. Data truncation after switching is completed with both epoch and start offset to ensure commitlog consistency.

Detailed design of election
DLedgeer Controller
We need a controller with strong metadata consistency to manage the SyncStateSet and switch the master when a broker's master broker goes offline. Currently DLedger, as a Raft Commitlog-based repository, fits the bill just fine.

DLedger Controller has the following two deployment scenarios:

Embedded NameServer: If you want to ensure the high availability of the election module, you must start at least three NameServers, turn on the Controller switch in the NameServer to start the plug-in of the DLedger Controller. The Controller relies on the DLedger capability to ensure strong consistency. In addition, DLedger Controller is not as complicated as zookeeper and etcd. It tries to simplify external API, has no monitoring mechanism, and relies on polling and notification instead.

Independent deployment: DLedger Controller also supports independent deployment

SyncStateSet
The SyncStateSet list represents a list of synchronized replicas. It mainly represents the number of Slave replicas that follow the Master plus the Master in a group of broker replicas. The main criterion is the difference between Master and Slave. When the Master goes offline, we will select a new Master from the SyncStateSet.

Changes to the SyncStateSet are primarily initiated by the Master Broker. The Master Broker completes SyncStateSet's Shrink and Expand requests through scheduled task judgments and synchronization processes, and initiates an Alter SyncStateSet request to the controller. After the controller application is successful, update your local cache SyncStateSet.

Shrink SyncStateSet
Shrink SyncStateSet, refers to the removal of those replicas that are too far from the Master in the SyncStateSet replica set. The gap here is mainly in several aspects:

Whether to establish a connection with the Master Broker, if disconnected, remove the Slave from the SyncStateSet (no connection may be a network problem, heartbeat timeout, etc.).
The haMaxTimeSlaveNotCatchUp parameter is added, and HaConnection records the timestamp lastCaughtUpTimeMs of the last time the Slave catches up with the Master. The meaning of the timestamp is: every time the Master sends data (transferData) to the Slave, record its current MaxOffset as lastMasterMaxOffset and the current timestamp lastTransferTimeMs. If slaveAckOffset>=lastMasterMaxOffset is reported in ReadSocketService, lastCaughtUpTimeMs is updated to lastTransferTimeMs.
The scheduled task scans each connection, if (cur_time - connection.lastCaughtUpTimeMs) > haMaxTimeSlaveNotCatchUp, the Slave is Out-of-sync.
Finally, if it is determined that a Slave is out-of-sync, the Master needs to update the SyncStateSet to the Controller Alter in time. If the update is successful, it will be applied locally.

Expand SyncStateSet
Similarly, if a Slave replica catches up with the Master, the Master needs to update the Controller Alter SyncStateSet in a timely manner. If the update succeeds, it will be applied locally. The terms of joining SyncStateSet are SlaveAckOffset >= ConfirmOffset (the concept of ConfirmOffset is described below as the minimum MaxOffset for all current SyncStateSet replicas).

Controller API
The Controller builds consistent metadata internally based on Dledger, and provides API for modifying and reading metadata externally. The main external APIs are as follows:

public interface Controller {
/**

  • Alter SyncStateSet of broker replicas.
  • @param request AlterSyncStateSetRequestHeader
  • @return RemotingCommand(AlterSyncStateSetResponseHeader)
    */
    CompletableFuture alterSyncStateSet(
    final AlterSyncStateSetRequestHeader request, final SyncStateSet syncStateSet);

/**

  • Elect new master for a broker.
  • @param request ElectMasterRequest
  • @return RemotingCommand(ElectMasterResponseHeader)
    */
    CompletableFuture electMaster(final ElectMasterRequestHeader request);

/**

  • Register api when a replicas of a broker startup.
  • @param request RegisterBrokerRequest
  • @return RemotingCommand(RegisterBrokerResponseHeader)
    */
    CompletableFuture registerBroker(final BrokerRegisterRequestHeader request);

/**

  • Get the Replica Info for a target broker.
  • @param request GetRouteInfoRequest
  • @return RemotingCommand(GetReplicaInfoResponseHeader)
    */
    CompletableFuture getReplicaInfo(final GetReplicaInfoRequestHeader request);

/**

  • Get Metadata of controller
  • @return RemotingCommand(GetControllerMetadataResponseHeader)
    */
    RemotingCommand getControllerMetadata();

}
AlterSyncStateSet

First of all, the AlterSyncStateSet must be initiated by the master of a broker. The slave has no right to initiate the request, and the request will report the latest SyncStateSet of the group of brokers to the controller. The request processing does a pre-check first.

Pre-check logic:

Compare whether the Broker that initiated the request is the Master, Alter SyncStateSet can only be initiated by the Master Broker
Compared to SyncStateSet epoch, it may be stale AlterSyncStateSet request
Check the correctness of the SyncStateSet, that is, whether the Brokers in the SyncStateSet are all in the Alive broker (through a lightweight heartbeat mechanism)
The new SyncStateSet must contain the current leader because the Master cannot be removed from the SyncStateSet
If the check passes, we can generate an Alter SyncStateSet event, initiate consensus request through Dledger, and finally modify the in-memory SyncStateSet.

Elect Master

ELectMaster mainly selects a new Master from the SyncStateSet when the Master of a Broker replica group is offline or inaccessible. This event is initiated by the Controller itself.

The heartbeat mechanism is mainly used here. The Broker will report the heartbeat to the Controller on a regular basis, and the Controller will also regularly scan the timeout Broker (scanNotActiveBroker). If a Broker's heartbeat times out, the Controller will determine whether it is a Master (brokerId = 0) and if it is a Master, it will initiate ElectMaster to elect a new Broker Master.

The method of electing a Master is relatively simple. We only need to select a surviving copy (the heartbeat has not timed out) from the SyncStateSet list corresponding to the group of Brokers to become the new Master, and then elect to generate an ElectMaster event. After passing the DLedger consensus apply it to the memory metadata and notify the corresponding Broker replica group of the result (Broker itself also has a polling mechanism to obtain the Master information (getReplicaInfo) of its own replica group for further assurance to prevent notification loss).

In addition, the controller adds the enableUncleanMasterElect parameter. At this time, if the SyncStateSet does not have a copy that meets the requirements, it can be selected from all the current surviving copies, but a large number of messages may be lost.

RegisterBroker

RegisterBroker is first called when the Broker comes online, making a registration request to the Controller.

The Controller returns the brokerId and Master of the replicas. That is, in Controller mode, brokerId are determined and assigned by the Controller.

Assignment of brokerId
When a broker goes online for the first time, the metadata cannot find BrokerId and initiates the ApplyBrokerId event proposal, Assign a brokerId to the Broker (brokerId is applied if the Broker is not Master or 0 otherwise).

Master-slave relationship is determined when online
In addition when the broker first goes online, there is no master. At this time, the first called broker will try to become the master, form the ElectMasterEvent event and submit the proposal, and the first Broker in the replica group that successfully applies the ElectMasterEvent event log will become the Master, and form only its own SyncStateSet.

GetReplicaInfo

This API is mainly used by the Broker to regularly pull the latest metadata information from the Controller. In order to prevent the loss of notifications after the master election, the broker will also call this method regularly, so as to know who is the master of this replica group.

GetControllerMetadata

This API mainly obtains the active controller (the Leader of the Controller), and returns the IP address of the active controller, Among the above APIs, only the GetControllerMetadata API can be called by all controllers (leader controller+follower controller), and other APIs can only be called by the active controller.

Consistent metadata
Consistent metadata is finally constructed by applying the log events of Commitlog in DLedger. The data structure is as follows:

private final Map<String/* brokerName /, BrokerInfo> replicaInfoTable;
private final Map<String/
brokerName */, SyncStateInfo> syncStateSetInfoTable;
BrokerInfo data structure

/**

  • Broker info, mapping from brokerAddress to {brokerId, brokerHaAddress}.
    /
    public class BrokerInfo {
    private final String clusterName;
    private final String brokerName;
    // Start from 1
    private final AtomicLong brokerIdCount;
    private final HashMap<String/
    Address/, Long/brokerId*/> brokerIdTable;
    }
    SyncSateInfo data structure

/**

  • Manages the master and syncStateSet of broker replicas.
    /
    public class SyncStateInfo {
    private final String clusterName;
    private final String brokerName;
    private Set<String/
    Address*/> syncStateSet;
    private int syncStateSetEpoch;
    private String masterAddress;
    private int masterEpoch;
    }
    Event
    Events here can also refer to log types, and the state machine will eventually apply the events in the DLedger Commitlog to build consistent metadata.

AlterSyncStateSet Event

It is initiated by the AlterInSyncReplicas API after verification, and the corresponding SyncStateInfo data in syncStateSetInfoTable is modified after the state machine is applied.

ElectMaster Event

It is initiated by the ElectMaster API and RegisterBroker API after passing the verification. After the state machine log is applied, the corresponding SyncStateInfo data in the syncStateSetInfoTable is modified, masterEpoch = masterEpoch+1, the master is changed to elect a broker, and the SyncStateSet is changed to only the set of elected brokers.

ApplyBrokerId Event

Initiated by the RegisterBroker API. When the broker queries the replica group information, it is initiated when the brokerId cannot be found. When the state machine applies the log, it will first check whether it already exists. If it already exists, it will not apply again.

Detailed design of repilication
Replication solution
On the basis of the original replication, multiple stages are extended to complete the necessary data truncation and postback. The replication process is carried out in multiple stages, and different tasks are completed in HANDSHAKE (the ordinary basic version basically does nothing, and the election version completes the comparison and truncation of epoch and startOffset).

public enum HAConnectionState {
/**
* Ready to start connection.
/
READY,
/
*
* CommitLog consistency checking.
/
HANDSHAKE,
/
*
* Synchronizing data.
/
TRANSFER,
/
*
* Temporarily stop transferring.
/
SUSPEND,
/
*
* Connection shutdown.
*/
SHUTDOWN,
}
Overall process
As shown in the figure below, it is the overall process of log replication:

Master and Slave respectively accept commands from Controller and execute ChangeToXXX
Master starts AutoSwitchHAService, listens for connections, and creates AutoSwitchHAConnection
Slave starts AutoSwitchHAClient, and in Ready stage, connects to Master (connectToMaster)
After the connection is completed, AutoSwitchHAClient enters the HandShake phase, and sends a Handshake packet, including some status bits and the address of the Slave.
AutoSwitchHAConnection echoes the handshake packet, which includes its local EpochEntry array.
AutoSwitchHAClient compares the received MasterEpochEntryArray with the local EpochEntryArray, and performs the corresponding log phase process, so as to be consistent with the Master.
After the truncation is completed, AutoSwitchHAClient enters the Transfer phase and continuously reports slaveOffset to AutoSwitchHAConnection.
AutoSwitchHAConnection sends log packets to enter the log replication process.

Election Truncation Algorithm
First, add MasterEpoch, which represents the version number of Master, analogous to Term in Raft.

The MasterEpoch is specified by the Controller to ensure that there is only one MasterEpoch at the same time, that is, only one Master exists. Whenever a new Master is elected, the local maximum MaxOffset will be used as the startOffset of the MasterEpoch.

In addition, a new epochFile is added, stored in the ~/store folder, which stores each epoch and its corresponding log start sequence startOffset. (Considering the importance of the file, it is not stored in the checkpoint file)

The algorithm is described as follows:

In order to facilitate the understanding of the algorithm, the concept of endOffset is added here. endOffset is actually the startOffset of the next epoch (the maximum position if there is no next epoch), and does not need to be stored in the actual implementation.

Slave compares the obtained Master <startoffset, endoffset>. If the startoffset is equal, the epoch is valid. The truncation point is the smaller endoffset of the two. After truncation, it corrects its own <epoch, startoffset> information and enters the transfer stage; if Not equal, find the previous epoch of the slave, and continue to walk 1 until the truncation site is found.

slave:TreeMap<Epoch, Pair<startOffset,endOffset>> epochMap;

Iterator iterator = epochMap.descendingMap().entrySet().iterator();
truncateOffset = -1;

while (iterator.hasNext()) {
Map.Entry<Epoch, Pair<startOffset,endOffset>> curEntry = iterator.next();
Pair<startOffset,endOffset> masterOffset = findMasterOffsetByEpoch(curEntry.getKey());
if(masterOffset!=null && curEntry.getKey().getObejct1() == masterOffset.getObejct1()) {
truncateOffset = Math.min(curEntry.getKey().getObejct2(), masterOffset.getObejct2());
break;
}
}
If the truncateOffset is not found, such as the Master file has expired and deleted, manual processing is required. When truncating, ensure the consistency of consume queue and commitlog, and consume queue will also be truncated.

Algorithm Consistency Proof:

The epoch is specified by the Controller. The consensus protocol ensures that all epochs assigned to the broker are unique and increasing. During each epoch, the controller assigns only one broker to be the master. And each broker becomes Master and truncates the commitlog to the boundary of the last message.

For each log (epoch, offset), only one master is responsible for accepting the log, and the slave will only replicate the log from the same master. So for the same (epoch, offset), it will represent the same log entry.

Every time the Slave truncates the log according to the above algorithm, it is guaranteed that the log before truncateOffset is consistent with the Master.

Truncation example analysis:

(1) The endoffset of epoch0 is truncated to the minimum of 900. When unCleanMasterElect=false, allAckInSyncStateSet=true, only the messages within 900 are committed messages.

(2) A loses 900-1000 messages due to asynchronous flush. Finally, the endoffset of epoch0 is truncated to 900, which is the minimum of the two. A will continue to copy B's messages, and no messages will be lost.

(3) The endoffset of epoch1 is truncated to the minimum of 1200, which does not actually need to be truncated.

(4) The endoffset of epoch0 is truncated to the minimum of 900.

(5) Both B and C are truncated to 800 of epoch1.

(6) The endoffset of epoch0 is truncated to the minimum of 1000.

Epoch file updates and corrections
The epoch file stores Map<epoch, startOffset> information, which needs to be updated in time, especially since the slave and master are stream-based replication, the slave cannot perceive the epoch change during the master-standby replication process (because the message will not be parsed). In addition, the map in the epoch file needs to be corrected in time, mainly in the following scenarios:

When preparing handshake, correct after truncation, and correct your map to the map before the master truncation point
Correction of exceeding the valid offset when Recovering
When the file is deleted, if the endOffset of a certain epoch is exceeded, the epoch is deleted.
In order to solve the problem that the slave based on stream replication cannot perceive the change of epoch, the offset field will be added to the header of the transfer phase. The specific process is as follows:

Each time a batch of messages is sent in WriteSocketService, a Header is sent at the same time:

/**

  • Header protocol in syncing msg from master.
  • current state + body size + offset + epoch +
  • epochStartOffset + additionalInfo(confirmOffset).
    */
    public static final int MSG_HEADER_SIZE = 4 + 4 + 8 + 4 + 8 + 8;
    Among them, epoch represents the version number of all messages in this data stream sent by the master, and epochStartOffset represents the startOffset of the epoch. When the master transmits the log, it is guaranteed that a batch sent at a time is in the same epoch, and cannot span multiple epochs. Two new variables can be added to WriteSocketService:

currentTransferEpoch: Indicates which epoch the current WriteSocketService.nextTransferFromWhere corresponds to
currentTransferEpochEndOffset: corresponds to the end offset of currentTransferEpoch. If currentTransferEpoch == current maximum epoch, then currentTransferEpochEndOffset= -1, indicating no bound.
When WriteSocketService transmits the next batch of logs (assuming the total size of this batch of logs is size), if it finds that nextTransferFromWhere + size > currentTransferEpochEndOffset, it will selectMappedBufferResult limit to currentTransferEpochEndOffset.
Finally, modify currentTransferEpoch and currentTransferEpochEndOffset to the next epoch. Correspondingly, when Slave accepts the log, if the epoch change is found from the header, it will be recorded in the local epoch file.
ConfirmOffset
Since truncation occurs in the algorithm if the original commitlog. getMaxOffset is used as the endpoint of reput, two consumer group will subscribe to the same topic, one will receive the truncated message, and the other will not receive the truncated message after the master/slave switch. A Read uncommitted condition occurs. ConfirmOffset is a new confirmOffset concept. Consume queue is dispatch only to confirmOffset to prevent messages that might be truncated from being read.

Computationally, the Master confirmOffset is the smallest MaxOffset point in all SyncStateSet replicas. The confirmOffset of the Slave is determined by two values. One is the Header that transmits the Master's current confirmOffset when the Master transmits, and the other is the current maximum confirmOffset. The minimum of the two values is taken.

A solution without losing messages
Due to the time difference between reports, the Master elected from SyncStateSet may still lose messages, so it needs to be used together with RIP-34. For example, if inSyncReplicas=2 (synchronous replication) in the case of two replicas, the Master elected in SyncStateSet after it hangs up must include all committed messages, and no messages will be lost. For example, if inSyncReplicas=2 under three replicas, after the Master hangs up, at least one of them will keep up with the Master replica. If the Slave that chooses to keep up becomes the master, no messages will be lost.

In addition, the allAckInSyncStateSet parameter is added. This parameter requires that all replicas in SyncStateSet must be acked before returning to the client. Enabling this parameter can ensure that no messages are lost (when allAckInSyncStateSet is true, inSyncReplicas will be invalid).

The implementation must ensure that the new SyncStateSet will be applied locally after the new SyncStateSet is successfully updated to the Controller.

Explain:

Assuming that there are two replicas of A and B, A is the Master and B is the Slave. At the beginning, the SyncStateSet is {A, B}. After B goes down, A applies to the Controller for a SyncStateSet change. Only when the Controller confirms that the update is successful, A will local SyncStateSet be updated to {A} (that is, the message after the update is successful will fail to be sent), which ensures that after allAckInSyncStateSet is enabled, the replica in SyncStateSet must have all committed messages, that is, messages will not be lost.

When the number of brokers in the SyncStateSet is less than minInSyncReplicas, the message will fail to send.

Detailed design of broker

The figure shows the interaction process between broker and controller:

The main logic for the interaction between the Broker and the Controller is in the component ReplicasManager
When the broker goes online, ReplicasManager will initiate a registration request through RegisterBrokerToController (if the brokerId or the master-standby relationship cannot be obtained, the request will be made), and obtain the BrokerId and MasterAddress
Next, ReplicasManager calls the API of AutoSwitchHAService and executes the ChangeToxxx process
AutoSwitchHAService operates log truncation and replication according to the replication process described above
Online process
The broker side needs a complete online process.

Add the controllerAddress parameter, which is mainly used to configure the IP list of the Controller. Before the Broker is officially registered with the Namesrv, it will first register with the Controller, obtain the master-slave relationship and the brokerId (determine whether it is the master or the slave and the brokerId), and then register with the Nameserver.

The broker will obtain the IP of the active controller node from any controller (the background will also periodically obtain and update the IP of the active controller node), and then access the active controller.

Master-slave relationship
In the election mode, the parameters brokerRole and brokerId are invalid, and finally the controller decides the master/slave relationship.

The master-slave relationship is determined after the initial launch: if there is no SyncStateSet of the broker group, the first one in the broker group to initiate a consensus request to the active controller is the master (composes only its own SyncStateSet, due to the log and consensus protocol, even if multiple brokers There is only one Master when it is initiated), and the rest are reserved. The allocation of brokerId increases from 1, that is, each broker is numbered from 1, the master's brokerId will become 0, and the original number will be restored after the master becomes a slave.

The master-slave relationship is determined after going online again: subject to SyncStateSet

Regularly obtain the master-slave relationship: Each broker will periodically poll the Controller to obtain master-slave information (getReplicationInfo), even if it updates its own role (to prevent notification loss).

Switching process
It is basically the same as the current switching process of DLedger mode.

Turn on/off some scanning threads for secondary messages (timing, transactions, POP ACK, etc.)
Modify brokerId and brokerRole
Re-register with Nameserver
If you switch to the master, you also need to wait for consumeQueue dispatch, topicQueueTable recover, etc.

Async Learner
We also introduced a new role for the Broker: Async Learner, similar to the Learner role in Raft, which can be turned on with the parameter isAsyncLearner.

If the Broker enables the role of AsyncLearner, it will notify the Master through AutoSwitchHA Protocl during the log replication process. In the subsequent log replication, the Master will not add AsyncLearner to SyncStateSet, which ensures that there is no need to wait for AsyncLearner to reply Ack (asynchronous replication), At the same time it will not join the election process (because it is not in the SyncStateSet).

Applicable scenarios: Asynchronous replication in different data centers.

Interface Design/Change
Method signature/behavior changes
New RequestCode

// Alter syncStateSet
public static final int CONTROLLER_ALTER_SYNC_STATE_SET = 1001;

// Elect master
public static final int CONTROLLER_ELECT_MASTER = 1002;

// Register broker
public static final int CONTROLLER_REGISTER_BROKER = 1003;

// Get replica info
public static final int CONTROLLER_GET_REPLICA_INFO = 1004;

// Get controller metadata
public static final int CONTROLLER_GET_METADATA_INFO = 1005;

// Get syncStateData (used for adminTool)
public static final int CONTROLLER_GET_SYNC_STATE_DATA = 1006;

// Get brokerEpoch (used for adminTool)
public static final int GET_BROKER_EPOCH_CACHE = 1007;

// Notify broker role changed
public static final int NOTIFY_BROKER_ROLE_CHANGED = 1008;
Controller接口如下

/**

  • The api for controller
    /
    public interface Controller {
    /
    *
    • Startup controller
      /
      void startup();
      /
      *
    • Shutdown controller
      /
      void shutdown();
      /
      *
    • Start scheduling controller events, this function only will be triggered when the controller becomes leader.
      /
      void startScheduling();
      /
      *
    • Stop scheduling controller events, this function only will be triggered when the controller shutdown leaderShip.
      /
      void stopScheduling();
      /
      *
    • Whether this controller is in leader state.
      /
      boolean isLeaderState();
      /
      *
    • Alter SyncStateSet of broker replicas.
    • @param request AlterSyncStateSetRequestHeader
    • @return RemotingCommand(AlterSyncStateSetResponseHeader)
      /
      CompletableFuture alterSyncStateSet(
      final AlterSyncStateSetRequestHeader request, final SyncStateSet syncStateSet);
      /
      *
    • Elect new master for a broker.
    • @param request ElectMasterRequest
    • @return RemotingCommand(ElectMasterResponseHeader)
      /
      CompletableFuture electMaster(final ElectMasterRequestHeader request);
      /
      *
    • Register api when a replicas of a broker startup.
    • @param request RegisterBrokerRequest
    • @return RemotingCommand(RegisterBrokerResponseHeader)
      /
      CompletableFuture registerBroker(final BrokerRegisterRequestHeader request);
      /
      *
    • Get the Replica Info for a target broker.
    • @param request GetRouteInfoRequest
    • @return RemotingCommand(GetReplicaInfoResponseHeader)
      /
      CompletableFuture getReplicaInfo(final GetReplicaInfoRequestHeader request);
      /
      *
    • Get Metadata of controller
    • @return RemotingCommand(GetControllerMetadataResponseHeader)
      /
      RemotingCommand getControllerMetadata();
      /
      *
    • Get inSyncStateData for target brokers, this api is used for admin tools.
      /
      CompletableFuture getSyncStateData(final List brokerNames);
      /
      *
    • Get the remotingServer used by the controller, the upper layer will reuse this remotingServer.
      */
      RemotingServer getRemotingServer();
      }
      CLI command changes
      Add two CLI commands

GetBrokerEpochCommand
sh bin/mqadmin getSyncStateSet
usage: mqadmin getSyncStateSet -a [-b ] [-c ] [-h] [-i ] [-n ]
-a,--controllerAddress the address of controller
-b,--brokerName which broker to fetch
-c,--clusterName which cluster
-h,--help Print help
-i,--interval the interval(second) of get info
-n,--namesrvAddr Name server address list, eg:'192.168.0.1:9876;192.168.0.2:9876'
Used to get the Epoch information of a Broker

GetSyncStateSetSubCommand
sh bin/mqadmin getBrokerEpoch
usage: mqadmin getBrokerEpoch [-b ] [-c ] [-h] [-i ] [-n ]
-b,--brokerName which broker to fetch
-c,--clusterName which cluster
-h,--help Print help
-i,--interval the interval(second) of get info
-n,--namesrvAddr Name server address list, eg: '192.168.0.1:9876;192.168.0.2:9876'
Used to get the SyncStateSet information of a Broker group or cluster

Log format or content changes
The controller component will add a log configuration file

https://github.com/apache/rocketmq/blob/5.0.0-beta-dledger-controller/distribution/conf/logback_controller.xml

Add the following parameters

Controller main parameters:

enableControllerInNamesrv: Whether to enable the controller in the Nameserver, the default is false. If it is deployed for the embedded Nameserver, it needs to be set to true in the NameServer configuration file
controllerDLegerGroup: The name of the DLedger Raft Group, which can be consistent with the same DLedger Raft Group.
controllerDLegerPeers: Port information of each node in the DLedger Group, the configuration of each node in the same Group must be consistent.
controllerDLegerSelfId: Node id, which must belong to one of controllerDLegerPeers; each node within the same Group must be unique.
controllerStorePath: The controller log storage location. The controller is stateful. When the controller restarts or crashes, it needs to rely on the log to recover data. This directory is very important and cannot be easily deleted.
enableElectUncleanMaster: Whether the Master can be elected from outside the SyncStateSet. If true, the copy with the data behind may be selected as the Master and messages will be lost. The default is false.
notifyBrokerRoleChanged: Whether to actively notify when the role on the broker replica group changes, the default is true.
Broker's new parameters:

enableControllerMode: The master switch of the Broker controller mode, only if the value is true, the controller mode will be turned on. Defaults to false.
controllerAddr: The address of the controller. Multiple controllers are separated by semicolons. E.g controllerAddr = 127.0.0.1:9877;127.0.0.1:9878;127.0.0.1:9879
controllerDeployedStandAlone: ​​Whether the controller is deployed independently, if the controller is deployed independently, it is true, and if it is deployed with an embedded Nameserver, it is false. Defaults to false.
syncBrokerMetadataPeriod: The time interval for synchronizing Broker replica information to the controller. Default 5000 (5s).
checkSyncStateSetPeriod: The time interval for checking SyncStateSet, checking SyncStateSet may shrink SyncState. Default 5000 (5s).
syncControllerMetadataPeriod: The time interval for synchronizing the controller metadata, mainly to obtain the address of the active controller. Default 10000 (10s).
haMaxTimeSlaveNotCatchup: Indicates that the slave does not keep up with the maximum time interval of the Master. If the slave in the SyncStateSet exceeds the time interval, it will be removed from the SyncStateSet. Default is 15000 (15s).
storePathEpochFile: The location to store the epoch file. The epoch file is very important and cannot be deleted at will. The default is in the store directory.
allAckInSyncStateSet: If the value is true, a message needs to be copied to each copy in the SyncStateSet to return success to the client, which can ensure that messages are not lost. Defaults to false.
syncFromLastFile: If the slave is started from an empty disk, whether to copy from the last file. Defaults to false.
asyncLearner: If the value is true, the replica will not enter the SyncStateSet, that is, it will not be elected as the Master, but will always be used as a learner copy for asynchronous replication. Defaults to false.
Compatibility, Deprecation, and Migration Plan
This mode does not add or modify any client-level APIs, and there is no client-side compatibility issue.

No modification has been made to the capabilities of the Nameserver itself, and there is no compatibility problem with the Nameserver. If enableControllerInNamesrv is enabled and the controller parameters are configured correctly, the controller function is enabled.

If the Broker sets enableControllerMode=false, it will still run in the previous way. If enableControllerMode=true is set, the controller needs to be deployed and the parameters are configured correctly to run normally.

The specific behavior is shown in the following table:

Old nameserver Old nameserver + Deploy controllers independently New nameserver enables controller New nameserver disable controller
Old broker Normal running, cannot failover Normal running, cannot failover Normal running, cannot failover Normal running, cannot failover
New broker enable controller mode Unable to go online normally Normal running, can failover Normal running, can failover Unable to go online normally
New broker disable controller mode Normal running, cannot failover Normal running, cannot failover Normal running, cannot failover Normal running, cannot failover
Are there deprecated APIs?
No

How do we do migration?
It can be seen from the above compatibility statement that the NameServer can be upgraded normally without compatibility issues. If you do not want to upgrade the Nameserver, you can deploy the Controller component independently to obtain the switching capability.

For the Broker upgrade, there are two situations:

(1) Master-Slave deployment is upgraded to Controller switching architecture

You can upgrade with data directly. For each group of Brokers, shut down the active and standby Brokers to ensure that the Commitlogs of the active and standby are aligned (you can disable writing to the group of Brokers for a period of time before upgrading, or copy store data to ensure consistency), upgrade the package then restart it.

(2) The original DLedger mode is upgraded to the Controller switching architecture

Due to the difference between the original DLedger mode message data format and the data format in Master-Slave, no upgrade path with data is provided. In the case of deploying multiple groups of Brokers, you can disable writing to a certain group of brokers for a period of time (as long as you confirm that all the existing messages are consumed, for example, it is determined according to the storage time of the messages), and then clear the store directory except config/topics.json, and subscriptionGroup.json (retaining the metadata of the topic and subscription relationship), and then perform an empty disk upgrade.

Rejected Alternatives
How does alternatives solve the issue you proposed?
Introduce external coordination components such as zookeeper, etcd to achieve switching capabilities

Pros and Cons of alternatives
Pros:

There is no need to write part of the code of the controller, and the switching capability can be quickly realized

Cons:

Additional operation and maintenance costs are incurred after the introduction of external components

Why should we reject above alternatives
The DLedger model has been used in the community for many years. DLedger itself is a Raft-based Commitlog repository. On this basis, a consensus component of fault-tolerant consensus can be built. Through deep integration with RocketMQ, the DLedger Controller can become very lightweight, and It is loosely coupled and can be optionally deployed or embedded in NameServer. This component does not have too much operation and maintenance burden. Therefore there is no need to introduce additional external coordination components.

Appendix
Testing report

Quick Start

Deploy and upgrade guide

Code review advice

Copyright © 2016~2022 The Apache Software Foundation.

Pages 64
Home
RocketMQ Improvement Proposal
RIP
User Guide
FAQ
Community
Release Policy
Clone this wiki locally
https://github.com/apache/rocketmq.wiki.git
Footer
© 2025 GitHub, Inc.
Footer navigation
Terms
Privacy
Security
Status
GitHub Community
Docs
Contact
Manage cookies
Do not share my personal information
RIP 44 Support DLedger Controller · apache/rocketmq Wiki