深入理解Redis复制

复制

A few things to understand ASAP about Redis replication.

1) Redis replication is asynchronous, but you can configure a master to
   stop accepting writes if it appears to be not connected with at least
   a given number of slaves.
2) Redis slaves are able to perform a partial resynchronization with the
   master if the replication link is lost for a relatively small amount of
   time. You may want to configure the replication backlog size (see the next
   sections of this file) with a sensible value depending on your needs.
3) Replication is automatic and does not need user intervention. After a
   network partition slaves automatically try to reconnect to masters
   and resynchronize with them.

 

复制的实现

1. 设置主节点的地址和端口

简而言之,是执行SLAVEOF命令,该命令是个异步命令,在设置完masterhost和masterport属性之后,从节点将向发送SLAVEOF的客户端返回OK。表示复制指令已经被接受,而实际的复制工作将在OK返回之后才真正开始执行。

 

2. 创建套接字连接。

在执行完SLAVEOF命令后,从节点根据命令所设置的IP和端口,创建连向主节点的套接字连接。如果创建成功,则从节点将为这个套接字关联一个专门用于处理复制工作的文件事件处理器,这个处理器将负责执行后续的复制工作,比如接受RDB文件,以及接受主节点传播来的写命令等。

 

3. 发送PING命令。

从节点成为主节点的客户端之后,首先会向主节点发送一个PING命令,其作用如下:

1. 检查套接字的读写状态是否正常。

2. 检查主节点是否能正常处理命令请求。

如果从节点读取到“PONG”的回复,则表示主从节点之间的网路连接状态正常,并且主节点可以正常处理从节点发送的命令请求。

 

4. 身份验证

从节点在收到主节点返回的“PONG”回复之后,接下来会做的就是身份验证。如果从节点设置了masterauth选项,则进行身份验证。反之则不进行。

在需要进行身份验证的情况下,从节点将向主节点发送一条AUTH命令,命令的参数即可从节点masterauth选项的值。

 

5. 发送端口信息。

在身份验证之后,从节点将执行REPLCONF listening-port  <port-number>,向主节点发送从节点的监听端口号。

主节点会将其记录在对应的客户端状态的slave_listening_port属性中,这点可通过info Replication查看。

127.0.0.1:6379> info Replication
# Replication
role:master
connected_slaves:1
slave0:ip=127.0.0.1,port=6380,state=online,offset=3696,lag=0

 

6. 同步。

从节点向主节点发送PSYNC命令,执行同步操作,并将自己的数据库更新至主节点数据库当前所处的状态。

 

7. 命令传播

当完成了同步之后,主从节点就会进入命令传播阶段。这时主节点只要一直将自己执行的写命令发送到从节点,而从节点只要一直接收并执行主节点发来的写命令,就可以保证主从节点保持一致了。

 

8. 心跳检测

在命令传播阶段,从节点默认会以每秒一次的频率,向主节点发送命令。

REPLCONF ACK <replication_offset>

其中,replication_offset是从节点当前的复制偏移量。

发送REPLCONF ACK主从节点有三个作用:

1> 检测主从节点的网络连接状态。

2> 辅助实现min-slave选项。

3> 检查是否存在命令丢失。

REPLCONF ACK命令和复制积压缓冲区是Redis 2.8版本新增的,在此之前,即使命令在传播过程中丢失,主从节点都不会注意到。

 

复制的相关参数

slaveof <masterip> <masterport>
masterauth <master-password>

slave-serve-stale-data yes

slave-read-only yes

repl-diskless-sync no

repl-diskless-sync-delay 5

repl-ping-slave-period 10

repl-timeout 60

repl-disable-tcp-nodelay no

repl-backlog-size 1mb

repl-backlog-ttl 3600

slave-priority 100

min-slaves-to-write 3
min-slaves-max-lag 10

slave-announce-ip 5.5.5.5
slave-announce-port 1234

其中,

slaveof <masterip> <masterport>:开启复制,只需这条命令即可。

masterauth <master-password>:如果master中通过requirepass参数设置了密码,则slave中需设置该参数。

slave-serve-stale-data:当主从连接中断,或主从复制建立期间,是否允许slave对外提供服务。默认为yes,即允许对外提供服务,但有可能会读到脏的数据。

slave-read-only:将slave设置为只读模式。需要注意的是,只读模式针对的只是客户端的写操作,对于管理命令无效。

repl-diskless-sync,repl-diskless-sync-delay:是否使用无盘复制。为了降低主节点磁盘开销,Redis支持无盘复制,生成的RDB文件不保存到磁盘而是直接通过网络发送给从节点。无盘复制适用于主节点所在机器磁盘性能较差但网络宽带较充裕的场景。需要注意的是,无盘复制目前依然处于实验阶段。

repl-ping-slave-period:master每隔一段固定的时间向SLAVE发送一个PING命令。

repl-timeout:复制超时时间。

# The following option sets the replication timeout for:
#
# 1) Bulk transfer I/O during SYNC, from the point of view of slave.
# 2) Master timeout from the point of view of slaves (data, pings).
# 3) Slave timeout from the point of view of masters (REPLCONF ACK pings).
#
# It is important to make sure that this value is greater than the value
# specified for repl-ping-slave-period otherwise a timeout will be detected
# every time there is low traffic between the master and the slave.

 

repl-disable-tcp-nodelay:设置为yes,主节点会等待一段时间才发送TCP数据包,具体等待时间取决于Linux内核,一般是40毫秒。适用于主从网络环境复杂或带宽紧张的场景。默认为no。

 

repl-backlog-size:复制积压缓冲区,复制积压缓冲区是保存在主节点上的一个固定长度的队列。用于从Redis 2.8开始引入的部分复制。

# Set the replication backlog size. The backlog is a buffer that accumulates
# slave data when slaves are disconnected for some time, so that when a slave
# wants to reconnect again, often a full resync is not needed, but a partial
# resync is enough, just passing the portion of data the slave missed while
# disconnected.
#
# The bigger the replication backlog, the longer the time the slave can be
# disconnected and later be able to perform a partial resynchronization.
#
# The backlog is only allocated once there is at least a slave connected.

只有slave连接上来,才会开辟backlog。

 

repl-backlog-ttl:如果master上的slave全都断开了,且在指定的时间内没有连接上,则backlog会被master清除掉。repl-backlog-ttl即用来设置该时长,默认为3600s,如果设置为0,则永不清除。

 

slave-priority:设置slave的优先级,用于Redis Sentinel主从切换时使用,值越小,则提升为主的优先级越高。需要注意的是,如果设置为0,则代表该slave不参加选主。

 

slave-announce-ip,slave-announce-port :常用于端口转发或NAT场景下,对Master暴露真实IP和端口信息。

 

同步的过程

1. 从节点向主节点发送PSYNC命令。

2. 收到PSYNC命令的主节点执行BGSAVE命令,在后台生成一个RDB文件,并使用一个缓冲区记录从现在开始执行的所有写命令。

3. 当主节点的BGSAVE命令执行完毕时,主节点会将BGSAVE命令生成的RDB文件发送给从节点,从节点接受并载入这个RDB文件,将自己的数据库状态更新至主节点执行BGSAVE命令时的数据库状态。

4. 主节点将记录在缓冲区里面的所有写命令发送给从节点,从节点执行这些写命令,将自己的数据库状态更新至主节点数据库当前所处的状态。

 

需要注意的是,在步骤2中提到的缓冲区,其实是有大小限制的,其由client-output-buffer-limit slave 256mb 64mb 60决定,该参数的语法及解释如下:

# client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds>
#
# A client is immediately disconnected once the hard limit is reached, or if
# the soft limit is reached and remains reached for the specified number of
# seconds (continuously).

意思是如果该缓冲区的大小超过256M,或该缓冲区的大小超过64M,且持续了60s,主节点会马上断开从节点的连接。断开连接后,在60s之后(repl-timeout),从节点发现没有从主节点中获得数据,会重新启动复制。

 

在Redis 2.8之前,如果因网络原因,主从节点复制中断,当再次建立连接时,还是会执行SYNC命令进行全量复制。效率较为低下。从Redis 2.8开始,引入了PSYNC命令代替SYNC命令来执行复制时的同步操作。

PSYNC命令具有全量同步(full resynchronization)和增量同步(partial resynchronization)。

全量同步的日志:

master:

19544:M 05 Oct 20:44:04.713 * Slave 127.0.0.1:6380 asks for synchronization
19544:M 05 Oct 20:44:04.713 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for 'dc419fe03ddc9ba30cf2a2cf1894872513f1ef96', my 
replication IDs are 'f8a035fdbb7cfe435652b3445c2141f98a65e437' and '0000000000000000000000000000000000000000')19544:M 05 Oct 20:44:04.713 * Starting BGSAVE for SYNC with target: disk
19544:M 05 Oct 20:44:04.713 * Background saving started by pid 20585
20585:C 05 Oct 20:44:04.723 * DB saved on disk
20585:C 05 Oct 20:44:04.723 * RDB: 0 MB of memory used by copy-on-write
19544:M 05 Oct 20:44:04.813 * Background saving terminated with success
19544:M 05 Oct 20:44:04.814 * Synchronization with slave 127.0.0.1:6380 succeeded

slave:

19746:S 05 Oct 20:44:04.288 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new
 master with just a partial transfer.19746:S 05 Oct 20:44:04.288 * SLAVE OF 127.0.0.1:6379 enabled (user request from 'id=3 addr=127.0.0.1:37128 fd=8 name= age=929 idle=0 flags=N db=0 sub=0 psub=
0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')19746:S 05 Oct 20:44:04.712 * Connecting to MASTER 127.0.0.1:6379
19746:S 05 Oct 20:44:04.712 * MASTER <-> SLAVE sync started
19746:S 05 Oct 20:44:04.712 * Non blocking connect for SYNC fired the event.
19746:S 05 Oct 20:44:04.713 * Master replied to PING, replication can continue...
19746:S 05 Oct 20:44:04.713 * Trying a partial resynchronization (request dc419fe03ddc9ba30cf2a2cf1894872513f1ef96:1191).
19746:S 05 Oct 20:44:04.713 * Full resync from master: f8a035fdbb7cfe435652b3445c2141f98a65e437:1190
19746:S 05 Oct 20:44:04.713 * Discarding previously cached master state.
19746:S 05 Oct 20:44:04.814 * MASTER <-> SLAVE sync: receiving 224566 bytes from master
19746:S 05 Oct 20:44:04.814 * MASTER <-> SLAVE sync: Flushing old data
19746:S 05 Oct 20:44:04.815 * MASTER <-> SLAVE sync: Loading DB in memory
19746:S 05 Oct 20:44:04.817 * MASTER <-> SLAVE sync: Finished with success

 

增量同步的日志:

master:

19544:M 05 Oct 20:42:06.423 # Connection with slave 127.0.0.1:6380 lost.
19544:M 05 Oct 20:42:06.753 * Slave 127.0.0.1:6380 asks for synchronization
19544:M 05 Oct 20:42:06.753 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 0 bytes of backlog starting from offset 1037.

slave:

19746:S 05 Oct 20:42:06.423 # Connection with master lost.
19746:S 05 Oct 20:42:06.423 * Caching the disconnected master state.
19746:S 05 Oct 20:42:06.752 * Connecting to MASTER 127.0.0.1:6379
19746:S 05 Oct 20:42:06.752 * MASTER <-> SLAVE sync started
19746:S 05 Oct 20:42:06.752 * Non blocking connect for SYNC fired the event.
19746:S 05 Oct 20:42:06.753 * Master replied to PING, replication can continue...
19746:S 05 Oct 20:42:06.753 * Trying a partial resynchronization (request f8a035fdbb7cfe435652b3445c2141f98a65e437:1037).
19746:S 05 Oct 20:42:06.753 * Successful partial resynchronization with master.
19746:S 05 Oct 20:42:06.753 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

 

在Redis 4.0中,master_replid和offset存储在RDB文件中。当从节点被优雅的关闭并重新启动时,Redis能够从RDB文件中重新加载master_replid和offset,从而使增量同步成为可能。

 

增量同步的实现依赖于以下三部分:

1. 主从节点的复制偏移量。

2. 主节点的复制积压缓冲区。

3. 节点的运行ID(run ID)。

 

当一个从节点被提升为主节点时,其它的从节点必须与新主节点重新同步。在Redis 4.0 之前,因为master_replid发生了变化,所以这个过程是一个全量同步。在Redis 4.0之后,新主节点会记录旧主节点的naster_replid和offset,因为能够接受来自其它从节点的增量同步请求,即使请求中的master_replid不同。在底层实现上,当执行slaveof no one时,会将master_replid,master_repl_offset+1复制为master_replid,second_repl_offset。

 

复制相关变量

# Replication
role:master
connected_slaves:2
slave0:ip=127.0.0.1,port=6380,state=online,offset=5698,lag=0
slave1:ip=127.0.0.1,port=6381,state=online,offset=5698,lag=0
master_replid:e071f49c8d9d6719d88c56fa632435fba83e145d
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:5698
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:5698

# Replication
role:slave
master_host:127.0.0.1
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:126
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:15715bc0bd37a71cae3d08b9566f001ccbc739de
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:126
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:126

 

其中,

role: Value is "master" if the instance is replica of no one, or "slave" if the instance is a replica of some master instance. Note that a replica can be master of another replica (chained replication).

master_replid: The replication ID of the Redis server. 每个Redis节点启动后都会动态分配一个40位的十六进制字符串作为运行ID。主的运行ID。

master_replid2: The secondary replication ID, used for PSYNC after a failover. 在执行slaveof no one时,会将master_replid,master_repl_offset+1复制为master_replid,second_repl_offset。

master_repl_offset: The server's current replication offset.  Master的复制偏移量。

second_repl_offset: The offset up to which replication IDs are accepted.

repl_backlog_active: Flag indicating replication backlog is active 是否开启了backlog。

repl_backlog_size: Total size in bytes of the replication backlog buffer. repl-backlog-size的大小。

repl_backlog_first_byte_offset: The master offset of the replication backlog buffer. backlog中保存的Master最早的偏移量,

repl_backlog_histlen: Size in bytes of the data in the replication backlog buffer. backlog中数据的大小。


If the instance is a replica, these additional fields are provided:

master_host: Host or IP address of the master. Master的IP。

master_port: Master listening TCP port. Master的端口。

master_link_status: Status of the link (up/down). 主从之间的连接状态。

master_last_io_seconds_ago: Number of seconds since the last interaction with master.  主节点每隔10s对从从节点发送PING命令,以判断从节点的存活性和连接状态。该变量代表多久之前,主从进行了心跳交互。

master_sync_in_progress: Indicate the master is syncing to the replica. 主节点是否在向从节点同步数据。个人觉得,应该指的是全量同步或增量同步。

slave_repl_offset: The replication offset of the replica instance. Slave的复制偏移量。

slave_priority: The priority of the instance as a candidate for failover. Slave的权重。

slave_read_only: Flag indicating if the replica is read-only. Slave是否处于可读模式。


If a SYNC operation is on-going, these additional fields are provided:

master_sync_left_bytes: Number of bytes left before syncing is complete. 

master_sync_last_io_seconds_ago: Number of seconds since last transfer I/O during a SYNC operation. 


If the link between master and replica is down, an additional field is provided:

master_link_down_since_seconds: Number of seconds since the link is down. 主从连接中断持续的时间。

 

The following field is always provided:

connected_slaves: Number of connected replicas. 连接的Slave的数量。

 

If the server is configured with the min-slaves-to-write (or starting with Redis 5 with the min-replicas-to-write) directive, an additional field is provided:

min_slaves_good_slaves: Number of replicas currently considered good。状态正常的从节点的数量。

 

For each replica, the following line is added:
slaveXXX: id, IP address, port, state, offset, lag. Slave的状态。

slave0:ip=127.0.0.1,port=6381,state=online,offset=1288,lag=1

 

如何监控主从延迟

# Replication
role:master
connected_slaves:2
slave0:ip=127.0.0.1,port=6381,state=online,offset=560,lag=0
slave1:ip=127.0.0.1,port=6380,state=online,offset=560,lag=0
master_replid:15715bc0bd37a71cae3d08b9566f001ccbc739de
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:560

其中,master_repl_offset是主节点的复制偏移量,slaveX中的offset即对应从节点的复制偏移量,两者的差值即主从的延迟量。

 

如何评估backlog缓冲区的大小

t * (master_repl_offset2 - master_repl_offset1 ) / (t2 - t1)

t is how long the disconnections may last in seconds.

 

参考:

1. 《Redis开发与运维》

2. 《Redis设计与实现》

3. 《Redis 4.X Cookbook》

posted @ 2018-10-08 08:48  iVictor  阅读(6560)  评论(0编辑  收藏  举报