redis 主从哨兵02

一.为什么要复制

  • 1.实现数据的多副本存储,从而可以实现服务的高可用
  • 2.提供更好的读性能,分担读请求

二.复制技术的关键点及难点

  • 1.如何指定被复制对象
  • 2.增量还是全量,以及如何实现增量
  • 3.复制时不影响前端业务的操作
  • 4.网络被中断后如何处理
  • 5.如何防止发送出去的数据丢失,没有到达从服务器
  • 6.如何识别被复制的数据源发生变化,导致数据出错

三.复制步骤

graph LR
全量同步--增量同步-->命令传播
3.1指定master
  • 1.配置文件配置slaveof
  • 2.从节点命令执行slaveof命令
3.2建立socket连接
  • 从服务器根据配置或者命令行命令slaveof,创建连向主服务器的socket
3.3发送ping命令(当连接创建后发送)
  • 1.通过ping命令检查socket的读写状态是否正常
  • 2.检查主服务器是否能正常处理命令请求
  • 3.当从服务器不能在规定的时间内得到ping的回复,则表示网络不正常,从服务器会断开socket并重连
  • 4.如果从服务器收到主服务器返回的一个错误信息,比如BUSY redis is busy running ascript, youcan...,则从服务器会断开并重连
  • 5.如果从服务器收到的回应是PING,则表示一切正常,可以执行下一步流程
3.4身份验证
  • 1.如果从服务器设置了masterauth选项,则进行身份验证,否则部进行
  • 2.通过向master发送命令auth来实现认证,auth passwd
  • 3.当master没有设置requirepass时,会提示出现no password is set
  • 4.当master设置与slave的密码不一样时,则出现invalid password错误
3.5发送端口信息
  • 1.从服务器执行命令REPLCONF listening-port <port-number>,向主服务器发送从服务器的命令监控端口
  • 2.这个端口号是为了在master上执行info命令时,可以查看从节点的端口信息,也就是从主动告知主自己的监听端口
3.6同步
  • 主从服务器之间互为客户端,可以皮尺发送命令和相应回应
3.7命令传播
  • 主服务器执行命令后会发送给从服务器

四.同步过程记录

五.配置说明

slave <masterip> <masterport> 
# 指定被复制的数据源
masterauth <master-password>
# 被复制数据源的认证密码
slave-serve-stale-data yes
# yes 表示slave与master之间的连接断开或者正处于复制时,slave服务器可以接受客户端的请求,缺点是可能读取到可期数据
# no 表示不接受客户端请求,返回错误信息"SYNC with master ip progress"
slave-read-only yes
# 从服务器是否只读,如果不是只读,可能会和主从之间产生数据不一致
repl-timeout 60
# 复制超时时间
# slave在于master SYNC期间有大量数据传输,造成超时
# 在slave角度,master超时,包括数据、ping等
# 在master角度,slave超时,当master发送REPLCONF、ACK pings
repl-disable-tcp-nodelay no
# yes redis将使用更少的tcp和带宽来向slave发送数据,本质就是提高包的有效使用率,多个数据放在一个包中传输,但会导致一定的数据延迟,linux系统是发送堆栈超时40ms
# no 包利用率不高,但延迟更低
repl-backlog-size 1mb
# master端固定发送缓冲区,影响从节点与主节点网络中断后是否全部同步;如果从节点需要多少的数据还在缓冲区,则增量同步,如果超时或者积压淘汰,则发生全量同步
repl-backlog-ttl 3600
# 当slave与master断开后,一定时间超时后,释放backlog的数据
slave-priority 100
# 用于配置从节点优先级,当主节点不能正常工作时,redis sentinel使用它来选择一个从节点并提升为主节点,优先级越高的从节点更有几率提升为主节点
# 当满足下面的条件时,主不接收前端的写请求
min-slaves-to-write 3
# 最少多少个slave在线,默认是0,表示关闭此功能
min-slaves-max-lag 10
# 最小时间延迟,超过该值前端停止写入

六.同步流程

image

七.全量同步过程

image

  • 7.1从库进行slaveof
415:S 20 Nov 14:17:17.330 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
415:S 20 Nov 14:17:17.331 * SLAVE OF 172.16.10.140:6379 enabled (user request from 'id=4 addr=127.0.0.1:55027 fd=11 name= age=198 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')
415:S 20 Nov 14:17:17.586 * Connecting to MASTER 172.16.10.140:6379
415:S 20 Nov 14:17:17.586 * MASTER <-> SLAVE sync started
415:S 20 Nov 14:17:17.586 * Non blocking connect for SYNC fired the event.
415:S 20 Nov 14:17:17.587 * Master replied to PING, replication can continue...
415:S 20 Nov 14:17:17.587 * Trying a partial resynchronization (request 572caecf4c0bf264880b2e3899a3dae52e7704e9:1).
415:S 20 Nov 14:17:17.592 * Full resync from master: 030a3c44c4f64eb9a02c3b36f3891226fc2074fe:0
415:S 20 Nov 14:17:17.592 * Discarding previously cached master state.
415:S 20 Nov 14:17:17.681 * MASTER <-> SLAVE sync: receiving 201 bytes from master
415:S 20 Nov 14:17:17.698 * MASTER <-> SLAVE sync: Flushing old data
415:S 20 Nov 14:17:19.605 * MASTER <-> SLAVE sync: Loading DB in memory
415:S 20 Nov 14:17:19.605 * MASTER <-> SLAVE sync: Finished with success
415:S 20 Nov 14:17:19.606 * Background append only file rewriting started by pid 687
415:S 20 Nov 14:17:19.631 * AOF rewrite child asks to stop sending diffs.
687:C 20 Nov 14:17:19.631 * Parent agreed to stop sending diffs. Finalizing AOF...
687:C 20 Nov 14:17:19.631 * Concatenating 0.00 MB of AOF diff received from parent.
687:C 20 Nov 14:17:19.632 * SYNC append only file rewrite performed
687:C 20 Nov 14:17:19.632 * AOF rewrite: 2 MB of memory used by copy-on-write
415:S 20 Nov 14:17:19.707 * Background AOF rewrite terminated with success
415:S 20 Nov 14:17:19.707 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
415:S 20 Nov 14:17:19.707 * Background AOF rewrite finished successfully
  • 7.2主库的log
10110:M 20 Nov 14:17:16.884 * Slave 172.16.10.141:6379 asks for synchronization
10110:M 20 Nov 14:17:16.884 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '572caecf4c0bf264880b2e3899a3dae52e7704e9', my replication IDs are '14f0fbdd33f13d8e6d07c13bb0a184ba7a43c258' and 'ff98eda832c57bef003947b34ae024063689ca44')
10110:M 20 Nov 14:17:16.885 * Starting BGSAVE for SYNC with target: disk
10110:M 20 Nov 14:17:16.888 * Background saving started by pid 11565
11565:C 20 Nov 14:17:16.891 * DB saved on disk
11565:C 20 Nov 14:17:16.891 * RDB: 6 MB of memory used by copy-on-write
10110:M 20 Nov 14:17:16.978 * Background saving terminated with success
10110:M 20 Nov 14:17:16.978 * Synchronization with slave 172.16.10.141:6379 succeeded
  • 7.3主库关闭
# 主库log
12519:C 20 Nov 14:22:10.243 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
12519:C 20 Nov 14:22:10.243 # Redis version=4.0.8, bits=64, commit=00000000, modified=0, pid=12519, just started
12519:C 20 Nov 14:22:10.243 # Configuration loaded
12520:M 20 Nov 14:22:10.245 * Increased maximum number of open files to 10032 (it was originally set to 1024).
12520:M 20 Nov 14:22:10.245 # Creating Server TCP listening socket *:6379: bind: Address already in use
10110:M 20 Nov 14:23:36.032 # User requested shutdown...
10110:M 20 Nov 14:23:36.032 * Calling fsync() on the AOF file.
10110:M 20 Nov 14:23:36.032 * Removing the pid file.
10110:M 20 Nov 14:23:36.032 # Redis is now ready to exit, bye bye...

# 从库log
415:S 20 Nov 14:23:36.736 # Connection with master lost.
415:S 20 Nov 14:23:36.736 * Caching the disconnected master state.
415:S 20 Nov 14:23:37.456 * Connecting to MASTER 172.16.10.140:6379
415:S 20 Nov 14:23:37.456 * MASTER <-> SLAVE sync started
415:S 20 Nov 14:23:37.456 # Error condition on socket for SYNC: Connection refused
415:S 20 Nov 14:23:38.458 * Connecting to MASTER 172.16.10.140:6379
415:S 20 Nov 14:23:38.459 * MASTER <-> SLAVE sync started
415:S 20 Nov 14:23:38.459 # Error condition on socket for SYNC: Connection refused
415:S 20 Nov 14:23:39.462 * Connecting to MASTER 172.16.10.140:6379
  • 7.4主库启动
# 从库log
415:S 20 Nov 14:24:39.625 # Error condition on socket for SYNC: Connection refused
415:S 20 Nov 14:24:40.626 * Connecting to MASTER 172.16.10.140:6379
415:S 20 Nov 14:24:40.626 * MASTER <-> SLAVE sync started
415:S 20 Nov 14:24:40.627 * Non blocking connect for SYNC fired the event.
415:S 20 Nov 14:24:40.627 * Master replied to PING, replication can continue...
415:S 20 Nov 14:24:40.628 * Trying a partial resynchronization (request 030a3c44c4f64eb9a02c3b36f3891226fc2074fe:702).
415:S 20 Nov 14:24:40.629 * Full resync from master: 1e1b4acf86e7882c044eb952136e04e5a70b077b:0
415:S 20 Nov 14:24:40.629 * Discarding previously cached master state.
415:S 20 Nov 14:24:40.712 * MASTER <-> SLAVE sync: receiving 216 bytes from master
415:S 20 Nov 14:24:40.712 * MASTER <-> SLAVE sync: Flushing old data
415:S 20 Nov 14:24:40.712 * MASTER <-> SLAVE sync: Loading DB in memory
415:S 20 Nov 14:24:40.712 * MASTER <-> SLAVE sync: Finished with success
415:S 20 Nov 14:24:40.713 * Background append only file rewriting started by pid 1102
415:S 20 Nov 14:24:40.737 * AOF rewrite child asks to stop sending diffs.
1102:C 20 Nov 14:24:40.737 * Parent agreed to stop sending diffs. Finalizing AOF...
1102:C 20 Nov 14:24:40.737 * Concatenating 0.00 MB of AOF diff received from parent.
1102:C 20 Nov 14:24:40.737 * SYNC append only file rewrite performed
1102:C 20 Nov 14:24:40.738 * AOF rewrite: 2 MB of memory used by copy-on-write
415:S 20 Nov 14:24:40.829 * Background AOF rewrite terminated with success
415:S 20 Nov 14:24:40.829 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
415:S 20 Nov 14:24:40.829 * Background AOF rewrite finished successfully

# 主库log,run_id改变,全同步
12992:M 20 Nov 14:24:39.924 * Slave 172.16.10.141:6379 asks for synchronization
12992:M 20 Nov 14:24:39.925 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '030a3c44c4f64eb9a02c3b36f3891226fc2074fe', my replication IDs are '510ae8234d41a712b9c60fe63a4cf193fc3a9fe2' and '0000000000000000000000000000000000000000')
12992:M 20 Nov 14:24:39.925 * Starting BGSAVE for SYNC with target: disk
12992:M 20 Nov 14:24:39.925 * Background saving started by pid 13002
13002:C 20 Nov 14:24:39.927 * DB saved on disk
13002:C 20 Nov 14:24:39.927 * RDB: 6 MB of memory used by copy-on-write
12992:M 20 Nov 14:24:40.008 * Background saving terminated with success
12992:M 20 Nov 14:24:40.008 * Synchronization with slave 172.16.10.141:6379 succeeded

八.断线后增量复制过程

image

从库重启
  • 8.1从库关闭
# 主库记录连接丢失
12992:M 20 Nov 14:30:33.092 # Connection with slave 172.16.10.141:6379 lost.
  • 8.2主库继续写数据
127.0.0.1:6379> set k11 v11
OK
127.0.0.1:6379> set k22 v22
OK
  • 8.3从库启动,从库重新启动,也会进行全量同步,因为slave的 run_id也改变了
# 从库log
1520:S 20 Nov 14:31:55.315 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1520:S 20 Nov 14:31:55.315 * SLAVE OF 172.16.10.140:6379 enabled (user request from 'id=2 addr=127.0.0.1:55195 fd=10 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')
1520:S 20 Nov 14:31:55.712 * Connecting to MASTER 172.16.10.140:6379
1520:S 20 Nov 14:31:55.712 * MASTER <-> SLAVE sync started
1520:S 20 Nov 14:31:55.712 * Non blocking connect for SYNC fired the event.
1520:S 20 Nov 14:31:55.712 * Master replied to PING, replication can continue...
1520:S 20 Nov 14:31:55.713 * Trying a partial resynchronization (request 3a389f3b7dc9e3a394e6fdac5b7028e59aa635a8:1).
1520:S 20 Nov 14:31:55.715 * Full resync from master: 1e1b4acf86e7882c044eb952136e04e5a70b077b:575
1520:S 20 Nov 14:31:55.715 * Discarding previously cached master state.
1520:S 20 Nov 14:31:55.784 * MASTER <-> SLAVE sync: receiving 235 bytes from master
1520:S 20 Nov 14:31:55.785 * MASTER <-> SLAVE sync: Flushing old data
1520:S 20 Nov 14:31:55.785 * MASTER <-> SLAVE sync: Loading DB in memory
1520:S 20 Nov 14:31:55.785 * MASTER <-> SLAVE sync: Finished with success
1520:S 20 Nov 14:31:55.786 * Background append only file rewriting started by pid 1533
1520:S 20 Nov 14:31:55.809 * AOF rewrite child asks to stop sending diffs.
1533:C 20 Nov 14:31:55.809 * Parent agreed to stop sending diffs. Finalizing AOF...
1533:C 20 Nov 14:31:55.809 * Concatenating 0.00 MB of AOF diff received from parent.
1533:C 20 Nov 14:31:55.809 * SYNC append only file rewrite performed
1533:C 20 Nov 14:31:55.809 * AOF rewrite: 6 MB of memory used by copy-on-write
1520:S 20 Nov 14:31:55.812 * Background AOF rewrite terminated with success
1520:S 20 Nov 14:31:55.812 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
1520:S 20 Nov 14:31:55.812 * Background AOF rewrite finished successfully

# 主库log
12992:M 20 Nov 14:31:55.010 * Slave 172.16.10.141:6379 asks for synchronization
12992:M 20 Nov 14:31:55.010 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '3a389f3b7dc9e3a394e6fdac5b7028e59aa635a8', my replication IDs are '1e1b4acf86e7882c044eb952136e04e5a70b077b' and '0000000000000000000000000000000000000000')
12992:M 20 Nov 14:31:55.010 * Starting BGSAVE for SYNC with target: disk
12992:M 20 Nov 14:31:55.011 * Background saving started by pid 14369
14369:C 20 Nov 14:31:55.013 * DB saved on disk
14369:C 20 Nov 14:31:55.013 * RDB: 6 MB of memory used by copy-on-write
12992:M 20 Nov 14:31:55.081 * Background saving terminated with success
12992:M 20 Nov 14:31:55.081 * Synchronization with slave 172.16.10.141:6379 succeeded
从库断线,进行增量同步(积压区数据还在)
  • 1.从库断线后,主库依然写入数据
# slave
systemctl stop network&&sleep 60&&systemctl start network &
  • 2.slave上线后
#主库log
12992:M 20 Nov 15:17:37.019 # Disconnecting timedout slave: 172.16.10.141:6379
12992:M 20 Nov 15:17:37.019 # Connection with slave 172.16.10.141:6379 lost.
12992:M 20 Nov 15:17:38.092 * Slave 172.16.10.141:6379 asks for synchronization
12992:M 20 Nov 15:17:38.093 * Partial resynchronization request from 172.16.10.141:6379 accepted. Sending 165 bytes of backlog starting from offset 4388.

# 从库log
1705:S 20 Nov 15:17:38.792 # MASTER timeout: no data nor PING received...
1705:S 20 Nov 15:17:38.793 # Connection with master lost.
1705:S 20 Nov 15:17:38.793 * Caching the disconnected master state.
1705:S 20 Nov 15:17:38.793 * Connecting to MASTER 172.16.10.140:6379
1705:S 20 Nov 15:17:38.794 * MASTER <-> SLAVE sync started
1705:S 20 Nov 15:17:38.795 * Non blocking connect for SYNC fired the event.
1705:S 20 Nov 15:17:38.795 * Master replied to PING, replication can continue...
1705:S 20 Nov 15:17:38.795 * Trying a partial resynchronization (request 1e1b4acf86e7882c044eb952136e04e5a70b077b:4388).
1705:S 20 Nov 15:17:38.796 * Successful partial resynchronization with master.
1705:S 20 Nov 15:17:38.796 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

九.停掉主后,重新启动,会不会重新全量同步

  • 因为run_id源改变,发生全量同步

十.心跳检测

从服务器默认每10秒一次的频率向主发送心跳命令:REPLCONF ACK <replication_offset>

  • 通过心跳检测可以知道网络状况,通过info命令可以查看到lag参数,表示主从延迟,单位是秒,一般为0或者1
  • 在心跳检测中带有当前从的复制偏移量,当主发送给从的命令有丢失时,可以通过这种高频的心跳检测及时发现偏移量不正确,主服务器可以把缺失的命令重新发给从服务器
  • 通过心跳检查可以实现min-slaves功能,即如果主从状态不正常时,不允许主写入数据

十一.Redis高可用应该解决那些问题

  • 1.多个节点拥有相同的数据
    • 复制技术
  • 2.当主节点宕机后,如何产生新的主节点
  • 3.当主节点宕机后,从节点如何自动连接到新的主节点
  • 4.如何判断主节点宕机
  • 5.旧的主节点恢复后,如何处理
  • 6.如何监控redis所有节点的健康状态

十二.什么是sentinel(哨兵)

  • 1.本身也就是redis程序的一部分
  • 2.主要功能
    • 2.1监控redis节点的健康状态
    • 2.2通知,把监控到的变化通知给相关系统或者redis实例,通过redis的订阅机制实现
    • 2.3自动热备(failover),主节点宕机----选举新的主节点
    • 2.4.配置管理,redis实例可以通过sentinel获取到某些共享信息
  • 3.Sentinel本身也是分布式,解决了自身单点问题
    image
12.1安装配置sentinel
  • 1.复制配置slave
port 6380
logfile "/home/liubx/redisdata/slave1/logs/redis.log“
pidfile /var/run/redis.pid与主路径不一致
dir /home/liubx/redisdata/slave1
slaveof localhost  6379
  • 2.sentinel配置
在redis的安装目录下有一个配置文件sentinel.conf
daemonize yes
logfile "/home/liubx/sentinel/sentinel.log“
sentinel monitor mymaster 127.0.0.1 6379 1
# 监控名 IP 端口 票数
# 1个sentinel可监控多个master
  • 3.启动sentinel
    • redis-sentinel ../sentinel.conf
    • redis-server ../sentinel.conf --sentinel
12.2HA步骤
  • 1.主观判断主节点是否下线
  • 2.客观判断主节点下线
  • 3.sentinel选举出执行故障转移的节点(多个sentinel构成对主节点的监控)
  • 4.故障转移
    • 选出新的主服务器
    • 修改从服务器的复制目标
    • 将旧的主服务器变为从服务器
12.3主观判断下线
1.默认每10秒一次的频率发送ping命令,用于检测相关节点是否在线
  • 包括主服务器 主所属的从服务器 以及其它sentinel
  • 返回+PONG 、–LOADING、 -MASTERDOWN这三种状态中一种表示节点在线,反之,则节点不在线
2.在某段时间内,如果ping的返回不正确,则表示该节点主观下线
  • 时间由参数sentinel down-after-milliseconds master 50000配置,单位为毫秒
  • 这个时间的设置不仅仅影响主节点,还影响主节点所属的所有从节点以及同样监听这个主节点的其它sentinel
    • 比如master的ip为1.1 此时的sentinel的ip为1.2,有从节点1.3,1.4,均指向1.1主节点;同时,另外一个sentinel的ip为1.5,并监控1.1;则如果1.2这个sentinel的时间配置为10000毫秒,则1.2判断1.1,1.3,1.4,1.5主观下线的时间都为10000毫秒
  • 不同的sentinel,这个配置时间可以不一样
12.4客观判断下线
当一定数量的其它sentinel也同样判断该master下线时,此sentinel就认为此master为客观下线
  • 这个数量由sentinel monitor master ip port num这里面的num指定
Sentinel之间会创建通信连接,通过发送命令来获取别的sentinel的判断信息
  • 发送sentinel is-master-down-by-addr <current_epoch>
    • Current_epoch 配置纪元,也可以理解为选举轮次计数器
    • runid为sentinel的实例id,可以为*,代表判断主节点是否下线状态,如果是具体的id,则表示选举领头的sentinel
    • Ip为被sentinel判断为主观下线的主服务器的ip地址
    • Port为被判断下线的主服务器端口
  • 当其它sentinel收到上面的命令时,会返回以下三个数据
    • down_state:1代表主服务器下线,0代表未下线
    • leader_runid:*代表此次回复仅为判断主服务器是否下线,具体的值为局部领头sentinel的运行id
    • leader_epoch:上一个参数为具体的运行id时,此参数代表此实例的配置纪元类似于配置版本;如果上一个参数为*,则此参数为0
12.5选举领头sentinel
  • 某个sentinel发现主节点客观不在线后都可以发起选举
  • 一个sentinel在一次选举中只能投一次票,先到先得
  • 一次投票完成后,无论是否成功,投票周期都会加一,即epoch加一
  • 如果某个sentinel获取到超过一半的投票,则自己就成为领头sentinel,负责实施故障转移
12.6选举举例

场景:三台sentinel,编号为1,2,3,master的ip为192.168.1.110,端口为6379
步骤:

  • 1这个sentinel先判断主节点主观下线
  • 1发送sentinel is-master-down-by-addr 192.168.1.110 6379 1 *给2和3节点
  • 1获取到反馈后,达到了判断master客观下线的条件
  • 1发起选举,发送sentinel is-master-down-by-addr 192.168.1.110 6379 1 ab12cd34(1自己的实例id)给2和3节点
  • 2收到消息后,因为是第一个收到1的,所以它也选举1,回复消息包含1,ab12cd34,1,分别代表主已经下线,选举的sentinel的实例id为
    ab12cd34,选举周期为1;
  • 1收到2的反馈后,发现所获得票是一半以上,则自己成为主,执行故障转移操作
12.7故障转移
1.选出新的主服务器
  • 删除主服务器的所有slave中处于下线状态的从服务器
  • 删除最近5秒内没有回复sentinel发出的info命令的从服务器
  • 删除与主服务器断线时间超过down-after-milliseconds*10毫秒的服务器
  • 按照slave的优先级排序,优先级越高,越容易被选中
  • 优先级一样高,则按照复制偏移量来排,数据偏移量越大说明数据越新
  • 通过向选出的从服务器发送slaveof no one命令来转变身份
  • 以每秒一次的频率发送info命令,如果返回信息中role:master,则选举成功
2.修改从服务器的复制目标
  • 向其它从服务器发送slaveof命令即可
3.将旧的主服务器变为从服务器
  • 因为主服务器已经下线,并不会做任何操作,但是sentinel会在自己的内部状态中维护主已经变为从,当重新连接后,会发送slaveof命令

十三.sentinel

13.1
  • 1)当前主从模式
127.0.0.1:6379> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=172.16.10.141,port=6379,state=online,offset=5266,lag=0
master_replid:1e1b4acf86e7882c044eb952136e04e5a70b077b
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:5266
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:5266
  • 2)配置2节点的sentinel
vi /usr/local/redis/etc/sentinel.conf
dir "/usr/local/redis/work"
logfile "/usr/local/redis/sentinel.log"
daemonize yes
protected-mode no
sentinel monitor mymaster 172.16.3.140 6379 1
# 上面的mymaster随意起,但是一定要放在下面这行引用的名字之前,不然会报名字找不到
sentinel auth-pass mymaster foobared
  • 3)启动sentinel监控redis-sentinel /usr/local/redis/etc/sentinel.conf
25401:X 20 Nov 15:30:06.428 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
25401:X 20 Nov 15:30:06.428 # Redis version=4.0.8, bits=64, commit=00000000, modified=0, pid=25401, just started
25401:X 20 Nov 15:30:06.428 # Configuration loaded
25402:X 20 Nov 15:30:06.430 * Increased maximum number of open files to 10032 (it was originally set to 1024).
25402:X 20 Nov 15:30:06.431 * Running mode=sentinel, port=26379.
25402:X 20 Nov 15:30:06.431 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
25402:X 20 Nov 15:30:06.432 # Sentinel ID is 20c0b9c989a852c87c59d913cd1c17c5b7bc2414
25402:X 20 Nov 15:30:06.432 # +monitor master mymaster 172.16.10.140 6379 quorum 1
25402:X 20 Nov 15:30:06.433 * +slave slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
25402:X 20 Nov 15:30:06.902 * +sentinel sentinel ff661bc57580186ec6bd2c5162925381e0eef451 172.16.10.141 26379 @ mymaster 172.16.10.140 6379

5778:X 20 Nov 15:30:03.530 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
5778:X 20 Nov 15:30:03.530 # Redis version=4.0.8, bits=64, commit=00000000, modified=0, pid=5778, just started
5778:X 20 Nov 15:30:03.530 # Configuration loaded
5779:X 20 Nov 15:30:03.532 * Increased maximum number of open files to 10032 (it was originally set to 1024).
5779:X 20 Nov 15:30:03.533 * Running mode=sentinel, port=26379.
5779:X 20 Nov 15:30:03.534 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
5779:X 20 Nov 15:30:03.535 # Sentinel ID is ff661bc57580186ec6bd2c5162925381e0eef451
5779:X 20 Nov 15:30:03.535 # +monitor master mymaster 172.16.10.140 6379 quorum 1
5779:X 20 Nov 15:30:03.537 * +slave slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:30:09.198 * +sentinel sentinel 20c0b9c989a852c87c59d913cd1c17c5b7bc2414 172.16.10.140 26379 @ mymaster 172.16.10.140 6379
  • 4)关闭master 172.16.10.140
# 过一会后 slave变成master
5779:X 20 Nov 15:32:42.152 # +sdown master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.152 # +odown master mymaster 172.16.10.140 6379 #quorum 1/1
5779:X 20 Nov 15:32:42.152 # +new-epoch 1
5779:X 20 Nov 15:32:42.152 # +try-failover master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.153 # +vote-for-leader ff661bc57580186ec6bd2c5162925381e0eef451 1
5779:X 20 Nov 15:32:42.155 # 20c0b9c989a852c87c59d913cd1c17c5b7bc2414 voted for ff661bc57580186ec6bd2c5162925381e0eef451 1
5779:X 20 Nov 15:32:42.253 # +elected-leader master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.253 # +failover-state-select-slave master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.308 # +selected-slave slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.308 * +failover-state-send-slaveof-noone slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.366 * +failover-state-wait-promotion slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:43.095 # +promoted-slave slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:43.095 # +failover-state-reconf-slaves master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:43.147 # +failover-end master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:43.147 # +switch-master mymaster 172.16.10.140 6379 172.16.10.141 6379
5779:X 20 Nov 15:32:43.147 * +slave slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
5779:X 20 Nov 15:33:13.204 # +sdown slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379

# 原主库log
25402:X 20 Nov 15:32:41.451 # +new-epoch 1
25402:X 20 Nov 15:32:41.452 # +vote-for-leader ff661bc57580186ec6bd2c5162925381e0eef451 1
25402:X 20 Nov 15:32:41.459 # +sdown master mymaster 172.16.10.140 6379
25402:X 20 Nov 15:32:41.459 # +odown master mymaster 172.16.10.140 6379 #quorum 1/1
25402:X 20 Nov 15:32:41.459 # Next failover delay: I will not start a failover before Tue Nov 20 15:38:42 2018
25402:X 20 Nov 15:32:42.445 # +config-update-from sentinel ff661bc57580186ec6bd2c5162925381e0eef451 172.16.10.141 26379 @ mymaster 172.16.10.140 6379
25402:X 20 Nov 15:32:42.445 # +switch-master mymaster 172.16.10.140 6379 172.16.10.141 6379
25402:X 20 Nov 15:32:42.446 * +slave slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
25402:X 20 Nov 15:33:12.473 # +sdown slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
  • 5)master已经飘到其中一个slave上了
  • 6)新master上的redis日志
1705:S 20 Nov 15:32:41.912 * MASTER <-> SLAVE sync started
1705:S 20 Nov 15:32:41.912 # Error condition on socket for SYNC: Connection refused
1705:M 20 Nov 15:32:42.366 # Setting secondary replication ID to 1e1b4acf86e7882c044eb952136e04e5a70b077b, valid up to offset: 22832. New replication ID is 6e7a0afb3aa5dbfc2c5b6c4f78afe8a9f0d0035c
1705:M 20 Nov 15:32:42.366 * Discarding previously cached master state.
1705:M 20 Nov 15:32:42.366 * MASTER MODE enabled (user request from 'id=60 addr=172.16.10.141:55503 fd=11 name=sentinel-ff661bc5-cmd age=159 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
1705:M 20 Nov 15:32:42.367 # CONFIG REWRITE executed with success.
  • 7)将挂掉的master开启
# 原master
25402:X 20 Nov 15:50:25.990 # -sdown slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
25402:X 20 Nov 15:50:35.967 * +convert-to-slave slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
# 新master
5779:X 20 Nov 15:50:26.744 # -sdown slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
  • 8)sentinel.conf被自动修改
dir "/usr/local/redis/work"
logfile "/usr/local/redis/work/sentinel.log"
daemonize yes
protected-mode no
sentinel myid 20c0b9c989a852c87c59d913cd1c17c5b7bc2414
# 上面的mymaster随意起,但是一定要放在下面这行引用的名字之前,不然会报名字找不到
sentinel monitor mymaster 172.16.10.141 6379 1
# Generated by CONFIG REWRITE
port 26379
sentinel auth-pass mymaster foobared
sentinel config-epoch mymaster 1
sentinel leader-epoch mymaster 1
sentinel known-slave mymaster 172.16.10.140 6379
sentinel known-sentinel mymaster 172.16.10.141 26379 ff661bc57580186ec6bd2c5162925381e0eef451
sentinel current-epoch 1

dir "/usr/local/redis/work"
logfile "/usr/local/redis/work/sentinel.log"
daemonize yes
protected-mode no
sentinel myid ff661bc57580186ec6bd2c5162925381e0eef451
# 上面的mymaster随意起,但是一定要放在下面这行引用的名字之前,不然会报名字找不到
sentinel monitor mymaster 172.16.10.141 6379 1
# Generated by CONFIG REWRITE
port 26379
sentinel auth-pass mymaster foobared
sentinel config-epoch mymaster 1
sentinel leader-epoch mymaster 1
sentinel known-slave mymaster 172.16.10.140 6379
sentinel known-sentinel mymaster 172.16.10.140 26379 20c0b9c989a852c87c59d913cd1c17c5b7bc2414
sentinel current-epoch 1
  • 9)注意
21239:X 29 Mar 16:43:12.722 # +try-failover master mymaster 172.16.3.140 6379
21239:X 29 Mar 16:43:12.724 # +vote-for-leader 863c1c8c627415dbc3004deb529d27df2299c2df 95
21239:X 29 Mar 16:43:23.438 # -failover-abort-not-elected master mymaster 172.16.3.140 6379
21239:X 29 Mar 16:43:23.497 # Next failover delay: I will not start a failover before Thu Mar 29 16:49:13 2018

当出现上面停掉master后,无法failover,我用的是第一种方法

1)如果redis实例没有配置
protected-mode yes
bind 192.168.98.136

则在sentinel 配置文件加上
protected-mode no 

即可

2)如果redis实例有配置
protected-mode yes
bind 192.168.98.136

则在sentinel 配置文件加上
protected-mode yes
bind 192.168.98.136

即可
posted @ 2018-09-11 13:39  Jenvid  阅读(427)  评论(0编辑  收藏  举报