redis 主从哨兵02

一.为什么要复制

1.实现数据的多副本存储，从而可以实现服务的高可用
2.提供更好的读性能，分担读请求

二.复制技术的关键点及难点

1.如何指定被复制对象
2.增量还是全量，以及如何实现增量
3.复制时不影响前端业务的操作
4.网络被中断后如何处理
5.如何防止发送出去的数据丢失，没有到达从服务器
6.如何识别被复制的数据源发生变化，导致数据出错

三.复制步骤

graph LR
全量同步--增量同步-->命令传播

3.1指定master

1.配置文件配置slaveof
2.从节点命令执行slaveof命令

3.2建立socket连接

从服务器根据配置或者命令行命令slaveof，创建连向主服务器的socket

3.3发送ping命令(当连接创建后发送)

1.通过ping命令检查socket的读写状态是否正常
2.检查主服务器是否能正常处理命令请求
3.当从服务器不能在规定的时间内得到ping的回复，则表示网络不正常，从服务器会断开socket并重连
4.如果从服务器收到主服务器返回的一个错误信息，比如BUSY redis is busy running ascript, youcan...，则从服务器会断开并重连
5.如果从服务器收到的回应是PING，则表示一切正常，可以执行下一步流程

3.4身份验证

1.如果从服务器设置了masterauth选项，则进行身份验证，否则部进行
2.通过向master发送命令auth来实现认证，auth passwd
3.当master没有设置requirepass时，会提示出现no password is set
4.当master设置与slave的密码不一样时，则出现invalid password错误

3.5发送端口信息

1.从服务器执行命令REPLCONF listening-port <port-number>，向主服务器发送从服务器的命令监控端口
2.这个端口号是为了在master上执行info命令时，可以查看从节点的端口信息，也就是从主动告知主自己的监听端口

3.6同步

主从服务器之间互为客户端，可以皮尺发送命令和相应回应

3.7命令传播

主服务器执行命令后会发送给从服务器

四.同步过程记录

五.配置说明

slave <masterip> <masterport> 
# 指定被复制的数据源
masterauth <master-password>
# 被复制数据源的认证密码
slave-serve-stale-data yes
# yes 表示slave与master之间的连接断开或者正处于复制时，slave服务器可以接受客户端的请求，缺点是可能读取到可期数据
# no 表示不接受客户端请求，返回错误信息"SYNC with master ip progress"
slave-read-only yes
# 从服务器是否只读，如果不是只读，可能会和主从之间产生数据不一致
repl-timeout 60
# 复制超时时间
# slave在于master SYNC期间有大量数据传输，造成超时
# 在slave角度，master超时，包括数据、ping等
# 在master角度，slave超时，当master发送REPLCONF、ACK pings
repl-disable-tcp-nodelay no
# yes redis将使用更少的tcp和带宽来向slave发送数据，本质就是提高包的有效使用率，多个数据放在一个包中传输，但会导致一定的数据延迟，linux系统是发送堆栈超时40ms
# no 包利用率不高，但延迟更低
repl-backlog-size 1mb
# master端固定发送缓冲区，影响从节点与主节点网络中断后是否全部同步；如果从节点需要多少的数据还在缓冲区，则增量同步，如果超时或者积压淘汰，则发生全量同步
repl-backlog-ttl 3600
# 当slave与master断开后，一定时间超时后，释放backlog的数据
slave-priority 100
# 用于配置从节点优先级，当主节点不能正常工作时，redis sentinel使用它来选择一个从节点并提升为主节点，优先级越高的从节点更有几率提升为主节点
# 当满足下面的条件时，主不接收前端的写请求
min-slaves-to-write 3
# 最少多少个slave在线，默认是0，表示关闭此功能
min-slaves-max-lag 10
# 最小时间延迟，超过该值前端停止写入

六.同步流程

七.全量同步过程

7.1从库进行slaveof

415:S 20 Nov 14:17:17.330 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
415:S 20 Nov 14:17:17.331 * SLAVE OF 172.16.10.140:6379 enabled (user request from 'id=4 addr=127.0.0.1:55027 fd=11 name= age=198 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')
415:S 20 Nov 14:17:17.586 * Connecting to MASTER 172.16.10.140:6379
415:S 20 Nov 14:17:17.586 * MASTER <-> SLAVE sync started
415:S 20 Nov 14:17:17.586 * Non blocking connect for SYNC fired the event.
415:S 20 Nov 14:17:17.587 * Master replied to PING, replication can continue...
415:S 20 Nov 14:17:17.587 * Trying a partial resynchronization (request 572caecf4c0bf264880b2e3899a3dae52e7704e9:1).
415:S 20 Nov 14:17:17.592 * Full resync from master: 030a3c44c4f64eb9a02c3b36f3891226fc2074fe:0
415:S 20 Nov 14:17:17.592 * Discarding previously cached master state.
415:S 20 Nov 14:17:17.681 * MASTER <-> SLAVE sync: receiving 201 bytes from master
415:S 20 Nov 14:17:17.698 * MASTER <-> SLAVE sync: Flushing old data
415:S 20 Nov 14:17:19.605 * MASTER <-> SLAVE sync: Loading DB in memory
415:S 20 Nov 14:17:19.605 * MASTER <-> SLAVE sync: Finished with success
415:S 20 Nov 14:17:19.606 * Background append only file rewriting started by pid 687
415:S 20 Nov 14:17:19.631 * AOF rewrite child asks to stop sending diffs.
687:C 20 Nov 14:17:19.631 * Parent agreed to stop sending diffs. Finalizing AOF...
687:C 20 Nov 14:17:19.631 * Concatenating 0.00 MB of AOF diff received from parent.
687:C 20 Nov 14:17:19.632 * SYNC append only file rewrite performed
687:C 20 Nov 14:17:19.632 * AOF rewrite: 2 MB of memory used by copy-on-write
415:S 20 Nov 14:17:19.707 * Background AOF rewrite terminated with success
415:S 20 Nov 14:17:19.707 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
415:S 20 Nov 14:17:19.707 * Background AOF rewrite finished successfully

7.2主库的log

10110:M 20 Nov 14:17:16.884 * Slave 172.16.10.141:6379 asks for synchronization
10110:M 20 Nov 14:17:16.884 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '572caecf4c0bf264880b2e3899a3dae52e7704e9', my replication IDs are '14f0fbdd33f13d8e6d07c13bb0a184ba7a43c258' and 'ff98eda832c57bef003947b34ae024063689ca44')
10110:M 20 Nov 14:17:16.885 * Starting BGSAVE for SYNC with target: disk
10110:M 20 Nov 14:17:16.888 * Background saving started by pid 11565
11565:C 20 Nov 14:17:16.891 * DB saved on disk
11565:C 20 Nov 14:17:16.891 * RDB: 6 MB of memory used by copy-on-write
10110:M 20 Nov 14:17:16.978 * Background saving terminated with success
10110:M 20 Nov 14:17:16.978 * Synchronization with slave 172.16.10.141:6379 succeeded

7.3主库关闭

# 主库log
12519:C 20 Nov 14:22:10.243 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
12519:C 20 Nov 14:22:10.243 # Redis version=4.0.8, bits=64, commit=00000000, modified=0, pid=12519, just started
12519:C 20 Nov 14:22:10.243 # Configuration loaded
12520:M 20 Nov 14:22:10.245 * Increased maximum number of open files to 10032 (it was originally set to 1024).
12520:M 20 Nov 14:22:10.245 # Creating Server TCP listening socket *:6379: bind: Address already in use
10110:M 20 Nov 14:23:36.032 # User requested shutdown...
10110:M 20 Nov 14:23:36.032 * Calling fsync() on the AOF file.
10110:M 20 Nov 14:23:36.032 * Removing the pid file.
10110:M 20 Nov 14:23:36.032 # Redis is now ready to exit, bye bye...

# 从库log
415:S 20 Nov 14:23:36.736 # Connection with master lost.
415:S 20 Nov 14:23:36.736 * Caching the disconnected master state.
415:S 20 Nov 14:23:37.456 * Connecting to MASTER 172.16.10.140:6379
415:S 20 Nov 14:23:37.456 * MASTER <-> SLAVE sync started
415:S 20 Nov 14:23:37.456 # Error condition on socket for SYNC: Connection refused
415:S 20 Nov 14:23:38.458 * Connecting to MASTER 172.16.10.140:6379
415:S 20 Nov 14:23:38.459 * MASTER <-> SLAVE sync started
415:S 20 Nov 14:23:38.459 # Error condition on socket for SYNC: Connection refused
415:S 20 Nov 14:23:39.462 * Connecting to MASTER 172.16.10.140:6379

7.4主库启动

# 从库log
415:S 20 Nov 14:24:39.625 # Error condition on socket for SYNC: Connection refused
415:S 20 Nov 14:24:40.626 * Connecting to MASTER 172.16.10.140:6379
415:S 20 Nov 14:24:40.626 * MASTER <-> SLAVE sync started
415:S 20 Nov 14:24:40.627 * Non blocking connect for SYNC fired the event.
415:S 20 Nov 14:24:40.627 * Master replied to PING, replication can continue...
415:S 20 Nov 14:24:40.628 * Trying a partial resynchronization (request 030a3c44c4f64eb9a02c3b36f3891226fc2074fe:702).
415:S 20 Nov 14:24:40.629 * Full resync from master: 1e1b4acf86e7882c044eb952136e04e5a70b077b:0
415:S 20 Nov 14:24:40.629 * Discarding previously cached master state.
415:S 20 Nov 14:24:40.712 * MASTER <-> SLAVE sync: receiving 216 bytes from master
415:S 20 Nov 14:24:40.712 * MASTER <-> SLAVE sync: Flushing old data
415:S 20 Nov 14:24:40.712 * MASTER <-> SLAVE sync: Loading DB in memory
415:S 20 Nov 14:24:40.712 * MASTER <-> SLAVE sync: Finished with success
415:S 20 Nov 14:24:40.713 * Background append only file rewriting started by pid 1102
415:S 20 Nov 14:24:40.737 * AOF rewrite child asks to stop sending diffs.
1102:C 20 Nov 14:24:40.737 * Parent agreed to stop sending diffs. Finalizing AOF...
1102:C 20 Nov 14:24:40.737 * Concatenating 0.00 MB of AOF diff received from parent.
1102:C 20 Nov 14:24:40.737 * SYNC append only file rewrite performed
1102:C 20 Nov 14:24:40.738 * AOF rewrite: 2 MB of memory used by copy-on-write
415:S 20 Nov 14:24:40.829 * Background AOF rewrite terminated with success
415:S 20 Nov 14:24:40.829 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
415:S 20 Nov 14:24:40.829 * Background AOF rewrite finished successfully

# 主库log，run_id改变，全同步
12992:M 20 Nov 14:24:39.924 * Slave 172.16.10.141:6379 asks for synchronization
12992:M 20 Nov 14:24:39.925 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '030a3c44c4f64eb9a02c3b36f3891226fc2074fe', my replication IDs are '510ae8234d41a712b9c60fe63a4cf193fc3a9fe2' and '0000000000000000000000000000000000000000')
12992:M 20 Nov 14:24:39.925 * Starting BGSAVE for SYNC with target: disk
12992:M 20 Nov 14:24:39.925 * Background saving started by pid 13002
13002:C 20 Nov 14:24:39.927 * DB saved on disk
13002:C 20 Nov 14:24:39.927 * RDB: 6 MB of memory used by copy-on-write
12992:M 20 Nov 14:24:40.008 * Background saving terminated with success
12992:M 20 Nov 14:24:40.008 * Synchronization with slave 172.16.10.141:6379 succeeded

八.断线后增量复制过程

从库重启

8.1从库关闭

# 主库记录连接丢失
12992:M 20 Nov 14:30:33.092 # Connection with slave 172.16.10.141:6379 lost.

8.2主库继续写数据

127.0.0.1:6379> set k11 v11
OK
127.0.0.1:6379> set k22 v22
OK

8.3从库启动，从库重新启动，也会进行全量同步，因为slave的 run_id也改变了

# 从库log
1520:S 20 Nov 14:31:55.315 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1520:S 20 Nov 14:31:55.315 * SLAVE OF 172.16.10.140:6379 enabled (user request from 'id=2 addr=127.0.0.1:55195 fd=10 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')
1520:S 20 Nov 14:31:55.712 * Connecting to MASTER 172.16.10.140:6379
1520:S 20 Nov 14:31:55.712 * MASTER <-> SLAVE sync started
1520:S 20 Nov 14:31:55.712 * Non blocking connect for SYNC fired the event.
1520:S 20 Nov 14:31:55.712 * Master replied to PING, replication can continue...
1520:S 20 Nov 14:31:55.713 * Trying a partial resynchronization (request 3a389f3b7dc9e3a394e6fdac5b7028e59aa635a8:1).
1520:S 20 Nov 14:31:55.715 * Full resync from master: 1e1b4acf86e7882c044eb952136e04e5a70b077b:575
1520:S 20 Nov 14:31:55.715 * Discarding previously cached master state.
1520:S 20 Nov 14:31:55.784 * MASTER <-> SLAVE sync: receiving 235 bytes from master
1520:S 20 Nov 14:31:55.785 * MASTER <-> SLAVE sync: Flushing old data
1520:S 20 Nov 14:31:55.785 * MASTER <-> SLAVE sync: Loading DB in memory
1520:S 20 Nov 14:31:55.785 * MASTER <-> SLAVE sync: Finished with success
1520:S 20 Nov 14:31:55.786 * Background append only file rewriting started by pid 1533
1520:S 20 Nov 14:31:55.809 * AOF rewrite child asks to stop sending diffs.
1533:C 20 Nov 14:31:55.809 * Parent agreed to stop sending diffs. Finalizing AOF...
1533:C 20 Nov 14:31:55.809 * Concatenating 0.00 MB of AOF diff received from parent.
1533:C 20 Nov 14:31:55.809 * SYNC append only file rewrite performed
1533:C 20 Nov 14:31:55.809 * AOF rewrite: 6 MB of memory used by copy-on-write
1520:S 20 Nov 14:31:55.812 * Background AOF rewrite terminated with success
1520:S 20 Nov 14:31:55.812 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
1520:S 20 Nov 14:31:55.812 * Background AOF rewrite finished successfully

# 主库log
12992:M 20 Nov 14:31:55.010 * Slave 172.16.10.141:6379 asks for synchronization
12992:M 20 Nov 14:31:55.010 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for '3a389f3b7dc9e3a394e6fdac5b7028e59aa635a8', my replication IDs are '1e1b4acf86e7882c044eb952136e04e5a70b077b' and '0000000000000000000000000000000000000000')
12992:M 20 Nov 14:31:55.010 * Starting BGSAVE for SYNC with target: disk
12992:M 20 Nov 14:31:55.011 * Background saving started by pid 14369
14369:C 20 Nov 14:31:55.013 * DB saved on disk
14369:C 20 Nov 14:31:55.013 * RDB: 6 MB of memory used by copy-on-write
12992:M 20 Nov 14:31:55.081 * Background saving terminated with success
12992:M 20 Nov 14:31:55.081 * Synchronization with slave 172.16.10.141:6379 succeeded

从库断线，进行增量同步（积压区数据还在）

1.从库断线后，主库依然写入数据

# slave
systemctl stop network&&sleep 60&&systemctl start network &

2.slave上线后

#主库log
12992:M 20 Nov 15:17:37.019 # Disconnecting timedout slave: 172.16.10.141:6379
12992:M 20 Nov 15:17:37.019 # Connection with slave 172.16.10.141:6379 lost.
12992:M 20 Nov 15:17:38.092 * Slave 172.16.10.141:6379 asks for synchronization
12992:M 20 Nov 15:17:38.093 * Partial resynchronization request from 172.16.10.141:6379 accepted. Sending 165 bytes of backlog starting from offset 4388.

# 从库log
1705:S 20 Nov 15:17:38.792 # MASTER timeout: no data nor PING received...
1705:S 20 Nov 15:17:38.793 # Connection with master lost.
1705:S 20 Nov 15:17:38.793 * Caching the disconnected master state.
1705:S 20 Nov 15:17:38.793 * Connecting to MASTER 172.16.10.140:6379
1705:S 20 Nov 15:17:38.794 * MASTER <-> SLAVE sync started
1705:S 20 Nov 15:17:38.795 * Non blocking connect for SYNC fired the event.
1705:S 20 Nov 15:17:38.795 * Master replied to PING, replication can continue...
1705:S 20 Nov 15:17:38.795 * Trying a partial resynchronization (request 1e1b4acf86e7882c044eb952136e04e5a70b077b:4388).
1705:S 20 Nov 15:17:38.796 * Successful partial resynchronization with master.
1705:S 20 Nov 15:17:38.796 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

九.停掉主后，重新启动，会不会重新全量同步

因为run_id源改变，发生全量同步

十.心跳检测

从服务器默认每10秒一次的频率向主发送心跳命令：REPLCONF ACK <replication_offset>

通过心跳检测可以知道网络状况，通过info命令可以查看到lag参数，表示主从延迟，单位是秒，一般为0或者1
在心跳检测中带有当前从的复制偏移量，当主发送给从的命令有丢失时，可以通过这种高频的心跳检测及时发现偏移量不正确，主服务器可以把缺失的命令重新发给从服务器
通过心跳检查可以实现min-slaves功能，即如果主从状态不正常时，不允许主写入数据

十一.Redis高可用应该解决那些问题

1.多个节点拥有相同的数据
- 复制技术
2.当主节点宕机后，如何产生新的主节点
3.当主节点宕机后，从节点如何自动连接到新的主节点
4.如何判断主节点宕机
5.旧的主节点恢复后，如何处理
6.如何监控redis所有节点的健康状态

十二.什么是sentinel（哨兵）

1.本身也就是redis程序的一部分
2.主要功能
- 2.1监控redis节点的健康状态
- 2.2通知，把监控到的变化通知给相关系统或者redis实例，通过redis的订阅机制实现
- 2.3自动热备（failover），主节点宕机----选举新的主节点
- 2.4.配置管理，redis实例可以通过sentinel获取到某些共享信息
3.Sentinel本身也是分布式，解决了自身单点问题

12.1安装配置sentinel

1.复制配置slave

port 6380
logfile "/home/liubx/redisdata/slave1/logs/redis.log“
pidfile /var/run/redis.pid与主路径不一致
dir /home/liubx/redisdata/slave1
slaveof localhost  6379

2.sentinel配置

在redis的安装目录下有一个配置文件sentinel.conf
daemonize yes
logfile "/home/liubx/sentinel/sentinel.log“
sentinel monitor mymaster 127.0.0.1 6379 1
# 监控名 IP 端口 票数
# 1个sentinel可监控多个master

3.启动sentinel
- redis-sentinel ../sentinel.conf
- redis-server ../sentinel.conf --sentinel

12.2HA步骤

1.主观判断主节点是否下线
2.客观判断主节点下线
3.sentinel选举出执行故障转移的节点（多个sentinel构成对主节点的监控）
4.故障转移
- 选出新的主服务器
- 修改从服务器的复制目标
- 将旧的主服务器变为从服务器

12.3主观判断下线

1.默认每10秒一次的频率发送ping命令，用于检测相关节点是否在线

包括主服务器主所属的从服务器以及其它sentinel
返回+PONG 、–LOADING、 -MASTERDOWN这三种状态中一种表示节点在线，反之，则节点不在线

2.在某段时间内，如果ping的返回不正确，则表示该节点主观下线

时间由参数sentinel down-after-milliseconds master 50000配置,单位为毫秒
这个时间的设置不仅仅影响主节点，还影响主节点所属的所有从节点以及同样监听这个主节点的其它sentinel
- 比如master的ip为1.1 此时的sentinel的ip为1.2，有从节点1.3，1.4，均指向1.1主节点；同时，另外一个sentinel的ip为1.5，并监控1.1；则如果1.2这个sentinel的时间配置为10000毫秒，则1.2判断1.1，1.3，1.4，1.5主观下线的时间都为10000毫秒
不同的sentinel，这个配置时间可以不一样

12.4客观判断下线

当一定数量的其它sentinel也同样判断该master下线时，此sentinel就认为此master为客观下线

这个数量由sentinel monitor master ip port num这里面的num指定

Sentinel之间会创建通信连接，通过发送命令来获取别的sentinel的判断信息

发送sentinel is-master-down-by-addr
<current_epoch>
- Current_epoch 配置纪元，也可以理解为选举轮次计数器
- runid为sentinel的实例id，可以为*,代表判断主节点是否下线状态，如果是具体的id，则表示选举领头的sentinel
- Ip为被sentinel判断为主观下线的主服务器的ip地址
- Port为被判断下线的主服务器端口
当其它sentinel收到上面的命令时，会返回以下三个数据
- down_state:1代表主服务器下线，0代表未下线
- leader_runid：*代表此次回复仅为判断主服务器是否下线，具体的值为局部领头sentinel的运行id
- leader_epoch：上一个参数为具体的运行id时，此参数代表此实例的配置纪元类似于配置版本;如果上一个参数为*，则此参数为0

12.5选举领头sentinel

某个sentinel发现主节点客观不在线后都可以发起选举
一个sentinel在一次选举中只能投一次票，先到先得
一次投票完成后，无论是否成功，投票周期都会加一，即epoch加一
如果某个sentinel获取到超过一半的投票，则自己就成为领头sentinel，负责实施故障转移

12.6选举举例

场景：三台sentinel，编号为1，2，3，master的ip为192.168.1.110，端口为6379
步骤：

1这个sentinel先判断主节点主观下线
1发送sentinel is-master-down-by-addr 192.168.1.110 6379 1 *给2和3节点
1获取到反馈后，达到了判断master客观下线的条件
1发起选举，发送sentinel is-master-down-by-addr 192.168.1.110 6379 1 ab12cd34(1自己的实例id)给2和3节点
2收到消息后，因为是第一个收到1的，所以它也选举1，回复消息包含1，ab12cd34，1，分别代表主已经下线，选举的sentinel的实例id为
ab12cd34，选举周期为1；
1收到2的反馈后，发现所获得票是一半以上，则自己成为主，执行故障转移操作

12.7故障转移

1.选出新的主服务器

删除主服务器的所有slave中处于下线状态的从服务器
删除最近5秒内没有回复sentinel发出的info命令的从服务器
删除与主服务器断线时间超过down-after-milliseconds*10毫秒的服务器
按照slave的优先级排序，优先级越高，越容易被选中
优先级一样高，则按照复制偏移量来排，数据偏移量越大说明数据越新
通过向选出的从服务器发送slaveof no one命令来转变身份
以每秒一次的频率发送info命令，如果返回信息中role：master，则选举成功

2.修改从服务器的复制目标

向其它从服务器发送slaveof命令即可

3.将旧的主服务器变为从服务器

因为主服务器已经下线，并不会做任何操作，但是sentinel会在自己的内部状态中维护主已经变为从，当重新连接后，会发送slaveof命令

十三.sentinel

13.1

1)当前主从模式

127.0.0.1:6379> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=172.16.10.141,port=6379,state=online,offset=5266,lag=0
master_replid:1e1b4acf86e7882c044eb952136e04e5a70b077b
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:5266
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:5266

2)配置2节点的sentinel

vi /usr/local/redis/etc/sentinel.conf
dir "/usr/local/redis/work"
logfile "/usr/local/redis/sentinel.log"
daemonize yes
protected-mode no
sentinel monitor mymaster 172.16.3.140 6379 1
# 上面的mymaster随意起，但是一定要放在下面这行引用的名字之前，不然会报名字找不到
sentinel auth-pass mymaster foobared

3)启动sentinel监控redis-sentinel /usr/local/redis/etc/sentinel.conf

25401:X 20 Nov 15:30:06.428 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
25401:X 20 Nov 15:30:06.428 # Redis version=4.0.8, bits=64, commit=00000000, modified=0, pid=25401, just started
25401:X 20 Nov 15:30:06.428 # Configuration loaded
25402:X 20 Nov 15:30:06.430 * Increased maximum number of open files to 10032 (it was originally set to 1024).
25402:X 20 Nov 15:30:06.431 * Running mode=sentinel, port=26379.
25402:X 20 Nov 15:30:06.431 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
25402:X 20 Nov 15:30:06.432 # Sentinel ID is 20c0b9c989a852c87c59d913cd1c17c5b7bc2414
25402:X 20 Nov 15:30:06.432 # +monitor master mymaster 172.16.10.140 6379 quorum 1
25402:X 20 Nov 15:30:06.433 * +slave slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
25402:X 20 Nov 15:30:06.902 * +sentinel sentinel ff661bc57580186ec6bd2c5162925381e0eef451 172.16.10.141 26379 @ mymaster 172.16.10.140 6379

5778:X 20 Nov 15:30:03.530 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
5778:X 20 Nov 15:30:03.530 # Redis version=4.0.8, bits=64, commit=00000000, modified=0, pid=5778, just started
5778:X 20 Nov 15:30:03.530 # Configuration loaded
5779:X 20 Nov 15:30:03.532 * Increased maximum number of open files to 10032 (it was originally set to 1024).
5779:X 20 Nov 15:30:03.533 * Running mode=sentinel, port=26379.
5779:X 20 Nov 15:30:03.534 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
5779:X 20 Nov 15:30:03.535 # Sentinel ID is ff661bc57580186ec6bd2c5162925381e0eef451
5779:X 20 Nov 15:30:03.535 # +monitor master mymaster 172.16.10.140 6379 quorum 1
5779:X 20 Nov 15:30:03.537 * +slave slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:30:09.198 * +sentinel sentinel 20c0b9c989a852c87c59d913cd1c17c5b7bc2414 172.16.10.140 26379 @ mymaster 172.16.10.140 6379

4)关闭master 172.16.10.140

# 过一会后 slave变成master
5779:X 20 Nov 15:32:42.152 # +sdown master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.152 # +odown master mymaster 172.16.10.140 6379 #quorum 1/1
5779:X 20 Nov 15:32:42.152 # +new-epoch 1
5779:X 20 Nov 15:32:42.152 # +try-failover master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.153 # +vote-for-leader ff661bc57580186ec6bd2c5162925381e0eef451 1
5779:X 20 Nov 15:32:42.155 # 20c0b9c989a852c87c59d913cd1c17c5b7bc2414 voted for ff661bc57580186ec6bd2c5162925381e0eef451 1
5779:X 20 Nov 15:32:42.253 # +elected-leader master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.253 # +failover-state-select-slave master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.308 # +selected-slave slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.308 * +failover-state-send-slaveof-noone slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:42.366 * +failover-state-wait-promotion slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:43.095 # +promoted-slave slave 172.16.10.141:6379 172.16.10.141 6379 @ mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:43.095 # +failover-state-reconf-slaves master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:43.147 # +failover-end master mymaster 172.16.10.140 6379
5779:X 20 Nov 15:32:43.147 # +switch-master mymaster 172.16.10.140 6379 172.16.10.141 6379
5779:X 20 Nov 15:32:43.147 * +slave slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
5779:X 20 Nov 15:33:13.204 # +sdown slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379

# 原主库log
25402:X 20 Nov 15:32:41.451 # +new-epoch 1
25402:X 20 Nov 15:32:41.452 # +vote-for-leader ff661bc57580186ec6bd2c5162925381e0eef451 1
25402:X 20 Nov 15:32:41.459 # +sdown master mymaster 172.16.10.140 6379
25402:X 20 Nov 15:32:41.459 # +odown master mymaster 172.16.10.140 6379 #quorum 1/1
25402:X 20 Nov 15:32:41.459 # Next failover delay: I will not start a failover before Tue Nov 20 15:38:42 2018
25402:X 20 Nov 15:32:42.445 # +config-update-from sentinel ff661bc57580186ec6bd2c5162925381e0eef451 172.16.10.141 26379 @ mymaster 172.16.10.140 6379
25402:X 20 Nov 15:32:42.445 # +switch-master mymaster 172.16.10.140 6379 172.16.10.141 6379
25402:X 20 Nov 15:32:42.446 * +slave slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
25402:X 20 Nov 15:33:12.473 # +sdown slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379

5)master已经飘到其中一个slave上了
6)新master上的redis日志

1705:S 20 Nov 15:32:41.912 * MASTER <-> SLAVE sync started
1705:S 20 Nov 15:32:41.912 # Error condition on socket for SYNC: Connection refused
1705:M 20 Nov 15:32:42.366 # Setting secondary replication ID to 1e1b4acf86e7882c044eb952136e04e5a70b077b, valid up to offset: 22832. New replication ID is 6e7a0afb3aa5dbfc2c5b6c4f78afe8a9f0d0035c
1705:M 20 Nov 15:32:42.366 * Discarding previously cached master state.
1705:M 20 Nov 15:32:42.366 * MASTER MODE enabled (user request from 'id=60 addr=172.16.10.141:55503 fd=11 name=sentinel-ff661bc5-cmd age=159 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
1705:M 20 Nov 15:32:42.367 # CONFIG REWRITE executed with success.

7)将挂掉的master开启

# 原master
25402:X 20 Nov 15:50:25.990 # -sdown slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
25402:X 20 Nov 15:50:35.967 * +convert-to-slave slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379
# 新master
5779:X 20 Nov 15:50:26.744 # -sdown slave 172.16.10.140:6379 172.16.10.140 6379 @ mymaster 172.16.10.141 6379

8)sentinel.conf被自动修改

dir "/usr/local/redis/work"
logfile "/usr/local/redis/work/sentinel.log"
daemonize yes
protected-mode no
sentinel myid 20c0b9c989a852c87c59d913cd1c17c5b7bc2414
# 上面的mymaster随意起，但是一定要放在下面这行引用的名字之前，不然会报名字找不到
sentinel monitor mymaster 172.16.10.141 6379 1
# Generated by CONFIG REWRITE
port 26379
sentinel auth-pass mymaster foobared
sentinel config-epoch mymaster 1
sentinel leader-epoch mymaster 1
sentinel known-slave mymaster 172.16.10.140 6379
sentinel known-sentinel mymaster 172.16.10.141 26379 ff661bc57580186ec6bd2c5162925381e0eef451
sentinel current-epoch 1

dir "/usr/local/redis/work"
logfile "/usr/local/redis/work/sentinel.log"
daemonize yes
protected-mode no
sentinel myid ff661bc57580186ec6bd2c5162925381e0eef451
# 上面的mymaster随意起，但是一定要放在下面这行引用的名字之前，不然会报名字找不到
sentinel monitor mymaster 172.16.10.141 6379 1
# Generated by CONFIG REWRITE
port 26379
sentinel auth-pass mymaster foobared
sentinel config-epoch mymaster 1
sentinel leader-epoch mymaster 1
sentinel known-slave mymaster 172.16.10.140 6379
sentinel known-sentinel mymaster 172.16.10.140 26379 20c0b9c989a852c87c59d913cd1c17c5b7bc2414
sentinel current-epoch 1

9)注意

21239:X 29 Mar 16:43:12.722 # +try-failover master mymaster 172.16.3.140 6379
21239:X 29 Mar 16:43:12.724 # +vote-for-leader 863c1c8c627415dbc3004deb529d27df2299c2df 95
21239:X 29 Mar 16:43:23.438 # -failover-abort-not-elected master mymaster 172.16.3.140 6379
21239:X 29 Mar 16:43:23.497 # Next failover delay: I will not start a failover before Thu Mar 29 16:49:13 2018

当出现上面停掉master后，无法failover，我用的是第一种方法

1）如果redis实例没有配置
protected-mode yes
bind 192.168.98.136

则在sentinel 配置文件加上
protected-mode no 

即可

2）如果redis实例有配置
protected-mode yes
bind 192.168.98.136

则在sentinel 配置文件加上
protected-mode yes
bind 192.168.98.136

即可

posted @ 2018-09-11 13:39 Jenvid 阅读(568) 评论(0) 收藏举报

刷新页面返回顶部

不懂ABAP的python不是好basis