Redis Sentinel 高可用实现说明

背景:

     前面介绍了Redis 复制、Sentinel的搭建和原理说明,通过这篇文章大致能了解Sentinel的原理和实现方法以及相关的搭建。这篇文章就针对Redis Sentinel的搭建做下详细的说明。

安装:

     这里对源码编译进行一下说明,本文实例的操作系统是Ubuntu16.04,使用Redis的版本是3.2.0。安装步骤如下:

  • 下载源码包:wget http://download.redis.io/releases/redis-3.2.0.tar.gz
  • 安装依赖包:sudo apt-get install gcc tcl
  • 解压编译   :
    #tar zxvf redis-3.2.0.tar.gz
    ...
    ...
    #make
    ...
    Hint: It's a good idea to run 'make test' ;)
    #make test
    ...
    \o/ All tests passed without errors!
    ...
    #make install

    注意:这里很可能会在make test 这步出现一个错误:

    [err]: Test replication partial resync: ok psync (diskless: yes, reconnect: 1) in tests/integration/replication-psync.tcl

    Expected condition '[s -1 sync_partial_ok] > 0' to be true ([s -1 sync_partial_ok] > 0)

    出现这个问题的原因可能是"测试点在配置比较低的机器上会因为超时而过不了",本文的环境是一个lxc的虚拟机。不过有2个方法可以避免:

    1:在解压目录中修改
    # vi tests/integration/replication-psync.tcl
    把 after 100 改成 after 500
    
    2:用taskset来make test
    # taskset -c 1 make test

    到此redis编译安装完成。

  • 编译文件的目录里有2个配置:
    redis.confsentinel.conf,配置文件说明请见这篇文章
  • 本文测试的环境架构:
    3个redis实例1主、2从、3sentinel。M:10.0.3.110、S:10.0.3.92、10.0.3.66,每个redis实例上配置一个sentinel实例。修改配置文件:
    redis.conf
  • # Redis configuration file example.
    # ./redis-server /path/to/redis.conf
    
    ################################## INCLUDES ###################################
    
    # include /path/to/local.conf
    # include /path/to/other.conf
    
    ################################## NETWORK #####################################
    
    bind 10.0.3.110
    
    protected-mode yes
    
    port 6379
    
    tcp-backlog 511
    
    unixsocket "/tmp/redis.sock"
    unixsocketperm 700
    
    timeout 0
    
    tcp-keepalive 0
    
    ################################# GENERAL #####################################
    
    daemonize yes
    
    pidfile "/var/run/redis6379.pid"
    
    loglevel notice
    
    logfile "/var/log/redis/redis_6379.log"
    
    # syslog-enabled no
    # syslog-ident redis
    # syslog-facility local0
    
    databases 16
    supervised no
    
    ################################ SNAPSHOTTING  ################################
    
    save 900 1
    save 300 10
    save 60 10000
    
    stop-writes-on-bgsave-error yes
    
    rdbcompression yes
    
    rdbchecksum yes
    
    dbfilename "dump_6379.rdb"
    
    dir "/var/lib/redis_6379"
    
    ################################# REPLICATION #################################
    
    # slaveof <masterip> <masterport>
    masterauth "dxydxy"
    
    slave-serve-stale-data yes
    slave-read-only yes
    
    repl-diskless-sync no
    repl-diskless-sync-delay 5
    
    # repl-ping-slave-period 10
    # repl-timeout 60
    
    repl-disable-tcp-nodelay no
    repl-backlog-size 5mb
    repl-backlog-ttl 3600
    
    slave-priority 100
    
    #min-slaves-to-write 3
    #min-slaves-max-lag 10
    
    ################################## SECURITY ###################################
    
    requirepass "dxydxy"
    # rename-command CONFIG b840fc02d524045429941cc15f59e41cb7be6c52
    # rename-command CONFIG ""
    
    ################################### LIMITS ####################################
    
    maxclients 1000
    #maxmemory <bytes>
    maxmemory-policy noeviction
    # maxmemory-samples 5
    
    ############################## APPEND ONLY MODE ###############################
    
    appendonly yes
    
    appendfilename "appendonly_6379.aof"
    
    # appendfsync always
    appendfsync everysec
    # appendfsync no
    
    no-appendfsync-on-rewrite no
    auto-aof-rewrite-percentage 100
    auto-aof-rewrite-min-size 64mb
    aof-load-truncated yes
    
    ################################ LUA SCRIPTING  ###############################
    
    lua-time-limit 5000
    
    ################################ REDIS CLUSTER  ###############################
    
    # cluster-enabled yes
    # cluster-config-file nodes-6379.conf
    # cluster-node-timeout 15000
    # cluster-slave-validity-factor 10
    # cluster-migration-barrier 1
    # cluster-require-full-coverage yes
    
    ################################## SLOW LOG ###################################
    
    slowlog-log-slower-than 10000
    slowlog-max-len 128
    
    ################################ LATENCY MONITOR ##############################
    
    latency-monitor-threshold 0
    
    ############################# EVENT NOTIFICATION ##############################
    
    notify-keyspace-events ""
    
    ############################### ADVANCED CONFIG ###############################
    
    hash-max-ziplist-entries 512
    hash-max-ziplist-value 64
    
    list-max-ziplist-entries 512
    list-max-ziplist-value 64
    
    list-compress-depth 0
    set-max-intset-entries 512
    
    zset-max-ziplist-entries 128
    zset-max-ziplist-value 64
    
    hll-sparse-max-bytes 3000
    
    activerehashing yes
    
    client-output-buffer-limit normal 0 0 0
    client-output-buffer-limit slave 256mb 64mb 60
    client-output-buffer-limit pubsub 32mb 8mb 60
    
    hz 10
    aof-rewrite-incremental-fsync yes
    
    list-max-ziplist-size -2
    View Code

    sentinel.conf

    port 16379
    
    dir "/var/lib/sentinel_16379"
    
    logfile "/var/log/redis/sentinel_16379.log"
    
    daemonize yes
    
    protected-mode no
    
    sentinel monitor dxy 10.0.3.110 6379 2
    
    sentinel auth-pass dxy dxydxy
    
    sentinel down-after-milliseconds dxy 15000
    
    sentinel failover-timeout dxy 120000
    
    #发生切换之后执行的一个自定义脚本:如发邮件、vip切换等
    #sentinel notification-script <master-name> <script-path>
    #sentinel client-reconfig-script <master-name> <script-path>

    配置文件保存在 /etc/redis/目录下,按照配置文件创建相应的目录。和Redis 复制、Sentinel的搭建和原理说明这里不同的是各个redis实例都配置了密码访问的限制(requirepass)。
    注意:当一个master配置需要密码才能连接时,客户端和slave在连接时都需要提供密码。master通过requirepass设置自身的密码,不提供密码无法连接到这个master。slave通过masterauth来设置访问master时的密码。客户端需要auth提供密码,但是当使用了sentinel时,由于一个master可能会变成一个slave,一个slave也可能会变成master,所以需要同时设置上述两个配置项,并且sentinel需要连接master和slave,需要设置参数:sentinel auth-pass <master_name> xxxxx。

  • 创建redis用户和组,把配置文件里指定的目录均授权。
    # useradd redis
    # groupadd redis
    # chown -R redis.redis redis/
    # chown -R redis.redis /etc/redis/
  • 开启各个redis实例
    redis-server /etc/redis/redis.conf

        注意:开启的时redis的日志会报几个WARNING

  • 29407:M 14 Jun 14:36:42.186 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
    处理:修改/etc/sysctl.conf文件,增加一行 net.core.somaxconn= 1024;然后执行命令:sysctl -p
    
    29407:M 14 Jun 14:36:42.186 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
    处理echo 1 > /proc/sys/vm/
    
    29407:M 14 Jun 14:36:42.187 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
    处理echo never > /sys/kernel/mm/transparent_hugepage/enabled

    WARNING说明:

    net.core.somaxconn是linux中的一个kernel参数,表示socket监听(listen)的backlog上限。
    backlog是socket的监听队列,当一个请求(request)尚未被处理或建立时,他会进入backlog。
    而socket server可以一次性处理backlog中的所有请求,处理后的请求不再位于监听队列中。
    当server处理请求较慢,以至于监听队列被填满后,新来的请求会被拒绝。
    所以说net.core.somaxconn限制了接收新 TCP 连接侦听队列的大小。
    对于一个经常处理新连接的高负载 web服务环境来说,默认的 128 太小了。大多数环境这个值建议增加到 1024 或者更多。
    
    
    overcommit_memory参数说明:
    设置内存分配策略(可选,根据服务器的实际情况进行设置)
    /proc/sys/vm/overcommit_memory
    可选值:0120, 表示内核将检查是否有足够的可用内存供应用进程使用;如果有足够的可用内存,内存申请允许;否则,内存申请失败,并把错误返回给应用进程。
    1, 表示内核允许分配所有的物理内存,而不管当前的内存状态如何。
    2, 表示内核允许分配超过所有物理内存和交换空间总和的内存
    注意:redis在dump数据的时候,会fork出一个子进程,理论上child进程所占用的内存和parent是一样的,比如parent占用的内存为8G,这个时候也要同样分配8G的内存给child,如果内存无法负担,往往会造成redis服务器的down机或者IO负载过高,效率下降。所以这里比较优化的内存分配策略应该设置为 1(表示内核允许分配所有的物理内存,而不管当前的内存状态如何)。
    View Code
  • 建立好复制后(slaveof)开启各个sentinel实例
  • redis-sentinel /etc/redis/sentinel.conf

    注意:这里出现一个问题,这个问题罪魁祸首是参数:protected-mode。看下日志:

    2208:X 14 Jun 23:13:09.185 * +sentinel sentinel ebf9b1b4a5cc98bffead5d0996b8f43deb806641 10.0.3.92 16379 @ dxy 10.0.3.110 6379
    2208:X 14 Jun 23:13:24.234 # +sdown sentinel ebf9b1b4a5cc98bffead5d0996b8f43deb806641 10.0.3.92 16379 @ dxy 10.0.3.110 6379
    2208:X 14 Jun 23:14:18.888 * +sentinel sentinel 07e189ae6c30d4951d3eb48e9effd948de026c3b 10.0.3.66 16379 @ dxy 10.0.3.110 6379
    2208:X 14 Jun 23:14:33.962 # +sdown sentinel 07e189ae6c30d4951d3eb48e9effd948de026c3b 10.0.3.66 16379 @ dxy 10.0.3.110 6379

    从日志里可以看到,除了本地的sentinel正常,其他2个sentinel都主观不可用了(SDOWN),时间刚好15秒(down-after-milliseconds 15000),sentinel会向master发送心跳PING来确认master是否存活,如果master在“一定时间范围”内不回应PONG 或者是回复了一个错误消息,那么这个sentinel会主观地(单方面地)认为这个master已经不可用了(subjectively down, 也简称为SDOWN)。而这个down-after-milliseconds就是用来指定这个“一定时间范围”的,单位是毫秒
    通过时间点的判断可以看到,sentinel之间发现不了对方,导致SDOWN(从Redis 复制、Sentinel的搭建和原理说明里介绍的发现机制)。因为没有错误信息,这里找了半天原因都没发现什么问题。最后登陆sentinel上查看一下:

    # redis -h 10.0.3.110 -p 16379
    10.0.3.110:16379> info
    DENIED Redis is running in protected mode because protected mode is enabled, no bind address was specified, no authentication password is requested to clients. In this mode connections are only accepted from the loopback interface. If you want to connect from external computers to Redis you may adopt one of the following solutions: 1) Just disable protected mode sending the command 'CONFIG SET protected-mode no' from the loopback interface by connecting to Redis from the same host the server is running, however MAKE SURE Redis is not publicly accessible from internet if you do so. Use CONFIG REWRITE to make this change permanent. 2) Alternatively you can just disable the protected mode by editing the Redis configuration file, and setting the protected mode option to 'no', and then restarting the server. 3) If you started the server manually just for testing, restart it with the '--protected-mode no' option. 4) Setup a bind address or an authentication password. NOTE: You only need to do one of the above things in order for the server to start accepting connections from the outside.

    这里看到一大串的信息,总的就是在说redis在没有开启bind和密码的情况下,保护模式被开启。然后Redis的只接受来自环回IPv4和IPv6地址的连接。拒绝外部连接,使用户知道发生了什么错误。其实应该为用户提供了线索,而不是拒绝连接。具体的说明可以看作者的讨论,最后作者给出的建议是关闭保护模式:--portected-mode no。所以最后我们这里的错误信息可以得到解释:由于sentinel没有指定bind和密码访问,所以被开启了protected-mode保护模式,拒绝其他sentinel的连接。导致进入了ODWON。在sentinel.conf里加入:

    protected-mode no

    问题得到解决。portected-mode是3.2被引入,默认开启。具体的信息如下:

    # Protected mode is a layer of security protection, in order to avoid that
    # Redis instances left open on the internet are accessed and exploited.
    #
    # When protected mode is on and if:
    #
    # 1) The server is not binding explicitly to a set of addresses using the
    #    "bind" directive.
    # 2) No password is configured.
    #
    # The server only accepts connections from clients connecting from the
    # IPv4 and IPv6 loopback addresses 127.0.0.1 and ::1, and from Unix domain
    # sockets.
    #
    # By default protected mode is enabled. You should disable it only if
    # you are sure you want clients from other hosts to connect to Redis
    # even if no authentication is configured, nor a specific set of interfaces
    # are explicitly listed using the "bind" directive.
    protected-mode yes
    View Code
  • 开启sentinel,查看日志:(成功开启)
    2253:X 14 Jun 23:48:05.477 # Sentinel ID is 68fdb1e07c0998b119e4678f7aead7742a7b1f64
    2253:X 14 Jun 23:48:05.477 # +monitor master dxy 10.0.3.110 6379 quorum 2
    2253:X 14 Jun 23:48:05.478 * +slave slave 10.0.3.92:6379 10.0.3.92 6379 @ dxy 10.0.3.110 6379
    2253:X 14 Jun 23:48:05.512 * +slave slave 10.0.3.66:6379 10.0.3.66 6379 @ dxy 10.0.3.110 6379
    2253:X 14 Jun 23:48:14.894 * +sentinel sentinel b2fb07a1cce853ddec86a993428fb09edf15b6c1 10.0.3.92 16379 @ dxy 10.0.3.110 6379
    2253:X 14 Jun 23:48:23.346 * +sentinel sentinel d9b198d75ede190fc63d95af8a7ca58e1a395c9b 10.0.3.66 16379 @ dxy 10.0.3.110 6379
  • 查看状态,验证sentinel是否建立成功。(任意登陆一个sentinel查看)
    10.0.3.92:16379> info sentinel
    # Sentinel
    sentinel_masters:1
    sentinel_tilt:0
    sentinel_running_scripts:0
    sentinel_scripts_queue_length:0
    sentinel_simulate_failure_flags:0
    master0:name=dxy,status=ok,address=10.0.3.110:6379,slaves=2,sentinels=3

    上面粗体的字说明sentinel开启成功。

测试:

注意:因为上面的虚拟机连不了邮件服务器,所以更换了环境。新环境:版本2.8.4,3个redis实例1主、2从、3sentinel。M:192.168.200.208<6379>、S:192.168.200.199、192.168.200.73,每个redis实例上配置一个sentinel<7379>实例。

① 查看:info 

192.168.200.208:6379> info replication
# Replication
role:master
connected_slaves:2
slave0:ip=192.168.200.199,port=6379,state=online,offset=354835,lag=0
slave1:ip=192.168.200.73,port=6379,state=online,offset=354835,lag=0
master_repl_offset:354974 
repl_backlog_active:1
repl_backlog_size:5242880 
repl_backlog_first_byte_offset:2
repl_backlog_histlen:354973
192.168.200.208:6379>

192.168.200.208:7379> info sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
192.168.200.208:7379> sentinel master dxy
 1) "name"
 2) "dxy"
 3) "ip"
 4) "192.168.200.208"
 5) "port"
 6) "6379"
 7) "runid"
 8) "50ad7cfe6676fc1a1e671ead4a780958942879fc"
 9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ok-ping-reply"
14) "682"
15) "last-ping-reply"
16) "682"
17) "info-refresh"
18) "3301"
19) "role-reported"
20) "master"
21) "role-reported-time"
22) "1930980"
23) "config-epoch"
24) "4"
25) "num-slaves"
26) "2"
27) "num-other-sentinels"
28) "2"
29) "quorum"
30) "2"
31) "down-after-milliseconds"
32) "30000"
33) "failover-timeout"
34) "180000"
35) "parallel-syncs"
36) "1"
37) "client-reconfig-script"
38) "/opt/bin/notify.py"

192.168.200.208:7379> sentinel slaves dxy
1)  1) "name"
    2) "192.168.200.199:6379"
    3) "ip"
    4) "192.168.200.199"  
    5) "port"
    6) "6379"
    7) "runid"
    8) "c4e7bf53f7cee3c28bc369e1db656f879bf41947"
    9) "flags"
   10) "slave"
   11) "pending-commands" 
   12) "0"
   13) "last-ok-ping-reply"
   14) "591"
   15) "last-ping-reply"  
   16) "591"
   17) "info-refresh"
   18) "3606"
   19) "role-reported"
   20) "slave"
   21) "role-reported-time"
   22) "1971346"
   23) "master-link-down-time"
   24) "0"
   25) "master-link-status"
   26) "ok"
   27) "master-host"
   28) "192.168.200.208"
   29) "master-port"
   30) "6379"
   31) "slave-priority"
   32) "100"
   33) "slave-repl-offset"
   34) "400362"
2)  1) "name"
    2) "192.168.200.73:6379"
    3) "ip"
    4) "192.168.200.73"
    5) "port"
    6) "6379"
    7) "runid"
    8) "64ad290c43bba2b062220029c4c91274bb4465b9"
    9) "flags"
   10) "slave"
   11) "pending-commands"
   12) "0"
   13) "last-ok-ping-reply"
   14) "591"
   15) "last-ping-reply"
   16) "591"
   17) "info-refresh"
   18) "4817"
   19) "role-reported"
   20) "slave"
   21) "role-reported-time"
   22) "326006"
   23) "master-link-down-time"
   24) "0"
   25) "master-link-status"
   26) "ok"
   27) "master-host"
   28) "192.168.200.208"
   29) "master-port"
   30) "6379"
   31) "slave-priority"
   32) "100"
   33) "slave-repl-offset"
   34) "400085"
View Code

② 验证failover

kill 掉 master,通过日志查看是切换过程的信息:

[7637] 17 Jun 12:11:08.728 # +sdown master dxy 192.168.200.208 6379   #进入客观不可用
[7637] 17 Jun 12:11:08.819 # +odown master dxy 192.168.200.208 6379   #quorum 2/2 #投票好之后进入主观不可用
[7637] 17 Jun 12:11:08.819 # +new-epoch 5                             #版本号
[7637] 17 Jun 12:11:08.819 # +try-failover master dxy 192.168.200.208 6379  #达到failover条件,正等待其他sentinel的选举
[7637] 17 Jun 12:11:08.819 # +vote-for-leader 38da843c4ad8baf95dcfdcd968ae6c2f05ab995c 5  #选举出leader
[7637] 17 Jun 12:11:08.820 # 192.168.200.199:7379 voted for 38da843c4ad8baf95dcfdcd968ae6c2f05ab995c 5
[7637] 17 Jun 12:11:08.820 # 192.168.200.73:7379 voted for 38da843c4ad8baf95dcfdcd968ae6c2f05ab995c 5
[7637] 17 Jun 12:11:08.909 # +elected-leader master dxy 192.168.200.208 6379 #选择leader
[7637] 17 Jun 12:11:08.909 # +failover-state-select-slave master dxy 192.168.200.208 6379 #选择一个slave当选新master
[7637] 17 Jun 12:11:08.965 # +selected-slave slave 192.168.200.73:6379 192.168.200.73 6379 @ dxy 192.168.200.208 6379 #选择了从73作为master
[7637] 17 Jun 12:11:08.965 * +failover-state-send-slaveof-noone slave 192.168.200.73:6379 192.168.200.73 6379 @ dxy 192.168.200.208 6379 #当把选择为新master的slave的身份进行切换
[7637] 17 Jun 12:11:09.017 * +failover-state-wait-promotion slave 192.168.200.73:6379 192.168.200.73 6379 @ dxy 192.168.200.208 6379 #等待其他sentinel的确认
[7637] 17 Jun 12:11:09.867 # +promoted-slave slave 192.168.200.73:6379 192.168.200.73 6379 @ dxy 192.168.200.208 6379 #确认成功
[7637] 17 Jun 12:11:09.867 # +failover-state-reconf-slaves master dxy 192.168.200.208 6379 #Failover状态变为reconf-slaves 
[7637] 17 Jun 12:11:09.957 * +slave-reconf-sent slave 192.168.200.199:6379 192.168.200.199 6379 @ dxy 192.168.200.208 6379 #sentinel发送SLAVEOF命令把它重新配置,重新配置到新主
[7637] 17 Jun 12:11:10.887 * +slave-reconf-inprog slave 192.168.200.199:6379 192.168.200.199 6379 @ dxy 192.168.200.208 6379 #slave被重新配置为另外一个master的slave,但数据复制还未发生
[7637] 17 Jun 12:11:10.887 * +slave-reconf-done slave 192.168.200.199:6379 192.168.200.199 6379 @ dxy 192.168.200.208 6379 #slave被重新配置为另外一个master的slave并且数据复制已经与master同步
[7637] 17 Jun 12:11:10.946 # -odown master dxy 192.168.200.208 6379 #老主离开主观不可用
[7637] 17 Jun 12:11:10.946 # +failover-end master dxy 192.168.200.208 6379 ##failover成功完成
[7637] 17 Jun 12:11:10.946 # +switch-master dxy 192.168.200.208 6379 192.168.200.73 6379 #监听新的master
[7637] 17 Jun 12:11:10.946 * +slave slave 192.168.200.199:6379 192.168.200.199 6379 @ dxy 192.168.200.73 6379 #发现slave
[7637] 17 Jun 12:11:10.947 * +slave slave 192.168.200.208:6379 192.168.200.208 6379 @ dxy 192.168.200.73 6379
[7637] 17 Jun 12:11:40.960 # +sdown slave 192.168.200.208:6379 192.168.200.208 6379 @ dxy 192.168.200.73 6379
View Code

start 老的master,通过日志查看:

[98910] 17 Jun 12:29:01.856 # -sdown slave 192.168.200.208:6379 192.168.200.208 6379 @ dxy 192.168.200.73 6379
[98910] 17 Jun 12:29:11.793 * +convert-to-slave slave 192.168.200.208:6379 192.168.200.208 6379 @ dxy 192.168.200.73 6379  #failover 成功!
View Code

更多的日志信息见上一篇文章。在sentinel里有个选项client-reconfig-script,接下来说明下。 

failover脚本高可用,通过参数 client-reconfig-script 指定脚本:failover发生时候执行的脚本。

该参数的解释:

# When the master changed because of a failover a script can be called in
# order to perform application-specific tasks to notify the clients that the
# configuration has changed and the master is at a different address.
# 
# The following arguments are passed to the script:
#
# <master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>
#
# <state> is currently always "failover"
# <role> is either "leader" or "observer"
# 
# The arguments from-ip, from-port, to-ip, to-port are used to communicate
# the old address of the master and the new address of the elected slave
# (now a master).
#
# This script should be resistant to multiple invocations.
View Code

返回的参数:

<master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>

脚本的目的是在发生failover之后,发送邮件报警,并且把vip切换到新的master上,有点类似MySQL的MHA,脚本比较简单,没有做其他多余的判断,也可以根据复杂的情况加强这个脚本。实现方法:

①:首先在三台redis实例上建立信任用密码登陆。

用ssh-keygen创建公钥,一直默认回车,最后会在.ssh/下面生成id_rsa.pub
ssh-keygen -t rsa  

把id_rsa.pub 文件复制到另外2台机子并导入公钥: 
cat id_rsa.pub >> /root/.ssh/authorized_keys 

这里需要注意:因为测试中的sentinel实例和redis实例是放一起的,要是本地的sentinel要操作(down,up VIP)redis实例,也需要本地也可以访问本地,即自己ssh-keygen创建的公钥也要放到自己的authorized_keys中,最后每个服务器的authorized_keys都相互包含(三行)。

②:第一次执行的时候需要在master上先设置vip,即搭好redis sentinel之后,就需要在master上设置好vip。

③:通过收集日志,取得所需要的ip。

④:发送、记录日志,并且远程执行up、down VIP。

在此之前首先要安装paramiko模块:easy_install paramiko,需要依赖包:apt-get install python-setuptools python-dev build-essential libffi-dev libssl-dev;或则直接执行:apt-get install python-paramiko。

具体脚本如下:logging说明

#!/usr/bin/env python
#-*-encoding:utf8-*-
#------------------------------------------------
# Name:        notify.py
# Purpose:     failover切换后的操作
# Author:      zhoujy
# Created:     2016-06-17
#------------------------------------------------
import os
import sys
import time
import datetime
import smtplib
import subprocess
import fileinput
import logging
import paramiko
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.Utils import COMMASPACE, formatdate

reload(sys)
sys.setdefaultencoding('utf8')

def send_mail(to, subject, text, from_mail, server="localhost"):
    message = MIMEMultipart()
    message['From'] = from_mail
    message['To'] = COMMASPACE.join(to)
    message['Date'] = formatdate(localtime=True)
    message['Subject'] = subject
    message.attach(MIMEText(text,_charset='utf-8'))
    smtp = smtplib.SMTP(server)
    smtp.sendmail(from_mail, to, message.as_string())
    smtp.close()

#关vip
def down_vip(hostname,port):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(hostname=hostname,port=port)
    stdin, stdout, stderr = ssh.exec_command("ifconfig eth0:0 down")
#    print stdout.readlines()
    if  not stderr.readlines() :
        print "down vip ok..."
    else :
        print stderr.readlines()
    ssh.close()

#开vip
def up_vip(hostname,port,vip):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(hostname=hostname,port=port)
    stdin, stdout, stderr = ssh.exec_command("ifconfig eth0:0 %s;arping -c 3 -A %s;hash -r" %(vip,vip))
#    print stdout.readlines()
    if  not stderr.readlines() :
        print "up vip ok..."
    else :
        print stderr.readlines()
    ssh.close()

if __name__ == "__main__":
#服务器端口
    ssh_port = 22
#指定VIP
    vip      = '192.168.200.2'
#通过logging.basicConfig函数对日志的输出格式及方式做相关配置
    logging.basicConfig(level=logging.INFO,
                format=':::%(levelname)s::: \n%(message)s',
                datefmt='%a, %d %b %Y %H:%M:%S',
                filename='/var/log/redis/failover.txt',
                filemode='a')
#定义一个StreamHandler,将INFO级别的日志信息打印到标准错误,并将其添加到当前的日志处理对象
    console = logging.StreamHandler()
    console.setLevel(logging.INFO)
    formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
    console.setFormatter(formatter)
    logging.getLogger('').addHandler(console)

    time =  (datetime.datetime.now()).strftime("%Y-%m-%d %H:%M:%S")
    message = sys.argv[1:]
    master_name = sys.argv[1]
    role = sys.argv[2]
    stats = sys.argv[3]
    from_ip = sys.argv[4]
    from_port = sys.argv[5]
    to_ip = sys.argv[6]
    to_port = sys.argv[7]
    messages = "++++++++++++++++++++++++++"+time+" failover++++++++++++++++++++++++++"+'\n'+' '.join(message)
    subject = ''' Redis 【%s】 Failover ''' %master_name
    info = ''' %s : Redis Master %s failover %s(%s:%s) to %s(%s:%s) succeeded ! '''  %(time,master_name,from_ip,from_ip,from_port,to_ip,to_ip,to_port)
    mail_list =['zjy@dxyer.com']
    if role == 'leader':
        logging.info(messages)
        down_vip(from_ip,ssh_port)
        up_vip(to_ip,ssh_port,vip)
        send_mail(mail_list, subject.encode("utf8"), info +' and VIP do sucessed !!', "Redis_failover_report@ls.xxx.net", server="192.168.xxx.xxx")

当发生切换时,最终邮件报警的内容如下:

2016-06-17 19:06:42 : Redis Master dxy failover 192.168.200.73(192.168.200.73:6379) to 192.168.200.208(192.168.200.208:6379) succeeded !  and VIP do sucessed !!

日志里记录的信息如下:

::INFO:::
++++++++++++++++++++++++++2016-06-17 19:06:42 failover++++++++++++++++++++++++++
dxy leader start 192.168.200.73 6379 192.168.200.208 6379
:::INFO:::
Connected (version 2.0, client OpenSSH_6.6.1p1)
:::INFO:::
Authentication (publickey) successful!
:::INFO:::
Connected (version 2.0, client OpenSSH_6.6.1p1)
:::INFO:::
Authentication (publickey) successful!

BTW:程序可以直接连vip访问Redis,实现一定的高可用:当vip切换的时候,服务会断开,多久不可用主要看设置的检测时间(down-after-milliseconds:默认30秒,可以设置更低,如5000即5秒)和程序重连的时间。当然也可以直接用java的jedis客户端访问,直接实现高可用(通过sentinel中的信息得到master,再连master)。

总结:

通过Redis 复制、Sentinel的搭建和原理说明和本文大致的了解redis sentinel 高可用的实现,sentinel比较简单在压力不大,单机可以满足需求的情况下,redis sentinel是一个不错的选择。

 

参考文档:

Redis 复制、Sentinel的搭建和原理说明

集群Failover解决方案

python 的日志logging模块

python paramiko

Redis Sentinel高可用架构 

 

posted @ 2016-06-20 12:28 jyzhou 阅读(...) 评论(...) 编辑 收藏