keepalived高可用nginx,zabbix监控脑裂

 

keepalived简介

keepalived是什么?

Keepalived 软件起初是专为LVS负载均衡软件设计的,用来管理并监控LVS集群系统中各个服务节点的状态,后来又加入了可以实现高可用的VRRP功能。因此,Keepalived除了能够管理LVS软件外,还可以作为其他服务(例如:Nginx、Haproxy、MySQL等)的高可用解决方案软件。

Keepalived软件主要是通过VRRP协议实现高可用功能的。VRRP是Virtual Router RedundancyProtocol(虚拟路由器冗余协议)的缩写,VRRP出现的目的就是为了解决静态路由单点故障问题的,它能够保证当个别节点宕机时,整个网络可以不间断地运行。

所以,Keepalived 一方面具有配置管理LVS的功能,同时还具有对LVS下面节点进行健康检查的功能,另一方面也可实现系统网络服务的高可用功能。

keepalived官网

keepalived的重要功能

keepalived 有三个重要的功能,分别是:

  • 管理LVS负载均衡软件
  • 实现LVS集群节点的健康检查
  • 作为系统网络服务的高可用性(failover)

 keepalived高可用故障转移的原理

Keepalived 高可用服务之间的故障切换转移,是通过 VRRP (Virtual Router Redundancy Protocol ,虚拟路由器冗余协议)来实现的。

在 Keepalived 服务正常工作时,主 Master 节点会不断地向备节点发送(多播的方式)心跳消息,用以告诉备 Backup 节点自己还活看,当主 Master 节点发生故障时,就无法发送心跳消息,备节点也就因此无法继续检测到来自主 Master 节点的心跳了,于是调用自身的接管程序,接管主 Master 节点的 IP 资源及服务。而当主 Master 节点恢复时,备 Backup 节点又会释放主节点故障时自身接管的IP资源及服务,恢复到原来的备用角色。

那么,什么是VRRP呢?
VRRP ,全 称 Virtual Router Redundancy Protocol ,中文名为虚拟路由冗余协议 ,VRRP的出现就是为了解决静态踣甶的单点故障问题,VRRP是通过一种竞选机制来将路由的任务交给某台VRRP路由器的。

keepalived原理

keepalived高可用架构图

 

 

keepalived工作原理描述

Keepalived高可用对之间是通过VRRP通信的,因此,我们从 VRRP开始了解起:
1) VRRP,全称 Virtual Router Redundancy Protocol,中文名为虚拟路由冗余协议,VRRP的出现是为了解决静态路由的单点故障。
2) VRRP是通过一种竟选协议机制来将路由任务交给某台 VRRP路由器的。
3) VRRP用 IP多播的方式(默认多播地址(224.0_0.18))实现高可用对之间通信。
4) 工作时主节点发包,备节点接包,当备节点接收不到主节点发的数据包的时候,就启动接管程序接管主节点的开源。备节点可以有多个,通过优先级竞选,但一般 Keepalived系统运维工作中都是一对。
5) VRRP使用了加密协议加密数据,但Keepalived官方目前还是推荐用明文的方式配置认证类型和密码。

介绍完 VRRP,接下来我再介绍一下 Keepalived服务的工作原理:

Keepalived高可用是通过 VRRP 进行通信的, VRRP是通过竞选机制来确定主备的,主的优先级高于备,因此,工作时主会优先获得所有的资源,备节点处于等待状态,当主挂了的时候,备节点就会接管主节点的资源,然后顶替主节点对外提供服务。

在 Keepalived 服务之间,只有作为主的服务器会一直发送 VRRP 广播包,告诉备它还活着,此时备不会枪占主,当主不可用时,即备监听不到主发送的广播包时,就会启动相关服务接管资源,保证业务的连续性.接管速度最快可以小于1秒。

 

keepalived配置文件

keepalived默认配置文件

//keepalived 的主配置文件是 /etc/keepalived/keepalived.conf
[root@master ~]# cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived

global_defs {       //全局配置  #    (这里定义的会影响全局)       
   notification_email {     //定义报警收件人邮件地址
     acassen@firewall.loc
     failover@firewall.loc
     sysadmin@firewall.loc
   }
   notification_email_from Alexandre.Cassen@firewall.loc    //定义报警发件人邮箱
   smtp_server 192.168.200.1    //邮箱服务器地址
   smtp_connect_timeout 30      //定义邮箱超时时间
   router_id LVS_DEVEL          //定义路由标识信息,同局域网内唯一
   vrrp_skip_check_adv_addr      
   vrrp_strict
   vrrp_garp_interval 0
   vrrp_gna_interval 0
}

vrrp_instance VI_1 {        //定义实例
    state MASTER            //指定keepalived节点的初始状态,可选值为MASTER|BACKUP
    interface eth0          //VRRP实例绑定的网卡接口,用户发送VRRP包
    virtual_router_id 51    //虚拟路由的ID,同一集群要一致
    priority 100            //定义优先级,按优先级来决定主备角色,优先级越大越优先
    nopreempt               //设置不抢占
    advert_int 1            //主备通讯时间间隔
    authentication {        //配置认证
        auth_type PASS      //认证方式,此处为密码
        auth_pass 1111      //同一集群中的keepalived配置里的此处必须一致,推荐使用8位随机数
    }
    virtual_ipaddress {     //配置要使用的VIP地址
        192.168.170.250
    }
}

virtual_server 192.168.170.250 80 {    //配置虚拟服务器
    delay_loop 6        //健康检查的时间间隔
    lb_algo rr          //lvs调度算法
    lb_kind NAT         //lvs模式
    persistence_timeout 50      //持久化超时时间,单位是秒
    protocol TCP        //4层协议

    sorry_server 192.168.170.134 80   //定义备用服务器,当所有RS都故障时用sorry_server来响应客户端

    real_server 192.168.170.133 80 {    //定义真实处理请求的服务器
        weight 1    //给服务器指定权重,默认为1
        HTTP_GET {
            url {
              path /testurl/test.jsp    //指定要检查的URL路径
              digest 640205b7b0fc66c1ea91c463fac6334d   //摘要信息
            }
            url {
              path /testurl2/test.jsp
              digest 640205b7b0fc66c1ea91c463fac6334d
            }
            url {
              path /testurl3/test.jsp
              digest 640205b7b0fc66c1ea91c463fac6334d
            }
            connect_timeout 3       //连接超时时间
            nb_get_retry 3          //get尝试次数
            delay_before_retry 3    //在尝试之前延迟多长时间
        }
    }

    real_server 192.168.200.3 1358 {
        weight 1
        HTTP_GET {
            url {
              path /testurl/test.jsp
              digest 640205b7b0fc66c1ea91c463fac6334c
            }
            url {
              path /testurl2/test.jsp
              digest 640205b7b0fc66c1ea91c463fac6334c
            }
            connect_timeout 3
            nb_get_retry 3
            delay_before_retry 3
        }
    }
}

 

 

 

 

 

环境说明

 

系统信息
主机名
ip
centos8 
master
192.168.170.133
centos8
slave
192.168.170.134
                                 vip: 192.168.170.250
 

 

keepalived安装

主节点

//准备工作,关闭防火墙selinux
[root@master ~]# systemctl stop firewalld
[root@master ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
[root@master ~]# setenforce 0
[root@master ~]# sed -ri 's/^(SELINUX=).*/\1disabled/g' /etc/selinux/config

//配置网络源,安装一些常用命令
[root@master ~]# curl -o /etc/yum.repos.d/CentOS7-Base-163.repo http://mirrors.163.com/.help/CentOS7-Base-163.repo
[root@master ~]# sed -i 's/\$releasever/7/g' /etc/yum.repos.d/CentOS7-Base-163.repo
[root@master ~]# sed -i 's/^enabled=.*/enabled=1/g' /etc/yum.repos.d/CentOS7-Base-163.repo
[root@master ~]# yum -y install epel-release vim wget gcc gcc-c++



//安装keepalived
[root@master ~]# yum -y install keepalived

 

备节点

//关闭防火墙与SELINUX
[root@slave ~]# systemctl stop firewalld
[root@slave ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
[root@slave ~]# setenforce 0
[root@slave ~]# sed -ri 's/^(SELINUX=).*/\1disabled/g' /etc/selinux/config

//配置网络源
[root@slave ~]# curl -o /etc/yum.repos.d/CentOS7-Base-163.repo http://mirrors.163.com/.help/CentOS7-Base-163.repo
[root@slave ~]# sed -i 's/\$releasever/7/g' /etc/yum.repos.d/CentOS7-Base-163.repo
[root@slave ~]# sed -i 's/^enabled=.*/enabled=1/g' /etc/yum.repos.d/CentOS7-Base-163.repo
[root@slave ~]# yum -y install epel-release vim wget gcc gcc-c++

//安装keepalived
[root@slave ~]# yum -y install keepalived

 

 

yum安装keepalived生成的配置文件

[root@slave ~]# rpm -ql keepalived
/etc/keepalived                  //配置目录
/etc/keepalived/keepalived.conf       //此为主配置文件
/etc/sysconfig/keepalived
/usr/bin/genhash
/usr/lib/.build-id
/usr/lib/.build-id/6c
/usr/lib/.build-id/6c/fcab96b8b176cef32532ae9cbd10f36ec694c3
/usr/lib/.build-id/c6
/usr/lib/.build-id/c6/776cd07d0f0df98dab704859c21192bc551e3c
/usr/lib/systemd/system/keepalived.service           //此为服务控制文件
......
......

 

 

在主备机分别安装nginx

[root@master ~]# yum install -y nginx
[root@master ~]# systemctl start nginx
[root@master ~]# ss -antl
State    Recv-Q    Send-Q          Local Address:Port         Peer Address:Port    
LISTEN   0         128                   0.0.0.0:80                0.0.0.0:*       
LISTEN   0         128                   0.0.0.0:22                0.0.0.0:*       
LISTEN   0         128                      [::]:80                   [::]:*       
LISTEN   0         128                      [::]:22                   [::]:*   
[root@master ~]# cd /usr/share/nginx/html/
[root@master html]# echo "master node" > index.html 

[root@slave ~]# yum install -y nginx
[root@slave ~]# systemctl start nginx
[root@slave ~]# ss -antl
State    Recv-Q    Send-Q          Local Address:Port         Peer Address:Port    
LISTEN   0         128                   0.0.0.0:80                0.0.0.0:*       
LISTEN   0         128                   0.0.0.0:22                0.0.0.0:*       
LISTEN   0         128                      [::]:80                   [::]:*       
LISTEN   0         128                      [::]:22                   [::]:*       
[root@slave ~]# cd /usr/share/nginx/html/
[root@slave html]# echo "backup node" > index.html

 

测试nginx网页访问

 

 

 

 

 

 

 

配置主keepalived

 

//在/scripts目录下创建脚本,内容如下
[root@master keepalived]# cat /scripts/check_n.sh
#!/bin/bash
   
nginx_status=`ps -ef|grep -v "grep" | grep "nginx"|wc -l`
 
if [ $nginx_status -lt 1 ];then
           systemctl stop keepalived
fi
[root@master scripts]# chmod +x check_n.sh
[root@master scripts]# vim notify.sh
#!/bin/bash
VIP=$2
sendmail (){
        subject="${VIP}'s server keepalived state is translate"
        content="`date +'%F %T'`: `hostname`'s state change to master"
        echo $content | mail -s "$subject" 1470044516@qq.com
}
case "$1" in
  master)
        nginx_status=$(ps -ef|grep -Ev "grep|$0"|grep '\bnginx\b'|wc -l)
        if [ $nginx_status -lt 1 ];then
            systemctl start nginx
        fi
        sendmail
  ;;
  backup)
        nginx_status=$(ps -ef|grep -Ev "grep|$0"|grep '\bnginx\b'|wc -l)
        if [ $nginx_status -gt 0 ];then
            systemctl stop nginx
        fi
  ;;
  *)
        echo "Usage:$0 master|backup VIP"
  ;;
esac
[root@master scripts]# chmod +x notify.sh

 

[root@master ~]# cd /etc/keepalived/
[root@master keepalived]# cp keepalived.conf{,.bak}
[root@master keepalived]# cat keepalived.conf
! Configuration File for keepalived

global_defs {
   router_id lb01
}

vrrp_script nginx_check {
    script "/scripts/check_n.sh"
    interval 10
    weight -20
}

vrrp_instance VI_1 {
    state MASTER
    interface ens160
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mei
    }
    virtual_ipaddress {
        192.168.170.250
    }
    track_script {
        nginx_check
    }
    notify_master "/scripts/notify.sh master 192.168.170.250"
    notify_backup "/scripts/notify.sh backup 192.168.170.250"
}

virtual_server 192.168.170.250 80 {
    delay_loop 6
    lb_algo rr
    lb_kind DR
    persistence_timeout 50
    protocol TCP

    real_server 192.168.170.133 80 {
        weight 1
        TCP_CHECK {
            connect_port 80
            connect_timeout 3
            nb_get_retry 3
            delay_before_retry 3
        }
    }

    real_server 192.168.170.134 80 {
        weight 1
        TCP_CHECK {
            connect_port 80
            connect_timeout 3
            nb_get_retry 3
            delay_before_retry 3
        }
    }
}

 

 

配置从keepalived.conf

使用脚本notify

[root@slave ~]# mkdir /scripts
[root@slave ~]# cd /scripts/
[root@slave scripts]# vim notify.sh
#!/bin/bash
VIP=$2
sendmail (){
        subject="${VIP}'s server keepalived state is translate"
        content="`date +'%F %T'`: `hostname`'s state change to master"
        echo $content | mail -s "$subject" 1470044516@qq.com
}
case "$1" in
  master)
        nginx_status=$(ps -ef|grep -Ev "grep|$0"|grep '\bnginx\b'|wc -l)
        if [ $nginx_status -lt 1 ];then
            systemctl start nginx
        fi
        sendmail
  ;;
  backup)
        nginx_status=$(ps -ef|grep -Ev "grep|$0"|grep '\bnginx\b'|wc -l)
        if [ $nginx_status -gt 0 ];then
            systemctl stop nginx
        fi
  ;;
  *)
        echo "Usage:$0 master|backup VIP"
  ;;
esac

[root@slave scripts]# chmod +x notify.sh
! Configuration File for keepalived

global_defs {
   router_id lb02
}

vrrp_instance VI_1 {
    state BACKUP
    interface ens160
    virtual_router_id 51
    priority 90
    nopreempt
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass mei
    }
    virtual_ipaddress {
        192.168.170.250
    }
    notify_master "/scripts/notify.sh master 192.168.170.250"
    notify_backup "/scripts/notify.sh backup 192.168.170.250"
}

virtual_server 192.168.170.250 80 {
    delay_loop 6
    lb_algo rr
    lb_kind DR
    persistence_timeout 50
    protocol TCP

    real_server 192.168.170.133 80 {
        weight 1
        TCP_CHECK {
            connect_port 80
            connect_timeout 3
            nb_get_retry 3
            delay_before_retry 3
        }
    }

    real_server 192.168.170.134 80 {
        weight 1
        TCP_CHECK {
            connect_port 80
            connect_timeout 3
            nb_get_retry 3
            delay_before_retry 3
        }
    }
}

 

启动服务

//在master
[root@master ~]# systemctl enable --now keepalived.service 
Created symlink /etc/systemd/system/multi-user.target.wants/keepalived.service → /usr/lib/systemd/system/keepalived.service.
[root@master ~]# systemctl enable --now nginx.service 
Created symlink /etc/systemd/system/multi-user.target.wants/nginx.service → /usr/lib/systemd/system/nginx.service.


//在backup
[root@slave ~]# systemctl enable --now  keepalived.service 
Created symlink /etc/systemd/system/multi-user.target.wants/keepalived.service → /usr/lib/systemd/system/keepalived.service.

 

 

[root@master ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:fa:2c:7d brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.133/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 1201sec preferred_lft 1201sec
    inet 192.168.170.250/32 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[root@master ~]# ss -antl
State    Recv-Q     Send-Q          Local Address:Port         Peer Address:Port    
LISTEN   0          128                   0.0.0.0:80                0.0.0.0:*       
LISTEN   0          128                   0.0.0.0:22                0.0.0.0:*       
LISTEN   0          128                      [::]:80                   [::]:*       
LISTEN   0          128                      [::]:22                   [::]:

[root@slave ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 1151sec preferred_lft 1151sec
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[root@slave ~]# ss -antl
State    Recv-Q     Send-Q          Local Address:Port         Peer Address:Port    
LISTEN   0          128                   0.0.0.0:22                0.0.0.0:*       
LISTEN   0          128                      [::]:22                   [::]:*

 

 

当master完好时

 

 

当master故障时,备节点自动上位

[root@master ~]# systemctl stop nginx.service 
[root@master ~]# ss -antl
State    Recv-Q     Send-Q          Local Address:Port         Peer Address:Port    
LISTEN   0          128                   0.0.0.0:22                0.0.0.0:*       
LISTEN   0          128                      [::]:22                   [::]:*       
[root@master ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:fa:2c:7d brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.133/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 1029sec preferred_lft 1029sec
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever



[root@slave ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 1016sec preferred_lft 1016sec
    inet 192.168.170.250/32 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[root@slave ~]# ss -antl
State    Recv-Q     Send-Q          Local Address:Port         Peer Address:Port    
LISTEN   0          128                   0.0.0.0:80                0.0.0.0:*       
LISTEN   0          128                   0.0.0.0:22                0.0.0.0:*       
LISTEN   0          128                      [::]:80                   [::]:*       
LISTEN   0          128                      [::]:22                   [::]:*

 

 

 

 

 

当master恢复时

[root@master ~]# systemctl start nginx.service 
[root@master ~]# systemctl restart keepalived.service 
[root@master ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:fa:2c:7d brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.133/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 930sec preferred_lft 930sec
    inet 192.168.170.250/32 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[root@master ~]# ss -antl
State    Recv-Q     Send-Q          Local Address:Port         Peer Address:Port    
LISTEN   0          128                   0.0.0.0:80                0.0.0.0:*       
LISTEN   0          128                   0.0.0.0:22                0.0.0.0:*       
LISTEN   0          128                      [::]:80                   [::]:*       
LISTEN   0          128                      [::]:22                   [::]:*



//从节点自动退位
[root@slave ~]# ss -antl
State    Recv-Q     Send-Q          Local Address:Port         Peer Address:Port    
LISTEN   0          128                   0.0.0.0:22                0.0.0.0:*       
LISTEN   0          128                      [::]:22                   [::]:*       
[root@slave ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 907sec preferred_lft 907sec
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

 

 

 

用zabbix监控脑裂

环境

主机 IP地址 安装
master 192.168.170.133

lamp架构

zabbix_server

zabbix_agentd

slave 192.168.170.134 zabbix_agentd

 

 

zabbix具体安装请看zabbix监控,和zabbix配置

 

 

脑裂

在高可用(HA)系统中,当联系2个节点的“心跳线”断开时,本来为一整体、动作协调的HA系统,就分裂成为2个独立的个体。由于相互失去了联系,都以为是对方出了故障。两个节点上的HA软件像“裂脑人”一样,争抢“共享资源”、争起“应用服务”,就会发生严重后果——或者共享资源被瓜分、2边“服务”都起不来了;或者2边“服务”都起来了,但同时读写“共享存储”,导致数据损坏(常见如数据库轮询着的联机日志出错)。
  
对付HA系统“裂脑”的对策,目前达成共识的的大概有以下几条:

  • 添加冗余的心跳线,例如:双线条线(心跳线也HA),尽量减少“裂脑”发生几率;
  • 启用磁盘锁。正在服务一方锁住共享磁盘,“裂脑”发生时,让对方完全“抢不走”共享磁盘资源。但使用锁磁盘也会有一个不小的问题,如果占用共享盘的一方不主动“解锁”,另一方就永远得不到共享磁盘。现实中假如服务节点突然死机或崩溃,就不可能执行解锁命令。后备节点也就接管不了共享资源和应用服务。于是有人在HA中设计了“智能”锁。即:正在服务的一方只在发现心跳线全部断开(察觉不到对端)时才启用磁盘锁。平时就不上锁了。
  • 设置仲裁机制。例如设置参考IP(如网关IP),当心跳线完全断开时,2个节点都各自ping一下参考IP,不通则表明断点就出在本端。不仅“心跳”、还兼对外“服务”的本端网络链路断了,即使启动(或继续)应用服务也没有用了,那就主动放弃竞争,让能够ping通参考IP的一端去起服务。更保险一些,ping不通参考IP的一方干脆就自我重启,以彻底释放有可能还占用着的那些共享资源

脑裂产生的原因

一般来说,脑裂的发生,有以下几种原因:

    • 高可用服务器对之间心跳线链路发生故障,导致无法正常通信
      • 因心跳线坏了(包括断了,老化)
      • 因网卡及相关驱动坏了,ip配置及冲突问题(网卡直连)
      • 因心跳线间连接的设备故障(网卡及交换机)
      • 因仲裁的机器出问题(采用仲裁的方案)
    • 高可用服务器上开启了 iptables防火墙阻挡了心跳消息传输
    • 高可用服务器上心跳网卡地址等信息配置不正确,导致发送心跳失败
    • 其他服务配置不当等原因,如心跳方式不同,心跳广插冲突、软件Bug等

 

 

对脑裂的监控应在备用服务器上进行,通过添加zabbix自定义监控进行。
监控什么信息呢?监控备上有无VIP地址

备机上出现VIP有两种情况:

  • 发生了脑裂
  • 正常的主备切换

监控只是监控发生脑裂的可能性,不能保证一定是发生了脑裂,因为正常的主备切换VIP也是会到备上的。

 

 

查看ip脚本

[root@slave scripts]# pwd
/scripts
[root@slave scripts]# cat check_backupip.sh 
#!/bin/bash

if [ `ip a show ens160 |grep 192.168.170.250|wc -l` -eq 0 ];then
        echo "0"
    else
    echo "1"
fi

 

在备节点修改配置文件,打开自定义监控的功能

[root@slave scripts]# vim /usr/local/etc/zabbix_agentd.conf
//在最后加入
UnsafeUserParameters=1
UserParameter=check.backup,/scripts/check_backupip.sh
                                                   

 

 

在主节点测试能否获取值

//备节点并没有vip
[root@slave ~]# ip a show ens160
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 1407sec preferred_lft 1407sec
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever


//在主节点测试
[root@master ~]# zabbix_get -s 192.168.170.134 -k "check.backup"   
0

 

在zabbixweb页面添加监控项

 

 

 

 

 

 

 

添加触发器

 

 

 

 

 

 

 

 添加媒介

 

 

 

 

 

 

添加用户

 

 

 

 

 

添加动作

 

 

 

 

 

 

 

 

 

 

 

触发触发器

//在主节点关掉nginx,模拟发生脑裂
[root@master ~]# ss -antl
State    Recv-Q    Send-Q          Local Address:Port          Peer Address:Port    
LISTEN   0         128                   0.0.0.0:81                 0.0.0.0:*       
LISTEN   0         128                   0.0.0.0:22                 0.0.0.0:*       
LISTEN   0         128                   0.0.0.0:10050              0.0.0.0:*       
LISTEN   0         128                   0.0.0.0:10051              0.0.0.0:*       
LISTEN   0         128                   0.0.0.0:9000               0.0.0.0:*       
LISTEN   0         128                         *:80                       *:*       
LISTEN   0         128                      [::]:81                    [::]:*       
LISTEN   0         128                      [::]:22                    [::]:*       
LISTEN   0         80                          *:3306                     *:*       
[root@master ~]# systemctl stop nginx.service 
[root@master ~]# ip a show ens160
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:fa:2c:7d brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.133/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 1617sec preferred_lft 1617sec
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::76dd:4d80:292a:6a7a/64 scope link dadfailed tentative noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::7675:93c:79de:89ad/64 scope link dadfailed tentative noprefixroute 
       valid_lft forever preferred_lft forever


[root@slave ~]# ip a show ens160
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff
    inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160
       valid_lft 1542sec preferred_lft 1542sec
    inet 192.168.170.250/32 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

 

 

 

 

 

 

 

 

posted @ 2021-05-21 02:02  取个名字真滴难  阅读(353)  评论(0)    收藏  举报