keepalived高可用nginx,zabbix监控脑裂
keepalived简介
keepalived是什么?
Keepalived 软件起初是专为LVS负载均衡软件设计的,用来管理并监控LVS集群系统中各个服务节点的状态,后来又加入了可以实现高可用的VRRP功能。因此,Keepalived除了能够管理LVS软件外,还可以作为其他服务(例如:Nginx、Haproxy、MySQL等)的高可用解决方案软件。
Keepalived软件主要是通过VRRP协议实现高可用功能的。VRRP是Virtual Router RedundancyProtocol(虚拟路由器冗余协议)的缩写,VRRP出现的目的就是为了解决静态路由单点故障问题的,它能够保证当个别节点宕机时,整个网络可以不间断地运行。
所以,Keepalived 一方面具有配置管理LVS的功能,同时还具有对LVS下面节点进行健康检查的功能,另一方面也可实现系统网络服务的高可用功能。
keepalived官网
keepalived的重要功能
keepalived 有三个重要的功能,分别是:
- 管理LVS负载均衡软件
- 实现LVS集群节点的健康检查
- 作为系统网络服务的高可用性(failover)
keepalived高可用故障转移的原理
Keepalived 高可用服务之间的故障切换转移,是通过 VRRP (Virtual Router Redundancy Protocol ,虚拟路由器冗余协议)来实现的。
在 Keepalived 服务正常工作时,主 Master 节点会不断地向备节点发送(多播的方式)心跳消息,用以告诉备 Backup 节点自己还活看,当主 Master 节点发生故障时,就无法发送心跳消息,备节点也就因此无法继续检测到来自主 Master 节点的心跳了,于是调用自身的接管程序,接管主 Master 节点的 IP 资源及服务。而当主 Master 节点恢复时,备 Backup 节点又会释放主节点故障时自身接管的IP资源及服务,恢复到原来的备用角色。
那么,什么是VRRP呢?
VRRP ,全 称 Virtual Router Redundancy Protocol ,中文名为虚拟路由冗余协议 ,VRRP的出现就是为了解决静态踣甶的单点故障问题,VRRP是通过一种竞选机制来将路由的任务交给某台VRRP路由器的。
keepalived原理
keepalived高可用架构图
keepalived工作原理描述
Keepalived高可用对之间是通过VRRP通信的,因此,我们从 VRRP开始了解起:
1) VRRP,全称 Virtual Router Redundancy Protocol,中文名为虚拟路由冗余协议,VRRP的出现是为了解决静态路由的单点故障。
2) VRRP是通过一种竟选协议机制来将路由任务交给某台 VRRP路由器的。
3) VRRP用 IP多播的方式(默认多播地址(224.0_0.18))实现高可用对之间通信。
4) 工作时主节点发包,备节点接包,当备节点接收不到主节点发的数据包的时候,就启动接管程序接管主节点的开源。备节点可以有多个,通过优先级竞选,但一般 Keepalived系统运维工作中都是一对。
5) VRRP使用了加密协议加密数据,但Keepalived官方目前还是推荐用明文的方式配置认证类型和密码。
介绍完 VRRP,接下来我再介绍一下 Keepalived服务的工作原理:
Keepalived高可用是通过 VRRP 进行通信的, VRRP是通过竞选机制来确定主备的,主的优先级高于备,因此,工作时主会优先获得所有的资源,备节点处于等待状态,当主挂了的时候,备节点就会接管主节点的资源,然后顶替主节点对外提供服务。
在 Keepalived 服务之间,只有作为主的服务器会一直发送 VRRP 广播包,告诉备它还活着,此时备不会枪占主,当主不可用时,即备监听不到主发送的广播包时,就会启动相关服务接管资源,保证业务的连续性.接管速度最快可以小于1秒。
keepalived配置文件
keepalived默认配置文件
//keepalived 的主配置文件是 /etc/keepalived/keepalived.conf [root@master ~]# cat /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { //全局配置 # (这里定义的会影响全局) notification_email { //定义报警收件人邮件地址 acassen@firewall.loc failover@firewall.loc sysadmin@firewall.loc } notification_email_from Alexandre.Cassen@firewall.loc //定义报警发件人邮箱 smtp_server 192.168.200.1 //邮箱服务器地址 smtp_connect_timeout 30 //定义邮箱超时时间 router_id LVS_DEVEL //定义路由标识信息,同局域网内唯一 vrrp_skip_check_adv_addr vrrp_strict vrrp_garp_interval 0 vrrp_gna_interval 0 } vrrp_instance VI_1 { //定义实例 state MASTER //指定keepalived节点的初始状态,可选值为MASTER|BACKUP interface eth0 //VRRP实例绑定的网卡接口,用户发送VRRP包 virtual_router_id 51 //虚拟路由的ID,同一集群要一致 priority 100 //定义优先级,按优先级来决定主备角色,优先级越大越优先 nopreempt //设置不抢占 advert_int 1 //主备通讯时间间隔 authentication { //配置认证 auth_type PASS //认证方式,此处为密码 auth_pass 1111 //同一集群中的keepalived配置里的此处必须一致,推荐使用8位随机数 } virtual_ipaddress { //配置要使用的VIP地址 192.168.170.250 } } virtual_server 192.168.170.250 80 { //配置虚拟服务器 delay_loop 6 //健康检查的时间间隔 lb_algo rr //lvs调度算法 lb_kind NAT //lvs模式 persistence_timeout 50 //持久化超时时间,单位是秒 protocol TCP //4层协议 sorry_server 192.168.170.134 80 //定义备用服务器,当所有RS都故障时用sorry_server来响应客户端 real_server 192.168.170.133 80 { //定义真实处理请求的服务器 weight 1 //给服务器指定权重,默认为1 HTTP_GET { url { path /testurl/test.jsp //指定要检查的URL路径 digest 640205b7b0fc66c1ea91c463fac6334d //摘要信息 } url { path /testurl2/test.jsp digest 640205b7b0fc66c1ea91c463fac6334d } url { path /testurl3/test.jsp digest 640205b7b0fc66c1ea91c463fac6334d } connect_timeout 3 //连接超时时间 nb_get_retry 3 //get尝试次数 delay_before_retry 3 //在尝试之前延迟多长时间 } } real_server 192.168.200.3 1358 { weight 1 HTTP_GET { url { path /testurl/test.jsp digest 640205b7b0fc66c1ea91c463fac6334c } url { path /testurl2/test.jsp digest 640205b7b0fc66c1ea91c463fac6334c } connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } }
环境说明
系统信息 |
主机名 |
ip |
centos8 |
master |
192.168.170.133 |
centos8 |
slave |
192.168.170.134 |
vip: 192.168.170.250
keepalived安装
主节点
//准备工作,关闭防火墙selinux [root@master ~]# systemctl stop firewalld [root@master ~]# systemctl disable firewalld Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service. Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service. [root@master ~]# setenforce 0 [root@master ~]# sed -ri 's/^(SELINUX=).*/\1disabled/g' /etc/selinux/config //配置网络源,安装一些常用命令 [root@master ~]# curl -o /etc/yum.repos.d/CentOS7-Base-163.repo http://mirrors.163.com/.help/CentOS7-Base-163.repo [root@master ~]# sed -i 's/\$releasever/7/g' /etc/yum.repos.d/CentOS7-Base-163.repo [root@master ~]# sed -i 's/^enabled=.*/enabled=1/g' /etc/yum.repos.d/CentOS7-Base-163.repo [root@master ~]# yum -y install epel-release vim wget gcc gcc-c++ //安装keepalived [root@master ~]# yum -y install keepalived
备节点
//关闭防火墙与SELINUX [root@slave ~]# systemctl stop firewalld [root@slave ~]# systemctl disable firewalld Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service. Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service. [root@slave ~]# setenforce 0 [root@slave ~]# sed -ri 's/^(SELINUX=).*/\1disabled/g' /etc/selinux/config //配置网络源 [root@slave ~]# curl -o /etc/yum.repos.d/CentOS7-Base-163.repo http://mirrors.163.com/.help/CentOS7-Base-163.repo [root@slave ~]# sed -i 's/\$releasever/7/g' /etc/yum.repos.d/CentOS7-Base-163.repo [root@slave ~]# sed -i 's/^enabled=.*/enabled=1/g' /etc/yum.repos.d/CentOS7-Base-163.repo [root@slave ~]# yum -y install epel-release vim wget gcc gcc-c++ //安装keepalived [root@slave ~]# yum -y install keepalived
yum安装keepalived生成的配置文件
[root@slave ~]# rpm -ql keepalived /etc/keepalived //配置目录 /etc/keepalived/keepalived.conf //此为主配置文件 /etc/sysconfig/keepalived /usr/bin/genhash /usr/lib/.build-id /usr/lib/.build-id/6c /usr/lib/.build-id/6c/fcab96b8b176cef32532ae9cbd10f36ec694c3 /usr/lib/.build-id/c6 /usr/lib/.build-id/c6/776cd07d0f0df98dab704859c21192bc551e3c /usr/lib/systemd/system/keepalived.service //此为服务控制文件 ...... ......
在主备机分别安装nginx
[root@master ~]# yum install -y nginx [root@master ~]# systemctl start nginx [root@master ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:80 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 [::]:80 [::]:* LISTEN 0 128 [::]:22 [::]:* [root@master ~]# cd /usr/share/nginx/html/ [root@master html]# echo "master node" > index.html [root@slave ~]# yum install -y nginx [root@slave ~]# systemctl start nginx [root@slave ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:80 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 [::]:80 [::]:* LISTEN 0 128 [::]:22 [::]:* [root@slave ~]# cd /usr/share/nginx/html/ [root@slave html]# echo "backup node" > index.html
测试nginx网页访问
配置主keepalived
//在/scripts目录下创建脚本,内容如下 [root@master keepalived]# cat /scripts/check_n.sh #!/bin/bash nginx_status=`ps -ef|grep -v "grep" | grep "nginx"|wc -l` if [ $nginx_status -lt 1 ];then systemctl stop keepalived fi [root@master scripts]# chmod +x check_n.sh [root@master scripts]# vim notify.sh #!/bin/bash VIP=$2 sendmail (){ subject="${VIP}'s server keepalived state is translate" content="`date +'%F %T'`: `hostname`'s state change to master" echo $content | mail -s "$subject" 1470044516@qq.com } case "$1" in master) nginx_status=$(ps -ef|grep -Ev "grep|$0"|grep '\bnginx\b'|wc -l) if [ $nginx_status -lt 1 ];then systemctl start nginx fi sendmail ;; backup) nginx_status=$(ps -ef|grep -Ev "grep|$0"|grep '\bnginx\b'|wc -l) if [ $nginx_status -gt 0 ];then systemctl stop nginx fi ;; *) echo "Usage:$0 master|backup VIP" ;; esac [root@master scripts]# chmod +x notify.sh
[root@master ~]# cd /etc/keepalived/ [root@master keepalived]# cp keepalived.conf{,.bak} [root@master keepalived]# cat keepalived.conf ! Configuration File for keepalived global_defs { router_id lb01 } vrrp_script nginx_check { script "/scripts/check_n.sh" interval 10 weight -20 } vrrp_instance VI_1 { state MASTER interface ens160 virtual_router_id 51 priority 100 advert_int 1 authentication { auth_type PASS auth_pass mei } virtual_ipaddress { 192.168.170.250 } track_script { nginx_check } notify_master "/scripts/notify.sh master 192.168.170.250" notify_backup "/scripts/notify.sh backup 192.168.170.250" } virtual_server 192.168.170.250 80 { delay_loop 6 lb_algo rr lb_kind DR persistence_timeout 50 protocol TCP real_server 192.168.170.133 80 { weight 1 TCP_CHECK { connect_port 80 connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } real_server 192.168.170.134 80 { weight 1 TCP_CHECK { connect_port 80 connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } }
配置从keepalived.conf
使用脚本notify
[root@slave ~]# mkdir /scripts [root@slave ~]# cd /scripts/ [root@slave scripts]# vim notify.sh #!/bin/bash VIP=$2 sendmail (){ subject="${VIP}'s server keepalived state is translate" content="`date +'%F %T'`: `hostname`'s state change to master" echo $content | mail -s "$subject" 1470044516@qq.com } case "$1" in master) nginx_status=$(ps -ef|grep -Ev "grep|$0"|grep '\bnginx\b'|wc -l) if [ $nginx_status -lt 1 ];then systemctl start nginx fi sendmail ;; backup) nginx_status=$(ps -ef|grep -Ev "grep|$0"|grep '\bnginx\b'|wc -l) if [ $nginx_status -gt 0 ];then systemctl stop nginx fi ;; *) echo "Usage:$0 master|backup VIP" ;; esac [root@slave scripts]# chmod +x notify.sh
! Configuration File for keepalived global_defs { router_id lb02 } vrrp_instance VI_1 { state BACKUP interface ens160 virtual_router_id 51 priority 90 nopreempt advert_int 1 authentication { auth_type PASS auth_pass mei } virtual_ipaddress { 192.168.170.250 } notify_master "/scripts/notify.sh master 192.168.170.250" notify_backup "/scripts/notify.sh backup 192.168.170.250" } virtual_server 192.168.170.250 80 { delay_loop 6 lb_algo rr lb_kind DR persistence_timeout 50 protocol TCP real_server 192.168.170.133 80 { weight 1 TCP_CHECK { connect_port 80 connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } real_server 192.168.170.134 80 { weight 1 TCP_CHECK { connect_port 80 connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } }
启动服务
//在master [root@master ~]# systemctl enable --now keepalived.service Created symlink /etc/systemd/system/multi-user.target.wants/keepalived.service → /usr/lib/systemd/system/keepalived.service. [root@master ~]# systemctl enable --now nginx.service Created symlink /etc/systemd/system/multi-user.target.wants/nginx.service → /usr/lib/systemd/system/nginx.service. //在backup [root@slave ~]# systemctl enable --now keepalived.service Created symlink /etc/systemd/system/multi-user.target.wants/keepalived.service → /usr/lib/systemd/system/keepalived.service.
[root@master ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:fa:2c:7d brd ff:ff:ff:ff:ff:ff inet 192.168.170.133/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 1201sec preferred_lft 1201sec inet 192.168.170.250/32 scope global ens160 valid_lft forever preferred_lft forever inet6 fe80::c8b5:ee83:7837:cb77/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@master ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:80 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 [::]:80 [::]:* LISTEN 0 128 [::]:22 [::]: [root@slave ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 1151sec preferred_lft 1151sec inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute valid_lft forever preferred_lft forever inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@slave ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 [::]:22 [::]:*
当master完好时
当master故障时,备节点自动上位
[root@master ~]# systemctl stop nginx.service [root@master ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 [::]:22 [::]:* [root@master ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:fa:2c:7d brd ff:ff:ff:ff:ff:ff inet 192.168.170.133/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 1029sec preferred_lft 1029sec inet6 fe80::c8b5:ee83:7837:cb77/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@slave ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 1016sec preferred_lft 1016sec inet 192.168.170.250/32 scope global ens160 valid_lft forever preferred_lft forever inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute valid_lft forever preferred_lft forever inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@slave ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:80 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 [::]:80 [::]:* LISTEN 0 128 [::]:22 [::]:*
当master恢复时
[root@master ~]# systemctl start nginx.service [root@master ~]# systemctl restart keepalived.service [root@master ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:fa:2c:7d brd ff:ff:ff:ff:ff:ff inet 192.168.170.133/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 930sec preferred_lft 930sec inet 192.168.170.250/32 scope global ens160 valid_lft forever preferred_lft forever inet6 fe80::c8b5:ee83:7837:cb77/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@master ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:80 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 [::]:80 [::]:* LISTEN 0 128 [::]:22 [::]:* //从节点自动退位 [root@slave ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 [::]:22 [::]:* [root@slave ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 907sec preferred_lft 907sec inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute valid_lft forever preferred_lft forever inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute valid_lft forever preferred_lft forever
用zabbix监控脑裂
环境
主机 | IP地址 | 安装 |
master | 192.168.170.133 |
lamp架构 zabbix_server zabbix_agentd |
slave | 192.168.170.134 | zabbix_agentd |
zabbix具体安装请看zabbix监控,和zabbix配置
脑裂
在高可用(HA)系统中,当联系2个节点的“心跳线”断开时,本来为一整体、动作协调的HA系统,就分裂成为2个独立的个体。由于相互失去了联系,都以为是对方出了故障。两个节点上的HA软件像“裂脑人”一样,争抢“共享资源”、争起“应用服务”,就会发生严重后果——或者共享资源被瓜分、2边“服务”都起不来了;或者2边“服务”都起来了,但同时读写“共享存储”,导致数据损坏(常见如数据库轮询着的联机日志出错)。
对付HA系统“裂脑”的对策,目前达成共识的的大概有以下几条:
- 添加冗余的心跳线,例如:双线条线(心跳线也HA),尽量减少“裂脑”发生几率;
- 启用磁盘锁。正在服务一方锁住共享磁盘,“裂脑”发生时,让对方完全“抢不走”共享磁盘资源。但使用锁磁盘也会有一个不小的问题,如果占用共享盘的一方不主动“解锁”,另一方就永远得不到共享磁盘。现实中假如服务节点突然死机或崩溃,就不可能执行解锁命令。后备节点也就接管不了共享资源和应用服务。于是有人在HA中设计了“智能”锁。即:正在服务的一方只在发现心跳线全部断开(察觉不到对端)时才启用磁盘锁。平时就不上锁了。
- 设置仲裁机制。例如设置参考IP(如网关IP),当心跳线完全断开时,2个节点都各自ping一下参考IP,不通则表明断点就出在本端。不仅“心跳”、还兼对外“服务”的本端网络链路断了,即使启动(或继续)应用服务也没有用了,那就主动放弃竞争,让能够ping通参考IP的一端去起服务。更保险一些,ping不通参考IP的一方干脆就自我重启,以彻底释放有可能还占用着的那些共享资源
脑裂产生的原因
一般来说,脑裂的发生,有以下几种原因:
- 高可用服务器对之间心跳线链路发生故障,导致无法正常通信
- 因心跳线坏了(包括断了,老化)
- 因网卡及相关驱动坏了,ip配置及冲突问题(网卡直连)
- 因心跳线间连接的设备故障(网卡及交换机)
- 因仲裁的机器出问题(采用仲裁的方案)
- 高可用服务器上开启了 iptables防火墙阻挡了心跳消息传输
- 高可用服务器上心跳网卡地址等信息配置不正确,导致发送心跳失败
- 其他服务配置不当等原因,如心跳方式不同,心跳广插冲突、软件Bug等
对脑裂的监控应在备用服务器上进行,通过添加zabbix自定义监控进行。
监控什么信息呢?监控备上有无VIP地址
备机上出现VIP有两种情况:
- 发生了脑裂
- 正常的主备切换
监控只是监控发生脑裂的可能性,不能保证一定是发生了脑裂,因为正常的主备切换VIP也是会到备上的。
查看ip脚本
[root@slave scripts]# pwd /scripts [root@slave scripts]# cat check_backupip.sh #!/bin/bash if [ `ip a show ens160 |grep 192.168.170.250|wc -l` -eq 0 ];then echo "0" else echo "1" fi
在备节点修改配置文件,打开自定义监控的功能
[root@slave scripts]# vim /usr/local/etc/zabbix_agentd.conf //在最后加入 UnsafeUserParameters=1 UserParameter=check.backup,/scripts/check_backupip.sh
在主节点测试能否获取值
//备节点并没有vip [root@slave ~]# ip a show ens160 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 1407sec preferred_lft 1407sec inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute valid_lft forever preferred_lft forever inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute valid_lft forever preferred_lft forever //在主节点测试 [root@master ~]# zabbix_get -s 192.168.170.134 -k "check.backup"
0
在zabbixweb页面添加监控项
添加触发器
添加媒介
添加用户
添加动作
触发触发器
//在主节点关掉nginx,模拟发生脑裂 [root@master ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:81 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 0.0.0.0:10050 0.0.0.0:* LISTEN 0 128 0.0.0.0:10051 0.0.0.0:* LISTEN 0 128 0.0.0.0:9000 0.0.0.0:* LISTEN 0 128 *:80 *:* LISTEN 0 128 [::]:81 [::]:* LISTEN 0 128 [::]:22 [::]:* LISTEN 0 80 *:3306 *:* [root@master ~]# systemctl stop nginx.service [root@master ~]# ip a show ens160 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:fa:2c:7d brd ff:ff:ff:ff:ff:ff inet 192.168.170.133/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 1617sec preferred_lft 1617sec inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute valid_lft forever preferred_lft forever inet6 fe80::76dd:4d80:292a:6a7a/64 scope link dadfailed tentative noprefixroute valid_lft forever preferred_lft forever inet6 fe80::7675:93c:79de:89ad/64 scope link dadfailed tentative noprefixroute valid_lft forever preferred_lft forever [root@slave ~]# ip a show ens160 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:81:b8:da brd ff:ff:ff:ff:ff:ff inet 192.168.170.134/24 brd 192.168.170.255 scope global dynamic noprefixroute ens160 valid_lft 1542sec preferred_lft 1542sec inet 192.168.170.250/32 scope global ens160 valid_lft forever preferred_lft forever inet6 fe80::c8b5:ee83:7837:cb77/64 scope link dadfailed tentative noprefixroute valid_lft forever preferred_lft forever inet6 fe80::76dd:4d80:292a:6a7a/64 scope link noprefixroute valid_lft forever preferred_lft forever