Linux高可用HA集群Corosync+Pacemaker三节点实战配置案例讲解（三） - LeeHang

Linux高可用HA集群Corosync+Pacemaker三节点实战配置案例讲解（三）

一、环境准备

节点信息：

节点名称	IP地址（业务网）	IP地址（心跳网）	角色
node1	10.0.0.11	192.168.1.11	集群节点
node2	10.0.0.22	192.168.1.22	集群节点
node3	10.0.0.33	192.168.1.33	集群节点

共享配置：

VIP：10.0.0.100
服务：Apache（监听 VIP，数据目录 /var/www/html）。
系统：CentOS 7/8 或 Rocky Linux 8。

二、基础配置

1. 配置主机名与 hosts 文件（所有节点）

# 设置主机名（以 node1 为例）
hostnamectl set-hostname node1

# 编辑 /etc/hosts，添加以下内容
10.0.0.11 node1
10.0.0.22 node2
10.0.0.33 node3
192.168.1.11 node1-hb
192.168.1.22 node2-hb
192.168.1.33 node3-hb

2. 关闭防火墙与 SELinux（所有节点）

systemctl stop firewalld && systemctl disable firewalld
setenforce 0
sed -i 's/SELINUX=enforcing/SELINUX=permissive/g' /etc/selinux/config

3. 时间同步（所有节点）

yum install -y chrony
systemctl enable chronyd && systemctl start chronyd
chronyc sources -v

三、安装 Corosync 和 Pacemaker（所有节点）

# CentOS/Rocky Linux
yum install -y corosync pacemaker pcs fence-agents-all
systemctl enable pcsd && systemctl start pcsd

# 设置 hacluster 用户密码（所有节点需一致）
echo "hacluster" | passwd --stdin hacluster

四、配置 Corosync 集群

1. 初始化集群认证（任意节点）

pcs cluster auth node1 node2 node3 -u hacluster -p hacluster

[root@node1:~]$pcs cluster auth node1 node2 node3 -u hacluster -p hacluster
node1: Authorized
node3: Authorized
node2: Authorized

2. 创建并启动集群（任意节点）

pcs cluster setup --name web_cluster node1 node2 node3
[root@node1:~]$pcs cluster setup --name web_cluster node1 node2 node3
Destroying cluster on nodes: node1, node2, node3...
node3: Stopping Cluster (pacemaker)...
node1: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (pacemaker)...
node1: Successfully destroyed cluster
node2: Successfully destroyed cluster
node3: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'node1', 'node2', 'node3'
node1: successful distribution of the file 'pacemaker_remote authkey'
node2: successful distribution of the file 'pacemaker_remote authkey'
node3: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node1: Succeeded
node2: Succeeded
node3: Succeeded

Synchronizing pcsd certificates on nodes node1, node2, node3...
node1: Success
node3: Success
node2: Success
Restarting pcsd on the nodes in order to reload the certificates...
node1: Success
node3: Success
node2: Success


pcs cluster start --all
[root@node1:~]$pcs cluster start --all
node1: Starting Cluster (corosync)...
node2: Starting Cluster (corosync)...
node3: Starting Cluster (corosync)...
node3: Starting Cluster (pacemaker)...
node1: Starting Cluster (pacemaker)...
node2: Starting Cluster (pacemaker)...

pcs cluster enable --all
[root@node1:~]$pcs cluster enable --all
node1: Cluster Enabled
node2: Cluster Enabled
node3: Cluster Enabled

# 检查集群状态
pcs status cluster
[root@node1:~]$pcs status cluster
Cluster Status:
 Stack: corosync
 Current DC: node1 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
 Last updated: Wed Apr  2 11:42:11 2025
 Last change: Wed Apr  2 11:41:50 2025 by hacluster via crmd on node1
 3 nodes configured
 0 resource instances configured

PCSD Status:
  node1: Online
  node2: Online
  node3: Online

3. 配置 Corosync 使用专用心跳网络

（1）三节点生成共享密钥：
corosync-keygen
该命令会生成以下文件：
/etc/corosync/authkey（旧版本可能为 totem.conf 或 totem.clk，具体取决于 Corosync 版本）。
新版本 Corosync 默认使用 authkey，但需根据配置文件中的 crypto_ 参数调整

（2）分发密钥到所有节点
# 示例：将密钥从 node1 复制到 node2 和 node3
sudo scp /etc/corosync/authkey root@node2:/etc/corosync/authkey
sudo scp /etc/corosync/authkey root@node3:/etc/corosync/authkey

（3）配置文件权限检查
sudo chown root:root /etc/corosync/authkey
sudo chmod 600 /etc/corosync/authkey

（4）配置文件启用加密认证

# 修改 Corosync 配置文件（所有节点）
vi /etc/corosync/corosync.conf

[root@node1:~]$cat /etc/corosync/corosync.conf
totem {
	version: 2
    secauth: on    #启用节点间认证
	crypto_cipher: aes256    #这里需要通过corosync-keygen提前生成共享密钥
	crypto_hash: sha256

	interface {
		ringnumber: 0
		bindnetaddr: 192.168.1.0    #绑定心跳子网地址
		mcastaddr: 239.255.1.1
		mcastport: 5405    #采用多播端口5405
		ttl: 1
	}
}

nodelist {
    node {
        ring0_addr: node1-hb    #这里填写心跳地址对应的主机名，node1-hb对应192.168.1.11
        name: node1
        nodeid: 1
    }

    node {
        ring0_addr: node2-hb    #这里填写心跳地址对应的主机名，node2-hb对应192.168.1.22
        name: node2
        nodeid: 2
    }

    node {
        ring0_addr: node3-hb    #这里填写心跳地址对应的主机名，node3-hb对应192.168.1.33
        name: node3
        nodeid: 3
    }
}


logging {
	fileline: off
	to_stderr: no
	to_logfile: yes
	logfile: /var/log/cluster/corosync.log    #日志文件存储路径
	to_syslog: yes
	debug: off
	timestamp: on
	logger_subsys {
		subsys: QUORUM
		debug: off
	}
}

quorum {
	provider: corosync_votequorum
    expected_votes: 3  
    # 添加 expected_votes 参数，确保集群在少数节点存活时仍能正常运行,expected_votes 应等于集群节点数，避免因节点故障导致集群无法决策；错误配置可能导致集群无法正常运行或出现脑裂，需谨慎设置
    # expected_votes 是集群 quorum 的“总票数基准”，必须与实际节点的投票数严格匹配
    # 双节点集群需特殊处理（启用 two_node 参数）

}

特殊场景：2 节点集群
```bash
quorum {
    provider: corosync_votequorum
    expected_votes: 2
    two_node: 1  # 必须启用此参数！
}

2 节点集群需要特殊配置：
默认情况下，2 节点的半数是 1，可能导致“脑裂”（任何节点离线都会导致 quorum 失败）。
two_node: 1 告诉集群这是一个双节点配置，允许 1 个节点维持 quorum。
expected_votes: 2 表示总票数为 2，但实际允许 1 票通过（通过 two_node 逻辑）



# （5）重启 Corosync服务并验证
pcs cluster reload corosync 或者 systemctl restart corosync
检查集群状态，确保所有节点已加入且无认证错误：
# 检查集群成员状态
corosync-cfgtool -s
[root@node1:~]$corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
	id	= 192.168.1.11
	status	= ring 0 active with no faults

五、配置集群属性与 STONITH

1. 禁用无关属性（三节点需关闭 no-quorum-policy）

pcs property set stonith-enabled=true     # 启用 STONITH，确保集群中故障节点被隔离或重启的机制，以防止脑裂（split-brain）情况的发生
pcs property set no-quorum-policy=freeze  # 失去 Quorum 时冻结资源，no-quorum-policy 属性定义了当集群失去法定人数（quorum）时的行为。设置为 freeze           
意味着当集群失去法定人数时，所有资源将被冻结，不会进行任何资源的启动、停止或迁移操作，以防止脑裂

2. 配置 STONITH 设备（示例使用 fence_ipmilan）

# 假设节点使用 IPMI 带外管理
pcs stonith create ipmi-fence fence_ipmilan \
  pcmk_host_list="node1 node2 node3" \
  ipaddr="192.168.1.50" \
  login="admin" passwd="secret" \
  op monitor interval=60s

六、配置高可用资源

1. 创建 VIP 资源

pcs resource create WebVIP ocf:heartbeat:IPaddr2 \
  ip=10.0.0.100 cidr_netmask=24 \
  op monitor interval=30s

2. 创建 Apache 资源

# 安装 Apache（所有节点）
yum install -y httpd

# 创建 Apache 资源
pcs resource create WebServer systemd:httpd \
  op monitor interval=60s

3. 配置资源约束

# 绑定 VIP 和 Apache 必须运行在同一节点
pcs constraint colocation add WebVIP with WebServer

# 设置启动顺序：先启动 VIP，再启动 Apache
pcs constraint order start WebVIP then start WebServer

4. （可选）配置 DRBD 存储同步

# 参考 DRBD 配置步骤（需提前准备共享磁盘分区）
# 此处假设已配置 DRBD 资源 drbd_web
pcs resource create WebData ocf:linbit:drbd drbd_resource=drbd_web
pcs resource promotable WebData promoted-max=1 promoted-node-max=1
pcs resource create FsWeb Filesystem device="/dev/drbd0" directory="/var/www/html" fstype="ext4"

# 更新约束
pcs constraint colocation add WebServer with FsWeb
pcs constraint order promote WebData then start FsWeb

七、测试故障转移

1. 模拟节点故障

# 手动关闭 node1 的 Pacemaker
pcs cluster stop node1

# 观察资源是否迁移到 node2 或 node3
pcs status

2. 恢复节点并重新加入集群

# 在 node1 上启动集群服务
pcs cluster start node1

# 检查资源是否自动平衡（根据位置约束）
pcs status

3. 验证 VIP 和服务可用性

curl http://192.168.1.100

八、关键命令与监控

1. 常用命令

pcs status                  # 查看集群整体状态
pcs resource show           # 显示资源状态
pcs constraint list         # 列出所有约束
crm_mon -1                  # 实时监控集群事件

2. 日志查看

journalctl -u corosync -f   # 查看 Corosync 日志
journalctl -u pacemaker -f  # 查看 Pacemaker 日志

九、常见问题与解决

1. 脑裂（Split-Brain）

现象：节点间无法通信，资源被多个节点同时启动。
解决：
- 确保 STONITH 配置正确。
- 检查 Quorum 设置（三节点集群需至少 2 节点存活）。

2. 资源无法启动

检查项：
- 资源代理脚本权限（如 ocf:heartbeat:IPaddr2）。
- 端口冲突（如 VIP 是否已被占用）。
- SELinux 或防火墙阻止服务启动。

3. 节点无法加入集群

检查项：
- 心跳网络是否互通（ping 10.0.0.x）。
- Corosync 配置文件中节点名称和 IP 是否正确。

十、总结

通过以上步骤，您已成功搭建了一个三节点的高可用 Web 集群。关键点包括：

Quorum 配置：三节点需至少 2 节点存活才能操作资源。
STONITH 必要性：防止脑裂导致数据损坏。
资源约束策略：确保 VIP 和服务在同一个节点运行。

此方案可扩展至其他服务（如 MySQL、NFS），只需替换资源定义即可实现业务级高可用。

posted on 2025-04-02 09:21 LeeHang 阅读(73) 评论(0) 收藏举报

刷新页面返回顶部