Pacemaker入门之---了解配置故障转移

案例说明:
以下流程介绍了创建运行服务的一个 Pacemaker 集群,当节点上的服务变为不可用时,将其从一个节点切换到另一个节点上。通过这个步骤,您可以了解如何在双节点集群中创建服务,并可以查看在运行该服务的节点出现问题时会出现什么情况。
这个示例步骤配置一个运行 Apache HTTP 服务器的双节点 Pacemaker 集群。然后,您可以停止一个节点上的 Apache 服务来查看该服务仍然可用。

系统版本:

[root@node201 ~]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)

在本例中:

节点为 node01.hw.net和 node02.hw.net。
浮动 IP 地址为 192.168.1.120。
节点信息:
         [root@node01 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.1.11  node01 node01.hw.net
192.168.1.12  node02 node02.hw.net

先决条件

  • 两个可以相互通讯的、运行 CentOS 7.9 的节点
  • 一个浮动的 IP 地址,它与一个节点静态分配的 IP 地址处于同一个网络。
  • 运行的节点的名称位于 /etc/hosts 文件中

步骤
在这两个节点中,通过 High Availability 频道安装 Red Hat High Availability Add-On 软件包,并启动并启用 pcsd 服务。

[root@node01 ~]# dnf install pcs pacemaker fence-agents-all

[root@node01 ~]# dnf list pcs pacemaker fence-agents-all
Repository epel is listed more than once in the configuration
Repository epel-debuginfo is listed more than once in the configuration
Repository epel-source is listed more than once in the configuration
Last metadata expiration check: 0:00:15 ago on Wed 08 May 2024 02:30:49 PM CST.
Installed Packages
fence-agents-all.x86_64                              4.2.1-41.el7                                        @System
pacemaker.x86_64                                     1.1.23-1.el7                                        @System
pcs.x86_64                                           0.9.169-3.el7.centos                                @System
Available Packages
fence-agents-all.x86_64                              4.2.1-41.el7_9.6                                    updates
pacemaker.x86_64                                     1.1.23-1.el7_9.1                                    updates
pcs.x86_64                                           0.9.169-3.el7.centos.3                              updates

[root@node01 ~]# systemctl start pcsd.service
[root@node01 ~]# systemctl enable pcsd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/pcsd.service to /usr/lib/systemd/system/pcsd.service.
[root@node01 ~]# systemctl status pcsd.service
● pcsd.service - PCS GUI and remote configuration interface
   Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2024-05-08 14:31:49 CST; 39s ago
     Docs: man:pcsd(8)
           man:pcs(8)
 Main PID: 27710 (pcsd)
   CGroup: /system.slice/pcsd.service
           └─27710 /usr/bin/ruby /usr/lib/pcsd/pcsd

May 08 14:31:48 node01 systemd[1]: Starting PCS GUI and remote configuration interface...
May 08 14:31:49 node01 systemd[1]: Started PCS GUI and remote configuration interface.

如果您正在运行 firewalld 守护进程,在两个节点上启用红帽高可用性附加组件所需的端口。

# firewall-cmd --permanent --add-service=high-availability
# firewall-cmd --reload

在集群的两个节点上为用户 hacluster 设置密码。

[root@node02 ~]# id hacluster
uid=189(hacluster) gid=189(haclient) groups=189(haclient)

[root@node02 ~]# passwd hacluster
Changing password for user hacluster.
New password:
BAD PASSWORD: The password is shorter than 8 characters
Retype new password:
passwd: all authentication tokens updated successfully.

在要运行 pcs 命令的节点上,为集群中的每个节点验证用户 hacluster。

# 移除旧的配置
[root@node01 ~]# pcs cluster  node  remove www.hw.net --force
www.hw.net: Stopping Cluster (pacemaker)...
www.hw.net: Successfully destroyed cluster
Error: Unable to update any nodes

[root@node01 ~]#  pcs cluster auth node01.hw.net node02.hw.net
Username: hacluster
Password:
node01.hw.net: Authorized
node02.hw.net: Authorized

[root@node02 ~]# pcs cluster auth node01.hw.net node02.hw.net
node01.hw.net: Already authorized
node02.hw.net: Already authorized

创建名为 my_cluster 的集群,两个节点都作为集群成员。这个命令会创建并启动集群。因为 pcs 配置命令对整个集群的影响,您只需要从集群的一个节点上运行。
在集群的一个节点中运行以下命令。

[root@node01 ~]#  pcs cluster setup --name my_cluster --start node01.hw.net node02.hw.net
Destroying cluster on nodes: node01.hw.net, node02.hw.net...
node01.hw.net: Stopping Cluster (pacemaker)...
node02.hw.net: Stopping Cluster (pacemaker)...
node01.hw.net: Successfully destroyed cluster
node02.hw.net: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'node01.hw.net', 'node02.hw.net'
node01.hw.net: successful distribution of the file 'pacemaker_remote authkey'
node02.hw.net: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node01.hw.net: Succeeded
node02.hw.net: Succeeded

Starting cluster on nodes: node01.hw.net, node02.hw.net...
node01.hw.net: Starting Cluster (corosync)...
node02.hw.net: Starting Cluster (corosync)...
node01.hw.net: Starting Cluster (pacemaker)...
node02.hw.net: Starting Cluster (pacemaker)...

Synchronizing pcsd certificates on nodes node01.hw.net, node02.hw.net...
node01.hw.net: Success
node02.hw.net: Success
Restarting pcsd on the nodes in order to reload the certificates...
node01.hw.net: Success
node02.hw.net: Success

红帽高可用性集群要求为集群配置隔离功能。需要满足这个要求的原因包括在 Red Hat High Availability 集群中的隔离中。在这里,仅显示在这个配置中故障转移是如何工作的。把 stonith-enabled 集群选项设置为 false 来禁用隔离
警告
对生产集群而言,不要使用 stonith-enabled=false。它通知集群,假设出现故障的节点已被安全隔离。

[root@node01 ~]# pcs status
Cluster name: my_cluster
WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: node01.hw.net (version 1.1.23-1.el7-9acf116022) - partition with quorum
Last updated: Wed May  8 14:45:55 2024
Last change: Wed May  8 14:45:25 2024 by hacluster via crmd on node01.hw.net

2 nodes configured
0 resource instances configured

Online: [ node01.hw.net node02.hw.net ]
No resources
Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

[root@node02 ~]# pcs property set stonith-enabled=false

[root@node02 ~]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: node01.hw.net (version 1.1.23-1.el7-9acf116022) - partition with quorum
Last updated: Wed May  8 14:46:29 2024
Last change: Wed May  8 14:46:17 2024 by root via cibadmin on node01.hw.net

2 nodes configured
0 resource instances configured
Online: [ node01.hw.net node02.hw.net ]No resources
Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

创建集群并禁用隔离后,检查集群的状态。
注意
运行 pcs cluster status 命令时,可能会显示与系统组件启动时稍有不同示例的输出。

[root@node01 ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: node01.hw.net (version 1.1.23-1.el7-9acf116022) - partition with quorum
 Last updated: Wed May  8 14:47:00 2024
 Last change: Wed May  8 14:46:17 2024 by root via cibadmin on node01.hw.net
 2 nodes configured
 0 resource instances configured

PCSD Status:
  node01.hw.net: Online
  node02.hw.net: Online

在这两个节点中,配置网页浏览器并创建一个网页来显示简单的文本信息。如果您正在运行 firewalld 守护进程,启用 httpd 所需的端口。
注意
不要使用 systemctl enable 启用任何由集群管理的服务在系统引导时启动。

# dnf install -y httpd wget
...
# firewall-cmd --permanent --add-service=http
# firewall-cmd --reload

# cat <<-END >/var/www/html/index.html
<html>
<body>My Test Site - $(hostname)</body>
</html>
END

要让 Apache 资源代理获得 Apache 状态,集群中的每个节点都会在现有配置之外创建一个新的配置来启用状态服务器 URL。

# cat <<-END > /etc/httpd/conf.d/status.conf
<Location /server-status>
SetHandler server-status
Order deny,allow
Deny from all
Allow from 127.0.0.1
Allow from ::1
</Location>
END

创建 IPaddr2 和 apache 资源,供集群管理。'IPaddr2' 资源是一个浮动 IP 地址,它不能是一个已经与物理节点关联的 IP 地址。如果没有指定 'IPaddr2' 资源的 NIC 设备,浮动 IP 必须位于与静态分配的 IP 地址相同的网络中。
您可以使用 pcs resource list 命令显示所有可用资源类型的列表。您可以使用 pcs resource describe resourcetype 命令显示您可以为指定资源类型设置的参数。例如,以下命令显示您可以为类型为 apache 的资源设置的参数:

# pcs resource describe apache
...

在这个示例中,IP 地址资源和 apache 资源都配置为名为 apachegroup 的组的一部分,这样可确保这些资源在同一节点中运行。
在集群中的一个节点中运行以下命令:

[root@node02 ~]# pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=192.168.1.120 --group apachegroup
[root@node02 ~]# pcs resource create WebSite ocf:heartbeat:apache configfile=/etc/httpd/conf/httpd.conf statusurl="http://localhost/server-status" --group apachegroup
[root@node02 ~]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: node01.hw.net (version 1.1.23-1.el7-9acf116022) - partition with quorum
Last updated: Wed May  8 14:54:16 2024
Last change: Wed May  8 14:54:06 2024 by root via cibadmin on node02.hw.net

2 nodes configured
2 resource instances configured

Online: [ node01.hw.net node02.hw.net ]

Full list of resources:

 Resource Group: apachegroup
     ClusterIP  (ocf::heartbeat:IPaddr2):       Started node01.hw.net
     WebSite    (ocf::heartbeat:apache):        Started node01.hw.net

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

# vip和httpd服务加载在node01节点
[root@node01 ~]# ps -ef |grep httpd
root      4868     1  0 14:54 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    4869  4868  0 14:54 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    4870  4868  0 14:54 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    4871  4868  0 14:54 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    4872  4868  0 14:54 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    4873  4868  0 14:54 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid

[root@node01 ~]# ip add sh
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:6c:30:8f brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.11/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.120/24 brd 192.168.1.255 scope global secondary enp0s3
       valid_lft forever preferred_lft forever

[root@node02 ~]# ps -ef |grep httpd

请注意,在这个实例中,apachegroup 服务在节点node01.hw.net 中运行。

访问您创建的网站,在运行该服务的节点上停止运行该服务,查看该服务如何切换到第二个节点。
将浏览器指向使用您配置的浮动 IP 地址创建的网站。这会显示您定义的文本信息,显示运行网站的节点名称。
停止 apache web 服务。使用 killall -9 模拟应用程序级别的崩溃。
# killall -9 httpd

检查集群状态。您应该可以看到,停止 web 服务会导致操作失败,但集群软件在运行该服务的节点中重启该服务,所以您应该仍然可以访问网页浏览器。

如下所示,node01节点httpd服务down后,被自动拉起,没有切换:

[root@node01 ~]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: node01.hw.net (version 1.1.23-1.el7-9acf116022) - partition with quorum
Last updated: Wed May  8 14:57:38 2024
Last change: Wed May  8 14:54:06 2024 by root via cibadmin on node02.hw.net

2 nodes configured
2 resource instances configured

Online: [ node01.hw.net node02.hw.net ]

Full list of resources:
Resource Group: apachegroup
     ClusterIP  (ocf::heartbeat:IPaddr2):       Started node01.hw.net
     WebSite    (ocf::heartbeat:apache):        Started node01.hw.net

Failed Resource Actions:
* WebSite_monitor_10000 on node01.hw.net 'not running' (7): call=13, status=complete, exitreason='',
    last-rc-change='Wed May  8 14:57:27 2024', queued=0ms, exec=0ms

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

在服务启动并再次运行后,清除失败状态。

[root@node01 ~]# pcs resource cleanup WebSite
Cleaned up ClusterIP on node02.hw.net
Cleaned up ClusterIP on node01.hw.net
Cleaned up WebSite on node02.hw.net
Cleaned up WebSite on node01.hw.net
Waiting for 1 reply from the CRMd. OK

[root@node01 ~]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: node01.hw.net (version 1.1.23-1.el7-9acf116022) - partition with quorum
Last updated: Wed May  8 14:59:00 2024
Last change: Wed May  8 14:58:58 2024 by hacluster via crmd on node01.hw.net

2 nodes configured
2 resource instances configured

Online: [ node01.hw.net node02.hw.net ]

Full list of resources:

 Resource Group: apachegroup
     ClusterIP  (ocf::heartbeat:IPaddr2):       Started node01.hw.net
     WebSite    (ocf::heartbeat:apache):        Started node01.hw.net

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

将运行该服务的节点设置为待机模式。请注意,由于禁用了隔离功能,因此我们无法有效地模拟节点级别的故障(比如拔掉电源电缆)。需要隔离功能集群才可以在出现这类问题时被恢复。
[root@node01 ~]# pcs node unstandby node01.hw.net

检查集群的状态并记录该服务正在运行的位置。

[root@node01 ~]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: node01.hw.net (version 1.1.23-1.el7-9acf116022) - partition with quorum
Last updated: Wed May  8 15:00:35 2024
Last change: Wed May  8 14:59:44 2024 by root via cibadmin on node01.hw.net

2 nodes configured
2 resource instances configured

Node node01.hw.net: standby
Online: [ node02.hw.net ]

Full list of resources:

 Resource Group: apachegroup
     ClusterIP  (ocf::heartbeat:IPaddr2):       Started node02.hw.net
     WebSite    (ocf::heartbeat:apache):        Started node02.hw.net

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

# vip和httpd服务切换到node02
[root@node02 ~]# ip add sh
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:7c:f2:c3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.12/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.120/24 brd 192.168.1.255 scope global secondary enp0s3
       valid_lft forever preferred_lft forever


[root@node02 ~]# ps -ef |grep httpd
root     28894     1  0 14:59 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache   29023 28894  0 14:59 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache   29024 28894  0 14:59 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache   29025 28894  0 14:59 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache   29026 28894  0 14:59 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache   29027 28894  0 14:59?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid

访问网站。服务应该仍然可用,显示信息应该指示服务正在运行的节点。

要将集群服务恢复到第一个节点,让节点离开待机模式。这不一定将该服务转换到第一个节点。

[root@node01 ~]# pcs node unstandby node01.hw.net
[root@node01 ~]# pcs status
Cluster name: my_cluster
Stack: corosync
Current DC: node01.hw.net (version 1.1.23-1.el7-9acf116022) - partition with quorum
Last updated: Wed May  8 15:03:00 2024
Last change: Wed May  8 15:02:56 2024 by root via cibadmin on node01.hw.net

2 nodes configured
2 resource instances configured

Online: [ node01.hw.net node02.hw.net ]
Full list of resources:
Resource Group: apachegroup
     ClusterIP  (ocf::heartbeat:IPaddr2):       Started node02.hw.net
     WebSite    (ocf::heartbeat:apache):        Started node02.hw.net

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

重启node02节点系统:
如下所示,httpd服务切换到node01节点:

最后,进行清理,停止两个节点上的集群服务。

[root@node01 ~]# pcs cluster stop --all
node02.hw.net: Stopping Cluster (pacemaker)...
node01.hw.net: Stopping Cluster (pacemaker)...
node02.hw.net: Stopping Cluster (corosync)...
node01.hw.net: Stopping Cluster (corosync)...
posted @ 2024-05-09 10:34  天涯客1224  阅读(356)  评论(0)    收藏  举报