consul-server端的启动与监控

一、使用consu-server来做服务发现动态管理每个节点的node_exporter

wget https://releases.hashicorp.com/consul/1.15.4/consul_1.15.4_linux_amd64.zip
unzip consul_1.15.4_linux_amd64.zip
mv consul /usr/local/bin/
consul version 
useradd -M -s /sbin/nologin consul
mkdir -p /etc/consul.d /opt/consul/data
chown -R consul:consul /opt/consul /etc/consul.d

配置启动文件

cat > /etc/consul.d/server.hcl <<'EOF'
datacenter = "dc1"
node_name  = "consul_server_BJ"
server     = true
bootstrap_expect = 1
client_addr = "0.0.0.0"
#bind_addr   = "{{ GetInterfaceIP \"eth0\" }}"   # 自动取 eth0 IP,也可手动写
bind_addr   = "192.168.70.15"
data_dir    = "/opt/consul/data"
ui          = true
ports {
  http = 8500
}
EOF

写成systemd服务

cat > /etc/systemd/system/consul.service <<'EOF'
[Unit]
Description=Consul Service Discovery
After=network.target

[Service]
User=consul
Group=consul
ExecStart=/usr/local/bin/consul agent -config-dir=/etc/consul.d
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5s
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now consul
systemctl status consul -l

访问页面:http://192.168.70.15:8500

给node_expoter配置的在线模板服务发现示例:

{
  "service": {
    "name": "node-exporter",
    "tags": ["prometheus", "node-exporter", "metrics"],
    "port": 9100,
    "meta": {
      "metrics_path": "/metrics",
      "scheme": "http",
      "job": "node0000747"
    },
    "check": {
      "name": "node-exporter-health",
      "http": "http://localhost:9100/metrics",
      "interval": "30s",
      "timeout": "5s"
    }
  }
}

image

 二、对于consul-server的监控,即补充通过curlAPI的方式获取到他的成员,然后写入到一个文件,让node_exporter去读取

1、在 Consul Server 上部署脚本创建/usr/local/bin/consul_members_exporter.sh

#!/bin/bash
# 通过 Consul API 获取成员状态,生成 Prometheus 格式指标

OUTPUT_DIR="/var/lib/node_exporter/textfile_collector"
OUTPUT_FILE="$OUTPUT_DIR/consul_members.prom"

mkdir -p $OUTPUT_DIR

# 写入指标头
cat > $OUTPUT_FILE << 'EOF'
# HELP consul_serf_lan_member_status Serf LAN member status: 1=alive, 2=leaving, 3=left, 4=failed
# TYPE consul_serf_lan_member_status gauge
EOF

# 调用 API 获取成员列表,生成指标
curl -s http://127.0.0.1:8500/v1/agent/members | jq -r '
  .[] | 
  "consul_serf_lan_member_status{member=\"" + .Name + "\",addr=\"" + .Addr + "\"} " + (.Status | tostring)
' >> $OUTPUT_FILE

# 同时生成成员总数指标
cat >> $OUTPUT_FILE << EOF
# HELP consul_serf_lan_members_total Total number of LAN members
# TYPE consul_serf_lan_members_total gauge
consul_serf_lan_members_total $(curl -s http://127.0.0.1:8500/v1/agent/members | jq 'length')
EOF

赋权并测试:

chmod +x /usr/local/bin/consul_members_exporter.sh
bash /usr/local/bin/consul_members_exporter.sh
cat /var/lib/node_exporter/textfile_collector/consul_members.prom

加入定时任务

echo '* * * * * root /usr/local/bin/consul_members_exporter.sh > /dev/null 2>&1' > /etc/cron.d/consul-members

node_exporter 加载 textfile

node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

如果是 systemd 管理,修改 /etc/systemd/system/node_exporter.service:

[Service]
ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

systemctl daemon-reload
systemctl restart node_exporter

2、Prometheus 采集与告警

Prometheus 自动通过 node_exporter 采集到 consul_serf_lan_member_status。
告警规则:

groups:
  - name: consul-members
    rules:
      - alert: ConsulNodeNotAlive
        expr: consul_serf_lan_member_status != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Consul 节点 {{ $labels.member }} 状态异常"
          description: "节点 {{ $labels.member }} ({{ $labels.addr }}) 当前状态码为 {{ $value }},1=alive, 2=leaving, 3=left, 4=failed"

 亦或者你的告警规则应该这样写,避免对 left 状态过度反应:

groups:
  - name: consul-members
    rules:
      # 只告警 failed(失联)和 leaving(正在离开),不告警 left(已确认离开)
      - alert: ConsulNodeNotAlive
        expr: consul_serf_lan_member_status == 4 or consul_serf_lan_member_status == 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Consul 节点 {{ $labels.member }} 状态异常"
          description: "节点 {{ $labels.member }} 状态为 {{ $value }} (2=leaving, 4=failed)"

      # left 状态单独处理:提醒管理员该节点已下线,但不算紧急故障
      - alert: ConsulNodeLeft
        expr: consul_serf_lan_member_status == 3
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "节点 {{ $labels.member }} 已标记为 left"
          description: "该节点已强制离开集群,将在 tombstone 超时后自动清理"

 

 

posted @ 2026-04-26 01:09  ZANAN  阅读(5)  评论(0)    收藏  举报