consul-server端的启动与监控
一、使用consu-server来做服务发现动态管理每个节点的node_exporter
wget https://releases.hashicorp.com/consul/1.15.4/consul_1.15.4_linux_amd64.zip unzip consul_1.15.4_linux_amd64.zip mv consul /usr/local/bin/ consul version useradd -M -s /sbin/nologin consul mkdir -p /etc/consul.d /opt/consul/data chown -R consul:consul /opt/consul /etc/consul.d
配置启动文件
cat > /etc/consul.d/server.hcl <<'EOF' datacenter = "dc1" node_name = "consul_server_BJ" server = true bootstrap_expect = 1 client_addr = "0.0.0.0" #bind_addr = "{{ GetInterfaceIP \"eth0\" }}" # 自动取 eth0 IP,也可手动写 bind_addr = "192.168.70.15" data_dir = "/opt/consul/data" ui = true ports { http = 8500 } EOF
写成systemd服务
cat > /etc/systemd/system/consul.service <<'EOF' [Unit] Description=Consul Service Discovery After=network.target [Service] User=consul Group=consul ExecStart=/usr/local/bin/consul agent -config-dir=/etc/consul.d ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure RestartSec=5s LimitNOFILE=65536 [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable --now consul systemctl status consul -l
访问页面:http://192.168.70.15:8500
给node_expoter配置的在线模板服务发现示例:
{ "service": { "name": "node-exporter", "tags": ["prometheus", "node-exporter", "metrics"], "port": 9100, "meta": { "metrics_path": "/metrics", "scheme": "http", "job": "node0000747" }, "check": { "name": "node-exporter-health", "http": "http://localhost:9100/metrics", "interval": "30s", "timeout": "5s" } } }

二、对于consul-server的监控,即补充通过curlAPI的方式获取到他的成员,然后写入到一个文件,让node_exporter去读取
1、在 Consul Server 上部署脚本创建/usr/local/bin/consul_members_exporter.sh
#!/bin/bash # 通过 Consul API 获取成员状态,生成 Prometheus 格式指标 OUTPUT_DIR="/var/lib/node_exporter/textfile_collector" OUTPUT_FILE="$OUTPUT_DIR/consul_members.prom" mkdir -p $OUTPUT_DIR # 写入指标头 cat > $OUTPUT_FILE << 'EOF' # HELP consul_serf_lan_member_status Serf LAN member status: 1=alive, 2=leaving, 3=left, 4=failed # TYPE consul_serf_lan_member_status gauge EOF # 调用 API 获取成员列表,生成指标 curl -s http://127.0.0.1:8500/v1/agent/members | jq -r ' .[] | "consul_serf_lan_member_status{member=\"" + .Name + "\",addr=\"" + .Addr + "\"} " + (.Status | tostring) ' >> $OUTPUT_FILE # 同时生成成员总数指标 cat >> $OUTPUT_FILE << EOF # HELP consul_serf_lan_members_total Total number of LAN members # TYPE consul_serf_lan_members_total gauge consul_serf_lan_members_total $(curl -s http://127.0.0.1:8500/v1/agent/members | jq 'length') EOF
赋权并测试:
chmod +x /usr/local/bin/consul_members_exporter.sh bash /usr/local/bin/consul_members_exporter.sh cat /var/lib/node_exporter/textfile_collector/consul_members.prom
加入定时任务
echo '* * * * * root /usr/local/bin/consul_members_exporter.sh > /dev/null 2>&1' > /etc/cron.d/consul-members
node_exporter 加载 textfile
node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
如果是 systemd 管理,修改 /etc/systemd/system/node_exporter.service:
[Service]
ExecStart=/usr/local/bin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
systemctl daemon-reload
systemctl restart node_exporter
2、Prometheus 采集与告警
Prometheus 自动通过 node_exporter 采集到 consul_serf_lan_member_status。
告警规则:
groups: - name: consul-members rules: - alert: ConsulNodeNotAlive expr: consul_serf_lan_member_status != 1 for: 1m labels: severity: critical annotations: summary: "Consul 节点 {{ $labels.member }} 状态异常" description: "节点 {{ $labels.member }} ({{ $labels.addr }}) 当前状态码为 {{ $value }},1=alive, 2=leaving, 3=left, 4=failed"
亦或者你的告警规则应该这样写,避免对 left 状态过度反应:
groups: - name: consul-members rules: # 只告警 failed(失联)和 leaving(正在离开),不告警 left(已确认离开) - alert: ConsulNodeNotAlive expr: consul_serf_lan_member_status == 4 or consul_serf_lan_member_status == 2 for: 1m labels: severity: critical annotations: summary: "Consul 节点 {{ $labels.member }} 状态异常" description: "节点 {{ $labels.member }} 状态为 {{ $value }} (2=leaving, 4=failed)" # left 状态单独处理:提醒管理员该节点已下线,但不算紧急故障 - alert: ConsulNodeLeft expr: consul_serf_lan_member_status == 3 for: 1h labels: severity: info annotations: summary: "节点 {{ $labels.member }} 已标记为 left" description: "该节点已强制离开集群,将在 tombstone 超时后自动清理"

浙公网安备 33010602011771号