Loading

AlertManager告警机制

上一篇[文章](https://www.cnblogs.com/zydev/p/16848444.html)我们讲了Prometheus如何将告警发送出来,

但是这种告警信息是没有分组和告警抑制,重复告警处理等功能的,AlertManager组件就是干这种事的。

这里我们使用AlertManager接受告警,并研究三个配置参数的效果:group_wait、group_interval、repeat_interval。

1. 启动自定义程序

这里我们使用一个webhook的方式接受AlertManager传过来的告警
webhook-receiver.go

func AlertHandler(w http.ResponseWriter, r *http.Request) {
	body, err := ioutil.ReadAll(r.Body)
	if err != nil {
		fmt.Printf("read body err, %v\n", err)
		return
	}
	fmt.Println(time.Now())
	fmt.Printf("%s\n\n", string(body))
	fmt.Fprintf(w, "Hi there, I love %s!", r.URL.Path[1:])
}

func main() {
	http.HandleFunc("/alert/webhook", AlertHandler)
	log.Fatal(http.ListenAndServe(":8090", nil))
}

2. 配置alertmanager

配置alertmanager,使用上面的程序接受告警

alertmanager.yml

global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'webhook'
receivers:
- name: webhook
  webhook_configs:
  - url: http://192.168.1.104:8090/alert/webhook

3. 发送告警

alertmanager提供了API https://github.com/prometheus/alertmanager/blob/main/api/v2/openapi.yaml,
将该文件粘贴到https://editor.swagger.io,即可看见所有API。

这里使用posman向http://192.168.1.200:9093/api/v2/alerts发送一个post请求创建一个告警。
EndsAt一定要大于发送的时间,否则alertmanager发自动忽略

[
    {
        "Labels": {
            "alertname": "NodeCpuPressure",
            "IP": "192.168.2.101"
        },
        "Annotations": {
            "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
        },
        "StartsAt": "2022-11-02T06:33:44.662Z", 
        "EndsAt": "2022-11-02T08:33:44.662Z"
    }
]

然后,我们调AlertManager的API来查询Alerts(GET /api/v2/alerts)与Groups(GET /api/v2/alerts/groups),可以通过浏览器直接调或者通过命令行curl来调

[
    {
        "annotations": {
            "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
        },
        "endsAt": "2022-11-02T08:33:44.662Z",
        "fingerprint": "27e1a08813b1ec3b",
        "receivers": [
            {
                "name": "webhook"
            }
        ],
        "startsAt": "2022-11-02T06:33:44.662Z",
        "status": {
            "inhibitedBy": [],
            "silencedBy": [],
            "state": "active"
        },
        "updatedAt": "2022-11-02T06:37:02.089Z",
        "labels": {
            "IP": "192.168.2.101",
            "alertname": "NodeCpuPressure"
        }
    }
]

查看分组的日志

[
    {
        "alerts": [
            {
                "annotations": {
                    "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
                },
                "endsAt": "2022-11-02T08:33:44.662Z",
                "fingerprint": "27e1a08813b1ec3b",
                "receivers": [
                    {
                        "name": "webhook"
                    }
                ],
                "startsAt": "2022-11-02T06:33:44.662Z",
                "status": {
                    "inhibitedBy": [],
                    "silencedBy": [],
                    "state": "active"
                },
                "updatedAt": "2022-11-02T06:37:02.089Z",
                "labels": {
                    "IP": "192.168.2.101",
                    "alertname": "NodeCpuPressure"
                }
            }
        ],
        "labels": {
            "alertname": "NodeCpuPressure"
        },
        "receiver": {
            "name": "webhook"
        }
    }
]

我们发现,AlertManger自动创建了一个Group,其Labels为{alertname=NodeCpuPressure},里面包含了刚才的告警。

查看alertmanager发送给程序的日志

{
    "receiver": "webhook",
    "status": "firing",
    "alerts": [
        {
            "status": "firing",
            "labels": {
                "IP": "192.168.2.101",
                "alertname": "NodeCpuPressure"
            },
            "annotations": {
                "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
            },
            "startsAt": "2022-11-02T06:33:44.662Z",
            "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "",
            "fingerprint": "27e1a08813b1ec3b"
        }
    ],
    "groupLabels": {
        "alertname": "NodeCpuPressure"
    },
    "commonLabels": {
        "IP": "192.168.2.101",
        "alertname": "NodeCpuPressure"
    },
    "commonAnnotations": {
        "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
    },
    "externalURL": "http://629720e6c34d:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"NodeCpuPressure\"}",
    "truncatedAlerts": 0
}

接着我们再发一个alert,alertname与上面的一样

[
    {
        "Labels": {
            "alertname": "NodeCpuPressure",
            "instance": "192.168.1.200:9100"
        },
        "Annotations": {
            "summary": "NodeCpuPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"
        },
        "StartsAt": "2022-11-02T06:33:44.662Z", 
        "EndsAt": "2022-11-02T08:33:44.662Z"
    }
]

在查看程序的警告

{
    "receiver": "webhook",
    "status": "firing",
    "alerts": [
        {
            "status": "firing",
            "labels": {
                "alertname": "NodeCpuPressure",
                "instance": "192.168.1.200:9100"
            },
            "annotations": {
                "summary": "NodeCpuPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"
            },
            "startsAt": "2022-11-02T06:33:44.662Z",
            "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "",
            "fingerprint": "e729dc1c8ec3316e"
        },
        {
            "status": "firing",
            "labels": {
                "IP": "192.168.2.101",
                "alertname": "NodeCpuPressure"
            },
            "annotations": {
                "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
            },
            "startsAt": "2022-11-02T06:33:44.662Z",
            "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "",
            "fingerprint": "27e1a08813b1ec3b"
        }
    ],
    "groupLabels": {
        "alertname": "NodeCpuPressure"
    },
    "commonLabels": {
        "alertname": "NodeCpuPressure"
    },
    "commonAnnotations": {},
    "externalURL": "http://629720e6c34d:9093",
    "version": "4",
    "groupKey": "{}:{alertname=\"NodeCpuPressure\"}",
    "truncatedAlerts": 0
}

发现了这个两个alert被编成一个告警发送给我们的程序,这就是告警分组。

接着我们在发送一个alertname不一样的告警

[
    {
        "Labels": {
            "alertname": "NodeMemPressure",
            "instance": "192.168.1.200:9100"
        },
        "Annotations": {
            "summary": "NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"
        },
        "StartsAt": "2022-11-02T06:33:44.662Z", 
        "EndsAt": "2022-11-02T08:33:44.662Z"
    }
]

程序马上就接收到了一个告警

2022-11-02 15:03:27.3966965 +0800 CST m=+13583.346347101
{"receiver":"webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.200:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T06:33:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"543ee1360ede8267"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure","instance":"192.168.1.200:9100"},"commonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPressure\"}","truncatedAlerts":0}

其中只包含alertname为NodeMemPressure的告警。

4. 解除告警

我们再发送以下的“解除告警”(即把EndsAt设置为一个过去的时间)

[
    {
        "Labels": {
            "alertname": "NodeMemPressure",
            "instance": "192.168.1.200:9100"
        },
        "Annotations": {
            "summary": "NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"
        },
        "StartsAt": "2022-11-02T06:33:44.662Z", 
        "EndsAt": "2022-11-02T06:50:44.662Z"
    }
]

程序马上接受到了一个解除告警警报

2022-11-02 15:28:17.5246315 +0800 CST m=+15073.474282101
{"receiver":"webhook","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.200:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T06:33:44.662Z","endsAt":"2022-11-02T06:50:44.662Z","generatorURL":"","fingerprint":"543ee1360ede8267"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure","instance":"192.168.1.200:9100"},"commonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPressure\"}","truncatedAlerts":0}

警报的状态变成了resolved

5. 参数解释

group_wait(default: 30s)
How long to initially wait to send a notification for a group of alerts. Allows to wait for an inhibiting alert to arrive or collect more initial alerts for the same group. (Usually ~0s to few minutes.)
一组告警第一次发送之前等待的时间。用于等待抑制告警,或等待同一组告警采集更多初始告警后一起发送。(一般设置为0秒 ~ 几分钟)

group_interval(default: 5m)
How long to wait before sending a notification about new alerts that are added to a group of alerts for which an initial notification has already been sent. (Usually ~5m or more.)
一组已发送初始通知的告警接收到新告警后,再次发送通知前等待的时间(一般设置为5分钟或更多)

repeat_interval(default: 4h)
How long to wait before sending a notification again if it has already been sent successfully for an alert. (Usually ~3h or more).
一条成功发送的告警,在再次发送通知之前等待的时间。 (通常设置为3小时或更长时间)。

6. 实验

参数

group_wait: 10s
group_interval: 5m
repeat_interval: 20m

在16:24:55 创建第一个告警 instance=192.168.1.100:9100
在16:25:00 创建第二个告警 instance=192.168.1.101:9100
在16:26:14 创建第三个告警 instance=192.168.1.103:9100

程序接受到的告警

2022-11-02 16:25:25.1830502 +0800 CST m=+18501.132700801
{"receiver":"webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.100:9100"},"annotations":{"summary":"NodeMemPressure, IP: 
192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"80a1a8cfbe10362e"},{"status":"firing","lab192.168.1.200, Vaels":{"alertname":"NodeMemPressure","instance":"192.168.1.101:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24::"NodeMemPressure44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"8a4e47f851b7fa01"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure"},"c0:00Z","generatorommonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPres192.168.1.200, Vasure\"}","truncatedAlerts":0}


2022-11-02 16:30:25.1791246 +0800 CST m=+18801.128775201
{"receiver":"webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.100:9100"},"annotations":{"summary":"NodeMemPressure, IP: 
192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"80a1a8cfbe10362e"},{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.101:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"8a4e47f851b7fa01"},{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.102:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"45fb02806009ff64"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure"},"commonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 
90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPressure\"}","truncatedAlerts":0}


2022-11-02 16:55:25.1825506 +0800 CST m=+20301.132201201
{"receiver":"webhook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.100:9100"},"annotations":{"summary":"NodeMemPressure, IP: 
192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"80a1a8cfbe10362e"},{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.101:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"8a4e47f851b7fa01"},{"status":"firing","labels":{"alertname":"NodeMemPressure","instance":"192.168.1.102:9100"},"annotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 90%, Threshold: 85%"},"startsAt":"2022-11-02T08:24:44.662Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"","fingerprint":"45fb02806009ff64"}],"groupLabels":{"alertname":"NodeMemPressure"},"commonLabels":{"alertname":"NodeMemPressure"},"commonAnnotations":{"summary":"NodeMemPressure, IP: 192.168.1.200, Value: 
90%, Threshold: 85%"},"externalURL":"http://629720e6c34d:9093","version":"4","groupKey":"{}:{alertname=\"NodeMemPressure\"}","truncatedAlerts":0}

因为第一个和第二个告警发送间隔小于group_wait,所以被编在一起发送
第三个告警与第一个告警产生时间大于group_wait,所以等待了group_interval(group_wait + group_interval)发送第二次告警
经过(group_interval * 4 > repeat_interval ),发送第三次告警(内容与第二次相同)。

单个告警过程
alertmanager收到告警后,等待group_wait(10s),发送第一次通知
未达到group_interval(5m 10s),休眠
达到group_interval(5m 10s)时,小于repeat_interval(20m 10s),休眠
到下一个group_interval(5m 10s),大于repeat_interval(20m 10s),发送第二次通知

posted @ 2022-11-02 16:22  头痛不头痛  阅读(1875)  评论(0编辑  收藏  举报