alertmanager相关
一、alertmanager收到的告警被置为[resolved]状态
1.问题描述
prometheus和alertmanager分别部署在不同集群,resolve_timeout配置为20m
查看alertmanager日志发现,prometheus发往alertmanager的告警全部被置为[resolved]状态
level=debug ts=2024-04-12T07:37:11.974Z caller=dispatch.go:165 component=dispatcher msg="Received alert" alert=POD内存使用率[f83426f][resolved]
2.问题解决
经过排查发现prometheus所在集群未开启时钟同步,导致和alertmanager所在集群存在时差,alertmanager则自动将超过resolve_timeout的告警置为[resolved]状态。
后面开启时钟同步后,问题解决。
二、alertmanager部分配置解释
global: resolve_timeout: 10m #再十分钟内没有再次收到告警,则将告警置为resolve状态 route: receiver: pero-api #告警接收者 group_by: #告警按照以下标签进行分组 - alertTarget - alertLevel - rule_id - metricId - cluster group_wait: 5m #在接收到第一条告警(一条新告警)时,将告警发送给receiver之前需要等待的时间 group_interval: 5m #对于一条已经出现过的告警,间隔5分钟检查一次告警 repeat_interval: 2h #对于一条已经出现过的告警(没有resolve的),每隔2小时重新发送给receiver ###以上为所有route的默认配置,以下为各个route的具体配置### routes: - receiver: pero-api group_by: - alertTarget - alertLevel - rule_id - metricId - cluster match: peroGroup: pero-api routes: - match: alertLevel: "4" group_wait: 5s group_interval: 5s - match: alertLevel: "3" group_wait: 30s group_interval: 30s - match: alertLevel: "2" group_wait: 2m group_interval: 2m - match: alertLevel: "1" group_wait: 5m group_interval: 5m inhibit_rules: #抑制规则 - source_match: #告警组为pero-api的级别为4的告警会抑制告警组为pero-api的级别为3的告警,并且标签也要匹配上 alertLevel: "4" peroGroup: pero-api target_match: alertLevel: "3" peroGroup: pero-api equal: ['peroGroup','metricId', 'cluster', 'rule_id', 'instance', 'alertTarget'] receivers: #给第三方应用发送告警信息 - name: pero-api webhook_configs: - send_resolved: true url: http://pero-api-svc.pcl:8080/pero/api/v3/send #pero-api应用接口
三、手动给alertmanager发送告警
1.报文
[ { "labels": { "alertTarget": "xdd666", "alertLevel": "1", "rule_id": "98", "instance": "xdd实例", "bizSystem": "xdd轻微系统", "alertname": "资源变更", "groupId": "", "log_info": "xdd轻微ddddddddddddddddddddddddddddddddddddddddddddd", "description": "资源对象:xdd轻微ddddd变更", "thirdStrategyId": "100", "clusterId": "sdsdfs-wewew-dsewe-sdcsd-ssdvdd", "log_time": "2024-11-04 16:16:16", "metricType": "Deployment", "rulesName": "xdd轻微告警名称", "clusterName": "cluster-pero", "namespace": "xdd轻微", "alertGroup": "pero-api", "alertModel": "resource" ####注意,新增字段或修改value("key": "value")会认为是新告警,会发送;但如果新增字段时value为空则不认为是新告警。 }, ####注意,value的值从有值变为空也会发送告警 "annotations": { ####有时又不一定,很玄学。随意修改value的值容易发生告警风暴 "alertContent": "xdd轻微ddddddddddddddddddddddddddddddddddddddddddddd变更" } }, { "labels": { "alertTarget": "xdd666", "alertLevel": "2", "rule_id": "97", "instance": "xdd实例", "bizSystem": "xdd中度系统", "alertname": "资源变更", "groupId": "", "log_info": "xdd中度ddddddddddddddddddddddddddddddddddddddddddddd", "description": "资源对象:xdd中度ddddd变更", "thirdStrategyId": "100", "clusterId": "sdsdfs-wewew-dsewe-sdcsd-ssdvdd", "log_time": "2024-11-04 16:16:16", "metricType": "Deployment", "rulesName": "xdd中度告警名称", "clusterName": "cluster-pero", "namespace": "xdd中度", "alertGroup": "pero-api", "alertModel": "resource" }, "annotations": { "alertContent": "xdd中度ddddddddddddddddddddddddddddddddddddddddddddd变更" } } ]
2.curl命令
curl -v --request POST \ --url http://alertmanager-svc.pcl:9093/api/v2/alerts \ --header 'Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxxx' \ --header 'content-type: application/json' \ --data '[ { "labels": { "alertTarget": "xdd666", "alertLevel": "1", "rule_id": "98", "instance": "xdd实例", "bizSystem": "xdd轻微系统", "alertname": "资源变更", "groupId": "", "log_info": "xdd轻微ddddddddddddddddddddddddddddddddddddddddddddd", "description": "资源对象:xdd轻微ddddd变更", "thirdStrategyId": "100", "clusterId": "sdsdfs-wewew-dsewe-sdcsd-ssdvdd", "log_time": "2024-11-04 16:16:16", "metricType": "Deployment", "rulesName": "xdd轻微告警名称", "clusterName": "cluster-pero", "namespace": "xdd轻微", "alertGroup": "pero-api", "alertModel": "resource" }, "annotations": { "alertContent": "xdd轻微ddddddddddddddddddddddddddddddddddddddddddddd变更" } }, { "labels": { "alertTarget": "xdd666", "alertLevel": "2", "rule_id": "97", "instance": "xdd实例", "bizSystem": "xdd中度系统", "alertname": "资源变更", "groupId": "", "log_info": "xdd中度ddddddddddddddddddddddddddddddddddddddddddddd", "description": "资源对象:xdd中度ddddd变更", "thirdStrategyId": "100", "clusterId": "sdsdfs-wewew-dsewe-sdcsd-ssdvdd", "log_time": "2024-11-04 16:16:16", "metricType": "Deployment", "rulesName": "xdd中度告警名称", "clusterName": "cluster-pero", "namespace": "xdd中度", "alertGroup": "pero-api", "alertModel": "resource" }, "annotations": { "alertContent": "xdd中度ddddddddddddddddddddddddddddddddddddddddddddd变更" } } ] '
四、alertmanager集群告警抑制不生效
1.场景
抑制规则:高级别告警抑制低级别告警,
alertmanger部署模式:alertmanager集群,通过负载均衡地址访问
发送告警时,alertmanager地址是一个负载均衡地址,可能有一部分告警发送给了alertmanager集群中的某个节点,另一部分发送给了alertmanager集群中其他节点。
这就会导致只有同一个节点的alermanager收到的告警会触发抑制规则,不同节点的告警不会互相抑制
2.解决
发送告警时,给每个alertmanager的节点都发送相同的告警