alertmanager相关

一、alertmanager收到的告警被置为[resolved]状态

1.问题描述
prometheus和alertmanager分别部署在不同集群,resolve_timeout配置为20m
查看alertmanager日志发现,prometheus发往alertmanager的告警全部被置为[resolved]状态

level=debug ts=2024-04-12T07:37:11.974Z caller=dispatch.go:165 component=dispatcher msg="Received alert" alert=POD内存使用率[f83426f][resolved]

 

2.问题解决
经过排查发现prometheus所在集群未开启时钟同步,导致和alertmanager所在集群存在时差,alertmanager则自动将超过resolve_timeout的告警置为[resolved]状态。
后面开启时钟同步后,问题解决。

 

二、alertmanager部分配置解释

global:
  resolve_timeout: 10m  #再十分钟内没有再次收到告警,则将告警置为resolve状态
route:
  receiver: pero-api  #告警接收者
  group_by:  #告警按照以下标签进行分组
     - alertTarget
     - alertLevel
     - rule_id
     - metricId
     - cluster
  group_wait: 5m  #在接收到第一条告警(一条新告警)时,将告警发送给receiver之前需要等待的时间
  group_interval: 5m  #对于一条已经出现过的告警,间隔5分钟检查一次告警
  repeat_interval: 2h  #对于一条已经出现过的告警(没有resolve的),每隔2小时重新发送给receiver
  ###以上为所有route的默认配置,以下为各个route的具体配置###
  routes:
  - receiver: pero-api
    group_by:
     - alertTarget
     - alertLevel
     - rule_id
     - metricId
     - cluster
    match:
      peroGroup: pero-api
    routes:
    - match:
        alertLevel: "4"
      group_wait: 5s
      group_interval: 5s
    - match:
        alertLevel: "3"
      group_wait: 30s
      group_interval: 30s
    - match:
        alertLevel: "2"
      group_wait: 2m
      group_interval: 2m
    - match:
        alertLevel: "1"
      group_wait: 5m
      group_interval: 5m
inhibit_rules:  #抑制规则
  - source_match:  #告警组为pero-api的级别为4的告警会抑制告警组为pero-api的级别为3的告警,并且标签也要匹配上
      alertLevel: "4"
      peroGroup: pero-api
    target_match:
      alertLevel: "3"
      peroGroup: pero-api
    equal: ['peroGroup','metricId', 'cluster', 'rule_id', 'instance', 'alertTarget']

receivers:  #给第三方应用发送告警信息
- name: pero-api
  webhook_configs:
  - send_resolved: true
    url: http://pero-api-svc.pcl:8080/pero/api/v3/send #pero-api应用接口

 

三、手动给alertmanager发送告警

1.报文

[
	{
		"labels": {
			"alertTarget": "xdd666",
			"alertLevel": "1",
			"rule_id": "98",
			"instance": "xdd实例",
			"bizSystem": "xdd轻微系统",
			"alertname": "资源变更",
			"groupId": "",
			"log_info": "xdd轻微ddddddddddddddddddddddddddddddddddddddddddddd",
			"description": "资源对象:xdd轻微ddddd变更",
			"thirdStrategyId": "100",
			"clusterId": "sdsdfs-wewew-dsewe-sdcsd-ssdvdd",
			"log_time": "2024-11-04 16:16:16",
			"metricType": "Deployment",
			"rulesName": "xdd轻微告警名称",
			"clusterName": "cluster-pero",
			"namespace": "xdd轻微",
			"alertGroup": "pero-api",
			"alertModel": "resource" ####注意,新增字段或修改value("key": "value")会认为是新告警,会发送;但如果新增字段时value为空则不认为是新告警。
		},                           ####注意,value的值从有值变为空也会发送告警
		"annotations": {             ####有时又不一定,很玄学。随意修改value的值容易发生告警风暴
			"alertContent": "xdd轻微ddddddddddddddddddddddddddddddddddddddddddddd变更"
		}
	},
	{
		"labels": {
			"alertTarget": "xdd666",
			"alertLevel": "2",
			"rule_id": "97",
			"instance": "xdd实例",
			"bizSystem": "xdd中度系统",
			"alertname": "资源变更",
			"groupId": "",
			"log_info": "xdd中度ddddddddddddddddddddddddddddddddddddddddddddd",
			"description": "资源对象:xdd中度ddddd变更",
			"thirdStrategyId": "100",
			"clusterId": "sdsdfs-wewew-dsewe-sdcsd-ssdvdd",
			"log_time": "2024-11-04 16:16:16",
			"metricType": "Deployment",
			"rulesName": "xdd中度告警名称",
			"clusterName": "cluster-pero",
			"namespace": "xdd中度",
			"alertGroup": "pero-api",
			"alertModel": "resource"
		},
		"annotations": {
			"alertContent": "xdd中度ddddddddddddddddddddddddddddddddddddddddddddd变更"
		}
	}
]

2.curl命令

curl -v --request POST \
  --url http://alertmanager-svc.pcl:9093/api/v2/alerts \
  --header 'Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxxx' \
  --header 'content-type: application/json' \
  --data '[
    {
        "labels": {
            "alertTarget": "xdd666",
            "alertLevel": "1",
            "rule_id": "98",
            "instance": "xdd实例",
            "bizSystem": "xdd轻微系统",
            "alertname": "资源变更",
            "groupId": "",
            "log_info": "xdd轻微ddddddddddddddddddddddddddddddddddddddddddddd",
            "description": "资源对象:xdd轻微ddddd变更",
            "thirdStrategyId": "100",
            "clusterId": "sdsdfs-wewew-dsewe-sdcsd-ssdvdd",
            "log_time": "2024-11-04 16:16:16",
            "metricType": "Deployment",
            "rulesName": "xdd轻微告警名称",
            "clusterName": "cluster-pero",
            "namespace": "xdd轻微",
            "alertGroup": "pero-api",
            "alertModel": "resource"
        },
        "annotations": {
            "alertContent": "xdd轻微ddddddddddddddddddddddddddddddddddddddddddddd变更"
        }
    },
    {
        "labels": {
            "alertTarget": "xdd666",
            "alertLevel": "2",
            "rule_id": "97",
            "instance": "xdd实例",
            "bizSystem": "xdd中度系统",
            "alertname": "资源变更",
            "groupId": "",
            "log_info": "xdd中度ddddddddddddddddddddddddddddddddddddddddddddd",
            "description": "资源对象:xdd中度ddddd变更",
            "thirdStrategyId": "100",
            "clusterId": "sdsdfs-wewew-dsewe-sdcsd-ssdvdd",
            "log_time": "2024-11-04 16:16:16",
            "metricType": "Deployment",
            "rulesName": "xdd中度告警名称",
            "clusterName": "cluster-pero",
            "namespace": "xdd中度",
            "alertGroup": "pero-api",
            "alertModel": "resource"
        },
        "annotations": {
            "alertContent": "xdd中度ddddddddddddddddddddddddddddddddddddddddddddd变更"
        }
    }
]
'

 

四、alertmanager集群告警抑制不生效

1.场景

抑制规则:高级别告警抑制低级别告警,

alertmanger部署模式:alertmanager集群,通过负载均衡地址访问

发送告警时,alertmanager地址是一个负载均衡地址,可能有一部分告警发送给了alertmanager集群中的某个节点,另一部分发送给了alertmanager集群中其他节点。

这就会导致只有同一个节点的alermanager收到的告警会触发抑制规则,不同节点的告警不会互相抑制

2.解决

发送告警时,给每个alertmanager的节点都发送相同的告警

posted @ 2024-04-12 17:07  wdgde  阅读(54)  评论(0)    收藏  举报