AlertManager

Alertmanager接收到的告警的数据结构：

type Alert struct {
   Status       string    `json:"status"`
   Labels       KV        `json:"labels"`
   Annotations  KV        `json:"annotations"`
   StartsAt     time.Time `json:"startsAt"`
   EndsAt       time.Time `json:"endsAt"`
   GeneratorURL string    `json:"generatorURL"`
   Fingerprint  string    `json:"fingerprint"`
}

具有相同Lable的Alert才会被认为是同一种。在prometheus rules文件配置的一条规则可能会产生多种报警

Alertmanager启动时，使用--config.file参数指定一份配置文件。

全局配置

global:
  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails, including port number.
  # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
  # Example: smtp.example.org:587
  [ smtp_smarthost: <string> ]
  # The default hostname to identify to the SMTP server.
  [ smtp_hello: <string> | default = "localhost" ]
  # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
  [ smtp_auth_username: <string> ]
  # SMTP Auth using LOGIN and PLAIN.
  [ smtp_auth_password: <secret> ]
  # SMTP Auth using PLAIN.
  [ smtp_auth_identity: <string> ]
  # SMTP Auth using CRAM-MD5.
  [ smtp_auth_secret: <secret> ]
  # The default SMTP TLS requirement.
  # Note that Go does not support unencrypted connections to remote SMTP endpoints.
  [ smtp_require_tls: <bool> | default = true ]
 
  # The API URL to use for Slack notifications.
  [ slack_api_url: <secret> ]
  [ slack_api_url_file: <filepath> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]
 
  # The default HTTP client configuration
  [ http_config: <http_config> ]
 
  # 告警的解决时间，超时还未解决会重发告警
  [ resolve_timeout: <duration> | default = 5m ]

templates
route

route被组织成routing tree。

告警首先走到根节点，其必须match所有告警

此后match到某个节点之后依次走到所有子节点。

receiver：根据receiver的name把告警送到receiver

group_by：此处填写标签的key，根据key将Alert分组，同一组的组合到一起发给receiver

continue：告警与子route匹配之后是否应该往下走

match和matchers：key-value的匹配规则

group_wait：一个新group的告警被构建出来后，等待若干时间再发送。期间有新的告警的话都组合到一起。

group_interval：已有group的告警，等待若干时间再发送。

repeat_interval：等待若干时间后重新发送

mute_time_intervals：覆盖全局的mute_time_intervals配置

routes：指定若干个子route

receivers

配置文件中，指定receiver，将相应的报警信息发送到webhook、邮件、钉钉等

receivers:
- name: <name>
  xxx_configs:

xxx可以是email、pagerduty、pushover、slack、opsgenie、webhook、victorops、wechat

inhibit_rules

告警抑制规则

示例：

"inhibit_rules":
- "equal":
  - "namespace"
  - "alertname"
  "source_match":
    "severity": "critical"
  "target_match_re":
    "severity": "warning|info"
- "equal":
  - "namespace"
  - "alertname"
  "source_match":
    "severity": "warning"
  "target_match_re":
    "severity": "info"

mute_time_intervals

告警静默规则：

mute_time_interval:
- name: <string>
  time_intervals:
    [ - <time_interval> ... ]

处理流程：

（1）接收到Alert，根据labels判断属于哪些Route（可存在多个Route，一个Route有多个Group，一个Group有多个Alert）

（2）将Alert分配到Group中，没有则新建Group

组合后的告警数据结构为：

type Data struct {
   Receiver string `json:"receiver"`
   Status   string `json:"status"`
   Alerts   Alerts `json:"alerts"`
   GroupLabels       KV `json:"groupLabels"`
   CommonLabels      KV `json:"commonLabels"`
   CommonAnnotations KV `json:"commonAnnotations"`
   ExternalURL string `json:"externalURL"`
}

（3）新的Group等待group_wait指定的时间（等待时可能收到同一Group的Alert），根据resolve_timeout判断Alert是否解决，然后发送通知

（4）已有的Group等待group_interval指定的时间，判断Alert是否解决，当上次发送通知到现在的间隔大于repeat_interval或者Group有更新时会发送通知

posted @ 2022-03-08 10:24 扬羽流风阅读(466) 评论(0) 收藏举报

刷新页面返回顶部

扬羽流风

AlertManager

公告