[k8s]prometheus+alertmanager二进制安装实现简单邮件告警

本次任务是用alertmanaer发一个报警邮件
本次环境采用二进制普罗组件
本次准备监控一个节点的内存,当使用率大于2%时候(测试),发邮件报警.

环境准备

下载二进制https://prometheus.io/download/

https://github.com/prometheus/prometheus/releases/download/v2.0.0/prometheus-2.0.0.windows-amd64.tar.gz
https://github.com/prometheus/alertmanager/releases/download/v0.12.0/alertmanager-0.12.0.windows-amd64.tar.gz
https://github.com/prometheus/node_exporter/releases/download/v0.15.2/node_exporter-0.15.2.linux-amd64.tar.gz

解压

/root/
├── alertmanager -> alertmanager-0.12.0.linux-amd64
├── alertmanager-0.12.0.linux-amd64
├── alertmanager-0.12.0.linux-amd64.tar.gz
├── node_exporter-0.15.2.linux-amd64
├── node_exporter-0.15.2.linux-amd64.tar.gz
├── prometheus -> prometheus-2.0.0.linux-amd64
├── prometheus-2.0.0.linux-amd64
└── prometheus-2.0.0.linux-amd64.tar.gz

实验架构

配置alertmanager

创建 alert.yml

[root@n1 alertmanager]# ls
alertmanager  alert.yml  amtool  data  LICENSE  NOTICE  simple.yml

alert.yml 里面定义下: 谁发送什么事件发给谁怎么发等.

cat alert.yml 
global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: 'maotai@163.com'
  smtp_auth_username: 'maotai@163.com'
  smtp_auth_password: '123456'


templates:
  - '/root/alertmanager/template/*.tmpl'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 10m
  receiver: default-receiver


receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'maotai@foxmail.com'
  
  
- 配置好后启动即可
./alertmanager -config.file=./alert.yml

配置prometheus

报警规则rule.yml配置(将被prometheus.yml调用)

当使用率大于2%时候(测试),发邮件报警

$ cat rule.yml 
groups:
- name: test-rule
  rules:
  - alert: NodeMemoryUsage
    expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 2
    for: 1m
    labels:
      severity: warning 
    annotations:
      summary: "{{$labels.instance}}: High Memory usage detected"
      description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"

关键在于这个公式

(node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 2

labels 给这个规则打个标签

annotations(报警说明)这部分是报警内容

监控k从哪里获取?(后面有说) node_memory_MemTotal/node_memory_Buffers/node_memory_Cached

prometheus.yml配置

添加node_expolore这个job
添加rule_files的报警规则,rule_files部分调用rule.yml

$ cat prometheus.yml 
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]

rule_files:
  - /root/prometheus/rule.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['192.168.14.11:9090']
  - job_name: linux
    static_configs:
      - targets: ['192.168.14.11:9100']
        labels:
          instance: db1

配置好后启动普罗然后访问,可以看到了node target了.

查看node_explore抛出的metric

查看alert,可以看到告警规则发生的状态

这些公式的key从这里可以看到(前提是当你安装了对应的explore),按照这个k来写告警公式

查看收到的邮件

微信报警配置

global:
  # The smarthost and SMTP sender used for mail notifications.
  resolve_timeout: 6m
  smtp_smarthost: 'x.x.x.x:25'
  smtp_from: 'maomao@qq.com'
  smtp_auth_username: 'maomao'
  smtp_auth_password: 'maomao@qq.com'
  smtp_require_tls: false

  # The auth token for Hipchat.
  hipchat_auth_token: '1234556789'
  # Alternative host for Hipchat. 
  hipchat_api_url: 'https://123'
  wechat_api_url: "https://123"
  wechat_api_secret: "123"
  wechat_api_corp_id: "123"
  

# The directory from which notification templates are read.
templates:
- 'templates/*.tmpl'

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['alertname']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 3s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 1h

  # A default receiver
  receiver: maotai


  routes:
  - match:
      job: "11"
      #service: "node_exporter"
    routes:
    - match:
        status: yellow
      receiver: maotai
    - match:
        status: orange
      receiver: berlin


# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is 
# already critical.
inhibit_rules:
- source_match:
    service: 'up'
  target_match:
    service: 'mysql'
  # Apply inhibition if the alerqtname is the same.
  equal: ["instance"]

- source_match:
    service: "mysql"
  target_match:
    service: "mysql-query"
  equal: ['instance']

- source_match:
    service: "A"
  target_match:
    service: "B"
  equal: ["instance"]

- source_match:
    service: "B"
  target_match:
    service: "C"
  equal: ["instance"]

receivers:
- name: 'maotai'
  email_configs:
  - to: 'maotai@qq.com'
    send_resolved: true
    html: '{{ template "email.default.html" . }}'
    headers: { Subject: "[mail] 测试技术部监控告警邮件" }
    
- name: "berlin"
  wechat_configs:
  - send_resolved: true
    to_user: "@all"
    to_party: ""
    to_tag: ""
    agent_id: "1"
    corp_id: "xxxxxxx"

posted @ 2018-01-12 17:48 _毛台阅读(4762) 评论(0) 收藏举报

刷新页面返回顶部