马达加斯加的老腊肉

Prometheus+node_exporter+alertmanager+prometheus_webhook_dingtalk+Grafana(非容器搭建)简单搭建监控报警平台笔记

一、搭建目的;

通过搭建过程,了解目前流行的监控系统。

二、搭建环境;

虚机

三、搭建配置调试过程;

1、prometheus相关安装包下载地址;https://prometheus.io/download/

2、grafana下载地址;https://grafana.com/grafana/download

3、安装

(1)、下载并解压安装prometheus(网上搜索教程,本笔记省略);配置prometheus并启动prometheus;

  prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# Alertmanager configuration
#  - job_name: 'Alertmanager'
#    static_configs:

        #- alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

*注意targets为什么不用服务器ip而是用localhost因为如果用服务器ip的话,一旦服务器ip变了就无法使用*

启动prometheus命令进入安装目录 ./prometheus --config.file=prometheus.yml &

netstat –tpln可以看到已经 监听9090端口,可以通过ip:9090访问prometheus;

(2)、安装启动node_exporter(网上搜索教程,本笔记省略);并接入到prometheus;

启动node_exporter;进入安装目录 ./node_exporter &

netstat –tpln可以看到已经 监听9100端口

修改prometheus;并重启prometheus查看ip:9090上node_exporter服务是否接入并up成功;

prometheus.yml;

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
#  - job_name: 'Alertmanager'
#    static_configs:
    
        #- alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node_self'
    scheme: http
    #tls_config:
      #ca_file: node_exporter.crt
    static_configs:
    - targets: ['localhost:9100']

重启prometheus在ip:9090看到如下图表示正常

image

image

(3)、安装配置alertmanager+prometheus_webhook_dingtalk完成报警收集与报警消息推送到钉钉;修改prometheus配置接入alertmanager并添加报警规则rules.yml

安装,启动prometheus_webhook_dingtalk

启动prometheus_webhook_dingtalk;进入安装目录;nohup ./prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx" --ding.profile="dev_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxx2"  2>&1 1>dingding.log &
说明:1、https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxx2https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx为钉钉自己创建机器人接口。webhook可惟在启动时指定多个机器人(注意在webhook中的—ding.profile命名不能相同;一个为ops_dingding;一个为dev_dingding);

启动后默认监听8060端口;

(4)、配置alertmanager.yml并启动alertmanager服务;

alertmanager.yml

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
  - receiver: 'test.yaya'
    match:
      priority: P0
    continue: true
  - receiver: 'web.hook'
    match:
      priority: P0
    continue: true
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:8060/dingtalk/ops_dingding/send'
#inhibit_rules:
  #- source_match:
    #  severity: 'critical'
    #target_match:
    #  severity: 'warning'
    #equal: ['alertname', 'dev', 'instance']
- name: 'test.yaya'
  webhook_configs:
  - url: 'http://127.0.0.1:8060/dingtalk/dev_dingding/send'
#inhibit_rules:
#  - source_match:
#      severity: 'critical'
#    target_match:
#      severity: 'warning'
#    equal: ['alertname','dev', 'instance']

*注意routes中的报警方式test.yaya和web.hook如果没有continue:true那么在第一个报警匹配之后不会再运行后台其它匹配的报警;url为报警的prometheus_webhook_dingtalk的接口;两个不同的机器人ops_dingding和dev_dingding*

进入安装目录 ;运行./alertmanager --config.file=alertmanager.yml &;监控9093端口配置服务正常启动。

配置prometheus.yml接入alertmanager

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets: ['localhost:9093']
#  - job_name: 'Alertmanager'
#    static_configs:

      - targets:
        #- alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node_self'
    scheme: http
    #tls_config:
      #ca_file: node_exporter.crt
    static_configs:
    - targets: ['localhost:9100']

*注意rule_files指定了报警规则文件;目录默认为prometheus安装目录 *

rules.yml

groups:
- name: "服务报警测试"
  rules:
  - alert: "内存服务报警"
    expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 40
    for: 1m
    labels:
      #token: {{ .Values.prometheus.prometheusSpec.externalLabels.env }}-bigdata
      priority: P0
      status: 告警
    annotations:
      description: "大数据告警:IPadress:{{$labels.instance}} 内存使用大于48%(目前使用:{{$value}}%)"
      summary: "大数据告警:CPU使用大于40%(目前使用:{{$value}}%)"

*注意runle.yml中的node_memory_MemAvailable_bytes等参数为node_exporter收集参数,更多内容请问度娘*

重启prometheus;

web打开ip:9090

image

报警从pending到firing话的钉钉上收到报警信息表示正常。


(5)安装grafana并图型node_export和push_gateway参数指定参数;

安装grafana(自行百度);启动 systemctl start grafana;

登录初始用户名/密码为admin/admin;

安装后配置数据源为prometheus;下载node_exporter基本监控json文件导入granfa;可以完成node_exporter数据获取生成监控图。


接入push_gateway数据自定义监控图;

1、安装push_gateway;开启服务;监听9091

自定义监控数据获取写入push_gateway;

#!/bin/bash
avl=`free -m|grep Mem|awk '{print $NF}'`
total=`free -m|grep Mem|awk '{print $2}'`
sum=$(printf "%.3f" `echo "scale=5;${avl}/${total}"|bc`)
res=`echo "$sum * 100"|bc`
#echo ${res}%
echo "Mem_use_persent ${res}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/wx_job
jk_disk.sh  test.sh     
[root@test04 pushgateway-0.7.0.linux-amd64]# cat /wuxiao/jb/jk_disk.sh 
res=`df -h|grep -E "/$"|awk '{print $5}'|awk -F"%" '{print $1}'`
#echo ${res}
echo "disk_jk_use_persent ${res}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/jk_disk_use

配置prometheus.yml接入prometheus,并生启prometheus

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets: ['localhost:9093']
#  - job_name: 'Alertmanager'
#    static_configs:

      - targets:
        #- alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node_self'  
    scheme: http
    #tls_config:
      #ca_file: node_exporter.crt
    static_configs: 
    - targets: ['localhost:9100']
  - job_name: 'pushgateway' 
    static_configs:
      - targets: ['localhost:9091']
        labels:
          instance: pushgateway

登录grafana配置

image

image

image

*注意数据源和监控的数据要填对*

image

image

配置倮保存即可

posted on 2021-07-18 00:39  马达加斯加的老腊肉  阅读(317)  评论(0编辑  收藏  举报

导航