docker compose部署prometheus+grafana+alertmanager+prometheus-webhook-dingtalk 。实现图形 prometheus监控 钉钉告警
一、了解服务作用
- Prometheus 开源的系统监控和报警框架,灵感源自Google的Borgmon监控系统
- AlertManager 处理由客户端应用程序(如Prometheus server)发送的警报。它负责将重复数据删除,分组和路由到正确的接收者集成,还负责沉默和抑制警报
- Node_Exporter 用来监控各节点的资源信息的exporter,应部署到prometheus监控的所有节点
- prometheus-webhook-dingtalk 钉钉告警插件
- grafana 监控可视化
二、创建prometheus目录 便于存放所有监控 。以及机器信息
服务器就一台:10.1.1.10 存放所有服务。想监控多一台 配置文件新增个job ,被监控方启个Node_Exporter服务即可
此文档路径可能有偏差 注意下文件路径以及compose写得路径
mkdir /data/prometheus #以下所有操作都在prometheus目录下操作
cd /data/prometheus
三、创建prometheus配置文件以及数据目录。用于启动prometheu时读取
mkdir /prometheus/data -p
chmod 777 /prometheus/data #创建存放prometheus数据目录
vim /prometheus/prometheus.yml
global:
scrape_interval: 15s # 多久 收集 一次数据
evaluation_interval: 15s # 多久 评估 一次规则
scrape_timeout: 10s # 每次 收集数据的 超时时间
# 收集数据 配置 列表
scrape_configs:
- job_name: prometheus # 必须配置, 自动附加的job labels, 必须唯一
static_configs:
- targets: ['10.1.1.10:9090'] # 指定prometheus ip端口
labels:
instance: prometheus #标签
- job_name: ehospital-exploit-database
static_configs:
- targets: ['10.1.1.10:9100']
labels:
instance: eehospital-exploit-database
alerting: #Alertmanager相关的配置
alertmanagers:
- static_configs:
- targets:
- 10.1.1.10:9093 #指定告警模块
rule_files: #告警规则文件, 可以使用通配符
- "/etc/prometheus/rules/*.yml"
四、创建告警规则文件及触发条件文件 。用于prometheus配置文件读取此告警内容
4.1:
mkdir runles #先创建rules目录
vim runles/alert-rules.yml #通用
groups:
- name: prometheus-alert
rules:
- alert: prometheus-down
expr: prometheus:up == 0
for: 1m
labels:
severity: 'critical'
annotations:
summary: "instance: {{ $labels.instance }} 宕机了"
description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。"
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-cpu-high
expr: prometheus:cpu:total:percent > 80
for: 3m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}"
description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已经持续一分钟高过80% 。"
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-cpu-iowait-high
expr: prometheus:cpu:iowait:percent >= 12
for: 3m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}"
description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已经持续三分钟高过12%"
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-load-load1-high
expr: (prometheus:load:load1) > (prometheus:cpu:count) * 1.2
for: 3m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-memory-high
expr: prometheus:memory:used:percent > 85
for: 3m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-disk-high
expr: prometheus:disk:used:percent > 80
for: 10m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-disk-read:count-high
expr: prometheus:disk:read:count:rate > 2000
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-disk-write-count-high
expr: prometheus:disk:write:count:rate > 2000
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-disk-read-mb-high
expr: prometheus:disk:read:mb:rate > 60
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}"
description: ""
instance: "{{ $labels.instance }}"
value: "{{ $value }}"
- alert: prometheus-disk-write-mb-high
expr: prometheus:disk:write:mb:rate > 60
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-filefd-allocated-percent-high
expr: prometheus:filefd_allocated:percent > 80
for: 10m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-network-netin-error-rate-high
expr: prometheus:network:netin:error:rate > 4
for: 1m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-network-netin-packet-rate-high
expr: prometheus:network:netin:packet:rate > 35000
for: 1m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-network-netout-packet-rate-high
expr: prometheus:network:netout:packet:rate > 35000
for: 1m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-network-tcp-total-count-high
expr: prometheus:network:tcp:total:count > 40000
for: 1m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-process-zoom-total-count-high
expr: prometheus:process:zoom:total:count > 10
for: 10m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: prometheus-time-offset-high
expr: prometheus:time:offset > 0.03
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} {{ $labels.desc }} {{ $value }} {{ $labels.unit }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
vim rules/record-rules.yml
groups:
- name: prometheus-record
rules:
- expr: up{job!="prometheus"}
record: prometheus:up
labels:
desc: "节点是否在线, 在线1,不在线0"
unit: " "
job: "prometheus"
- expr: time() - node_boot_time_seconds{}
record: prometheus:node_uptime
labels:
desc: "节点的运行时间"
unit: "s"
job: "prometheus"
##############################################################################################
# cpu #
- expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100
record: prometheus:cpu:total:percent
labels:
desc: "节点的cpu总消耗百分比"
unit: "%"
job: "prometheus"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="idle"}[5m]))) * 100
record: prometheus:cpu:idle:percent
labels:
desc: "节点的cpu idle百分比"
unit: "%"
job: "prometheus"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="iowait"}[5m]))) * 100
record: prometheus:cpu:iowait:percent
labels:
desc: "节点的cpu iowait百分比"
unit: "%"
job: "prometheus"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="system"}[5m]))) * 100
record: prometheus:cpu:system:percent
labels:
desc: "节点的cpu system百分比"
unit: "%"
job: "prometheus"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode="user"}[5m]))) * 100
record: prometheus:cpu:user:percent
labels:
desc: "节点的cpu user百分比"
unit: "%"
job: "prometheus"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job!="prometheus",mode=~"softirq|nice|irq|steal"}[5m]))) * 100
record: prometheus:cpu:other:percent
labels:
desc: "节点的cpu 其他的百分比"
unit: "%"
job: "prometheus"
##############################################################################################
##############################################################################################
# memory #
- expr: node_memory_MemTotal_bytes{job!="prometheus"}
record: prometheus:memory:total
labels:
desc: "节点的内存总量"
unit: byte
job: "prometheus"
- expr: node_memory_MemFree_bytes{job!="prometheus"}
record: prometheus:memory:free
labels:
desc: "节点的剩余内存量"
unit: byte
job: "prometheus"
- expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemFree_bytes{job!="prometheus"}
record: prometheus:memory:used
labels:
desc: "节点的已使用内存量"
unit: byte
job: "prometheus"
- expr: node_memory_MemTotal_bytes{job!="prometheus"} - node_memory_MemAvailable_bytes{job!="prometheus"}
record: prometheus:memory:actualused
labels:
desc: "节点用户实际使用的内存量"
unit: byte
job: "prometheus"
- expr: (1-(node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100
record: prometheus:memory:used:percent
labels:
desc: "节点的内存使用百分比"
unit: "%"
job: "prometheus"
- expr: ((node_memory_MemAvailable_bytes{job!="prometheus"} / (node_memory_MemTotal_bytes{job!="prometheus"})))* 100
record: prometheus:memory:free:percent
labels:
desc: "节点的内存剩余百分比"
unit: "%"
job: "prometheus"
##############################################################################################
# load #
- expr: sum by (instance) (node_load1{job!="prometheus"})
record: prometheus:load:load1
labels:
desc: "系统1分钟负载"
unit: " "
job: "prometheus"
- expr: sum by (instance) (node_load5{job!="prometheus"})
record: prometheus:load:load5
labels:
desc: "系统5分钟负载"
unit: " "
job: "prometheus"
- expr: sum by (instance) (node_load15{job!="prometheus"})
record: prometheus:load:load15
labels:
desc: "系统15分钟负载"
unit: " "
job: "prometheus"
##############################################################################################
# disk #
- expr: node_filesystem_size_bytes{job!="prometheus" ,fstype=~"ext4|xfs"}
record: prometheus:disk:usage:total
labels:
desc: "节点的磁盘总量"
unit: byte
job: "prometheus"
- expr: node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"}
record: prometheus:disk:usage:free
labels:
desc: "节点的磁盘剩余空间"
unit: byte
job: "prometheus"
- expr: node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"}
record: prometheus:disk:usage:used
labels:
desc: "节点的磁盘使用的空间"
unit: byte
job: "prometheus"
- expr: (1 - node_filesystem_avail_bytes{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job!="prometheus",fstype=~"ext4|xfs"}) * 100
record: prometheus:disk:used:percent
labels:
desc: "节点的磁盘的使用百分比"
unit: "%"
job: "prometheus"
- expr: irate(node_disk_reads_completed_total{job!="prometheus"}[1m])
record: prometheus:disk:read:count:rate
labels:
desc: "节点的磁盘读取速率"
unit: "次/秒"
job: "prometheus"
- expr: irate(node_disk_writes_completed_total{job!="prometheus"}[1m])
record: prometheus:disk:write:count:rate
labels:
desc: "节点的磁盘写入速率"
unit: "次/秒"
job: "prometheus"
- expr: (irate(node_disk_written_bytes_total{job!="prometheus"}[1m]))/1024/1024
record: prometheus:disk:read:mb:rate
labels:
desc: "节点的设备读取MB速率"
unit: "MB/s"
job: "prometheus"
- expr: (irate(node_disk_read_bytes_total{job!="prometheus"}[1m]))/1024/1024
record: prometheus:disk:write:mb:rate
labels:
desc: "节点的设备写入MB速率"
unit: "MB/s"
job: "prometheus"
##############################################################################################
# filesystem #
- expr: (1 -node_filesystem_files_free{job!="prometheus",fstype=~"ext4|xfs"} / node_filesystem_files{job!="prometheus",fstype=~"ext4|xfs"}) * 100
record: prometheus:filesystem:used:percent
labels:
desc: "节点的inode的剩余可用的百分比"
unit: "%"
job: "prometheus"
#############################################################################################
# filefd #
- expr: node_filefd_allocated{job!="prometheus"}
record: prometheus:filefd_allocated:count
labels:
desc: "节点的文件描述符打开个数"
unit: "%"
job: "prometheus"
- expr: node_filefd_allocated{job!="prometheus"}/node_filefd_maximum{job!="prometheus"} * 100
record: prometheus:filefd_allocated:percent
labels:
desc: "节点的文件描述符打开百分比"
unit: "%"
job: "prometheus"
#############################################################################################
# network #
- expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: prometheus:network:netin:bit:rate
labels:
desc: "节点网卡eth0每秒接收的比特数"
unit: "bit/s"
job: "prometheus"
- expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: prometheus:network:netout:bit:rate
labels:
desc: "节点网卡eth0每秒发送的比特数"
unit: "bit/s"
job: "prometheus"
- expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: prometheus:network:netin:packet:rate
labels:
desc: "节点网卡每秒接收的数据包个数"
unit: "个/秒"
job: "prometheus"
- expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: prometheus:network:netout:packet:rate
labels:
desc: "节点网卡发送的数据包个数"
unit: "个/秒"
job: "prometheus"
- expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: prometheus:network:netin:error:rate
labels:
desc: "节点设备驱动器检测到的接收错误包的数量"
unit: "个/秒"
job: "prometheus"
- expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: prometheus:network:netout:error:rate
labels:
desc: "节点设备驱动器检测到的发送错误包的数量"
unit: "个/秒"
job: "prometheus"
- expr: node_tcp_connection_states{job!="prometheus", state="established"}
record: prometheus:network:tcp:established:count
labels:
desc: "节点当前established的个数"
unit: "个"
job: "prometheus"
- expr: node_tcp_connection_states{job!="prometheus", state="time_wait"}
record: prometheus:network:tcp:timewait:count
labels:
desc: "节点timewait的连接数"
unit: "个"
job: "prometheus"
- expr: sum by (environment,instance) (node_tcp_connection_states{job!="prometheus"})
record: prometheus:network:tcp:total:count
labels:
desc: "节点tcp连接总数"
unit: "个"
job: "prometheus"
五、创建grafana数据目录以及配置文件 。 用于grafana存放数据
mkdir grafana/grafana-storage -p
chmod 777 grafana/grafana-storage
grafana.ini 配置文件可从grafana容器里拷贝一份出来
六、创建alert配置。用于向webhook发送告警
mkdir alert
vim alert/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: webhook
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
group_by: [alertname]
routes:
- receiver: webhook
group_wait: 10s
receivers:
- name: webhook
webhook_configs:
- url: http://10.1.1.10:8060/dingtalk/webhook1/send
send_resolved: true
~
指向webhook的地址
七、编辑docker-compose启动服务yml
vim docker-compose.yml
version: '3.2'
services:
prometheus:
image: prom/prometheus
restart: "always"
ports:
- 9090:9090
container_name: "prometheus"
volumes:
- "./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml"
- "./rules:/etc/prometheus/rules"
- "./prometheus/data:/prometheus"
command:
- '--config.file=/etc/prometheus/prometheus.yml' 设置yml路径 跟上面挂载对应
- '--storage.tsdb.path=/prometheus' #设置数据路径 跟上面挂载对应
#告警模块
alertmanager:
image: prom/alertmanager:latest
restart: "always"
ports:
- 9093:9093
container_name: "alertmanager"
volumes:
- "./alert/alertmanager.yml:/etc/alertmanager/alertmanager.yml"
#钉钉插件
webhook:
image: timonwong/prometheus-webhook-dingtalk
restart: "always"
ports:
- 8060:8060
container_name: "webhook" #token指定钉钉
command:
- '--ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=* 钉钉机器人地址'
#web界面
grafana:
image: grafana/grafana
restart: "always"
ports:
- 3000:3000
container_name: "grafana"
volumes:
- "./grafana/grafana.ini:/etc/grafana/grafana.ini" #配置文件自行拷贝出来
- "./grafana/grafana-storage:/var/lib/grafana"
7.2 启动
docker-compose -f docker-compose.yml up -d
八、创建启动收集服务node-exporter-compose.yml
vim node-exporter-compose.yml
docker-compose -f node-exporter-compose.yml up -d
version: '3.2'
services:
node-exporter:
image: prom/node-exporter
restart: "always"
ports:
- 9100:9100
container_name: "node-exporter"
volumes:
- "/proc:/host/proc:ro"
- "/sys:/host/sys:ro"
- "/:/rootfs:ro"
每加一台。创建一份即可。 本机也行
九、检查
docker ps -a #检查容器是否启动
netstat -nltp #检查端口是否启动
页面访问ip:9090

十、配置Grafana




#去官方下载监控模板即可
插件地址:
到这就部署完了。 谢谢观看,转载请@此文章

浙公网安备 33010602011771号