prometheus部署在内网服务器

一、prometheus安装
-------------------------------------------------------------------------------------------
所需组件的作用如下:

- Prometheus server:普罗米修斯的主服务器(端口:9090);
- NodeEXporter:负责收集Host硬件信息和操作系统信息,(端口:9100);
- cAdvisor:负责收集Host上运行的容器信息(端口:8080);
- Grafana:负责展示普罗米修斯监控界面(3000);
- Alertmanager:用来接收Prometheus发送的报警信息,并且执行设置好的报警方式,报警内容(同样也是在dockerA主机上部署,端口:9093);

systemctl enable prometheus.service
systemctl enable alertmanager
systemctl enable node_exporter.service
systemctl enable grafana-server.service
systemctl enable PrometheusAlert

 

1 下载
https://prometheus.io/download

2 创建用户并授权

[root@ntp1 src]# groupadd prometheus
[root@ntp1 src]# useradd -g prometheus -s /sbin/nologin prometheus
[root@ntp1 src]# tar -zxvf prometheus-2.18.1.linux-amd64.tar.gz -C /usr/local/
[root@ntp1 local]# mv prometheus-2.18.1.linux-amd64/ prometheus
[root@ntp1 local]# cd prometheus/
[root@ntp1 prometheus]# mkdir {data,logs,conf,rules} -p
[root@ntp1 prometheus]# chown -R prometheus.prometheus *

3 将Prometheus配置为系统服务
vim /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/usr/local/prometheus/data

Restart=on-failure

[Install]
WantedBy=multi-user.target

4、启动服务

systemctl daemon-reload
systemctl enable prometheus.service
systemctl restart prometheus.service

二、安装 alertmanager
-------------------------------------------------------------------------------------------
[root@ntp1 src]# tar -zxvf alertmanager-0.20.0.linux-amd64.tar.gz -C /usr/local/
[root@ntp1 local]# mv alertmanager-0.20.0.linux-amd64/ alertmanager

[root@ntp1 local]# vim /etc/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml

[Install]
WantedBy=multi-user.target


[root@ntp1 local]# systemctl daemon-reload
[root@ntp1 local]# systemctl start alertmanager
[root@ntp1 local]# systemctl status alertmanager
[root@ntp1 local]# systemctl enable alertmanager
[root@ntp1 local]# netstat -nltup|grep 9093
tcp6 0 0 :::9093 :::* LISTEN 10597/alertmanager


[root@localhost alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['instance'] #可以机器标签进行报警的分组
group_wait: 1s ##分组等待时间
group_interval: 10s #分组的时间间隔
repeat_interval: 5m #重复报警的时间间隔
receiver: 'web.hook.prometheusalert'
routes:
- receiver: 'prometheusalert-dingding'
# group_wait: 10m
match:
level: '2'
receivers:
- name: 'web.hook.prometheusalert'
webhook_configs:
- url: 'http://localhost:8080/prometheus/alert'
- name: 'prometheusalert-dingding'
webhook_configs:
- url: 'http://localhost:8080/prometheus/router?ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxx'
-------------------------------------------------------------------------------------------


三、node_exporter安装及配置
-------------------------------------------------------------------------------------------
1、下载及解压安装包、授权

wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz

[root@localhost local]# tar -zxvf node_exporter-0.17.1.linux-amd64.tar.gz -C /usr/local/
[root@localhost local]# mv node_exporter-0.17.1.linux-amd64/ node_exporter

[root@localhost prometheus]# chown -R prometheus.prometheus node_exporter

2、创建node_exporter.service的 systemd unit 文件

# vim /usr/lib/systemd/system/node_exporter.service

[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

3、启动服务

systemctl daemon-reload
systemctl enable node_exporter.service
systemctl start node_exporter.service

4、客户监控端数据汇报:&& grafana展示的数据 取值这里

访问:http://192.168.100.205:9100/metrics,查看从exporter具体能抓到的数据.


5、部署客户端加入监控

5.1 在客户端安装agent

[root@dockerhome src]# tar -zxvf node_exporter-0.17.1.linux-amd64.tar.gz -C /usr/local/
[root@dockerhome local]# mv node_exporter-0.18.1.linux-amd64/ node_exporter
#vim /etc/systemd/system/node_exporter.service
[Unit]
Description=mysql_exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

设置用户
groupadd prometheus

useradd -g prometheus -s /sbin/nologin prometheus

chown -R prometheus:prometheus /usr/local/node_exporter/

[root@dockerhome node_exporter]# groupadd prometheus
[root@dockerhome node_exporter]# useradd -g prometheus -s /sbin/nologin prometheus
[root@dockerhome node_exporter]# chown -R prometheus:prometheus /usr/local/node_exporter/
[root@dockerhome node_exporter]# systemctl daemon-reload
[root@dockerhome node_exporter]# systemctl restart node_exporter
[root@dockerhome node_exporter]# firewall-cmd --add-port=9100/tcp --permanent
success
[root@dockerhome node_exporter]# firewall-cmd --reload
success

-------------------------------------------------------------------------------------------
四、Grafana安装及配置
-------------------------------------------------------------------------------------------
1、下载及安装
wget https://dl.grafana.com/oss/release/grafana-6.7.3-1.x86_64.rpm
yum localinstall grafana-6.7.3-1.x86_64.rpm

2、启动服务
systemctl daemon-reload
systemctl enable grafana-server.service
systemctl start grafana-server.service

3、访问WEB界面 http://ip:3000

默认账号/密码:admin/admin

4、Grafana添加数据源
在登陆首页,点击"Configuration-Data Sources"按钮,跳转到添加数据源页面,配置如下:
Name: prometheus
Type: prometheus
URL: http://192.168.100.205:9090/
Access: Server
取消Default的勾选,其余默认,点击"Add",如下:
5、导入模版
-------------------------------------------------------------------------------------------

五、 prometheus 配置连通 alertmanager 添加监控主机,配置告警规则
5.1、配置文件
[root@localhost prometheus]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_timeout: 10s
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
- 'localhost:9093' #配置连通alertmanager

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/usr/local/prometheus/rules/rule*.yml" #配置告警规则目录文件
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- file_sd_configs:
- files:
- 'conf/host.yml' # 配置 node_exporter 要收集信息的主机列表
refresh_interval: 10s
job_name: Host
metrics_path: /metrics
relabel_configs:
- source_labels: [__address__]
regex: (.*)
target_label: instance
replacement: $1
- source_labels: [__address__]
regex: (.*)
target_label: __address__
replacement: $1:9100


[root@localhost rules]# cat rule_host.yml
groups:
- name: 主机状态-监控告警
rules:
- alert: 主机状态
expr: up == 0
for: 5s
labels:
status: 非常严重
annotations:
summary: "{{$labels.instance}} 服务器宕机"
description: "{{$labels.instance}} 服务器宕机超过3分钟"

- alert: CPU使用情况
expr: (100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) *100)) > 80
for: 3s
labels:
status: 一般告警
annotations:
summary: "{{$labels.instance}} CPU使用率过高!"
description: "{{$labels.instance }} CPU使用大于80% (目前使用:{{$value}}%)"
- alert: 内存使用
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 3s
labels:
status: 严重告警
annotations:
summary: "{{$labels.instance}} 内存使用率过高!"
description: "{{$labels.instance }} 内存使用大于80%(目前使用:{{$value}}%)"
- alert: 磁盘容量
expr: (1-(node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"})) * 100 > 80
for: 3s
labels:
status: 严重告警
annotations:
summary: "{{$labels.instance}} {{$labels.mountpoint}} 磁盘分区使用率过高!"
description: "{{$labels.instance}} {{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

- alert: IO性能
expr: avg(irate(node_disk_io_time_seconds_total[3m])) by(instance) * 100 > 80
for: 3s
labels:
status: 严重告警
annotations:
summary: "{{$labels.instance}} 流入磁盘IO使用率过高!"
description: "{{$labels.instance }} 流入磁盘IO大于80% (目前使用:{{$value}})"

 

[root@localhost prometheus]# cat conf/host.yml
- labels:
service: autofind
targets:
- 172.18.240.18
- 192.168.1.202
- 192.168.1.203
- 172.18.240.52
- 172.18.240.99

5.2 服务器端添加被监控主机IP
[root@ops001 prometheus-2.4.3.linux-amd64]# vim conf/host.yml
- labels:
service: autofind
targets:
- 172.18.240.18
- 192.168.1.202
- 172.18.240.52

5.3 查看效果
http://192.168.1.203:3000/


5.4 配置监控规则
// 配置告警规则,如果主机 down 了,就触发告警
[root@localhost prometheus]# vi rules/rule_host.yml


-------------------------------------------------------------------------------------------
六、报警设置
https://github.com/feiyu563/PrometheusAlert

6.1、 创建服务

[root@localhost linux]# tar -zxvf PrometheusAlert.tar.gz -C /usr/local/
[root@localhost linux]# vim /etc/systemd/system/PrometheusAlert.service
[Unit]
Description=PrometheusAlert
After=network-online.target

[Service]
Type=simple
User=prometheus
Restart=on-failure
WorkingDirectory=/usr/local/PrometheusAlert/PrometheusAlert/example/linux/
ExecStart=/usr/local/PrometheusAlert/PrometheusAlert/example/linux/PrometheusAlert

[Install]
WantedBy=multi-user.target


6.2、 修改配置文件
#---------------------↓webhook-----------------------
#是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启
open-dingding=1
#默认钉钉机器人地址
ddurl=https://oapi.dingtalk.com/robot/send?access_token=7dab8205a446c43f9xxxxxxxxxxxxxxx
#是否开启 @所有人(0为关闭,1为开启)
dd_isatall=0


-----------------------------------------------------------------------------------------

测试
curl -H "Content-Type: application/json" -d '{"msgtype":"text","text":{"content":"prometheus alert test"}}' https://oapi.dingtalk.com/robot/send?access_token=7dab8205a446c43f9f9eaef33d6fb66ddffa07d1f0e6df8da9a0936ea5e9f798

 

七、总结
-------------------------------------------------------------------------------------------

[root@localhost local]# systemctl restart alertmanager
[root@localhost local]# systemctl restart prometheus-webhook-dingtalk
[root@localhost local]# systemctl restart prometheus
[root@localhost local]# systemctl restart grafana-server
[root@localhost local]# systemctl restart node_exporter

 

posted @ 2020-06-01 16:57  db小白  阅读(1411)  评论(0)    收藏  举报