部署blackbox黑盒监控
1.blackbox_exporter概述
blackbox exporter支持基于HTTP, HTTPS, DNS, TCP, ICMP, gRPC协议来对目标节点进行监控。
比如基于http协议我们可以探测一个网站的返回状态码为200判读服务是否正常。
比如基于TCP协议我们可以探测一个主机端口是否监听。
比如基于ICMP协议来ping一个主机的连通性。
比如基于gRPC协议来调用接口并验证服务是否正常工作。
比如基于DNS协议可以来检测域名解析。
2.下载blockbox
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz
3.解压软件包
[root@node-exporter42 ~]# tar xf blackbox_exporter-0.25.0.linux-amd64.tar.gz -C /yanshier/softwares/
4.编写启动脚本
[root@node-exporter42 ~]# cat > /lib/systemd/system/blackbox_exporter.service <<EOF
[Unit]
Description=blackbox service
Documentation=https://www.yanshier.com/
After=network.target
[Service]
ExecStart=/yanshier/softwares/blackbox_exporter-0.25.0.linux-amd64/blackbox_exporter --config.file="/yanshier/softwares/blackbox_exporter-0.25.0.linux-amd64/blackbox.yml" --web.listen-address=:9115
[Install]
WantedBy=multi-user.target
EOF
5.启动blackbox
[root@node-exporter42 ~]# systemctl daemon-reload
[root@node-exporter42 ~]#
[root@node-exporter42 ~]# systemctl enable --now blackbox_exporter
Created symlink /etc/systemd/system/multi-user.target.wants/blackbox_exporter.service → /lib/systemd/system/blackbox_exporter.service.
[root@node-exporter42 ~]#
[root@node-exporter42 ~]# ss -ntl | grep 9115
LISTEN 0 4096 *:9115 *:*
[root@node-exporter42 ~]#
6.访问blackbox的WebUI
http://10.0.0.42:9115/
7.查看blackbox内置的模块列表
http://10.0.0.42:9115/config
8.手动实现http_2xx探测百度网站:
http://10.0.0.42:9115/probe?target=baidu.com&module=http_2xx&debug=true
- Prometheus集成blackbox黑盒http_2xx实现网站监控
1.修改Prometheus的配置文件
[root@prometheus-server31 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: "yanshier-blackbox"
# 指定探针路径
metrics_path: probe
# 传递模块参数,若不不指定,则默认就是http_2xx模块。
params:
module: [http_2xx]
static_configs:
# 配置需要监控的目标
- targets:
- www.jd.com
- www.yanshier.com
- 10.0.0.51:3000
# 表示监控目标并不直接监控,而是有blackbox进行监控
relabel_configs:
# 添加一个target参数
- source_labels: [__address__]
target_label: __param_target
# 修改Endpoint地址,而此时Endpoint地址和instance的__address__是一致的。
- target_label: __address__
replacement: 10.0.0.42:9115
# 由于修改了__address__,instance也会跟着变化,因此需要将target再重新赋值。
- source_labels: [__param_target]
target_label: instance
- job_name: 'yanshier_blackbox_exporter'
static_configs:
- targets: ['10.0.0.42:9115']
2.热加载配置文件
[root@prometheus-server31 ~]# curl -X POST 10.0.0.31:9090/-/reload
3.验证服务是否生效
http://10.0.0.31:9090/targets
4.blackbox_exporter的http_2xx探针指标说明
probe_http_ssl:
当probe_http_ssl的值为1时,表示该instance使用的是https协议。为0表示使用的http协议。
probe_http_status_code
表示网站返回的状态码,如果为0表示探测失败!
probe_http_duration_seconds:
表示分阶段耗时统计。
probe_duration_seconds:
表示总耗时。
probe_success:
表示探测是否成功,其中1表示探测成功,0表示探测失败。
5.grafana导入模板
7587
13659
prometheus基于blackbox的ICMP监控目标主机是否存活
1 修改Prometheus配置文件
[root@prometheus-server31 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: 'yanshier-blackbox-exporter-icmp'
metrics_path: /probe
params:
# 如果不指定模块,则默认类型为"http_2xx",不能乱写!乱写监控不到服务啦!
module: [icmp]
static_configs:
- targets:
- 10.0.0.41
- 10.0.0.42
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.0.0.42:9115
2.重新加载配置
[root@prometheus-server31 ~]# curl -X POST 10.0.0.31:9090/-/reload
3 访问prometheus的WebUI
5 访问blackbox的WebUI
http://10.0.0.42:9115/
6 grafana过滤jobs数据
基于"yanshier-blackbox-exporter-icmp"标签进行过滤。
- prometheus基于blackbox的TCP案例监控端口是否存活
1 修改Prometheus配置文件
[root@prometheus-server31 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
...
- job_name: 'yanshier-blackox-exporter-tcp'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 10.0.0.41:80
- 10.0.0.42:22
- 10.0.0.31:9090
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.0.0.42:9115
2 重新加载配置文件
[root@prometheus-server31 ~]# curl -X POST http://10.0.0.31:9090/-/reload
[root@prometheus-server31 ~]#
3.访问prometheus的WebUI
http://10.0.0.31:9090/targets
4.访问blackbox exporter的WebUI
http://10.0.0.41:9115/
5.使用grafana查看数据
基于"yanshier-blackbox-exporter-tcp"标签进行过滤。
prometheus基于blackbox的ssh案例监控ssh服务是否存活
1 修改Prometheus配置文件
[root@prometheus-server31 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: 'yanshier-blackox-exporter-ssh'
metrics_path: /probe
params:
module: [ssh_banner]
static_configs:
- targets:
- 10.0.0.41:22
- 10.0.0.43:22
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.0.0.42:9115
2 重新加载配置文件
[root@prometheus-server31 ~]# curl -X POST http://10.0.0.31:9090/-/reload
[root@prometheus-server31 ~]#
3.访问prometheus的WebUI
http://10.0.0.31:9090/targets
4.访问blackbox exporter的WebUI
http://10.0.0.41:9115/
5.使用grafana查看数据
基于"yanshier-blackox-exporter-ssh"标签进行过滤。
- 部署pushgateway组件
1.pushgateway的作用
就是用来用户自定义监控指标,一般用于临时存储。
2.下载pushgateway
wget https://github.com/prometheus/pushgateway/releases/download/v1.10.0/pushgateway-1.10.0.linux-amd64.tar.gz
3.解压软件包
[root@node-exporter42 ~]# tar xf pushgateway-1.10.0.linux-amd64.tar.gz -C /usr/local/bin/ pushgateway-1.10.0.linux-amd64/pushgateway --strip-components=1
4.创建数据目录
[root@node-exporter42 ~]# mkdir -pv /yanshier/data/pushgateway
5.编写启动脚本
cat > /lib/systemd/system/pushgateway.service <<EOF
[Unit]
Description=pushgateway services
Documentation=https://www.yanshier.com
After=network.target
[Service]
ExecStart=/usr/local/bin/pushgateway --web.telemetry-path="/metrics" --web.listen-address=:9091 --web.enable-lifecycle --persistence.file=/yanshier/data/pushgateway/pushgateway.data --persistence.interval=1m
[Install]
WantedBy=multi-user.target
EOF
6.启动服务
[root@node-exporter42 ~]# systemctl daemon-reload
[root@node-exporter42 ~]#
[root@node-exporter42 ~]# systemctl enable --now pushgateway.service
Created symlink /etc/systemd/system/multi-user.target.wants/pushgateway.service → /lib/systemd/system/pushgateway.service.
[root@node-exporter42 ~]#
[root@node-exporter42 ~]# ss -ntl | grep 9091
LISTEN 0 4096 *:9091 *:*
[root@node-exporter42 ~]#
7.查看pushgateway的WebUI
http://10.0.0.42:9091/
- Prometheus集成pushgateway组件实战案例
1.修改Prometheus的配置文件
[root@prometheus-server31 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: 'yanshier-pushgateway'
static_configs:
- targets: ['10.0.0.42:9091']
2.热加载配置文件
[root@prometheus-server31 ~]# curl -X POST 10.0.0.31:9090/-/reload
3.验证配置是否生效
http://10.0.0.31:9090/targets
4.发生测试数据到pushgateway
4.1 发送单条数据
[root@node-exporter41 ~]# echo "students_online_count 78" | curl --data-binary @- http://10.0.0.42:9091/metrics/job/yanshier-student-online
4.2 发送多条数据
cat <<EOF | curl --data-binary @- http://10.0.0.42:9091/metrics/job/yanshier_hobby/instance/10.0.0.99
# TYPE xijiao_count counter
xijiao_count{name="wanghaonan",age="22"} 365
xijiao_count{name="songlpngyang",age="22"} 366
xijiao_count{name="wanghuifeng",age="21"} 98
xijiao_count{name="yuanshuhao",age="23"} 86
xijiao_count{name="libowen",age="23"} 66
xijiao_count{name="luozhiyang",age="24"} 32
# TYPE game_seconds gauge
# HELP game_seconds play game times.
game_seconds{name="yanbo"} 3600
EOF
5.grafana展示数据
自定义Dashboard即可。略,见视频。
prometheus监控tcp的12种状态案例
[root@node-exporter41 ~]# cat > /usr/local/bin/tcp_status.sh <<'EOF'
#!/bin/bash
# 定义TCP的12种状态
ESTABLISHED_COUNT=0
SYN_SENT_COUNT=0
SYN_RECV_COUNT=0
FIN_WAIT1_COUNT=0
FIN_WAIT2_COUNT=0
TIME_WAIT_COUNT=0
CLOSE_COUNT=0
CLOSE_WAIT_COUNT=0
LAST_ACK_COUNT=0
LISTEN_COUNT=0
CLOSING_COUNT=0
UNKNOWN_COUNT=0
# 定义任务名称
JOB_NAME=tcp_status
# 定义实例名称
INSTANCE_NAME=harbor250
# 定义pushgateway主机
HOST=10.0.0.42
# 定义pushgateway端口
PORT=9091
# TCP的12种状态
ALL_STATUS=(ESTABLISHED SYN_SENT SYN_RECV FIN_WAIT1 FIN_WAIT2 TIME_WAIT CLOSE CLOSE_WAIT LAST_ACK LISTEN CLOSING UNKNOWN)
# 声明一个关联数组,类似于py的dict,go的map
declare -A tcp_status
# 统计TCP的12种状态
for i in ${ALL_STATUS[@]}
do
temp=`netstat -untalp | grep $i | wc -l`
tcp_status[${i}]=$temp
done
# 将统计后的结果发送到pushgateway
for i in ${!tcp_status[@]}
do
data="$i ${tcp_status[$i]}"
# TODO: shell如果想要设计成相同key不同标签的方式存在问题,只会有最后一种状态被发送
# 目前我怀疑是pushgateway组件不支持同一个metrics中key所对应的value不同的情况。
#data="yanshier_tcp_all_status{status=\"$i\"} ${tcp_status[$i]}"
#echo $data
echo $data | curl --data-binary @- http://${HOST}:${PORT}/metrics/job/${JOB_NAME}/instance/${INSTANCE_NAME}
# sleep 1
done
EOF
2.编写定时任务推送数据到pushgateway
[root@node-exporter41 ~]# echo "*/5 * * * * /usr/local/bin/tcp_status.sh" >> /var/spool/cron/crontabs/root
[root@node-exporter41 ~]#
[root@node-exporter41 ~]# crontab -l
*/5 * * * * /usr/local/bin/tcp_status.sh
[root@node-exporter41 ~]#
[root@node-exporter41 ~]# chmod +x /usr/local/bin/tcp_status.sh
[root@node-exporter41 ~]#
[root@node-exporter41 ~]# /usr/local/bin/tcp_status.sh
[root@node-exporter41 ~]#
3.观察pushgateway的WebUI
http://10.0.0.42:9091/
4.参考
4.1
[root@prometheus-server31 ~]# sed -i 's/\xc2\xa0/ /g' /usr/local/bin/tcp_status2.sh
[root@prometheus-server31 ~]#
[root@prometheus-server31 ~]# cat -A /usr/local/bin/tcp_status2.sh
#!/bin/bash
pushgateway_url="http://10.0.0.42:9091/metrics/job/tcp_status"
time=$(date +%Y-%m-%d+%H:%M:%S)
state="SYN-SENT SYN-RECV FIN-WAIT-1 FIN-WAIT-2 TIME-WAIT CLOSE CLOSE-WAIT LAST-ACK LISTEN CLOSING ESTAB"
for i in $state
do
t=`ss -tan |grep $i |wc -l`
echo tcp_connections{state=\""$i"\"} $t >>/tmp/tcp.txt
done;
cat /tmp/tcp.txt | curl --data-binary @- $pushgateway_url
rm -rf /tmp/tcp.txt
[root@prometheus-server31 ~]#
使用python程序自定义exporter案例
1 安装pip3工具包
[root@prometheus-node42 ~]# apt update
[root@prometheus-node42 ~]# apt install -y python3-pip
1.2 pip配置加速
[root@node-exporter41 ~]# mkdir ~/.pip
[root@node-exporter41 ~]#
[root@node-exporter41 ~]# vim ~/.pip/pip.conf
[root@node-exporter41 ~]#
[root@node-exporter41 ~]# cat ~/.pip/pip.conf
# [global]
# index-url=https://pypi.tuna.tsinghua.edu.cn/simple
# [install]
# trusted-host=pypi.douban.com
[global]
index-url=https://mirrors.aliyun.com/pypi/simple
[install]
trusted-host=mirrors.aliyun.com
[root@node-exporter41 ~]#
1.3 安装实际环境中相关模块库
[root@node-exporter41 ~]# pip3 install flask prometheus_client
[root@node-exporter41 ~]# pip3 list
1.4 编写代码
[root@node-exporter41 ~]# cat flask_metric.py
from prometheus_client import start_http_server,Counter, Summary
from flask import Flask, jsonify
from wsgiref.simple_server import make_server
import time
app = Flask(__name__)
# Create a metric to track time spent and requests made
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')
COUNTER_TIME = Counter("request_count", "Total request count of the host")
@app.route("/apps")
@REQUEST_TIME.time()
def requests_count():
COUNTER_TIME.inc()
return jsonify({"office": "https://www.yanshier.com"},{"auther":"Jason Yin"})
if __name__ == "__main__":
start_http_server(8000)
httpd = make_server( '0.0.0.0', 8001, app )
httpd.serve_forever()
[root@node-exporter41 ~]#
1.5 启动python程序
[root@node-exporter41 ~]# python3 flask_metric.py
...
# 当启动客户端测试时,可能会出现如下的信息
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
10.0.0.43 - - [13/Nov/2024 17:31:57] "GET /apps HTTP/1.1" 200 64
1.6 客户端测试
[root@node-exporter43 ~]# cat yanshier_curl_metrics.sh
#!/bin/bash
URL=http://10.0.0.41:8001/apps
while true;do
curl_num=$(( $RANDOM%50+1 ))
sleep_num=$(( $RANDOM%5+1 ))
for c_num in `seq $curl_num`;do
curl -s $URL &> /dev/null
done
sleep $sleep_num
done
[root@node-exporter43 ~]#
[root@node-exporter43 ~]# bash yanshier_curl_metrics.sh
1.7 prometheus监控python自定义的exporter实战
[root@prometheus-server31 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: "yinzhengjie_python_custom_metrics"
static_configs:
- targets:
- 10.0.0.41:8000
1.8 重新加载配置文件
curl -X POST http://10.0.0.31:9090/-/reload
1.9 验证prometheus是否采集到数据
http://10.0.0.31:9090/targets
1.10 grafana作图展示
request_count_total
pps请求总数。
increase(request_count_total{job="yinzhengjie_python_custom_metrics"}[1m])
每分钟请求数量曲线QPS。
irate(request_count_total{job="yinzhengjie_python_custom_metrics"}[1m])
每分钟请求量变化率曲线QPS
request_processing_seconds_sum{job="yinzhengjie_python_custom_metrics"} / request_processing_seconds_count{job="yinzhengjie_python_custom_metrics"}
请求处理平均耗时
Prometheus联邦模式
1.修改Prometheus32节点
[root@prometheus-server32 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: 'yanshier-prometheus32'
static_configs:
- targets: ["10.0.0.41:9100"]
[root@prometheus-server32 ~]# curl -X POST 10.0.0.32:9090/-/reload
[root@prometheus-server32 ~]#
2.修改Prometheus33节点
[root@prometheus-server33 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: 'yanshier-prometheus33'
static_configs:
- targets: ["10.0.0.42:9100","10.0.0.43:9100"]
[root@prometheus-server33 ~]# curl -X POST 10.0.0.33:9090/-/reload
[root@prometheus-server33 ~]#
3.验证各节点的配置是否生效
http://10.0.0.32:9090/targets
http://10.0.0.33:9090/targets
温馨提示:
并在2个Prometheus server服务端使用PromQL查询: node_cpu_guest_seconds_total
4.配置Prometheus 31的联邦模式
[root@prometheus-server31 ~]# vim /yanshier/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml
...
- job_name: "prometheus-federate-32"
metrics_path: "/federate"
# 用于解决标签的冲突问题,有效值为: true和false,默认值为false
# 当设置为true时,将保留抓取的标签以忽略服务器自身的标签。说白了会覆盖原有标签。
# 当设置为false时,则不会覆盖原有标签,而是在标点前加了一个"exported_"前缀。
honor_labels: true
params:
"match[]":
- '{job="promethues"}'
- '{__name__=~"job:.*"}'
- '{__name__=~"node.*"}'
static_configs:
- targets:
- "10.0.0.32:9090"
- job_name: "prometheus-federate-33"
metrics_path: "/federate"
honor_labels: true
params:
"match[]":
- '{job="promethues"}'
- '{__name__=~"job:.*"}'
- '{__name__=~"node.*"}'
static_configs:
- targets:
- "10.0.0.33:9090"
5.热加载配置文件
[root@prometheus-server31 ~]# curl -X POST 10.0.0.31:9090/-/reload
6.验证配置是否生效
http://10.0.0.31:9090/targets
查询PromQL指标:
node_cpu_guest_seconds_total{job=~"yanshier.*"}
7.grafana导入数据
1860