一、实验环境

软件:Prometheus、Grafana、Altermanager

硬件:k8s集群(控制节点、工作节点)

kubernetes-Kubeadm单master安装

二、Prometheus

2.1. 特点

  • 多维度数据模型,每一个时间序列数据都由metric度量指标名称和它的标签labels键值对集合唯一确定;
  • 灵活的查询语言(PromQL),可对指标进行加、乘、连接等操作;
  • 可直接本地部署,不依赖其他分布式存储;
  • 通过基于HTTP的pull方式采集时序数据;
  • 可通过中间网关pushgateway的方式把时间序列数据推送到Prometheus server端;
  • 高效存储,每采样数据占3.5 bytes左右,300W的时间序列,30s的间隔,保留60天,消耗磁盘大概200G;

2.2. 组件

the main Prometheus server which scrapes and stores time series data

client libraries for instrumenting application code

push gateway for supporting short-lived jobs

special-purpose exporters for services like HAProxy,StatsD, Graphite, etc.

an alertmanager to handle alerts

various support tools

Overview|Prometheus

2.3. Zabbix和Prometheus:

Zabbix Prometheus
后端用C,界面用PHP,定制化难度很高 后端用GO,前端Grafana,JSON编辑即可解决,定制化难度较低
集群规模上线10000个 支持更大的集群规模,速度也更快
适合监控物理机环境 适合云环境监控,对OpenStack、Kubernetes有更好的集成
监控数据存储在关系型数据库,如MySQL,很难从现有数据库中扩展维度 监控数据存储在基于时间序列的数据库内,便于对已有数据进行新的聚合
安装简单,zabbix-server一个软件包包括所有服务端功能 安装相对复杂,监控、告警和界面都分属不同组件
图形化界面成熟 界面相对较弱,很多配置需要修改配置文件

image

三、安装

3.1. node-export

curl http://192.168.56.129:9100/metrics

docker hub

[root@anyu967master1 ~]# kubectl create ns monitor-sa
[root@anyu967master1 ~]# docker pull prom/node-exporter
[root@anyu967master1 prometheus]# kubectl get pods -n monitor-sa -o wide
NAME                  READY   STATUS    RESTARTS   AGE   IP               NODE             NOMINATED NODE   READINESS GATES
node-exporter-fswvs   1/1     Running   0          57s   192.168.56.130   anyu967node1     <none>           <none>
node-exporter-z4gtq   1/1     Running   0          56s   192.168.56.129   anyu967master1   <none>           <none>

# 在命名空间monitor-sa下创建sa账号monitor
[root@anyu967master1 prometheus]# kubectl create serviceaccount monitor -n monitor-sa
serviceaccount/monitor created
# 在命名空间monitor-sa创建clusterrolebinding名字是monitor-clusterrolebinding
[root@anyu967master1 prometheus]# kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor-sa --clusterrole=cluster-admin --serviceaccount=monitor-sa:monitor    
clusterrolebinding.rbac.authorization.k8s.io/monitor-clusterrolebinding created

3.2. 安装Prometheus
Dashboards-k8s

# 创建configmap 存储卷,存放配置信息
[root@anyu967master1 prometheus]# kubectl apply -f prometheus-cfg.yaml 
configmap/prometheus-config created
[root@anyu967master1 prometheus]# kubectl get configmap -n monitor-sa
NAME                DATA   AGE
kube-root-ca.crt    1      6h34m
prometheus-config   1      87s

[root@anyu967master1 prometheus]# kubectl describe configmap prometheus-config -n monitor-sa
Name:         prometheus-config
Namespace:    monitor-sa
Labels:       app=prometheus
Annotations:  <none>

[root@anyu967master1 prometheus]# kubectl apply -f prometheus-svc.yaml
service/prometheus created
[root@anyu967master1 prometheus]# kubectl get svc -n monitor-sa
NAME         TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
prometheus   NodePort   10.106.162.34   <none>        9090:30920/TCP   10s  # http://192.168.56.130:30920
[root@anyu967master1 prometheus]# kubectl get pods -n monitor-sa -o wide -l app=prometheus
NAME                                 READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
prometheus-server-65586786cc-gksmn   1/1     Running   0          16m   10.244.87.25   anyu967node1   <none>           <none>
# 热加载   curl -X POST http://10.244.87.25:9090/-/reload

image

3.3. 安装Grafana

k8s-for-prometheus-dashboard

[root@anyu967master1 prometheus]# kubectl apply -f grafana.yaml 
deployment.apps/monitoring-grafana created
service/monitoring-grafana created
[root@anyu967master1 prometheus]# kubectl get pods -n kube-system -o wide -l k8s-app=grafana
NAME                                 READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
monitoring-grafana-d47cdf886-jc48x   1/1     Running   0          93s   10.244.87.27   anyu967node1   <none>           <none>
[root@anyu967master1 prometheus]# kubectl get svc -n kube-system
NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
kube-dns             ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   24d
metrics-server       ClusterIP   10.100.133.133   <none>        443/TCP                  18d
monitoring-grafana   NodePort    10.104.250.21    <none>        80:31019/TCP             4m12s   # http://192.168.56.130:31019/?orgId=1

image

3.4. 安装 kube-state-metrics组件

kube-state-metrics通过监听API Server生成有关资源对象的状态指标,比如Deployment、Node、Pod,需要注意的是kube-state-metrics只是简单的提供一个metrics数据,并不会存储这些指标数据,可以使用Prometheus来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如Deployment、Pod、副本状态等;调度了多少个replicas,现在可用的有几个,多少个Pod是running/stopped/terminated状态,Pod重启了多少次,有多少job在运行中;

# kubectl config use-context kubernetes-admin@kuberneters  切换当前上下文
# kubectl create sa kube-state-metrics
# kubectl create CLusterRolebinding kube-state-metrics --clusterrole=kube-state-metrics --serviceaccount=kube-system:kube-state-metrics
[root@anyu967master1 prometheus]# kubectl apply -f kube-state-metrics-rabc.yaml
serviceaccount/kube-state-metrics unchanged
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics unchanged
[root@anyu967master1 prometheus]# kubectl apply -f kube-state-metrics-deploy.yaml 
deployment.apps/kube-state-metrics created
[root@anyu967master1 prometheus]# kubectl get pods -n kube-system -o wide -l app=kube-state-metrics
NAME                                  READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
kube-state-metrics-6989c984d6-gmjlk   1/1     Running   0          57s   10.244.87.32   anyu967node1   <none>           <none>
[root@anyu967master1 prometheus]# kubectl get svc -n kube-system
NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
kube-dns             ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   25d
kube-state-metrics   ClusterIP   10.107.111.194   <none>        8080/TCP                 23s
metrics-server       ClusterIP   10.100.133.133   <none>        443/TCP                  19d
monitoring-grafana   NodePort    10.104.250.21    <none>        80:31019/TCP             25h

3.5. AlertManager

Kubernetes监控手册07-监控controller-manager
kube-prometheus监控 controller-manager && scheduler 组件
PROMETHEUS监控KUBE-CONTROLLER-MANAGER和KUBE-SCHEDULER

Kubernetes监控手册09-监控ETCD
Prometheus 入门到实战

告警处理流程:

1、Prometheus Server监控目标主机上暴露的HTTP接口(假设接口A),通过Promethes配置的'scrape_interval'定义的时间间隔,定期采集目标主机上监控数据;
2、当接口A不可用的时候,Server端会持续的尝试从接口中取数据,直到'scrape_timeout'时间后停止尝试,这时候把接口的状态变为D0WN;
3、Prometheus同时根据配置的'evaluation_interval'的时间间隔,定期(默认lmin)的对AlertRule进行评估:当到达评估周期的时候,发现接口A为DOWN,即UP=0为真,激活Alert,进入'PENDING'状态,并记录当前active的时间
4、当下一个alert rule的评估周期到来的时候,发现UP=0继续为真,然后判断警报Active的时间是否己经超出rule里的'for'持续时间,如果未超出,则进入下一个评估周期;如果时间超出,则alert的状态变为'FIRING',同时调用Alertmanager接口,发送相关报警数据;
5、AlertManager收到报警数据后,会将警报信息进行分组,然后根据alertmanager配置的'group_wait'时间先进行等待,等wait时间过后再发送报警信息;
6、属于同一个Alert Group的警报,在等待的过程中可能进入新的alert,如果之前的报警已经成功发出,那么间隔'group_interval'的时间间隔后再重新发送报警信息。比如配置的是邮件报警,那么同属一个group的报警信息会汇总在一个邮件里进行发送;

Prometheus Operator能做什么

[root@anyu967master1 prometheus]# kubectl apply -f alertmanager-cm.yaml 
configmap/alertmanager created
[root@anyu967master1 prometheus]# kubectl get configmap -n monitor-sa
NAME                DATA   AGE
alertmanager        1      64s
kube-root-ca.crt    1      2d2h
prometheus-config   1      43h

# 解决无法采集controller-manager数据,因为是 http://192.168.56.129:10257/metrics
[root@anyu967master1 prometheus]# kubectl get ep -n kube-system kube-controller-manager -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  creationTimestamp: "2022-12-10T17:10:23Z"
  labels:
    k8s-app: kube-controller-manager
    service.kubernetes.io/headless: ""
  name: kube-controller-manager
  namespace: kube-system
  resourceVersion: "343633"
  uid: 9f07cc5b-435d-42fd-8ca5-b2d94c616e13
subsets:
- addresses:
  - ip: 192.168.56.129
    nodeName: anyu967master1
    targetRef:
      kind: Pod
      name: kube-controller-manager-anyu967master1
      namespace: kube-system
      uid: 9c212344-11a8-4aec-9920-d043a67fa076
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP

[root@anyu967master1 prometheus]# kubectl edit configmap kube-proxy -n kube-system
kind: KubeProxyConfiguration
    metricsBindAddress: "0.0.0.0:10249"
[root@anyu967master1 prometheus]# kubectl get pods -n kube-system |grep kube-proxy |awk '{print $1}' |xargs kubectl delete pods -n kube-system
pod "kube-proxy-pntbq" deleted
pod "kube-proxy-z9562" deleted
[root@anyu967master1 prometheus]# netstat -anlp |grep :10249
tcp6       0      0 :::10249                :::*                    LISTEN      81735/kube-proxy

[root@anyu967master1 prometheus]# curl -s --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://localhost:2379/metrics | head -n 6
[root@anyu967master1 prometheus]# kubectl describe pod etcd-anyu967master1 -n kube-system
[root@anyu967master1 prometheus]# kubectl get pod etcd-anyu967master1 -n kube-system -o yaml

[root@anyu967master1 prometheus]# kubectl -n monitor-sa create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/server.key --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/ca.crt
secret/etcd-certs created

image

image

posted on 2023-04-22 16:30  anyu967  阅读(40)  评论(0)    收藏  举报