监控-Prometheus09-监控Kubernetes
- 在过去的几年中,云计算已经成为及分布式计算最火热的技术之一,其中Docker、Kubernetes、Prometheus等开源软件的发展极大地推动了云计算的发展。
- Kubernetes使用Docker进行容器管理,如果说Docker和kubernetes的搭配是云原生时代的基石,那么Prometheus为云原生插上了飞翔的翅膀。随着云原生社区的不断壮大,应用场景越来越复杂,需要一套针对云原生环境的完善并且开放的监控平台。在这样的环境下,Prometheus应运而生,天然支持Kubernetes。
- 传统方式部署步骤相对复杂,随着Operator的日益成熟,推荐使用Operator方式部署Prometheus。通过Operator方式部署Prometheus,可将更多的操作集成到Operator中,简化了操作过程,也使部署更加简单。
1、Prometheus Operator介绍
- Kubernetes的Prometheus Operator为Kubernetes服务和Prometheus实例的部署和管理提供了简单的监控定义。
- Prometheus Operator(后面都简称Operater)提供如下功能:
- 创建/销毁:在Kubernetes namespace中更加容易地启动一个Prometheues实例,一个特定应用程序或者团队可以更容易使用Operator。
- 便捷配置:可以通过Kubernetes资源配置Prometheus的基本信息,比如版本、存储、保留策略和副本集等。
1.1、Prometheus Operator架构
- Prometheus Operator架构如图11-1所示:

- 架构中的各组以k8s自定义资源的方式运行在Kubernetes集群中,它们各自有不同的作用。
- Operator:根据自定义资源(Custom ResourceDefinition,CRD)来部署和管理Prometheus Server,同时监控这些自定义资源事件的变化来做相应的处理,是整个系统的控制中心。
- Prometheus资源:声明Prometheus statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。
- Prometheus Server:Operator根据Prometheus资源部署Prometheus Server集群,这些自定义资源可以看作是用来管理Prometheus Server集群的StatefulSets资源。
- Alertmanager资源:定义AlertManager statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。
- ServiceMonitor资源:声明Prometheus监控的target列表。该资源通过Labels来选取对应的Service Endpoint,让Prometheus Server通过选取的Service来获取Metrics信息。
- Service:简单的说就是Prometheus监控的对象。
2.2、Prometheus Operator的自定义资源
- Prometheus Operater有四种自定义资源:
- Prometheus
- ServiceMonitor
- Alertmanager
- PrometheusRule
- 查看名称空间中的所有资源
]# kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -n monitoring
1、Prometheus资源
- Prometheus自定义资源(CRD):声明Prometheus statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。包含副本数量、持久化存储以及Prometheus实例发送警告到的Alertmanagers等配置选项。
- Prometheus Operator会根据Prometheus资源在相同namespace下生成一个StatefulSet控制器。Prometheus的Pod都会挂载一个名为<prometheus-name>的Secret,里面包含了Prometheus的配置。Prometheus Operator根据包含的ServiceMonitor生成配置,并且更新含有配置的Secret。无论是对ServiceMonitors或者Prometheus的修改,都会持续不断的被按照前面的步骤更新。
- 查看Prometheus资源(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring prometheus.monitoring.coreos.com/kube-promet
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
annotations:
meta.helm.sh/release-name: kube-prometheus
meta.helm.sh/release-namespace: monitoring
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: kube-prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kube-prometheus
helm.sh/chart: kube-prometheus-8.1.11
name: kube-prometheus-prometheus
namespace: monitoring
spec:
affinity: #定义亲和性
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: kube-prometheus
app.kubernetes.io/name: kube-prometheus
namespaces:
- monitoring
topologyKey: kubernetes.io/hostname
weight: 1
alerting: #定义Prometheus关联的Alertmanager
alertmanagers:
- name: kube-prometheus-alertmanager
namespace: monitoring
pathPrefix: /
port: http
containers: #定义容器
- name: prometheus #prometheus容器
livenessProbe: #容器存活性探针
failureThreshold: 10
httpGet:
path: /-/healthy
port: web
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
readinessProbe: #容器可用性探针
failureThreshold: 10
httpGet:
path: /-/ready
port: web
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: false
runAsNonRoot: true
startupProbe: #容器启动性探针
failureThreshold: 60
httpGet:
path: /-/ready
port: web
scheme: HTTP
initialDelaySeconds: 0
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 3
- name: config-reloader #config-reloader容器
livenessProbe:
failureThreshold: 6
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: reloader-web
timeoutSeconds: 5
readinessProbe:
failureThreshold: 6
initialDelaySeconds: 15
periodSeconds: 20
successThreshold: 1
tcpSocket:
port: reloader-web
timeoutSeconds: 5
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: false
runAsNonRoot: true
enableAdminAPI: false
evaluationInterval: 30s
externalUrl: http://127.0.0.1:9090/
image: docker.io/bitnami/prometheus:2.39.1-debian-11-r1 #镜像
listenLocal: false
logFormat: logfmt #日志风格
logLevel: info
paused: false
podMetadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: kube-prometheus
app.kubernetes.io/name: kube-prometheus
podMonitorNamespaceSelector: {}
podMonitorSelector: {}
portName: web
probeNamespaceSelector: {}
probeSelector: {}
replicas: 1 #定义Proemtheus“集群”有两个副本。说是集群,其实Prometheus自身不带集群功能,这里只是起两个完全一样的Prometheus来避免单点故障
retention: 10d
routePrefix: /
ruleNamespaceSelector: {}
ruleSelector: {} #定义Prometheus使用哪些PrometheusRule,根据标签选择。
scrapeInterval: 30s
securityContext:
fsGroup: 1001
runAsUser: 1001
serviceAccountName: kube-prometheus-prometheus
serviceMonitorNamespaceSelector: {} #定义Prometheus在哪些namespace中选择要被监控的ServiceMonitor,根据标签选择namespace。不声明则会全部选中
serviceMonitorSelector: {} #定义Prometheus选择哪些要被监控的ServiceMonitor,根据标签选择ServiceMonitor。不声明则会全部选中
shards: 1
storage: #定义存储卷
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
storageClassName: nfs-client
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2022-10-22T21:24:42Z"
status: "True"
type: Available
- lastTransitionTime: "2022-10-22T21:12:51Z"
status: "True"
type: Reconciled
paused: false
replicas: 1
shardStatuses:
- availableReplicas: 1
replicas: 1
shardID: "0"
unavailableReplicas: 0
updatedReplicas: 1
unavailableReplicas: 0
updatedReplicas: 1
- 查看Prometheus资源生成的StatefulSet控制器(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring statefulset.apps/prometheus-kube-prometheus-prometheus apiVersion: apps/v1 kind: StatefulSet metadata: annotations: meta.helm.sh/release-name: kube-prometheus meta.helm.sh/release-namespace: monitoring generation: 1 labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: kube-prometheus helm.sh/chart: kube-prometheus-8.1.11 operator.prometheus.io/name: kube-prometheus-prometheus operator.prometheus.io/shard: "0" name: prometheus-kube-prometheus-prometheus namespace: monitoring ownerReferences: - apiVersion: monitoring.coreos.com/v1 blockOwnerDeletion: true controller: true kind: Prometheus name: kube-prometheus-prometheus spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/instance: kube-prometheus-prometheus app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: prometheus operator.prometheus.io/name: kube-prometheus-prometheus operator.prometheus.io/shard: "0" prometheus: kube-prometheus-prometheus serviceName: prometheus-operated template: metadata: annotations: kubectl.kubernetes.io/default-container: prometheus creationTimestamp: null labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: kube-prometheus-prometheus app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: prometheus app.kubernetes.io/version: 2.39.0 operator.prometheus.io/name: kube-prometheus-prometheus operator.prometheus.io/shard: "0" prometheus: kube-prometheus-prometheus spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/name: kube-prometheus namespaces: - monitoring topologyKey: kubernetes.io/hostname weight: 1 automountServiceAccountToken: true containers: - args: - --web.console.templates=/etc/prometheus/consoles - --web.console.libraries=/etc/prometheus/console_libraries - --storage.tsdb.retention.time=10d - --config.file=/etc/prometheus/config_out/prometheus.env.yaml - --storage.tsdb.path=/prometheus - --web.enable-lifecycle - --web.external-url=http://127.0.0.1:9090/ - --web.route-prefix=/ - --web.config.file=/etc/prometheus/web_config/web-config.yaml image: docker.io/bitnami/prometheus:2.39.1-debian-11-r1 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 10 httpGet: path: /-/healthy port: web scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 3 name: prometheus ports: - containerPort: 9090 name: web protocol: TCP readinessProbe: failureThreshold: 10 httpGet: path: /-/ready port: web scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 3 resources: {} securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true startupProbe: failureThreshold: 60 httpGet: path: /-/ready port: web scheme: HTTP periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/config_out name: config-out readOnly: true - mountPath: /etc/prometheus/certs name: tls-assets readOnly: true - mountPath: /prometheus name: prometheus-kube-prometheus-prometheus-db subPath: prometheus-db - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 name: prometheus-kube-prometheus-prometheus-rulefiles-0 - mountPath: /etc/prometheus/web_config/web-config.yaml name: web-config readOnly: true subPath: web-config.yaml - args: - --listen-address=:8080 - --reload-url=http://127.0.0.1:9090/-/reload - --config-file=/etc/prometheus/config/prometheus.yaml.gz - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml - --watched-dir=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: SHARD value: "0" image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 6 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 name: config-reloader ports: - containerPort: 8080 name: reloader-web protocol: TCP readinessProbe: failureThreshold: 6 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 resources: limits: cpu: 100m memory: 50Mi requests: cpu: 100m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/config name: config - mountPath: /etc/prometheus/config_out name: config-out - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 name: prometheus-kube-prometheus-prometheus-rulefiles-0 dnsPolicy: ClusterFirst initContainers: - args: - --watch-interval=0 - --listen-address=:8080 - --config-file=/etc/prometheus/config/prometheus.yaml.gz - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml - --watched-dir=/etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: SHARD value: "0" image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0 imagePullPolicy: IfNotPresent name: init-config-reloader ports: - containerPort: 8080 name: reloader-web protocol: TCP resources: limits: cpu: 100m memory: 50Mi requests: cpu: 100m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/prometheus/config name: config - mountPath: /etc/prometheus/config_out name: config-out - mountPath: /etc/prometheus/rules/prometheus-kube-prometheus-prometheus-rulefiles-0 name: prometheus-kube-prometheus-prometheus-rulefiles-0 restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1001 runAsUser: 1001 serviceAccount: kube-prometheus-prometheus serviceAccountName: kube-prometheus-prometheus terminationGracePeriodSeconds: 600 volumes: - name: config secret: defaultMode: 420 secretName: prometheus-kube-prometheus-prometheus - name: tls-assets projected: defaultMode: 420 sources: - secret: name: prometheus-kube-prometheus-prometheus-tls-assets-0 - emptyDir: {} name: config-out - configMap: defaultMode: 420 name: prometheus-kube-prometheus-prometheus-rulefiles-0 name: prometheus-kube-prometheus-prometheus-rulefiles-0 - name: web-config secret: defaultMode: 420 secretName: prometheus-kube-prometheus-prometheus-web-config updateStrategy: type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: prometheus-kube-prometheus-prometheus-db spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: nfs-client volumeMode: Filesystem status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: prometheus-kube-prometheus-prometheus-8cbb4d97f observedGeneration: 1 readyReplicas: 1 replicas: 1 updateRevision: prometheus-kube-prometheus-prometheus-8cbb4d97f updatedReplicas: 1
2、ServiceMonitor资源
- ServiceMonitor自定义资源(CRD)能够声明如何监控一组动态的服务。它会使用标签选择一组需要被监控的服务(target)。
- Prometheus Operator想要监控Kubernetes集群中的应用时,它的Endpoints必须存在。
- Endpoints对象本质是一个IP地址列表。
- Endpoints对象由Service构建。Service对象通过对象选择器发现Pod并将它们添加到Endpoints对象中。
- 一个Service可以公开一个或多个端口,通常情况下,这些端口由指向一个Pod的多个Endpoints支持。
- Prometheus Operator引入ServiceMonitor对象,通过它发现Endpoints对象,然后让Prometheus去监控这些Pods。
- ServiceMonitor.Spec的endpoints部分用于配置需要收集metrics的Endpoints的端口和其他参数。在endpoints部分指定endpoint时,请严格使用。
- 注意:endpoints(小写)是ServiceMonitor CRD中的一个字段,而Endpoints(大写)是Kubernetes资源类型。
- ServiceMonitor和发现的目标可能来自任何namespace。这对于跨namespace的监控十分重要,比如monitoring。
- 使用Prometheus.Spec下ServiceMonitorNamespaceSelector,通过各自Prometheus server限制ServiceMonitors作用namespece。
- 使用ServiceMonitor.Spec下的namespaceSelector可以现在允许发现Endpoints对象的命名空间。要发现所有命名空间下的目标,namespaceSelector必须为空。
- 查看ServiceMonitor资源(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring servicemonitor.monitoring.coreos.com/kube-prometheus-node-exporter
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: kube-prometheus
meta.helm.sh/release-namespace: monitoring
generation: 1
labels:
app.kubernetes.io/instance: kube-prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: node-exporter
helm.sh/chart: node-exporter-3.2.1
name: kube-prometheus-node-exporter
namespace: monitoring
spec:
endpoints:
- port: metrics
interval: 15s #抓取Endpoints的时间间隔
relabelings: #标签重写
- action: replace
regex: (.*)
replacement: $1
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: instance
jobLabel: jobLabel
namespaceSelector: #定义Prometheus在哪些namespace中选择要被监控的Endpoints,根据标签选择
matchNames:
- monitoring
selector: #定义Prometheus选择哪些要被监控的Endpoints,根据标签选择Endpoints
matchLabels:
app.kubernetes.io/instance: kube-prometheus
app.kubernetes.io/name: node-exporter
3、PrometheusRule
- PrometheusRule CRD声明一个或多个Prometheus实例需要的Prometheus rule。
- Alerts和recording rules可以保存并应用为yaml文件,可以被动态加载而不需要重启。
- 获取PrometheusRule资源:https://github.com/prometheus-operator/kube-prometheus/tree/main/manifests
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: node-exporter
app.kubernetes.io/part-of: kube-prometheus
prometheus: k8s
role: alert-rules
name: node-exporter-rules
namespace: monitoring
spec:
groups:
- name: node-exporter
rules:
- alert: NodeFilesystemSpaceFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left and is filling
up.
runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemspacefillingup
summary: Filesystem is predicted to run out of space within the next 24 hours.
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40
and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
...
4、Alertmanager
- Alertmanager资源:定义AlertManager statefulset控制器的期望状态,Prometheus Operator确保这个statefulset控制器运行时一直与定义保持一致。包含副本数量、持久化存储的选项。
- Prometheus Operator会根据Alertmanager资源在相同namespace下生成一个StatefulSet控制器。Alertmanager的Pod都会挂载一个名为<prometheus-name>的Secret。
- 当有两个或更多配置的副本时,Operator可以高可用性模式运行Alertmanager实例。
- 查看Alertmanager资源(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring alertmanager.monitoring.coreos.com/kube-prometheus-alertmanager apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: annotations: meta.helm.sh/release-name: kube-prometheus meta.helm.sh/release-namespace: monitoring generation: 1 labels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: kube-prometheus helm.sh/chart: kube-prometheus-8.1.11 name: kube-prometheus-alertmanager namespace: monitoring spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/name: kube-prometheus namespaces: - monitoring topologyKey: kubernetes.io/hostname weight: 1 containers: - livenessProbe: failureThreshold: 120 httpGet: path: /-/healthy port: web scheme: HTTP initialDelaySeconds: 0 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 name: alertmanager readinessProbe: failureThreshold: 120 httpGet: path: /-/ready port: web scheme: HTTP initialDelaySeconds: 0 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true - livenessProbe: failureThreshold: 6 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 name: config-reloader readinessProbe: failureThreshold: 6 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true externalUrl: http://127.0.0.1:9093/ image: docker.io/bitnami/alertmanager:0.24.0-debian-11-r46 listenLocal: false logFormat: logfmt logLevel: info paused: false podMetadata: labels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/name: kube-prometheus portName: web replicas: 1 resources: {} retention: 120h routePrefix: / securityContext: fsGroup: 1001 runAsUser: 1001 serviceAccountName: kube-prometheus-alertmanager storage: #定义存储卷 volumeClaimTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: nfs-client
- 查看Alertmanager资源生成的StatefulSet控制器(使用helm部署kube-Prometheus的内容)
]# kubectl edit -n monitoring statefulset.apps/alertmanager-kube-prometheus-alertmanager apiVersion: apps/v1 kind: StatefulSet metadata: annotations: meta.helm.sh/release-name: kube-prometheus meta.helm.sh/release-namespace: monitoring prometheus-operator-input-hash: "13509733468393518222" generation: 1 labels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: kube-prometheus helm.sh/chart: kube-prometheus-8.1.11 name: alertmanager-kube-prometheus-alertmanager namespace: monitoring ownerReferences: - apiVersion: monitoring.coreos.com/v1 blockOwnerDeletion: true controller: true kind: Alertmanager name: kube-prometheus-alertmanager spec: podManagementPolicy: Parallel replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: alertmanager: kube-prometheus-alertmanager app.kubernetes.io/instance: kube-prometheus-alertmanager app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: alertmanager serviceName: alertmanager-operated template: metadata: annotations: kubectl.kubernetes.io/default-container: alertmanager creationTimestamp: null labels: alertmanager: kube-prometheus-alertmanager app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus-alertmanager app.kubernetes.io/managed-by: prometheus-operator app.kubernetes.io/name: alertmanager app.kubernetes.io/version: 0.24.0 spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: app.kubernetes.io/component: alertmanager app.kubernetes.io/instance: kube-prometheus app.kubernetes.io/name: kube-prometheus namespaces: - monitoring topologyKey: kubernetes.io/hostname weight: 1 containers: - args: - --config.file=/etc/alertmanager/config_out/alertmanager.env.yaml - --storage.path=/alertmanager - --data.retention=120h - --cluster.listen-address= - --web.listen-address=:9093 - --web.external-url=http://127.0.0.1:9093/ - --web.route-prefix=/ - --cluster.peer=alertmanager-kube-prometheus-alertmanager-0.alertmanager-operated:9094 - --cluster.reconnect-timeout=5m - --web.config.file=/etc/alertmanager/web_config/web-config.yaml env: - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP image: docker.io/bitnami/alertmanager:0.24.0-debian-11-r46 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 120 httpGet: path: /-/healthy port: web scheme: HTTP periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 name: alertmanager ports: - containerPort: 9093 name: web protocol: TCP - containerPort: 9094 name: mesh-tcp protocol: TCP - containerPort: 9094 name: mesh-udp protocol: UDP readinessProbe: failureThreshold: 120 httpGet: path: /-/ready port: web scheme: HTTP initialDelaySeconds: 3 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 3 resources: requests: memory: 200Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/alertmanager/config name: config-volume - mountPath: /etc/alertmanager/config_out name: config-out readOnly: true - mountPath: /etc/alertmanager/certs name: tls-assets readOnly: true - mountPath: /alertmanager name: alertmanager-kube-prometheus-alertmanager-db subPath: alertmanager-db - mountPath: /etc/alertmanager/web_config/web-config.yaml name: web-config readOnly: true subPath: web-config.yaml - args: - --listen-address=:8080 - --reload-url=http://127.0.0.1:9093/-/reload - --config-file=/etc/alertmanager/config/alertmanager.yaml.gz - --config-envsubst-file=/etc/alertmanager/config_out/alertmanager.env.yaml command: - /bin/prometheus-config-reloader env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: SHARD value: "-1" image: docker.io/bitnami/prometheus-operator:0.60.1-debian-11-r0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 6 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 name: config-reloader ports: - containerPort: 8080 name: reloader-web protocol: TCP readinessProbe: failureThreshold: 6 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: reloader-web timeoutSeconds: 5 resources: limits: cpu: 100m memory: 50Mi requests: cpu: 100m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/alertmanager/config name: config-volume readOnly: true - mountPath: /etc/alertmanager/config_out name: config-out dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1001 runAsUser: 1001 serviceAccount: kube-prometheus-alertmanager serviceAccountName: kube-prometheus-alertmanager terminationGracePeriodSeconds: 120 volumes: - name: config-volume secret: defaultMode: 420 secretName: alertmanager-kube-prometheus-alertmanager-generated - name: tls-assets projected: defaultMode: 420 sources: - secret: name: alertmanager-kube-prometheus-alertmanager-tls-assets-0 - emptyDir: {} name: config-out - name: web-config secret: defaultMode: 420 secretName: alertmanager-kube-prometheus-alertmanager-web-config updateStrategy: type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: creationTimestamp: null name: alertmanager-kube-prometheus-alertmanager-db spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: nfs-client volumeMode: Filesystem status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: alertmanager-kube-prometheus-alertmanager-b74c5965d observedGeneration: 1 readyReplicas: 1 replicas: 1 updateRevision: alertmanager-kube-prometheus-alertmanager-b74c5965d updatedReplicas: 1
2、使用helm部署kube-Prometheus
- Prometheus部署环境如下:
- Kubernetes版本为v1.20.14。
- helm版本为v3.8.2。
- kube-prometheus的版本bitnami/kube-prometheus:8.1.11。
2.1、创建动态存储卷
- 创建动态存储卷
- 参看:https://www.cnblogs.com/maiblogs/p/16392831.html的《6.2、动态存储卷》,只需创建到“创建NFS SotageClass”。
2.2、部署kube-prometheus
- kube-prometheus:8.1.11会自动安装如下组件:
- prometheus-operator
- prometheus
- state-metrics
- node-exporter
- blackbox-exporter
- alertmanager
1、创建名称空间
]# kubectl create namespace monitoring
2、下载kube-prometheus的chart
]# helm repo add bitnami https://charts.bitnami.com/bitnami ]# helm search repo prometheus ]# helm pull bitnami/kube-prometheus
3、修改values.yaml
//解压
]# tar zvfx kube-prometheus-8.1.11.tgz
//修改values.yaml
]# vim ./kube-prometheus/values.yaml
prometheus:
ingress:
enabled: true
hostname:
annotations: {kubernetes.io/ingress.class: "nginx"}
extraRules:
- host: prometheus.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kube-prometheus-prometheus
port:
number: 9090
externalUrl: "http://127.0.0.1:9090/"
persistence:
enabled: true
storageClass: "nfs-client"
alertmanager:
ingress:
enabled: true
hostname:
annotations: {kubernetes.io/ingress.class: "nginx"}
extraRules:
- host: alertmanager.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kube-prometheus-alertmanager
port:
number: 9093
externalUrl: "http://127.0.0.1:9093/"
persistence:
enabled: true
storageClass: "nfs-client"
- 修改后的values.yaml文件
4、应用kube-prometheus
]# helm install kube-prometheus kube-prometheus/ -n monitoring
5、访问prometheus和alertmanager
//修改hosts文件(C:\Windows\System32\drivers\etc) 10.1.1.11 prometheus.local alertmanager.local
- 使用http://prometheus.local:32080/访问prometheus。

- 使用http://alertmanager.local:32080/访问alertmanager。

2.3、实现告警
2.3.1、配置alertmanager
1、查看alertmanager配置文件
]# kubectl exec alertmanager-kube-prometheus-alertmanager-0 -n monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml
]# kubectl get secret alertmanager-kube-prometheus-alertmanager -n monitoring -o go-template='{{ index .data "alertmanager.yaml" }}' | base64 -d
global:
resolve_timeout: 5m
receivers:
- name: "null"
route:
group_by:
- job
group_interval: 5m
group_wait: 30s
receiver: "null"
repeat_interval: 12h
routes:
- match:
alertname: Watchdog
receiver: "null"
2、修改alertmanager的配置文件
- 创建alertmanager.yaml
- 注意,这里的alertmanager.yaml顶层多了两级alertmanager和config。
- 注意,route如果没有子节点,就必须设置routes: []。
- 报错信息,level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="undefined receiver \"null\" used in route"
- 注意,pod将存储卷挂在到了/alertmanager/。
]# vim alertmanager.yaml
alertmanager:
config:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: 'xxx@qq.com'
smtp_auth_username: 'xxx@qq.com'
smtp_auth_password: 'xxx'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 10s
receiver: 'email'
routes: []
receivers:
- name: 'email'
email_configs:
- to: 'xxx@xxx.com.cn'
templates:
- '/alertmanager/template.tmpl'
3、将告警模板template.tmpl放到存储卷上
]# vim /data1/monitoring-alertmanager-kube-prometheus-alertmanager-db-alertmanager-kube-prometheus-alertmanager-0-pvc-85d5342a-9f8c-41c3-95fe-f11c77579b0c/alertmanager-db/template.tmpl
{{ define "__subject" }}
{{ if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
{{ .Labels.alertname }}{{ .Annotations.title }}
{{ end }}{{ end }}{{ end }}
{{ define "email.default.html" }}
{{ range .Alerts }}
告警名称: {{ .Annotations.title }} <br>
告警级别: {{ .Labels.severity }} <br>
告警主机: {{ .Labels.instance }} <br>
告警信息: {{ .Annotations.description }} <br>
维护团队: {{ .Labels.team }} <br>
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end }}
4、滚动更新kube-prometheus
- 更新alertmanager.yaml配置文件
]# helm upgrade kube-prometheus kube-prometheus/ --values=alertmanager.yaml -n monitoring
2.3.2、创建告警规则
1、创建PrometheusRule资源
]# vim node-exporter-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: node-exporter
app.kubernetes.io/part-of: kube-prometheus
prometheus: k8s
role: alert-rules
name: node-exporter-rules
namespace: monitoring
spec:
groups:
- name: node-exporter
rules:
- alert: NodeFilesystemSpaceFillingUp
expr: up == 3
for: 10s
labels:
severity: "告警级别critical"
team: "维护团队OPS"
annotations:
title: "告警名称Instance {{ $labels.instance }} down"
description: "告警信息{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 3 minutes."
2、应用PrometheusRule资源
]# kubectl apply -f node-exporter-rules.yaml //查看prometheusrule资源 ]# kubectl get prometheusrule -A NAMESPACE NAME AGE monitoring node-exporter-rules 39s
2.4、部署grafana
1、下载grafana的chart
]# helm repo add bitnami https://charts.bitnami.com/bitnami ]# helm search repo grafana ]# helm pull bitnami/grafana
2、修改values.yaml
//解压
]# tar zvfx grafana-8.2.12.tgz
//修改values.yaml
]# vim grafana/values.yaml
admin:
password: "admin"
persistence:
storageClass: "nfs-client"
ingress:
enabled: true
hostname:
annotations: {kubernetes.io/ingress.class: "nginx"}
extraRules:
- host: grafana.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana
port:
number: 3000
3、应用grafana
]# helm install grafana grafana/ -n monitoring
4、访问grafana
//修改hosts文件(C:\Windows\System32\drivers\etc) 10.1.1.11 prometheus.local alertmanager.local grafana.local
- 使用http://grafana.local:32080/访问grafana。
5、添加数据源
- 在Kubernetes中,集群内部的服务可用通过Kubernetes内部的域名相互访问,Kubernetes内部的域名是:Service_Name.Namespace_Name.svc.cluster.local。
- 同一个名称空间中的服务,可以直接通过Service_Name相互访问。

3、使用github部署kube-Prometheus
- 注意,这里只是快速入门安装,并没有使用持久卷。
- Prometheus部署环境如下:
- Kubernetes版本为v1.20.14。
- kube-prometheus的版本0.8.o。
- kube-prometheus和Kubernetes的兼容性

1、下载kube-Prometheus
]# wget https://github.com/prometheus-operator/kube-prometheus/archive/refs/tags/v0.8.0.tar.gz
2、快速部署kube-prometheus
]# tar zvfx v0.8.0.tar.gz ]# cd kube-prometheus-0.8.0/ //将k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0替换为bitnami/kube-state-metrics:2.0.0 ]# vim ./manifests/kube-state-metrics-deployment.yaml //先创建名称空间、prometheus-operator等 ]# kubectl create -f manifests/setup //部署全部组件 ]# kubectl create -f manifests/
- 如果之前部署过prometheus,请先清除可能的残留
]# cd kube-prometheus-0.8.0/ ]# kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup
-
查看相关资源
//查看pod ]# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE monitoring alertmanager-main-0 2/2 Running 0 31s monitoring alertmanager-main-1 2/2 Running 0 31s monitoring alertmanager-main-2 2/2 Running 0 31s monitoring blackbox-exporter-55c457d5fb-4jvmm 3/3 Running 0 30s monitoring grafana-9df57cdc4-l7gxk 1/1 Running 0 29s monitoring kube-state-metrics-6cb48468f8-dbdnc 3/3 Running 0 29s monitoring node-exporter-6svtr 2/2 Running 0 29s monitoring node-exporter-hpfw9 2/2 Running 0 29s monitoring node-exporter-jksr2 2/2 Running 0 29s monitoring prometheus-adapter-59df95d9f5-rxzdg 1/1 Running 0 29s monitoring prometheus-adapter-59df95d9f5-zs46x 1/1 Running 0 29s monitoring prometheus-k8s-0 2/2 Running 1 29s monitoring prometheus-k8s-1 2/2 Running 1 29s monitoring prometheus-operator-7775c66ccf-bfxd9 2/2 Running 0 29m ... //查看service ]# kubectl get svc -A NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE monitoring alertmanager-main ClusterIP 10.20.24.158 <none> 9093/TCP 64s monitoring alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 64s monitoring blackbox-exporter ClusterIP 10.20.248.189 <none> 9115/TCP,19115/TCP 64s monitoring grafana ClusterIP 10.20.214.103 <none> 3000/TCP 63s monitoring kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 63s monitoring node-exporter ClusterIP None <none> 9100/TCP 63s monitoring prometheus-adapter ClusterIP 10.20.74.223 <none> 443/TCP 63s monitoring prometheus-k8s ClusterIP 10.20.73.57 <none> 9090/TCP 63s monitoring prometheus-operated ClusterIP None <none> 9090/TCP 62s monitoring prometheus-operator ClusterIP None <none> 8443/TCP 30m ... //查看pod控制器 ]# kubectl get deployment -A NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE monitoring blackbox-exporter 1/1 1 1 109s monitoring grafana 1/1 1 1 108s monitoring kube-state-metrics 1/1 1 1 108s monitoring prometheus-adapter 2/2 2 2 108s monitoring prometheus-operator 1/1 1 1 30m ... //查看有状态的pod控制器 ]# kubectl get sts -A NAMESPACE NAME READY AGE monitoring alertmanager-main 3/3 2m1s monitoring prometheus-k8s 2/2 119s ...
3、访问服务
- 访问prometheus
- http://10.1.1.11:19090/
//监听10.1.1.11:19090,并将请求转发到service后面的pod的9090端口 ]# kubectl port-forward svc/prometheus-k8s --address=10.1.1.11 19090:9090 -n monitoring
- 访问grafana
- http://10.1.1.11:13000/ (admin:admin)
//监听10.1.1.11:13000,并将请求转发到service后面的pod的3000端口 ]# kubectl port-forward svc/grafana --address=10.1.1.11 13000:3000 -n monitoring
- 访问alertmanager
- http://10.1.1.11:19093/
//监听10.1.1.11:19093,并将请求转发到service后面的pod的9093端口 ]# kubectl port-forward svc/alertmanager-main --address=10.1.1.11 19093:9093 -n monitoring
4、创建ingress规则
]# vim prometheus-ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: "nginx"
spec:
rules:
- host: prometheus.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-k8s
port:
number: 9090
- host: alertmanager.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: alertmanager-main
port:
number: 9093
- host: grafana.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana
port:
number: 3000
1
# #

浙公网安备 33010602011771号