Kubernetes1.13.1安装prometheus-operator监控
一.Prometheus简介
Prometheus是一款面向云原生应用程序的开源监控工具,作为第一个从CNCF毕业的监控工具而言,开发者对于Prometheus寄予了巨大的希望。在Kubernetes社区中,很多人认为Prometheus是容器场景中监控的第一方案,成为容器监控标准的制定者。在本文中,我们会为大家介绍如何快速部署一套Kubernetes的监控解决方案。
二.安装步骤
1./app/prometheus-operator/alertmanager.yaml文件内容,该文件主要配置告警邮件的接收人与发件人
global: resolve_timeout: 5m http_config: {} smtp_hello: 'smtp.exmail.qq.com:25' smtp_from: 'lihaichun@netschina.com' smtp_smarthost: 'smtp.exmail.qq.com:25' smtp_auth_username: 'lihaichun@netschina.com' smtp_auth_password: 'XXXX' smtp_require_tls: false pagerduty_url: https://events.pagerduty.com/v2/enqueue hipchat_api_url: https://api.hipchat.com/ opsgenie_api_url: https://api.opsgenie.com/ wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/ victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/# The root route on which each incoming alert enters.route: # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ['alertname', 'cluster', 'service'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 30s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 30s # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. #repeat_interval: 20s repeat_interval: 12h # A default receiver # If an alert isn't caught by a route, send it to default. receiver: default # All the above attributes are inherited by all child routes and can # overwritten on each. # The child route trees. routes: - match: severity: critical receiver: email_alertreceivers:- name: 'default' email_configs: - to : 'lihaichun@zhixueyun.com,zhujun@zhixueyun.com,ouyangluping@zhixueyun.com,tangjie@zhixueyun.com' send_resolved: true- name: 'email_alert' email_configs: - to : 'lihaichun@zhixueyun.com,zhujun@zhixueyun.com,ouyangluping@zhixueyun.com,tangjie@zhixueyun.com' send_resolved: truetemplates: [] |
2./app/prometheus-operator/bundle.yaml的内容如下
apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: name: prometheus-operatorroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus-operatorsubjects:- kind: ServiceAccount name: prometheus-operator namespace: monitoring---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: prometheus-operatorrules:- apiGroups: - apiextensions.k8s.io resources: - customresourcedefinitions verbs: - '*'- apiGroups: - monitoring.coreos.com resources: - alertmanagers - prometheuses - prometheuses/finalizers - alertmanagers/finalizers - servicemonitors - prometheusrules verbs: - '*'- apiGroups: - apps resources: - statefulsets verbs: - '*'- apiGroups: - "" resources: - configmaps - secrets verbs: - '*'- apiGroups: - "" resources: - pods verbs: - list - delete- apiGroups: - "" resources: - services - endpoints verbs: - get - create - update- apiGroups: - "" resources: - nodes verbs: - list - watch- apiGroups: - "" resources: - namespaces verbs: - get - list - watch---apiVersion: apps/v1beta2kind: Deploymentmetadata: labels: k8s-app: prometheus-operator name: prometheus-operator namespace: monitoringspec: replicas: 1 selector: matchLabels: k8s-app: prometheus-operator template: metadata: labels: k8s-app: prometheus-operator spec: containers: - args: - --kubelet-service=kube-system/kubelet - --logtostderr=true - --config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1 - --prometheus-config-reloader=quay.io/coreos/prometheus-config-reloader:v0.27.0 image: quay.io/coreos/prometheus-operator:v0.27.0 name: prometheus-operator ports: - containerPort: 8080 name: http resources: limits: cpu: 200m memory: 200Mi requests: cpu: 100m memory: 100Mi securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true nodeSelector: beta.kubernetes.io/os: linux securityContext: runAsNonRoot: true runAsUser: 65534 serviceAccountName: prometheus-operator---apiVersion: v1kind: ServiceAccountmetadata: name: prometheus-operator namespace: monitoring |
3./app/prometheus-operator/manifests文件夹的内容下
[root@iZbp1at8fph52evh70atb1Z prometheus-operator]# pwd/app/prometheus-operator[root@iZbp1at8fph52evh70atb1Z prometheus-operator]# lsalertmanager.yaml bundle.yaml manifests[root@iZbp1at8fph52evh70atb1Z prometheus-operator]# cd manifests/[root@iZbp1at8fph52evh70atb1Z manifests]# lsalertmanager-alertmanager.yaml kube-state-metrics-service.yaml prometheus-clusterRoleBinding.yamlalertmanager-serviceAccount.yaml node-exporter-clusterRoleBinding.yaml prometheus-clusterRole.yamlalertmanager-serviceMonitor.yaml node-exporter-clusterRole.yaml prometheus-prometheus.yamlalertmanager-service.yaml node-exporter-daemonset.yaml prometheus-roleBindingConfig.yamlgrafana-dashboardDatasources.yaml node-exporter-serviceAccount.yaml prometheus-roleBindingSpecificNamespaces.yamlgrafana-dashboardDefinitions.yaml node-exporter-serviceMonitor.yaml prometheus-roleConfig.yamlgrafana-dashboardSources.yaml node-exporter-service.yaml prometheus-roleSpecificNamespaces.yamlgrafana-deployment.yaml prometheus-rules.yamlgrafana-serviceAccount.yaml prometheus-adapter-clusterRoleBindingDelegator.yaml prometheus-serviceAccount.yamlgrafana-service.yaml prometheus-adapter-clusterRoleBinding.yaml prometheus-serviceMonitorApiserver.yamlkube-state-metrics-clusterRoleBinding.yaml prometheus-adapter-clusterRoleServerResources.yaml prometheus-serviceMonitorCoreDNS.yamlkube-state-metrics-clusterRole.yaml prometheus-adapter-clusterRole.yaml prometheus-serviceMonitorKubeControllerManager.yamlkube-state-metrics-deployment.yaml prometheus-adapter-configMap.yaml prometheus-serviceMonitorKubelet.yamlkube-state-metrics-roleBinding.yaml prometheus-adapter-deployment.yaml prometheus-serviceMonitorKubeScheduler.yamlkube-state-metrics-role.yaml prometheus-adapter-roleBindingAuthReader.yaml prometheus-serviceMonitor.yamlkube-state-metrics-serviceAccount.yaml prometheus-adapter-serviceAccount.yaml prometheus-service.yamlkube-state-metrics-serviceMonitor.yaml prometheus-adapter-service.yaml |
4./app/prometheus-operator/manifests/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: labels: prometheus: k8s role: alert-rules name: prometheus-k8s-rules namespace: monitoringspec: groups: - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate - expr: | sum by (namespace, pod_name, container_name) ( rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m]) ) record: namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate - expr: | sum(container_memory_usage_bytes{job="kubelet", image!="", container_name!=""}) by (namespace) record: namespace:container_memory_usage_bytes:sum - expr: | sum by (namespace, label_name) ( sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace, pod_name) * on (namespace, pod_name) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)") ) record: namespace_name:container_cpu_usage_seconds_total:sum_rate - expr: | sum by (namespace, label_name) ( sum(container_memory_usage_bytes{job="kubelet",image!="", container_name!=""}) by (pod_name, namespace) * on (namespace, pod_name) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)") ) record: namespace_name:container_memory_usage_bytes:sum - expr: | sum by (namespace, label_name) ( sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) by (namespace, pod) * on (namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)") ) record: namespace_name:kube_pod_container_resource_requests_memory_bytes:sum - expr: | sum by (namespace, label_name) ( sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} and on(pod) kube_pod_status_scheduled{condition="true"}) by (namespace, pod) * on (namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)") ) record: namespace_name:kube_pod_container_resource_requests_cpu_cores:sum - name: kube-scheduler.rules rules: - expr: | histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.99" record: cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile - expr: | histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.99" record: cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile - expr: | histogram_quantile(0.99, sum(rate(scheduler_binding_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.99" record: cluster_quantile:scheduler_binding_latency:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.9" record: cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.9" record: cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(scheduler_binding_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.9" record: cluster_quantile:scheduler_binding_latency:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.5" record: cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.5" record: cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(scheduler_binding_latency_microseconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.5" record: cluster_quantile:scheduler_binding_latency:histogram_quantile - name: kube-apiserver.rules rules: - expr: | histogram_quantile(0.99, sum(rate(apiserver_request_latencies_bucket{job="apiserver"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.99" record: cluster_quantile:apiserver_request_latencies:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(apiserver_request_latencies_bucket{job="apiserver"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.9" record: cluster_quantile:apiserver_request_latencies:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(apiserver_request_latencies_bucket{job="apiserver"}[5m])) without(instance, pod)) / 1e+06 labels: quantile: "0.5" record: cluster_quantile:apiserver_request_latencies:histogram_quantile - name: node.rules rules: - expr: sum(min(kube_pod_info) by (node)) record: ':kube_pod_info_node_count:' - expr: | max(label_replace(kube_pod_info{job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod) record: 'node_namespace_pod:kube_pod_info:' - expr: | count by (node) (sum by (node, cpu) ( node_cpu_seconds_total{job="node-exporter"} * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: )) record: node:node_num_cpu:sum - expr: | 1 - avg(rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m])) record: :node_cpu_utilisation:avg1m - expr: | 1 - avg by (node) ( rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m]) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:) record: node:node_cpu_utilisation:avg1m - expr: | sum(node_load1{job="node-exporter"}) / sum(node:node_num_cpu:sum) record: ':node_cpu_saturation_load1:' - expr: | sum by (node) ( node_load1{job="node-exporter"} * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) / node:node_num_cpu:sum record: 'node:node_cpu_saturation_load1:' - expr: | 1 - sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"}) / sum(node_memory_MemTotal_bytes{job="node-exporter"}) record: ':node_memory_utilisation:' - expr: | sum(node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"}) record: :node_memory_MemFreeCachedBuffers_bytes:sum - expr: | sum(node_memory_MemTotal_bytes{job="node-exporter"}) record: :node_memory_MemTotal_bytes:sum - expr: | sum by (node) ( (node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"}) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_memory_bytes_available:sum - expr: | sum by (node) ( node_memory_MemTotal_bytes{job="node-exporter"} * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_memory_bytes_total:sum - expr: | (node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum) / scalar(sum(node:node_memory_bytes_total:sum)) record: node:node_memory_utilisation:ratio - expr: | 1e3 * sum( (rate(node_vmstat_pgpgin{job="node-exporter"}[1m]) + rate(node_vmstat_pgpgout{job="node-exporter"}[1m])) ) record: :node_memory_swap_io_bytes:sum_rate - expr: | 1 - sum by (node) ( (node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_Buffers_bytes{job="node-exporter"}) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) / sum by (node) ( node_memory_MemTotal_bytes{job="node-exporter"} * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: 'node:node_memory_utilisation:' - expr: | 1 - (node:node_memory_bytes_available:sum / node:node_memory_bytes_total:sum) record: 'node:node_memory_utilisation_2:' - expr: | 1e3 * sum by (node) ( (rate(node_vmstat_pgpgin{job="node-exporter"}[1m]) + rate(node_vmstat_pgpgout{job="node-exporter"}[1m])) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_memory_swap_io_bytes:sum_rate - expr: | avg(irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m])) record: :node_disk_utilisation:avg_irate - expr: | avg by (node) ( irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_disk_utilisation:avg_irate - expr: | avg(irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3) record: :node_disk_saturation:avg_irate - expr: | avg by (node) ( irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3 * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_disk_saturation:avg_irate - expr: | max by (namespace, pod, device) ((node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} - node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) record: 'node:node_filesystem_usage:' - expr: | max by (namespace, pod, device) (node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) record: 'node:node_filesystem_avail:' - expr: | sum(irate(node_network_receive_bytes_total{job="node-exporter",device="eth0"}[1m])) + sum(irate(node_network_transmit_bytes_total{job="node-exporter",device="eth0"}[1m])) record: :node_net_utilisation:sum_irate - expr: | sum by (node) ( (irate(node_network_receive_bytes_total{job="node-exporter",device="eth0"}[1m]) + irate(node_network_transmit_bytes_total{job="node-exporter",device="eth0"}[1m])) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_net_utilisation:sum_irate - expr: | sum(irate(node_network_receive_drop_total{job="node-exporter",device="eth0"}[1m])) + sum(irate(node_network_transmit_drop_total{job="node-exporter",device="eth0"}[1m])) record: :node_net_saturation:sum_irate - expr: | sum by (node) ( (irate(node_network_receive_drop_total{job="node-exporter",device="eth0"}[1m]) + irate(node_network_transmit_drop_total{job="node-exporter",device="eth0"}[1m])) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_net_saturation:sum_irate - name: kube-prometheus-node-recording.rules rules: - expr: sum(rate(node_cpu{mode!="idle",mode!="iowait"}[3m])) BY (instance) record: instance:node_cpu:rate:sum - expr: sum((node_filesystem_size{mountpoint="/"} - node_filesystem_free{mountpoint="/"})) BY (instance) record: instance:node_filesystem_usage:sum - expr: sum(rate(node_network_receive_bytes[3m])) BY (instance) record: instance:node_network_receive_bytes:rate:sum - expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance) record: instance:node_network_transmit_bytes:rate:sum - expr: sum(rate(node_cpu{mode!="idle",mode!="iowait"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance) record: instance:node_cpu:ratio - expr: sum(rate(node_cpu{mode!="idle",mode!="iowait"}[5m])) record: cluster:node_cpu:sum_rate5m - expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu)) record: cluster:node_cpu:ratio - name: kubernetes-absent rules: - alert: AlertmanagerDown annotations: message: k8s-master-10.80.154.143 Alertmanager has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-alertmanagerdown expr: | absent(up{job="alertmanager-main"} == 1) for: 1m labels: severity: critical - alert: KubeAPIDown annotations: message: k8s-master-10.80.154.143 KubeAPI has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown expr: | absent(up{job="apiserver"} == 1) for: 1m labels: severity: critical - alert: KubeStateMetricsDown annotations: message: k8s-master-10.80.154.143 KubeStateMetrics has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatemetricsdown expr: | absent(up{job="kube-state-metrics"} == 1) for: 1m labels: severity: critical - alert: KubeletDown annotations: message: k8s-master-10.80.154.143 Kubelet has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown expr: | absent(up{job="kubelet"} == 1) for: 1m labels: severity: critical - alert: NodeExporterDown annotations: message: k8s-master-10.80.154.143 NodeExporter has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeexporterdown expr: | absent(up{job="node-exporter"} == 1) for: 1m labels: severity: critical - alert: PrometheusDown annotations: message: k8s-master-10.80.154.143 Prometheus has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusdown expr: | absent(up{job="prometheus-k8s"} == 1) for: 1m labels: severity: critical - name: kubernetes-apps rules: - alert: KubePodCrashLooping annotations: message: k8s-master-10.80.154.143 Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping expr: | rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0 for: 1m labels: severity: critical - alert: KubePodNotReady annotations: message: k8s-master-10.80.154.143 Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready expr: | sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) > 0 for: 1m labels: severity: critical - alert: KubeDeploymentGenerationMismatch annotations: message: k8s-master-10.80.154.143 Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch expr: | kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} for: 1m labels: severity: critical - alert: KubeDeploymentReplicasMismatch annotations: message: k8s-master-10.80.154.143 Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than an hour. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch expr: | kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"} for: 1m labels: severity: critical - alert: KubeStatefulSetReplicasMismatch annotations: message: k8s-master-10.80.154.143 StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch expr: | kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"} for: 1m labels: severity: critical - alert: KubeStatefulSetGenerationMismatch annotations: message: k8s-master-10.80.154.143 StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch expr: | kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} for: 1m labels: severity: critical - alert: KubeStatefulSetUpdateNotRolledOut annotations: message: k8s-master-10.80.154.143 StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout expr: | max without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"} ) for: 1m labels: severity: critical - alert: KubeDaemonSetRolloutStuck annotations: message: k8s-master-10.80.154.143 Only {{ $value }}% of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck expr: | kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} * 100 < 100 for: 1m labels: severity: critical - alert: KubeDaemonSetNotScheduled annotations: message: k8s-master-10.80.154.143 '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled expr: | kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0 for: 1m labels: severity: warning - alert: KubeDaemonSetMisScheduled annotations: message: k8s-master-10.80.154.143 '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled expr: | kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 for: 1m labels: severity: warning - alert: KubeCronJobRunning annotations: message: k8s-master-10.80.154.143 CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecronjobrunning expr: | time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600 for: 1m labels: severity: warning - alert: KubeJobCompletion annotations: message: k8s-master-10.80.154.143 Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion expr: | kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0 for: 1m labels: severity: warning - alert: KubeJobFailed annotations: message: k8s-master-10.80.154.143 Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed expr: | kube_job_status_failed{job="kube-state-metrics"} > 0 for: 1m labels: severity: warning - name: kubernetes-resources rules: - alert: KubeCPUOvercommit annotations: message: k8s-master-10.80.154.143 'Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. ' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit expr: | sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum) / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum)-1) / count(node:node_num_cpu:sum) for: 1m labels: severity: info - alert: zxyKubeCPUOvercommit annotations: message: k8s-master-10.80.154.143 '容器的CPU使用率大于100% ,当前值为{{ printf "%0.0f" $value }}% in namespace {{$labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod}}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit expr: | round(100 * label_join(label_join(sum(rate(container_cpu_usage_seconds_total{container_name != "POD", image !=""}[1m])) by (pod_name, container_name, namespace) , "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_cpu_cores) by (pod, container, namespace)) > 100 for: 1m labels: severity: critical - alert: zxyKubeMemoryOvercommit annotations: message: k8s-master-10.80.154.143 '容器的内存使用率大于100% ,当前值为{{ printf "%0.0f" $value }}% in namespace {{$labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod}}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit expr: | round(100 * label_join(label_join(sum(container_memory_usage_bytes{container_name != "POD", image !=""}) by (container_name, pod_name, namespace), "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace)) > 100 for: 1m labels: severity: critical - alert: KubeMemOvercommit annotations: message: k8s-master-10.80.154.143 Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit expr: | sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum) / sum(node_memory_MemTotal_bytes) > (count(node:node_num_cpu:sum)-1) / count(node:node_num_cpu:sum) for: 1m labels: severity: warning - alert: KubeCPUOvercommit annotations: message: k8s-master-10.80.154.143 Cluster has overcommitted CPU resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit expr: | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="requests.cpu"}) / sum(node:node_num_cpu:sum) > 1.5 for: 1m labels: severity: warning - alert: KubeMemOvercommit annotations: message: k8s-master-10.80.154.143 Cluster has overcommitted memory resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit expr: | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="requests.memory"}) / sum(node_memory_MemTotal_bytes{job="node-exporter"}) > 1.5 for: 1m labels: severity: warning - alert: KubeQuotaExceeded annotations: message: k8s-master-10.80.154.143 Namespace {{ $labels.namespace }} is using {{ printf "%0.0f" $value }}% of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded expr: | 100 * kube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 90 for: 1m labels: severity: warning - alert: CPUThrottlingHigh annotations: message: k8s-master-10.80.154.143 '{{ printf "%0.0f" $value }}% throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container_name }} in pod {{ $labels.pod_name }}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh expr: "100 * sum(increase(container_cpu_cfs_throttled_periods_total{}[5m])) by (container_name, pod_name, namespace) \n / \nsum(increase(container_cpu_cfs_periods_total{}[5m])) by (container_name, pod_name, namespace)\n > 99 \n" for: 1m labels: severity: warning - name: kubernetes-storage rules: - alert: KubePersistentVolumeUsageCritical annotations: message: k8s-master-10.80.154.143 The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ printf "%0.2f" $value }}% free. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeusagecritical expr: | 100 * kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} < 3 for: 1m labels: severity: critical - alert: KubePersistentVolumeFullInFourDays annotations: message: k8s-master-10.80.154.143 Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ printf "%0.2f" $value }}% is available. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefullinfourdays expr: | 100 * ( kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} ) < 15 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0 for: 1m labels: severity: critical - alert: KubePersistentVolumeErrors annotations: message: k8s-master-10.80.154.143 The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors expr: | kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0 for: 1m labels: severity: critical - name: kubernetes-system rules: - alert: KubeNodeNotReady annotations: message: k8s-master-10.80.154.143 '{{ $labels.node }} has been unready for more than an hour.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready expr: | kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0 for: 1m labels: severity: warning - alert: KubeVersionMismatch annotations: message: k8s-master-10.80.154.143 There are {{ $value }} different versions of Kubernetes components running. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch expr: | count(count(kubernetes_build_info{job!="kube-dns"}) by (gitVersion)) > 1 for: 1m labels: severity: warning - alert: KubeClientErrors annotations: message: k8s-master-10.80.154.143 Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf "%0.0f" $value }}% errors.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors expr: | (sum(rate(rest_client_requests_total{code!~"2..|404"}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m])) by (instance, job)) * 100 > 1 for: 1m labels: severity: warning - alert: KubeClientErrors annotations: message: k8s-master-10.80.154.143 Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf "%0.0f" $value }} errors / second. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors expr: | sum(rate(ksm_scrape_error_total{job="kube-state-metrics"}[5m])) by (instance, job) > 0.1 for: 1m labels: severity: warning - alert: KubeletTooManyPods annotations: message: k8s-master-10.80.154.143 Kubelet {{ $labels.instance }} is running {{ $value }} Pods, close to the limit of 110. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods expr: | kubelet_running_pod_count{job="kubelet"} > 110 * 0.9 for: 1m labels: severity: warning - alert: KubeAPILatencyHigh annotations: message: k8s-master-10.80.154.143 The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh expr: | cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 1 for: 1m labels: severity: warning - alert: KubeAPILatencyHigh annotations: message: k8s-master-10.80.154.143 The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh expr: | cluster_quantile:apiserver_request_latencies:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4 for: 1m labels: severity: critical - alert: KubeAPIErrorsHigh annotations: message: k8s-master-10.80.154.143 API server is returning errors for {{ $value }}% of requests. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh expr: | sum(rate(apiserver_request_count{job="apiserver",code=~"^(?:5..)$"}[5m])) without(instance, pod) / sum(rate(apiserver_request_count{job="apiserver"}[5m])) without(instance, pod) * 100 > 10 for: 1m labels: severity: critical - alert: KubeAPIErrorsHigh annotations: message: k8s-master-10.80.154.143 API server is returning errors for {{ $value }}% of requests. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh expr: | sum(rate(apiserver_request_count{job="apiserver",code=~"^(?:5..)$"}[5m])) without(instance, pod) / sum(rate(apiserver_request_count{job="apiserver"}[5m])) without(instance, pod) * 100 > 5 for: 1m labels: severity: warning - alert: KubeClientCertificateExpiration annotations: message: k8s-master-10.80.154.143 Kubernetes API certificate is expiring in less than 7 days. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration expr: | histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 labels: severity: warning - alert: KubeClientCertificateExpiration annotations: message: k8s-master-10.80.154.143 Kubernetes API certificate is expiring in less than 24 hours. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration expr: | histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 labels: severity: critical - name: alertmanager.rules rules: - alert: AlertmanagerConfigInconsistent annotations: message: k8s-master-10.80.154.143 The configuration of the instances of the Alertmanager cluster `{{$labels.service}}` are out of sync. expr: | count_values("config_hash", alertmanager_config_hash{job="alertmanager-main"}) BY (service) / ON(service) GROUP_LEFT() label_replace(prometheus_operator_spec_replicas{job="prometheus-operator",controller="alertmanager"}, "service", "alertmanager-$1", "name", "(.*)") != 1 for: 1m labels: severity: critical - alert: AlertmanagerFailedReload annotations: message: k8s-master-10.80.154.143 Reloading Alertmanager's configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}. expr: | alertmanager_config_last_reload_successful{job="alertmanager-main"} == 0 for: 1m labels: severity: warning - alert: AlertmanagerMembersInconsistent annotations: message: k8s-master-10.80.154.143 Alertmanager has not found all other members of the cluster. expr: | alertmanager_cluster_members{job="alertmanager-main"} != on (service) GROUP_LEFT() count by (service) (alertmanager_cluster_members{job="alertmanager-main"}) for: 1m labels: severity: critical - name: general.rules rules: - alert: TargetDown annotations: message: k8s-master-10.80.154.143 '{{ $value }}% of the {{ $labels.job }} targets are down.' expr: 100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10 for: 1m labels: severity: warning - name: kube-prometheus-node-alerting.rules rules: - alert: NodeDiskRunningFull annotations: message: k8s-master-10.80.154.143 Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 24 hours. expr: | (node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[6h], 3600 * 24) < 0) for: 1m labels: severity: warning - alert: NodeDiskRunningFull annotations: message: k8s-master-10.80.154.143 Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 2 hours. expr: | (node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) < 0) for: 1m labels: severity: critical - name: prometheus.rules rules: - alert: PrometheusConfigReloadFailed annotations: description: Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}} summary: Reloading Prometheus' configuration failed expr: | prometheus_config_last_reload_successful{job="prometheus-k8s"} == 0 for: 1m labels: severity: warning - alert: PrometheusNotificationQueueRunningFull annotations: description: Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{ $labels.pod}} summary: Prometheus' alert notification queue is running full expr: | predict_linear(prometheus_notifications_queue_length{job="prometheus-k8s"}[5m], 60 * 30) > prometheus_notifications_queue_capacity{job="prometheus-k8s"} for: 1m labels: severity: warning - alert: PrometheusErrorSendingAlerts annotations: description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{ $labels.pod}} to Alertmanager {{$labels.Alertmanager}} summary: Errors while sending alert from Prometheus expr: | rate(prometheus_notifications_errors_total{job="prometheus-k8s"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-k8s"}[5m]) > 0.01 for: 1m labels: severity: warning - alert: PrometheusErrorSendingAlerts annotations: description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{ $labels.pod}} to Alertmanager {{$labels.Alertmanager}} summary: Errors while sending alerts from Prometheus expr: | rate(prometheus_notifications_errors_total{job="prometheus-k8s"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-k8s"}[5m]) > 0.03 for: 1m labels: severity: critical - alert: PrometheusNotConnectedToAlertmanagers annotations: description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected to any Alertmanagers summary: Prometheus is not connected to any Alertmanagers expr: | prometheus_notifications_alertmanagers_discovered{job="prometheus-k8s"} < 1 for: 1m labels: severity: warning - alert: PrometheusTSDBReloadsFailing annotations: description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} reload failures over the last four hours.' summary: Prometheus has issues reloading data blocks from disk expr: | increase(prometheus_tsdb_reloads_failures_total{job="prometheus-k8s"}[2h]) > 0 for: 1m labels: severity: warning - alert: PrometheusTSDBCompactionsFailing annotations: description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} compaction failures over the last four hours.' summary: Prometheus has issues compacting sample blocks expr: | increase(prometheus_tsdb_compactions_failed_total{job="prometheus-k8s"}[2h]) > 0 for: 1m labels: severity: warning - alert: PrometheusTSDBWALCorruptions annotations: description: '{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead log (WAL).' summary: Prometheus write-ahead log is corrupted expr: | tsdb_wal_corruptions_total{job="prometheus-k8s"} > 0 for: 1m labels: severity: warning - alert: PrometheusNotIngestingSamples annotations: description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} isn't ingesting samples. summary: Prometheus isn't ingesting samples expr: | rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s"}[5m]) <= 0 for: 1m labels: severity: warning - alert: PrometheusTargetScrapesDuplicate annotations: description: '{{$labels.namespace}}/{{$labels.pod}} has many samples rejected due to duplicate timestamps but different values' summary: Prometheus has many samples rejected expr: | increase(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-k8s"}[5m]) > 0 for: 1m labels: severity: warning - name: prometheus-operator rules: - alert: PrometheusOperatorReconcileErrors annotations: message: k8s-master-10.80.154.143 Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator"}[5m]) > 0.1 for: 1m labels: severity: warning - alert: PrometheusOperatorNodeLookupErrors annotations: message: k8s-master-10.80.154.143 Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. expr: | rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator"}[5m]) > 0.1 for: 1m labels: severity: warning |
for: 1m代表1分钟检查一次,如果是for: 1h代表一个小时检查一次,告警邮件也是1小时发一次
这里特意加了下面2个监控通知,目的是为了当容器的内存与cpu使用率到了85%的给出邮件通知
- alert: zxyKubeCPUOvercommit annotations: message: '容器的CPU使用率大于85% ,当前值为{{ printf "%0.0f" $value }}% in namespace {{$labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod}}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit expr: | round(100 * label_join(label_join(sum(rate(container_cpu_usage_seconds_total{container_name != "POD", image !=""}[1m])) by (pod_name, container_name, namespace) , "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_cpu_cores) by (pod, container, namespace)) > 85 for: 1m labels: severity: critical- alert: zxyKubeMemoryOvercommit annotations: message: '容器的内存使用率大于85% ,当前值为{{ printf "%0.0f" $value }}% in namespace {{$labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod}}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit expr: | round(100 * label_join(label_join(sum(container_memory_usage_bytes{container_name != "POD", image !=""}) by (container_name, pod_name, namespace), "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace)) > 85 for: 1m labels: severity: critical |
大家也可以自己定义一些告警,可以参考grafana的取值定义
sum(label_replace(container_memory_usage_bytes{namespace="$namespace", pod_name="$pod", container_name!="POD", container_name!=""}, "container", "$1", "container_name", "(.*)")) by (container)sum(kube_pod_container_resource_requests_memory_bytes{namespace="$namespace", pod="$pod"}) by (container)sum(label_replace(container_memory_usage_bytes{namespace="$namespace", pod_name="$pod"}, "container", "$1", "container_name", "(.*)")) by (container) / sum(kube_pod_container_resource_requests_memory_bytes{namespace="$namespace", pod="$pod"}) by (container)sum(kube_pod_container_resource_limits_memory_bytes{namespace="$namespace", pod="$pod", container!=""}) by (container)sum(label_replace(container_memory_usage_bytes{namespace="$namespace", pod_name="$pod", container_name!=""}, "container", "$1", "container_name", "(.*)")) by (container) / sum(kube_pod_container_resource_limits_memory_bytes{namespace="$namespace", pod="$pod"}) by (container)round(100 * label_join(label_join(sum(rate(container_cpu_usage_seconds_total{container_name != "POD", image !=""}[1m])) by (pod_name, container_name, namespace) , "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_cpu_cores) by (pod, container, namespace)) > 75round(100 * label_join(label_join(sum(container_memory_usage_bytes{container_name != "POD", image !=""}) by (container_name, pod_name, namespace), "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace)) > 75 |
下载地址:
https://zxytest.zhixueyun.com/installer/prometheus-operator.zip
4.启动命令
kubectl create namespace monitoring
kubectl delete secret alertmanager-main -n monitoring
kubectl create secret generic alertmanager-main --from-file=/app/prometheus-operator/alertmanager.yaml -n monitoring
#替换message的开头,使告警信息知道具体是哪个环境的,比如zxy9.zhixueyun.com
sed -i 's/message: /message: zxy9.zhixueyun.com/g' /app/prometheus-operator/prometheus-rules.yaml
#注意要先启动bundle.yaml,否则manifest下面的服务将无法启动
kubectl create -f /app/prometheus-operator/bundle.yaml
kubectl create -f /app/prometheus-operator/manifests
5.删除命令
kubectl delete secret alertmanager-main -n monitoring
kubectl delete -f /app/prometheus-operator/manifests
6.测试
[root@iZbp1at8fph52evh70atb1Z manifests]# kubectl get svc -n monitoringNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEalertmanager-main NodePort 10.254.71.140 <none> 9093:30093/TCP 6m55salertmanager-operated ClusterIP None <none> 9093/TCP,6783/TCP 6m51sgrafana NodePort 10.254.83.196 <none> 3000:30000/TCP 6m55skube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 6m55snode-exporter ClusterIP None <none> 9100/TCP 6m55sprometheus-adapter ClusterIP 10.254.92.97 <none> 443/TCP 6m55sprometheus-k8s NodePort 10.254.148.92 <none> 9090:30001/TCP 6m55sprometheus-operated ClusterIP None <none> 9090/TCP 6m44sprometheus-operator ClusterIP None <none> 8080/TCP 7h48m |
grafana的访问地址:http://120.27.159.108:30000
prometheus-k8s的访问地址:http://120.27.159.108:30001
在Prometheus的Alerts类目中可以查看当前的报警规则,红色的规则表示正在触发报警,绿色的规则表示状态正常,默认prometheus operator会自动创建一批报警规则。
告警邮件
如果需要设置报警压制,需要访问Alter Manager,alertmanager的访问地址:http://120.27.159.108:30093,点击Silence可以设置报警压制的内容。
7.alerts分析,访问http://120.27.159.108:30001/alerts
alert: KubeCPUOvercommit
sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum) / sum(node:node_num_cpu:sum) > (count(node:node_num_cpu:sum) - 1) / count(node:node_num_cpu:sum)
的值如下,代表所有namespace request cpu总核数/k8s node总核数
sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)的值如下 ,代表所有namespace cpu request总和
namespace_name:kube_pod_container_resource_requests_cpu_cores:sum的值如下,代表每个namespace cpu request总和
kube_pod_container_resource_requests_cpu_cores的值如下,代表每个pod容器cpu资源request值
node:node_num_cpu:sum的值如下,代表每个k8s node的cpu总核数
访问http://120.27.159.108:30001/graph,输入kube_pod_container_resource_limits_memory_bytes,可以查询每个pod的内存limit值
8.磁盘空间告警配置,使用率大于85%告警
- expr: | max by (namespace, pod, device) ((node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} - node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) record: 'node:node_filesystem_usage:'- expr: | max by (namespace, pod, device) (node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}) record: 'node:node_filesystem_avail:'- alert: NodeDiskRunningFull annotations: message: Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 24 hours. expr: | (node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[6h], 3600 * 24) < 0) for: 30m labels: severity: warning- alert: NodeDiskRunningFull annotations: message: Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 2 hours. expr: | (node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) < 0) for: 10m labels: severity: critical |
9.如果node-exporter无法启动出现如下错误
[root@iZbp14qk2dtp82q129jrzqZ manifests]# kubectl logs node-exporter-9kg72 -n monitoring -c kube-rbac-proxyI0308 06:29:35.477100 19438 main.go:209] Generating self signed cert as no cert is providedlog: exiting because of error: log: cannot create log: open /tmp/kube-rbac-proxy.iZbp1hkg813np4ep5cuakvZ.unknownuser.log.INFO.20190308-062935.19438: permission denied |
则需要修改node-exporter-daemonset.yaml,
runAsNonRoot: false
runAsUser: 0
apiVersion: apps/v1beta2kind: DaemonSetmetadata: labels: app: node-exporter name: node-exporter namespace: monitoringspec: selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: containers: - args: - --web.listen-address=127.0.0.1:9100 - --path.procfs=/host/proc - --path.sysfs=/host/sys - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/) - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$ image: quay.io/prometheus/node-exporter:v0.16.0 name: node-exporter resources: limits: cpu: 250m memory: 180Mi requests: cpu: 102m memory: 180Mi volumeMounts: - mountPath: /host/proc name: proc readOnly: false - mountPath: /host/sys name: sys readOnly: false - mountPath: /host/root mountPropagation: HostToContainer name: root readOnly: true - args: - --secure-listen-address=$(IP):9100 - --upstream=http://127.0.0.1:9100/ env: - name: IP valueFrom: fieldRef: fieldPath: status.podIP image: quay.io/coreos/kube-rbac-proxy:v0.4.0 name: kube-rbac-proxy ports: - containerPort: 9100 hostPort: 9100 name: https resources: limits: cpu: 20m memory: 40Mi requests: cpu: 10m memory: 20Mi hostNetwork: true hostPID: true nodeSelector: beta.kubernetes.io/os: linux securityContext: runAsNonRoot: false runAsUser: 0 serviceAccountName: node-exporter tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master volumes: - hostPath: path: /proc name: proc - hostPath: path: /sys name: sys - hostPath: path: / name: root |
10.alertmanager-alertmanager.yaml,prometheus-prometheus.yaml最好通过nodeName: k8s_master_ip都限制在master节点
apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: labels: prometheus: k8s name: k8s namespace: monitoringspec: alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web baseImage: quay.io/prometheus/prometheus nodeName: 10.80.154.143 #nodeSelector: #beta.kubernetes.io/os: linux replicas: 2 resources: requests: memory: 600Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: v2.5.0 |
11.注意kind: ReplicationController要改成kind: Deployment,ReplicationController中container_memory_usage_bytes{container_name!="POD",image!=""}获取到的pod占用内存不准确,Deployment就准确
比如当kind为ReplicationController时,kubectl top po 获取到该pod的占用内存值与prometheus监测的值对应不上
参考文档:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus




















浙公网安备 33010602011771号