Prometheus 关键指标速查表

Prometheus 关键指标速查表

🧱 Node Exporter 指标（主机层监控）

node_cpu_seconds_total{mode!~"idle"}

CPU 使用时间（需要配合 rate() 计算百分比）
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

可用 / 总内存，计算内存使用率
node_disk_io_time_seconds_total

磁盘 IO 活跃时间，用于计算磁盘利用率
node_load1 / node_load5 / node_load15

系统负载，衡量整体 CPU 压力
node_network_receive_bytes_total / node_network_transmit_bytes_total

网络流入 / 流出流量（字节为单位）
node_boot_time_seconds / node_time_seconds

主机启动时间、当前时间，配合计算运行时长

☸️ kube-state-metrics 指标（K8s 资源状态）

kube_pod_status_phase{phase!="Running"}

非运行状态的 Pod 数量（Pending、Failed）
kube_pod_container_status_restarts_total

Pod 下容器重启次数，排查稳定性问题
kube_deployment_status_replicas / ...updated... / ...available...

检查 Deployment 副本数是否齐备
kube_node_status_condition{condition="Ready",status!="true"}

未就绪的节点（NotReady）
kube_job_status_failed

Job 执行失败次数，适用于数据批处理任务监控
kube_persistentvolumeclaim_status_phase{phase!="Bound"}

PVC 未绑定状态，可能影响存储挂载

🐋 cAdvisor 指标（容器层监控）

container_cpu_usage_seconds_total

容器 CPU 使用时间，可配合 rate() 统计近 5 分钟负载
container_memory_usage_bytes

容器内存占用
container_memory_rss

实际驻留内存，不包含可回收缓存，适用于 OOMKill 分析
container_fs_usage_bytes / container_fs_limit_bytes

容器磁盘使用率
container_start_time_seconds / container_last_seen

容器启动时间 / 最后采集时间，判断是否近期重启或失联

📡 consul_exporter 指标（服务发现 / 健康状态）

consul_up

Consul API 可访问性
consul_raft_peers

当前 Raft 节点数
consul_raft_leader

是否存在 Leader（为 1 则存在）
consul_health_service_status{status="critical"}

健康状态异常的服务实例

📊 Prometheus 自身指标

prometheus_tsdb_head_series

当前活跃的时间序列数，衡量存储压力
prometheus_target_interval_length_seconds

抓取间隔，可检测 Target 抓取是否过慢
prometheus_notifications_total

发出告警通知的次数，可用于统计频率

✅ 面试建议

熟练掌握常见指标的单位（bytes、seconds、count）
知道指标如何组合用于故障诊断（CPU 抖动、内存泄漏、重启频繁）
能构造合理的告警表达式（>80%、连续重启、状态异常）
了解数据来源（node_exporter、cAdvisor、kube-state-metrics 等）

posted @ 2025-07-10 11:38 弗拉宾教头阅读(47) 评论(0) 收藏举报

刷新页面返回顶部