Awesome Prometheus alerts
http://t.zoukankan.com/shoufu-p-14110485.html
转载于https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware
Collection of alerting rules
AlertManager config Rules Contribute on GitHub
⚠️ Caution ⚠️
Alert thresholds depend on nature of applications.
Some queries in this page may have arbitrary tolerance threshold.
Building an efficient and battle-tested monitoring platform takes time. 😉
- 
1. 1. Prometheus self-monitoring (25 rules)[copy all]- 
1.1.1. Prometheus job missingA Prometheus job has disappeared[copy]- alert: PrometheusJobMissing expr: absent(up{job="prometheus"}) for: 5m labels: severity: warning annotations: summary: Prometheus job missing (instance {{ $labels.instance }}) description: A Prometheus job has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.2. Prometheus target missingA Prometheus target has disappeared. An exporter might be crashed.[copy]- alert: PrometheusTargetMissing expr: up == 0 for: 5m labels: severity: critical annotations: summary: Prometheus target missing (instance {{ $labels.instance }}) description: A Prometheus target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.3. Prometheus all targets missingA Prometheus job does not have living target anymore.[copy]- alert: PrometheusAllTargetsMissing expr: count by (job) (up) == 0 for: 5m labels: severity: critical annotations: summary: Prometheus all targets missing (instance {{ $labels.instance }}) description: A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.4. Prometheus configuration reload failurePrometheus configuration reload error[copy]- alert: PrometheusConfigurationReloadFailure expr: prometheus_config_last_reload_successful != 1 for: 5m labels: severity: warning annotations: summary: Prometheus configuration reload failure (instance {{ $labels.instance }}) description: Prometheus configuration reload error\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.5. Prometheus too many restartsPrometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.[copy]- alert: PrometheusTooManyRestarts expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2 for: 5m labels: severity: warning annotations: summary: Prometheus too many restarts (instance {{ $labels.instance }}) description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.6. Prometheus AlertManager configuration reload failureAlertManager configuration reload error[copy]- alert: PrometheusAlertmanagerConfigurationReloadFailure expr: alertmanager_config_last_reload_successful != 1 for: 5m labels: severity: warning annotations: summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }}) description: AlertManager configuration reload error\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.7. Prometheus AlertManager config not syncedConfigurations of AlertManager cluster instances are out of sync[copy]- alert: PrometheusAlertmanagerConfigNotSynced expr: count(count_values("config_hash", alertmanager_config_hash)) > 1 for: 5m labels: severity: warning annotations: summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }}) description: Configurations of AlertManager cluster instances are out of sync\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.8. Prometheus AlertManager E2E dead man switchPrometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.[copy]- alert: PrometheusAlertmanagerE2eDeadManSwitch expr: vector(1) for: 5m labels: severity: critical annotations: summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }}) description: Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.9. Prometheus not connected to alertmanagerPrometheus cannot connect the alertmanager[copy]- alert: PrometheusNotConnectedToAlertmanager expr: prometheus_notifications_alertmanagers_discovered < 1 for: 5m labels: severity: critical annotations: summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }}) description: Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.10. Prometheus rule evaluation failuresPrometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.[copy]- alert: PrometheusRuleEvaluationFailures expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus rule evaluation failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.11. Prometheus template text expansion failuresPrometheus encountered {{ $value }} template text expansion failures[copy]- alert: PrometheusTemplateTextExpansionFailures expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus template text expansion failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} template text expansion failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.12. Prometheus rule evaluation slowPrometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.[copy]- alert: PrometheusRuleEvaluationSlow expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds for: 5m labels: severity: warning annotations: summary: Prometheus rule evaluation slow (instance {{ $labels.instance }}) description: Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.13. Prometheus notifications backlogThe Prometheus notification queue has not been empty for 10 minutes[copy]- alert: PrometheusNotificationsBacklog expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0 for: 5m labels: severity: warning annotations: summary: Prometheus notifications backlog (instance {{ $labels.instance }}) description: The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.14. Prometheus AlertManager notification failingAlertmanager is failing sending notifications[copy]- alert: PrometheusAlertmanagerNotificationFailing expr: rate(alertmanager_notifications_failed_total[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }}) description: Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.15. Prometheus target emptyPrometheus has no target in service discovery[copy]- alert: PrometheusTargetEmpty expr: prometheus_sd_discovered_targets == 0 for: 5m labels: severity: critical annotations: summary: Prometheus target empty (instance {{ $labels.instance }}) description: Prometheus has no target in service discovery\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.16. Prometheus target scraping slowPrometheus is scraping exporters slowly[copy]- alert: PrometheusTargetScrapingSlow expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60 for: 5m labels: severity: warning annotations: summary: Prometheus target scraping slow (instance {{ $labels.instance }}) description: Prometheus is scraping exporters slowly\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.17. Prometheus large scrapePrometheus has many scrapes that exceed the sample limit[copy]- alert: PrometheusLargeScrape expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10 for: 5m labels: severity: warning annotations: summary: Prometheus large scrape (instance {{ $labels.instance }}) description: Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.18. Prometheus target scrape duplicatePrometheus has many samples rejected due to duplicate timestamps but different values[copy]- alert: PrometheusTargetScrapeDuplicate expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Prometheus target scrape duplicate (instance {{ $labels.instance }}) description: Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.19. Prometheus TSDB checkpoint creation failuresPrometheus encountered {{ $value }} checkpoint creation failures[copy]- alert: PrometheusTsdbCheckpointCreationFailures expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} checkpoint creation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.20. Prometheus TSDB checkpoint deletion failuresPrometheus encountered {{ $value }} checkpoint deletion failures[copy]- alert: PrometheusTsdbCheckpointDeletionFailures expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} checkpoint deletion failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.21. Prometheus TSDB compactions failedPrometheus encountered {{ $value }} TSDB compactions failures[copy]- alert: PrometheusTsdbCompactionsFailed expr: increase(prometheus_tsdb_compactions_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB compactions failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.22. Prometheus TSDB head truncations failedPrometheus encountered {{ $value }} TSDB head truncation failures[copy]- alert: PrometheusTsdbHeadTruncationsFailed expr: increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB head truncation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.23. Prometheus TSDB reload failuresPrometheus encountered {{ $value }} TSDB reload failures[copy]- alert: PrometheusTsdbReloadFailures expr: increase(prometheus_tsdb_reloads_failures_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB reload failures (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB reload failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.24. Prometheus TSDB WAL corruptionsPrometheus encountered {{ $value }} TSDB WAL corruptions[copy]- alert: PrometheusTsdbWalCorruptions expr: increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB WAL corruptions\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.1.25. Prometheus TSDB WAL truncations failedPrometheus encountered {{ $value }} TSDB WAL truncation failures[copy]- alert: PrometheusTsdbWalTruncationsFailed expr: increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0 for: 5m labels: severity: critical annotations: summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }}) description: Prometheus encountered {{ $value }} TSDB WAL truncation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
1. 2. Host and hardware : node-exporter (26 rules)[copy all]- 
1.2.1. Host out of memoryNode memory is filling up (< 10% left)[copy]- alert: HostOutOfMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 for: 5m labels: severity: warning annotations: summary: Host out of memory (instance {{ $labels.instance }}) description: Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.2. Host memory under memory pressureThe node is under heavy memory pressure. High rate of major page faults[copy]- alert: HostMemoryUnderMemoryPressure expr: rate(node_vmstat_pgmajfault[1m]) > 1000 for: 5m labels: severity: warning annotations: summary: Host memory under memory pressure (instance {{ $labels.instance }}) description: The node is under heavy memory pressure. High rate of major page faults\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.3. Host unusual network throughput inHost network interfaces are probably receiving too much data (> 100 MB/s)[copy]- alert: HostUnusualNetworkThroughputIn expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: Host unusual network throughput in (instance {{ $labels.instance }}) description: Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.4. Host unusual network throughput outHost network interfaces are probably sending too much data (> 100 MB/s)[copy]- alert: HostUnusualNetworkThroughputOut expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: Host unusual network throughput out (instance {{ $labels.instance }}) description: Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.5. Host unusual disk read rateDisk is probably reading too much data (> 50 MB/s)[copy]- alert: HostUnusualDiskReadRate expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50 for: 5m labels: severity: warning annotations: summary: Host unusual disk read rate (instance {{ $labels.instance }}) description: Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.6. Host unusual disk write rateDisk is probably writing too much data (> 50 MB/s)[copy]- alert: HostUnusualDiskWriteRate expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50 for: 5m labels: severity: warning annotations: summary: Host unusual disk write rate (instance {{ $labels.instance }}) description: Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.7. Host out of disk spaceDisk is almost full (< 10% left)[copy]# please add ignored mountpoints in node_exporter parameters like # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)" - alert: HostOutOfDiskSpace expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 for: 5m labels: severity: warning annotations: summary: Host out of disk space (instance {{ $labels.instance }}) description: Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.8. Host disk will fill in 4 hoursDisk will fill in 4 hours at current write rate[copy]- alert: HostDiskWillFillIn4Hours expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0 for: 5m labels: severity: warning annotations: summary: Host disk will fill in 4 hours (instance {{ $labels.instance }}) description: Disk will fill in 4 hours at current write rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.9. Host out of inodesDisk is almost running out of available inodes (< 10% left)[copy]- alert: HostOutOfInodes expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint ="/rootfs"} * 100 < 10 for: 5m labels: severity: warning annotations: summary: Host out of inodes (instance {{ $labels.instance }}) description: Disk is almost running out of available inodes (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.10. Host unusual disk read latencyDisk latency is growing (read operations > 100ms)[copy]- alert: HostUnusualDiskReadLatency expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Host unusual disk read latency (instance {{ $labels.instance }}) description: Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.11. Host unusual disk write latencyDisk latency is growing (write operations > 100ms)[copy]- alert: HostUnusualDiskWriteLatency expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Host unusual disk write latency (instance {{ $labels.instance }}) description: Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.12. Host high CPU loadCPU load is > 80%[copy]- alert: HostHighCpuLoad expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Host high CPU load (instance {{ $labels.instance }}) description: CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.13. Host context switchingContext switching is growing on node (> 1000 / s)[copy]# 1000 context switches is an arbitrary number. # Alert threshold depends on nature of application. # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58 - alert: HostContextSwitching expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000 for: 5m labels: severity: warning annotations: summary: Host context switching (instance {{ $labels.instance }}) description: Context switching is growing on node (> 1000 / s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.14. Host swap is filling upSwap is filling up (>80%)[copy]- alert: HostSwapIsFillingUp expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80 for: 5m labels: severity: warning annotations: summary: Host swap is filling up (instance {{ $labels.instance }}) description: Swap is filling up (>80%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.15. Host SystemD service crashedSystemD service crashed[copy]- alert: HostSystemdServiceCrashed expr: node_systemd_unit_state{state="failed"} == 1 for: 5m labels: severity: warning annotations: summary: Host SystemD service crashed (instance {{ $labels.instance }}) description: SystemD service crashed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.16. Host physical component too hotPhysical hardware component too hot[copy]- alert: HostPhysicalComponentTooHot expr: node_hwmon_temp_celsius > 75 for: 5m labels: severity: warning annotations: summary: Host physical component too hot (instance {{ $labels.instance }}) description: Physical hardware component too hot\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.17. Host node overtemperature alarmPhysical node temperature alarm triggered[copy]- alert: HostNodeOvertemperatureAlarm expr: node_hwmon_temp_alarm == 1 for: 5m labels: severity: critical annotations: summary: Host node overtemperature alarm (instance {{ $labels.instance }}) description: Physical node temperature alarm triggered\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.18. Host RAID array got inactiveRAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.[copy]- alert: HostRaidArrayGotInactive expr: node_md_state{state="inactive"} > 0 for: 5m labels: severity: critical annotations: summary: Host RAID array got inactive (instance {{ $labels.instance }}) description: RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.19. Host RAID disk failureAt least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap[copy]- alert: HostRaidDiskFailure expr: node_md_disks{state="failed"} > 0 for: 5m labels: severity: warning annotations: summary: Host RAID disk failure (instance {{ $labels.instance }}) description: At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.20. Host kernel version deviationsDifferent kernel versions are running[copy]- alert: HostKernelVersionDeviations expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1 for: 5m labels: severity: warning annotations: summary: Host kernel version deviations (instance {{ $labels.instance }}) description: Different kernel versions are running\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.21. Host OOM kill detectedOOM kill detected[copy]- alert: HostOomKillDetected expr: increase(node_vmstat_oom_kill[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Host OOM kill detected (instance {{ $labels.instance }}) description: OOM kill detected\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.22. Host EDAC Correctable Errors detected{{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.[copy]- alert: HostEdacCorrectableErrorsDetected expr: increase(node_edac_correctable_errors_total[5m]) > 0 for: 5m labels: severity: info annotations: summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }}) description: {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.23. Host EDAC Uncorrectable Errors detected{{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.[copy]- alert: HostEdacUncorrectableErrorsDetected expr: node_edac_uncorrectable_errors_total > 0 for: 5m labels: severity: warning annotations: summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }}) description: {{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.24. Host Network Receive Errors{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.[copy]- alert: HostNetworkReceiveErrors expr: increase(node_network_receive_errs_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Host Network Receive Errors (instance {{ $labels.instance }}) description: {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.25. Host Network Transmit Errors{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.[copy]- alert: HostNetworkTransmitErrors expr: increase(node_network_transmit_errs_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Host Network Transmit Errors (instance {{ $labels.instance }}) description: {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.2.26. Host Network Interface SaturatedThe network interface "{{ $labels.interface }}" on "{{ $labels.instance }}" is getting overloaded.[copy]- alert: HostNetworkInterfaceSaturated expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 for: 5m labels: severity: warning annotations: summary: Host Network Interface Saturated (instance {{ $labels.instance }}) description: The network interface "{{ $labels.interface }}" on "{{ $labels.instance }}" is getting overloaded.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
1. 3. Docker containers : google/cAdvisor (6 rules)[copy all]- 
1.3.1. Container killedA container has disappeared[copy]- alert: ContainerKilled expr: time() - container_last_seen > 60 for: 5m labels: severity: warning annotations: summary: Container killed (instance {{ $labels.instance }}) description: A container has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.3.2. Container CPU usageContainer CPU usage is above 80%[copy]# cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly. # If you want to exclude it from this alert, just use: container_cpu_usage_seconds_total{name!=""} - alert: ContainerCpuUsage expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Container CPU usage (instance {{ $labels.instance }}) description: Container CPU usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.3.3. Container Memory usageContainer Memory usage is above 80%[copy]# See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d - alert: ContainerMemoryUsage expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Container Memory usage (instance {{ $labels.instance }}) description: Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.3.4. Container Volume usageContainer Volume usage is above 80%[copy]- alert: ContainerVolumeUsage expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance)) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Container Volume usage (instance {{ $labels.instance }}) description: Container Volume usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.3.5. Container Volume IO usageContainer Volume IO usage is above 80%[copy]- alert: ContainerVolumeIoUsage expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Container Volume IO usage (instance {{ $labels.instance }}) description: Container Volume IO usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.3.6. Container high throttle rateContainer is being throttled[copy]- alert: ContainerHighThrottleRate expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1 for: 5m labels: severity: warning annotations: summary: Container high throttle rate (instance {{ $labels.instance }}) description: Container is being throttled\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
1. 4. Blackbox : prometheus/blackbox_exporter (8 rules)[copy all]- 
1.4.1. Blackbox probe failedProbe failed[copy]- alert: BlackboxProbeFailed expr: probe_success == 0 for: 5m labels: severity: critical annotations: summary: Blackbox probe failed (instance {{ $labels.instance }}) description: Probe failed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.4.2. Blackbox slow probeBlackbox probe took more than 1s to complete[copy]- alert: BlackboxSlowProbe expr: avg_over_time(probe_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: Blackbox slow probe (instance {{ $labels.instance }}) description: Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.4.3. Blackbox probe HTTP failureHTTP status code is not 200-399[copy]- alert: BlackboxProbeHttpFailure expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 for: 5m labels: severity: critical annotations: summary: Blackbox probe HTTP failure (instance {{ $labels.instance }}) description: HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.4.4. Blackbox SSL certificate will expire soonSSL certificate expires in 30 days[copy]- alert: BlackboxSslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 5m labels: severity: warning annotations: summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }}) description: SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.4.5. Blackbox SSL certificate will expire soonSSL certificate expires in 3 days[copy]- alert: BlackboxSslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3 for: 5m labels: severity: critical annotations: summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }}) description: SSL certificate expires in 3 days\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.4.6. Blackbox SSL certificate expiredSSL certificate has expired already[copy]- alert: BlackboxSslCertificateExpired expr: probe_ssl_earliest_cert_expiry - time() <= 0 for: 5m labels: severity: critical annotations: summary: Blackbox SSL certificate expired (instance {{ $labels.instance }}) description: SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.4.7. Blackbox probe slow HTTPHTTP request took more than 1s[copy]- alert: BlackboxProbeSlowHttp expr: avg_over_time(probe_http_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: Blackbox probe slow HTTP (instance {{ $labels.instance }}) description: HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.4.8. Blackbox probe slow pingBlackbox ping took more than 1s[copy]- alert: BlackboxProbeSlowPing expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: Blackbox probe slow ping (instance {{ $labels.instance }}) description: Blackbox ping took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
1. 5. Windows Server : prometheus-community/windows_exporter (5 rules)[copy all]- 
1.5.1. Windows Server collector ErrorCollector {{ $labels.collector }} was not successful[copy]- alert: WindowsServerCollectorError expr: windows_exporter_collector_success == 0 for: 5m labels: severity: critical annotations: summary: Windows Server collector Error (instance {{ $labels.instance }}) description: Collector {{ $labels.collector }} was not successful\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.5.2. Windows Server service StatusWindows Service state is not OK[copy]- alert: WindowsServerServiceStatus expr: windows_service_status{status="ok"} != 1 for: 5m labels: severity: critical annotations: summary: Windows Server service Status (instance {{ $labels.instance }}) description: Windows Service state is not OK\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.5.3. Windows Server CPU UsageCPU Usage is more than 80%[copy]- alert: WindowsServerCpuUsage expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: Windows Server CPU Usage (instance {{ $labels.instance }}) description: CPU Usage is more than 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.5.4. Windows Server memory UsageMemory usage is more than 90%[copy]- alert: WindowsServerMemoryUsage expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90 for: 5m labels: severity: warning annotations: summary: Windows Server memory Usage (instance {{ $labels.instance }}) description: Memory usage is more than 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
1.5.5. Windows Server disk Space UsageDisk usage is more than 80%[copy]- alert: WindowsServerDiskSpaceUsage expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80 for: 5m labels: severity: critical annotations: summary: Windows Server disk Space Usage (instance {{ $labels.instance }}) description: Disk usage is more than 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 1. MySQL : prometheus/mysqld_exporter (8 rules)[copy all]- 
2.1.1. MySQL downMySQL instance is down on {{ $labels.instance }}[copy]- alert: MysqlDown expr: mysql_up == 0 for: 5m labels: severity: critical annotations: summary: MySQL down (instance {{ $labels.instance }}) description: MySQL instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.1.2. MySQL too many connectionsMore than 80% of MySQL connections are in use on {{ $labels.instance }}[copy]- alert: MysqlTooManyConnections expr: avg by (instance) (max_over_time(mysql_global_status_threads_connected[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80 for: 5m labels: severity: warning annotations: summary: MySQL too many connections (instance {{ $labels.instance }}) description: More than 80% of MySQL connections are in use on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.1.3. MySQL high threads runningMore than 60% of MySQL connections are in running state on {{ $labels.instance }}[copy]- alert: MysqlHighThreadsRunning expr: avg by (instance) (max_over_time(mysql_global_status_threads_running[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 60 for: 5m labels: severity: warning annotations: summary: MySQL high threads running (instance {{ $labels.instance }}) description: More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.1.4. MySQL Slave IO thread not runningMySQL Slave IO thread not running on {{ $labels.instance }}[copy]- alert: MysqlSlaveIoThreadNotRunning expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0 for: 5m labels: severity: critical annotations: summary: MySQL Slave IO thread not running (instance {{ $labels.instance }}) description: MySQL Slave IO thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.1.5. MySQL Slave SQL thread not runningMySQL Slave SQL thread not running on {{ $labels.instance }}[copy]- alert: MysqlSlaveSqlThreadNotRunning expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0 for: 5m labels: severity: critical annotations: summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }}) description: MySQL Slave SQL thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.1.6. MySQL Slave replication lagMysqL replication lag on {{ $labels.instance }}[copy]- alert: MysqlSlaveReplicationLag expr: mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 300 for: 5m labels: severity: warning annotations: summary: MySQL Slave replication lag (instance {{ $labels.instance }}) description: MysqL replication lag on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.1.7. MySQL slow queriesMySQL server mysql has some new slow query.[copy]- alert: MysqlSlowQueries expr: rate(mysql_global_status_slow_queries[2m]) > 0 for: 5m labels: severity: warning annotations: summary: MySQL slow queries (instance {{ $labels.instance }}) description: MySQL server mysql has some new slow query.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.1.8. MySQL restartedMySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.[copy]- alert: MysqlRestarted expr: mysql_global_status_uptime < 60 for: 5m labels: severity: warning annotations: summary: MySQL restarted (instance {{ $labels.instance }}) description: MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 2. PostgreSQL : wrouesnel/postgres_exporter (25 rules)[copy all]- 
2.2.1. Postgresql downPostgresql instance is down[copy]- alert: PostgresqlDown expr: pg_up == 0 for: 5m labels: severity: critical annotations: summary: Postgresql down (instance {{ $labels.instance }}) description: Postgresql instance is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.2. Postgresql restartedPostgresql restarted[copy]- alert: PostgresqlRestarted expr: time() - pg_postmaster_start_time_seconds < 60 for: 5m labels: severity: critical annotations: summary: Postgresql restarted (instance {{ $labels.instance }}) description: Postgresql restarted\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.3. Postgresql exporter errorPostgresql exporter is showing errors. A query may be buggy in query.yaml[copy]- alert: PostgresqlExporterError expr: pg_exporter_last_scrape_error > 0 for: 5m labels: severity: warning annotations: summary: Postgresql exporter error (instance {{ $labels.instance }}) description: Postgresql exporter is showing errors. A query may be buggy in query.yaml\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.4. Postgresql replication lagPostgreSQL replication lag is going up (> 10s)[copy]- alert: PostgresqlReplicationLag expr: (pg_replication_lag) > 10 and ON(instance) (pg_replication_is_replica == 1) for: 5m labels: severity: warning annotations: summary: Postgresql replication lag (instance {{ $labels.instance }}) description: PostgreSQL replication lag is going up (> 10s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.5. Postgresql table not vaccumedTable has not been vaccum for 24 hours[copy]- alert: PostgresqlTableNotVaccumed expr: time() - pg_stat_user_tables_last_autovacuum > 60 * 60 * 24 for: 5m labels: severity: warning annotations: summary: Postgresql table not vaccumed (instance {{ $labels.instance }}) description: Table has not been vaccum for 24 hours\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.6. Postgresql table not analyzedTable has not been analyzed for 24 hours[copy]- alert: PostgresqlTableNotAnalyzed expr: time() - pg_stat_user_tables_last_autoanalyze > 60 * 60 * 24 for: 5m labels: severity: warning annotations: summary: Postgresql table not analyzed (instance {{ $labels.instance }}) description: Table has not been analyzed for 24 hours\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.7. Postgresql too many connectionsPostgreSQL instance has too many connections[copy]- alert: PostgresqlTooManyConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.9 for: 5m labels: severity: warning annotations: summary: Postgresql too many connections (instance {{ $labels.instance }}) description: PostgreSQL instance has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.8. Postgresql not enough connectionsPostgreSQL instance should have more connections (> 5)[copy]- alert: PostgresqlNotEnoughConnections expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5 for: 5m labels: severity: warning annotations: summary: Postgresql not enough connections (instance {{ $labels.instance }}) description: PostgreSQL instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.9. Postgresql dead locksPostgreSQL has dead-locks[copy]- alert: PostgresqlDeadLocks expr: rate(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Postgresql dead locks (instance {{ $labels.instance }}) description: PostgreSQL has dead-locks\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.10. Postgresql slow queriesPostgreSQL executes slow queries[copy]- alert: PostgresqlSlowQueries expr: pg_slow_queries > 0 for: 5m labels: severity: warning annotations: summary: Postgresql slow queries (instance {{ $labels.instance }}) description: PostgreSQL executes slow queries\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.11. Postgresql high rollback rateRatio of transactions being aborted compared to committed is > 2 %[copy]- alert: PostgresqlHighRollbackRate expr: rate(pg_stat_database_xact_rollback{datname!~"template.*"}[3m]) / rate(pg_stat_database_xact_commit{datname!~"template.*"}[3m]) > 0.02 for: 5m labels: severity: warning annotations: summary: Postgresql high rollback rate (instance {{ $labels.instance }}) description: Ratio of transactions being aborted compared to committed is > 2 %\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.12. Postgresql commit rate lowPostgres seems to be processing very few transactions[copy]- alert: PostgresqlCommitRateLow expr: rate(pg_stat_database_xact_commit[1m]) < 10 for: 5m labels: severity: critical annotations: summary: Postgresql commit rate low (instance {{ $labels.instance }}) description: Postgres seems to be processing very few transactions\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.13. Postgresql low XID consumptionPostgresql seems to be consuming transaction IDs very slowly[copy]- alert: PostgresqlLowXidConsumption expr: rate(pg_txid_current[1m]) < 5 for: 5m labels: severity: warning annotations: summary: Postgresql low XID consumption (instance {{ $labels.instance }}) description: Postgresql seems to be consuming transaction IDs very slowly\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.14. Postgresqllow XLOG consumptionPostgres seems to be consuming XLOG very slowly[copy]- alert: PostgresqllowXlogConsumption expr: rate(pg_xlog_position_bytes[1m]) < 100 for: 5m labels: severity: warning annotations: summary: Postgresqllow XLOG consumption (instance {{ $labels.instance }}) description: Postgres seems to be consuming XLOG very slowly\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.15. Postgresql WALE replication stoppedWAL-E replication seems to be stopped[copy]- alert: PostgresqlWaleReplicationStopped expr: rate(pg_xlog_position_bytes[1m]) == 0 for: 5m labels: severity: critical annotations: summary: Postgresql WALE replication stopped (instance {{ $labels.instance }}) description: WAL-E replication seems to be stopped\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.16. Postgresql high rate statement timeoutPostgres transactions showing high rate of statement timeouts[copy]- alert: PostgresqlHighRateStatementTimeout expr: rate(postgresql_errors_total{type="statement_timeout"}[5m]) > 3 for: 5m labels: severity: critical annotations: summary: Postgresql high rate statement timeout (instance {{ $labels.instance }}) description: Postgres transactions showing high rate of statement timeouts\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.17. Postgresql high rate deadlockPostgres detected deadlocks[copy]- alert: PostgresqlHighRateDeadlock expr: rate(postgresql_errors_total{type="deadlock_detected"}[1m]) * 60 > 1 for: 5m labels: severity: critical annotations: summary: Postgresql high rate deadlock (instance {{ $labels.instance }}) description: Postgres detected deadlocks\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.18. Postgresql replication lab bytesPostgres Replication lag (in bytes) is high[copy]- alert: PostgresqlReplicationLabBytes expr: (pg_xlog_position_bytes and pg_replication_is_replica == 0) - GROUP_RIGHT(instance) (pg_xlog_position_bytes and pg_replication_is_replica == 1) > 1e+09 for: 5m labels: severity: critical annotations: summary: Postgresql replication lab bytes (instance {{ $labels.instance }}) description: Postgres Replication lag (in bytes) is high\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.19. Postgresql unused replication slotUnused Replication Slots[copy]- alert: PostgresqlUnusedReplicationSlot expr: pg_replication_slots_active == 0 for: 5m labels: severity: warning annotations: summary: Postgresql unused replication slot (instance {{ $labels.instance }}) description: Unused Replication Slots\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.20. Postgresql too many dead tuplesPostgreSQL dead tuples is too large[copy]- alert: PostgresqlTooManyDeadTuples expr: ((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 unless ON(instance) (pg_replication_is_replica == 1) for: 5m labels: severity: warning annotations: summary: Postgresql too many dead tuples (instance {{ $labels.instance }}) description: PostgreSQL dead tuples is too large\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.21. Postgresql split brainSplit Brain, too many primary Postgresql databases in read-write mode[copy]- alert: PostgresqlSplitBrain expr: count(pg_replication_is_replica == 0) != 1 for: 5m labels: severity: critical annotations: summary: Postgresql split brain (instance {{ $labels.instance }}) description: Split Brain, too many primary Postgresql databases in read-write mode\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.22. Postgresql promoted nodePostgresql standby server has been promoted as primary node[copy]- alert: PostgresqlPromotedNode expr: pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0 for: 5m labels: severity: warning annotations: summary: Postgresql promoted node (instance {{ $labels.instance }}) description: Postgresql standby server has been promoted as primary node\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.23. Postgresql configuration changedPostgres Database configuration change has occurred[copy]- alert: PostgresqlConfigurationChanged expr: {__name__=~"pg_settings_.*"} != ON(__name__) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m for: 5m labels: severity: warning annotations: summary: Postgresql configuration changed (instance {{ $labels.instance }}) description: Postgres Database configuration change has occurred\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.24. Postgresql SSL compression activeDatabase connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.[copy]- alert: PostgresqlSslCompressionActive expr: sum(pg_stat_ssl_compression) > 0 for: 5m labels: severity: critical annotations: summary: Postgresql SSL compression active (instance {{ $labels.instance }}) description: Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.2.25. Postgresql too many locks acquiredToo many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.[copy]- alert: PostgresqlTooManyLocksAcquired expr: ((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20 for: 5m labels: severity: critical annotations: summary: Postgresql too many locks acquired (instance {{ $labels.instance }}) description: Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 3. SQL Server : Ozarklake/prometheus-mssql-exporter (2 rules)[copy all]- 
2.3.1. SQL Server downSQl server instance is down[copy]- alert: SqlServerDown expr: mssql_up == 0 for: 5m labels: severity: critical annotations: summary: SQL Server down (instance {{ $labels.instance }}) description: SQl server instance is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.3.2. SQL Server deadlockSQL Server is having some deadlock.[copy]- alert: SqlServerDeadlock expr: rate(mssql_deadlocks[1m]) > 0 for: 5m labels: severity: warning annotations: summary: SQL Server deadlock (instance {{ $labels.instance }}) description: SQL Server is having some deadlock.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 4. PGBouncer : spreaker/prometheus-pgbouncer-exporter (3 rules)[copy all]- 
2.4.1. PGBouncer active connectinosPGBouncer pools are filling up[copy]- alert: PgbouncerActiveConnectinos expr: pgbouncer_pools_server_active_connections > 200 for: 5m labels: severity: warning annotations: summary: PGBouncer active connectinos (instance {{ $labels.instance }}) description: PGBouncer pools are filling up\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.4.2. PGBouncer errorsPGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.[copy]- alert: PgbouncerErrors expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[5m]) > 10 for: 5m labels: severity: warning annotations: summary: PGBouncer errors (instance {{ $labels.instance }}) description: PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.4.3. PGBouncer max connectionsThe number of PGBouncer client connections has reached max_client_conn.[copy]- alert: PgbouncerMaxConnections expr: rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: PGBouncer max connections (instance {{ $labels.instance }}) description: The number of PGBouncer client connections has reached max_client_conn.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 5. Redis : oliver006/redis_exporter (11 rules)[copy all]- 
2.5.1. Redis downRedis instance is down[copy]- alert: RedisDown expr: redis_up == 0 for: 5m labels: severity: critical annotations: summary: Redis down (instance {{ $labels.instance }}) description: Redis instance is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.2. Redis missing masterRedis cluster has no node marked as master.[copy]- alert: RedisMissingMaster expr: count(redis_instance_info{role="master"}) == 0 for: 5m labels: severity: critical annotations: summary: Redis missing master (instance {{ $labels.instance }}) description: Redis cluster has no node marked as master.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.3. Redis too many mastersRedis cluster has too many nodes marked as master.[copy]- alert: RedisTooManyMasters expr: count(redis_instance_info{role="master"}) > 1 for: 5m labels: severity: critical annotations: summary: Redis too many masters (instance {{ $labels.instance }}) description: Redis cluster has too many nodes marked as master.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.4. Redis disconnected slavesRedis not replicating for all slaves. Consider reviewing the redis replication status.[copy]- alert: RedisDisconnectedSlaves expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1 for: 5m labels: severity: critical annotations: summary: Redis disconnected slaves (instance {{ $labels.instance }}) description: Redis not replicating for all slaves. Consider reviewing the redis replication status.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.5. Redis replication brokenRedis instance lost a slave[copy]- alert: RedisReplicationBroken expr: delta(redis_connected_slaves[1m]) < 0 for: 5m labels: severity: critical annotations: summary: Redis replication broken (instance {{ $labels.instance }}) description: Redis instance lost a slave\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.6. Redis cluster flappingChanges have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).[copy]- alert: RedisClusterFlapping expr: changes(redis_connected_slaves[5m]) > 2 for: 5m labels: severity: critical annotations: summary: Redis cluster flapping (instance {{ $labels.instance }}) description: Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.7. Redis missing backupRedis has not been backuped for 24 hours[copy]- alert: RedisMissingBackup expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24 for: 5m labels: severity: critical annotations: summary: Redis missing backup (instance {{ $labels.instance }}) description: Redis has not been backuped for 24 hours\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.8. Redis out of memoryRedis is running out of memory (> 90%)[copy]- alert: RedisOutOfMemory expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 for: 5m labels: severity: warning annotations: summary: Redis out of memory (instance {{ $labels.instance }}) description: Redis is running out of memory (> 90%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.9. Redis too many connectionsRedis instance has too many connections[copy]- alert: RedisTooManyConnections expr: redis_connected_clients > 100 for: 5m labels: severity: warning annotations: summary: Redis too many connections (instance {{ $labels.instance }}) description: Redis instance has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.10. Redis not enough connectionsRedis instance should have more connections (> 5)[copy]- alert: RedisNotEnoughConnections expr: redis_connected_clients < 5 for: 5m labels: severity: warning annotations: summary: Redis not enough connections (instance {{ $labels.instance }}) description: Redis instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.5.11. Redis rejected connectionsSome connections to Redis has been rejected[copy]- alert: RedisRejectedConnections expr: increase(redis_rejected_connections_total[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Redis rejected connections (instance {{ $labels.instance }}) description: Some connections to Redis has been rejected\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 6. MongoDB : percona/mongodb_exporter// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
- 
2. 6. MongoDB : dcu/mongodb_exporter (10 rules)[copy all]- 
2.6.1. MongoDB replication lagMongodb replication lag is more than 10s[copy]- alert: MongodbReplicationLag expr: avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10 for: 5m labels: severity: critical annotations: summary: MongoDB replication lag (instance {{ $labels.instance }}) description: Mongodb replication lag is more than 10s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.2. MongoDB replication Status 3MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync[copy]- alert: MongodbReplicationStatus3 expr: mongodb_replset_member_state == 3 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 3 (instance {{ $labels.instance }}) description: MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.3. MongoDB replication Status 6MongoDB Replication set member as seen from another member of the set, is not yet known[copy]- alert: MongodbReplicationStatus6 expr: mongodb_replset_member_state == 6 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 6 (instance {{ $labels.instance }}) description: MongoDB Replication set member as seen from another member of the set, is not yet known\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.4. MongoDB replication Status 8MongoDB Replication set member as seen from another member of the set, is unreachable[copy]- alert: MongodbReplicationStatus8 expr: mongodb_replset_member_state == 8 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 8 (instance {{ $labels.instance }}) description: MongoDB Replication set member as seen from another member of the set, is unreachable\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.5. MongoDB replication Status 9MongoDB Replication set member is actively performing a rollback. Data is not available for reads[copy]- alert: MongodbReplicationStatus9 expr: mongodb_replset_member_state == 9 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 9 (instance {{ $labels.instance }}) description: MongoDB Replication set member is actively performing a rollback. Data is not available for reads\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.6. MongoDB replication Status 10MongoDB Replication set member was once in a replica set but was subsequently removed[copy]- alert: MongodbReplicationStatus10 expr: mongodb_replset_member_state == 10 for: 5m labels: severity: critical annotations: summary: MongoDB replication Status 10 (instance {{ $labels.instance }}) description: MongoDB Replication set member was once in a replica set but was subsequently removed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.7. MongoDB number cursors openToo many cursors opened by MongoDB for clients (> 10k)[copy]- alert: MongodbNumberCursorsOpen expr: mongodb_metrics_cursor_open{state="total_open"} > 10000 for: 5m labels: severity: warning annotations: summary: MongoDB number cursors open (instance {{ $labels.instance }}) description: Too many cursors opened by MongoDB for clients (> 10k)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.8. MongoDB cursors timeoutsToo many cursors are timing out[copy]- alert: MongodbCursorsTimeouts expr: increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100 for: 5m labels: severity: warning annotations: summary: MongoDB cursors timeouts (instance {{ $labels.instance }}) description: Too many cursors are timing out\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.9. MongoDB too many connectionsToo many connections[copy]- alert: MongodbTooManyConnections expr: mongodb_connections{state="current"} > 500 for: 5m labels: severity: warning annotations: summary: MongoDB too many connections (instance {{ $labels.instance }}) description: Too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.6.10. MongoDB virtual memory usageHigh memory usage[copy]- alert: MongodbVirtualMemoryUsage expr: (sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3 for: 5m labels: severity: warning annotations: summary: MongoDB virtual memory usage (instance {{ $labels.instance }}) description: High memory usage\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 7. RabbitMQ (official exporter) : rabbitmq/rabbitmq-prometheus (9 rules)[copy all]- 
2.7.1. Rabbitmq node downLess than 3 nodes running in RabbitMQ cluster[copy]- alert: RabbitmqNodeDown expr: sum(rabbitmq_build_info) < 3 for: 5m labels: severity: critical annotations: summary: Rabbitmq node down (instance {{ $labels.instance }}) description: Less than 3 nodes running in RabbitMQ cluster\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.2. Rabbitmq node not distributedDistribution link state is not 'up'[copy]- alert: RabbitmqNodeNotDistributed expr: erlang_vm_dist_node_state < 3 for: 5m labels: severity: critical annotations: summary: Rabbitmq node not distributed (instance {{ $labels.instance }}) description: Distribution link state is not 'up'\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.3. Rabbitmq instances different versionsRunning different version of Rabbitmq in the same cluster, can lead to failure.[copy]- alert: RabbitmqInstancesDifferentVersions expr: count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1 for: 5m labels: severity: warning annotations: summary: Rabbitmq instances different versions (instance {{ $labels.instance }}) description: Running different version of Rabbitmq in the same cluster, can lead to failure.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.4. Rabbitmq memory highA node use more than 90% of allocated RAM[copy]- alert: RabbitmqMemoryHigh expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90 for: 5m labels: severity: warning annotations: summary: Rabbitmq memory high (instance {{ $labels.instance }}) description: A node use more than 90% of allocated RAM\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.5. Rabbitmq file descriptors usageA node use more than 90% of file descriptors[copy]- alert: RabbitmqFileDescriptorsUsage expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90 for: 5m labels: severity: warning annotations: summary: Rabbitmq file descriptors usage (instance {{ $labels.instance }}) description: A node use more than 90% of file descriptors\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.6. Rabbitmq too much unackToo much unacknowledged messages[copy]- alert: RabbitmqTooMuchUnack expr: sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000 for: 5m labels: severity: warning annotations: summary: Rabbitmq too much unack (instance {{ $labels.instance }}) description: Too much unacknowledged messages\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.7. Rabbitmq too much connectionsThe total connections of a node is too high[copy]- alert: RabbitmqTooMuchConnections expr: rabbitmq_connections > 1000 for: 5m labels: severity: warning annotations: summary: Rabbitmq too much connections (instance {{ $labels.instance }}) description: The total connections of a node is too high\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.8. Rabbitmq no queue consumerA queue has less than 1 consumer[copy]- alert: RabbitmqNoQueueConsumer expr: rabbitmq_queue_consumers < 1 for: 5m labels: severity: warning annotations: summary: Rabbitmq no queue consumer (instance {{ $labels.instance }}) description: A queue has less than 1 consumer\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.9. Rabbitmq unroutable messagesA queue has unroutable messages[copy]- alert: RabbitmqUnroutableMessages expr: increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: Rabbitmq unroutable messages (instance {{ $labels.instance }}) description: A queue has unroutable messages\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 7. RabbitMQ (official exporter) : kbudde/rabbitmq-exporter (11 rules)[copy all]- 
2.7.1. Rabbitmq downRabbitMQ node down[copy]- alert: RabbitmqDown expr: rabbitmq_up == 0 for: 5m labels: severity: critical annotations: summary: Rabbitmq down (instance {{ $labels.instance }}) description: RabbitMQ node down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.2. Rabbitmq cluster downLess than 3 nodes running in RabbitMQ cluster[copy]- alert: RabbitmqClusterDown expr: sum(rabbitmq_running) < 3 for: 5m labels: severity: critical annotations: summary: Rabbitmq cluster down (instance {{ $labels.instance }}) description: Less than 3 nodes running in RabbitMQ cluster\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.3. Rabbitmq cluster partitionCluster partition[copy]- alert: RabbitmqClusterPartition expr: rabbitmq_partitions > 0 for: 5m labels: severity: critical annotations: summary: Rabbitmq cluster partition (instance {{ $labels.instance }}) description: Cluster partition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.4. Rabbitmq out of memoryMemory available for RabbmitMQ is low (< 10%)[copy]- alert: RabbitmqOutOfMemory expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90 for: 5m labels: severity: warning annotations: summary: Rabbitmq out of memory (instance {{ $labels.instance }}) description: Memory available for RabbmitMQ is low (< 10%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.5. Rabbitmq too many connectionsRabbitMQ instance has too many connections (> 1000)[copy]- alert: RabbitmqTooManyConnections expr: rabbitmq_connectionsTotal > 1000 for: 5m labels: severity: warning annotations: summary: Rabbitmq too many connections (instance {{ $labels.instance }}) description: RabbitMQ instance has too many connections (> 1000)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.6. Rabbitmq dead letter queue filling upDead letter queue is filling up (> 10 msgs)[copy]- alert: RabbitmqDeadLetterQueueFillingUp expr: rabbitmq_queue_messages{queue="my-dead-letter-queue"} > 10 for: 5m labels: severity: critical annotations: summary: Rabbitmq dead letter queue filling up (instance {{ $labels.instance }}) description: Dead letter queue is filling up (> 10 msgs)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.7. Rabbitmq too many messages in queueQueue is filling up (> 1000 msgs)[copy]- alert: RabbitmqTooManyMessagesInQueue expr: rabbitmq_queue_messages_ready{queue="my-queue"} > 1000 for: 5m labels: severity: warning annotations: summary: Rabbitmq too many messages in queue (instance {{ $labels.instance }}) description: Queue is filling up (> 1000 msgs)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.8. Rabbitmq slow queue consumingQueue messages are consumed slowly (> 60s)[copy]- alert: RabbitmqSlowQueueConsuming expr: time() - rabbitmq_queue_head_message_timestamp{queue="my-queue"} > 60 for: 5m labels: severity: warning annotations: summary: Rabbitmq slow queue consuming (instance {{ $labels.instance }}) description: Queue messages are consumed slowly (> 60s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.9. Rabbitmq no consumerQueue has no consumer[copy]- alert: RabbitmqNoConsumer expr: rabbitmq_queue_consumers == 0 for: 5m labels: severity: critical annotations: summary: Rabbitmq no consumer (instance {{ $labels.instance }}) description: Queue has no consumer\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.10. Rabbitmq too many consumersQueue should have only 1 consumer[copy]- alert: RabbitmqTooManyConsumers expr: rabbitmq_queue_consumers > 1 for: 5m labels: severity: critical annotations: summary: Rabbitmq too many consumers (instance {{ $labels.instance }}) description: Queue should have only 1 consumer\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.7.11. Rabbitmq unactive exchangeExchange receive less than 5 msgs per second[copy]- alert: RabbitmqUnactiveExchange expr: rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5 for: 5m labels: severity: warning annotations: summary: Rabbitmq unactive exchange (instance {{ $labels.instance }}) description: Exchange receive less than 5 msgs per second\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 8. Elasticsearch : justwatchcom/elasticsearch_exporter (13 rules)[copy all]- 
2.8.1. Elasticsearch Heap Usage Too HighThe heap usage is over 90% for 5m[copy]- alert: ElasticsearchHeapUsageTooHigh expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90 for: 5m labels: severity: critical annotations: summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }}) description: The heap usage is over 90% for 5m\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.2. Elasticsearch Heap Usage warningThe heap usage is over 80% for 5m[copy]- alert: ElasticsearchHeapUsageWarning expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80 for: 5m labels: severity: warning annotations: summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }}) description: The heap usage is over 80% for 5m\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.3. Elasticsearch disk space lowThe disk usage is over 80%[copy]- alert: ElasticsearchDiskSpaceLow expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20 for: 5m labels: severity: warning annotations: summary: Elasticsearch disk space low (instance {{ $labels.instance }}) description: The disk usage is over 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.4. Elasticsearch disk out of spaceThe disk usage is over 90%[copy]- alert: ElasticsearchDiskOutOfSpace expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10 for: 5m labels: severity: critical annotations: summary: Elasticsearch disk out of space (instance {{ $labels.instance }}) description: The disk usage is over 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.5. Elasticsearch Cluster RedElastic Cluster Red status[copy]- alert: ElasticsearchClusterRed expr: elasticsearch_cluster_health_status{color="red"} == 1 for: 5m labels: severity: critical annotations: summary: Elasticsearch Cluster Red (instance {{ $labels.instance }}) description: Elastic Cluster Red status\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.6. Elasticsearch Cluster YellowElastic Cluster Yellow status[copy]- alert: ElasticsearchClusterYellow expr: elasticsearch_cluster_health_status{color="yellow"} == 1 for: 5m labels: severity: warning annotations: summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }}) description: Elastic Cluster Yellow status\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.7. Elasticsearch Healthy NodesNumber Healthy Nodes less then number_of_nodes[copy]- alert: ElasticsearchHealthyNodes expr: elasticsearch_cluster_health_number_of_nodes < number_of_nodes for: 5m labels: severity: critical annotations: summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }}) description: Number Healthy Nodes less then number_of_nodes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.8. Elasticsearch Healthy Data NodesNumber Healthy Data Nodes less then number_of_data_nodes[copy]- alert: ElasticsearchHealthyDataNodes expr: elasticsearch_cluster_health_number_of_data_nodes < number_of_data_nodes for: 5m labels: severity: critical annotations: summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }}) description: Number Healthy Data Nodes less then number_of_data_nodes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.9. Elasticsearch relocation shardsNumber of relocation shards for 20 min[copy]- alert: ElasticsearchRelocationShards expr: elasticsearch_cluster_health_relocating_shards > 0 for: 5m labels: severity: critical annotations: summary: Elasticsearch relocation shards (instance {{ $labels.instance }}) description: Number of relocation shards for 20 min\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.10. Elasticsearch initializing shardsNumber of initializing shards for 10 min[copy]- alert: ElasticsearchInitializingShards expr: elasticsearch_cluster_health_initializing_shards > 0 for: 5m labels: severity: warning annotations: summary: Elasticsearch initializing shards (instance {{ $labels.instance }}) description: Number of initializing shards for 10 min\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.11. Elasticsearch unassigned shardsNumber of unassigned shards for 2 min[copy]- alert: ElasticsearchUnassignedShards expr: elasticsearch_cluster_health_unassigned_shards > 0 for: 5m labels: severity: critical annotations: summary: Elasticsearch unassigned shards (instance {{ $labels.instance }}) description: Number of unassigned shards for 2 min\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.12. Elasticsearch pending tasksNumber of pending tasks for 10 min. Cluster works slowly.[copy]- alert: ElasticsearchPendingTasks expr: elasticsearch_cluster_health_number_of_pending_tasks > 0 for: 5m labels: severity: warning annotations: summary: Elasticsearch pending tasks (instance {{ $labels.instance }}) description: Number of pending tasks for 10 min. Cluster works slowly.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.8.13. Elasticsearch no new documentsNo new documents for 10 min![copy]- alert: ElasticsearchNoNewDocuments expr: rate(elasticsearch_indices_docs{es_data_node="true"}[10m]) < 1 for: 5m labels: severity: warning annotations: summary: Elasticsearch no new documents (instance {{ $labels.instance }}) description: No new documents for 10 min!\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 9. Cassandra : instaclustr/cassandra-exporter// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
- 
2. 9. Cassandra : criteo/cassandra_exporter (18 rules)[copy all]- 
2.9.1. Cassandra hints countCassandra hints count has changed on {{ $labels.instance }} some nodes may go down[copy]- alert: CassandraHintsCount expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3 for: 5m labels: severity: critical annotations: summary: Cassandra hints count (instance {{ $labels.instance }}) description: Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.2. Cassandra compaction task pendingMany Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.[copy]- alert: CassandraCompactionTaskPending expr: avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[30m]) > 100 for: 5m labels: severity: warning annotations: summary: Cassandra compaction task pending (instance {{ $labels.instance }}) description: Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.3. Cassandra viewwrite latencyHigh viewwrite latency on {{ $labels.instance }} cassandra node[copy]- alert: CassandraViewwriteLatency expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000 for: 5m labels: severity: warning annotations: summary: Cassandra viewwrite latency (instance {{ $labels.instance }}) description: High viewwrite latency on {{ $labels.instance }} cassandra node\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.4. Cassandra cool hackerIncrease of Cassandra authentication failures[copy]- alert: CassandraCoolHacker expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5 for: 5m labels: severity: warning annotations: summary: Cassandra cool hacker (instance {{ $labels.instance }}) description: Increase of Cassandra authentication failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.5. Cassandra node downCassandra node down[copy]- alert: CassandraNodeDown expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra node down (instance {{ $labels.instance }}) description: Cassandra node down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.6. Cassandra commitlog pending tasksUnexpected number of Cassandra commitlog pending tasks[copy]- alert: CassandraCommitlogPendingTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15 for: 5m labels: severity: warning annotations: summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }}) description: Unexpected number of Cassandra commitlog pending tasks\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.7. Cassandra compaction executor blocked tasksSome Cassandra compaction executor tasks are blocked[copy]- alert: CassandraCompactionExecutorBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0 for: 5m labels: severity: warning annotations: summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }}) description: Some Cassandra compaction executor tasks are blocked\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.8. Cassandra flush writer blocked tasksSome Cassandra flush writer tasks are blocked[copy]- alert: CassandraFlushWriterBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0 for: 5m labels: severity: warning annotations: summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }}) description: Some Cassandra flush writer tasks are blocked\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.9. Cassandra repair pending tasksSome Cassandra repair tasks are pending[copy]- alert: CassandraRepairPendingTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2 for: 5m labels: severity: warning annotations: summary: Cassandra repair pending tasks (instance {{ $labels.instance }}) description: Some Cassandra repair tasks are pending\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.10. Cassandra repair blocked tasksSome Cassandra repair tasks are blocked[copy]- alert: CassandraRepairBlockedTasks expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0 for: 5m labels: severity: warning annotations: summary: Cassandra repair blocked tasks (instance {{ $labels.instance }}) description: Some Cassandra repair tasks are blocked\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.11. Cassandra connection timeouts totalSome connection between nodes are ending in timeout[copy]- alert: CassandraConnectionTimeoutsTotal expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5 for: 5m labels: severity: critical annotations: summary: Cassandra connection timeouts total (instance {{ $labels.instance }}) description: Some connection between nodes are ending in timeout\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.12. Cassandra storage exceptionsSomething is going wrong with cassandra storage[copy]- alert: CassandraStorageExceptions expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1 for: 5m labels: severity: critical annotations: summary: Cassandra storage exceptions (instance {{ $labels.instance }}) description: Something is going wrong with cassandra storage\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.13. Cassandra tombstone dumpToo much tombstones scanned in queries[copy]- alert: CassandraTombstoneDump expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000 for: 5m labels: severity: critical annotations: summary: Cassandra tombstone dump (instance {{ $labels.instance }}) description: Too much tombstones scanned in queries\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.14. Cassandra client request unvailable writeWrite failures have occurred because too many nodes are unavailable[copy]- alert: CassandraClientRequestUnvailableWrite expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra client request unvailable write (instance {{ $labels.instance }}) description: Write failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.15. Cassandra client request unvailable readRead failures have occurred because too many nodes are unavailable[copy]- alert: CassandraClientRequestUnvailableRead expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra client request unvailable read (instance {{ $labels.instance }}) description: Read failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.16. Cassandra client request write failureA lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.[copy]- alert: CassandraClientRequestWriteFailure expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra client request write failure (instance {{ $labels.instance }}) description: A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.17. Cassandra client request read failureA lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.[copy]- alert: CassandraClientRequestReadFailure expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Cassandra client request read failure (instance {{ $labels.instance }}) description: A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.9.18. Cassandra cache hit rate key cacheKey cache hit rate is below 85%[copy]- alert: CassandraCacheHitRateKeyCache expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85 for: 5m labels: severity: critical annotations: summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }}) description: Key cache hit rate is below 85%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
2. 10. Zookeeper : cloudflare/kafka_zookeeper_exporter// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
- 
2. 11. Kafka : danielqsj/kafka_exporter (2 rules)[copy all]- 
2.11.1. Kafka topics replicasKafka topic in-sync partition[copy]- alert: KafkaTopicsReplicas expr: sum(kafka_topic_partition_in_sync_replica) by (topic) < 3 for: 5m labels: severity: critical annotations: summary: Kafka topics replicas (instance {{ $labels.instance }}) description: Kafka topic in-sync partition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
2.11.2. Kafka consumers groupKafka consumers group[copy]- alert: KafkaConsumersGroup expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50 for: 5m labels: severity: critical annotations: summary: Kafka consumers group (instance {{ $labels.instance }}) description: Kafka consumers group\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
3. 1. Nginx : nginx-lua-prometheus (3 rules)[copy all]- 
3.1.1. Nginx high HTTP 4xx error rateToo many HTTP requests with status 4xx (> 5%)[copy]- alert: NginxHighHttp4xxErrorRate expr: sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }}) description: Too many HTTP requests with status 4xx (> 5%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.1.2. Nginx high HTTP 5xx error rateToo many HTTP requests with status 5xx (> 5%)[copy]- alert: NginxHighHttp5xxErrorRate expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }}) description: Too many HTTP requests with status 5xx (> 5%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.1.3. Nginx latency highNginx p99 latency is higher than 10 seconds[copy]- alert: NginxLatencyHigh expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 10 for: 5m labels: severity: warning annotations: summary: Nginx latency high (instance {{ $labels.instance }}) description: Nginx p99 latency is higher than 10 seconds\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
3. 2. Apache : Lusitaniae/apache_exporter (3 rules)[copy all]- 
3.2.1. Apache downApache down[copy]- alert: ApacheDown expr: apache_up == 0 for: 5m labels: severity: critical annotations: summary: Apache down (instance {{ $labels.instance }}) description: Apache down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.2.2. Apache workers loadApache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}[copy]- alert: ApacheWorkersLoad expr: (sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80 for: 5m labels: severity: critical annotations: summary: Apache workers load (instance {{ $labels.instance }}) description: Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.2.3. Apache restartApache has just been restarted, less than one minute ago.[copy]- alert: ApacheRestart expr: apache_uptime_seconds_total / 60 < 1 for: 5m labels: severity: warning annotations: summary: Apache restart (instance {{ $labels.instance }}) description: Apache has just been restarted, less than one minute ago.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
3. 3. HaProxy : Embedded exporter (HAProxy >= v2)// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
- 
3. 3. HaProxy : prometheus/haproxy_exporter (HAProxy < v2) (16 rules)[copy all]- 
3.3.1. HAProxy downHAProxy down[copy]- alert: HaproxyDown expr: haproxy_up == 0 for: 5m labels: severity: critical annotations: summary: HAProxy down (instance {{ $labels.instance }}) description: HAProxy down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.2. HAProxy high HTTP 4xx error rate backendToo many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}[copy]- alert: HaproxyHighHttp4xxErrorRateBackend expr: sum by (backend) rate(haproxy_server_http_responses_total{code="4xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.3. HAProxy high HTTP 4xx error rate backendToo many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}[copy]- alert: HaproxyHighHttp4xxErrorRateBackend expr: sum by (backend) rate(haproxy_server_http_responses_total{code="5xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.4. HAProxy high HTTP 4xx error rate serverToo many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}[copy]- alert: HaproxyHighHttp4xxErrorRateServer expr: sum by (server) rate(haproxy_server_http_responses_total{code="4xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }}) description: Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.5. HAProxy high HTTP 5xx error rate serverToo many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}[copy]- alert: HaproxyHighHttp5xxErrorRateServer expr: sum by (server) rate(haproxy_server_http_responses_total{code="5xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }}) description: Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.6. HAProxy server response errorsToo many response errors to {{ $labels.server }} server (> 5%).[copy]- alert: HaproxyServerResponseErrors expr: sum by (server) rate(haproxy_server_response_errors_total[1m]) / sum by (server) rate(haproxy_server_http_responses_total[1m]) * 100 > 5 for: 5m labels: severity: critical annotations: summary: HAProxy server response errors (instance {{ $labels.instance }}) description: Too many response errors to {{ $labels.server }} server (> 5%).\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.7. HAProxy backend connection errorsToo many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be to high.[copy]- alert: HaproxyBackendConnectionErrors expr: sum by (backend) rate(haproxy_backend_connection_errors_total[1m]) > 100 for: 5m labels: severity: critical annotations: summary: HAProxy backend connection errors (instance {{ $labels.instance }}) description: Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be to high.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.8. HAProxy server connection errorsToo many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be to high.[copy]- alert: HaproxyServerConnectionErrors expr: sum by (server) rate(haproxy_server_connection_errors_total[1m]) > 100 for: 5m labels: severity: critical annotations: summary: HAProxy server connection errors (instance {{ $labels.instance }}) description: Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be to high.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.9. HAProxy backend max active sessionHAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).[copy]- alert: HaproxyBackendMaxActiveSession expr: avg_over_time((sum by (backend) (haproxy_server_max_sessions) / sum by (backend) (haproxy_server_limit_sessions)) [2m]) * 100 > 80 for: 5m labels: severity: warning annotations: summary: HAProxy backend max active session (instance {{ $labels.instance }}) description: HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.10. HAProxy pending requestsSome HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend[copy]- alert: HaproxyPendingRequests expr: sum by (backend) haproxy_backend_current_queue > 0 for: 5m labels: severity: warning annotations: summary: HAProxy pending requests (instance {{ $labels.instance }}) description: Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.11. HAProxy HTTP slowing downAverage request time is increasing[copy]- alert: HaproxyHttpSlowingDown expr: avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 2 for: 5m labels: severity: warning annotations: summary: HAProxy HTTP slowing down (instance {{ $labels.instance }}) description: Average request time is increasing\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.12. HAProxy retry highHigh rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend[copy]- alert: HaproxyRetryHigh expr: rate(sum by (backend) (haproxy_backend_retry_warnings_total)) > 10 for: 5m labels: severity: warning annotations: summary: HAProxy retry high (instance {{ $labels.instance }}) description: High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.13. HAProxy backend downHAProxy backend is down[copy]- alert: HaproxyBackendDown expr: haproxy_backend_up == 0 for: 5m labels: severity: critical annotations: summary: HAProxy backend down (instance {{ $labels.instance }}) description: HAProxy backend is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.14. HAProxy server downHAProxy server is down[copy]- alert: HaproxyServerDown expr: haproxy_server_up == 0 for: 5m labels: severity: critical annotations: summary: HAProxy server down (instance {{ $labels.instance }}) description: HAProxy server is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.15. HAProxy frontend security blocked requestsHAProxy is blocking requests for security reason[copy]- alert: HaproxyFrontendSecurityBlockedRequests expr: rate(sum by (frontend) (haproxy_frontend_requests_denied_total)) > 10 for: 5m labels: severity: warning annotations: summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }}) description: HAProxy is blocking requests for security reason\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.3.16. HAProxy server healthcheck failureSome server healthcheck are failing on {{ $labels.server }}[copy]- alert: HaproxyServerHealthcheckFailure expr: increase(haproxy_server_check_failures_total) > 0 for: 5m labels: severity: warning annotations: summary: HAProxy server healthcheck failure (instance {{ $labels.instance }}) description: Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
3. 4. Traefik : Embedded exporter (3 rules)[copy all]- 
3.4.1. Traefik backend downAll Traefik backends are down[copy]- alert: TraefikBackendDown expr: count(traefik_backend_server_up) by (backend) == 0 for: 5m labels: severity: critical annotations: summary: Traefik backend down (instance {{ $labels.instance }}) description: All Traefik backends are down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.4.2. Traefik high HTTP 4xx error rate backendTraefik backend 4xx error rate is above 5%[copy]- alert: TraefikHighHttp4xxErrorRateBackend expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }}) description: Traefik backend 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.4.3. Traefik high HTTP 5xx error rate backendTraefik backend 5xx error rate is above 5%[copy]- alert: TraefikHighHttp5xxErrorRateBackend expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }}) description: Traefik backend 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
3. 4. Traefik : Embedded exporter v2 (3 rules)[copy all]- 
3.4.1. Traefik service downAll Traefik services are down[copy]- alert: TraefikServiceDown expr: count(traefik_service_server_up) by (service) == 0 for: 5m labels: severity: critical annotations: summary: Traefik service down (instance {{ $labels.instance }}) description: All Traefik services are down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.4.2. Traefik high HTTP 4xx error rate serviceTraefik service 4xx error rate is above 5%[copy]- alert: TraefikHighHttp4xxErrorRateService expr: sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }}) description: Traefik service 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
3.4.3. Traefik high HTTP 5xx error rate serviceTraefik service 5xx error rate is above 5%[copy]- alert: TraefikHighHttp5xxErrorRateService expr: sum(rate(traefik_service_requests_total{code=~"5.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5 for: 5m labels: severity: critical annotations: summary: Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }}) description: Traefik service 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
4. 1. PHP-FPM : bakins/php-fpm-exporter// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
- 
4. 2. JVM : java-client (1 rules)[copy all]- 
4.2.1. JVM memory filling upJVM memory is filling up (> 80%)[copy]- alert: JvmMemoryFillingUp expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8 for: 5m labels: severity: warning annotations: summary: JVM memory filling up (instance {{ $labels.instance }}) description: JVM memory is filling up (> 80%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
4. 3. Sidekiq : Strech/sidekiq-prometheus-exporter (2 rules)[copy all]- 
4.3.1. Sidekiq queue sizeSidekiq queue {{ $labels.name }} is growing[copy]- alert: SidekiqQueueSize expr: sidekiq_queue_size > 100 for: 5m labels: severity: warning annotations: summary: Sidekiq queue size (instance {{ $labels.instance }}) description: Sidekiq queue {{ $labels.name }} is growing\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
4.3.2. Sidekiq scheduling latency too highSidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.[copy]- alert: SidekiqSchedulingLatencyTooHigh expr: max(sidekiq_queue_latency) > 120 for: 5m labels: severity: critical annotations: summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }}) description: Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
5. 1. Kubernetes : kube-state-metrics (32 rules)[copy all]- 
5.1.1. Kubernetes Node readyNode {{ $labels.node }} has been unready for a long time[copy]- alert: KubernetesNodeReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 5m labels: severity: critical annotations: summary: Kubernetes Node ready (instance {{ $labels.instance }}) description: Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.2. Kubernetes memory pressure{{ $labels.node }} has MemoryPressure condition[copy]- alert: KubernetesMemoryPressure expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 for: 5m labels: severity: critical annotations: summary: Kubernetes memory pressure (instance {{ $labels.instance }}) description: {{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.3. Kubernetes disk pressure{{ $labels.node }} has DiskPressure condition[copy]- alert: KubernetesDiskPressure expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1 for: 5m labels: severity: critical annotations: summary: Kubernetes disk pressure (instance {{ $labels.instance }}) description: {{ $labels.node }} has DiskPressure condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.4. Kubernetes out of disk{{ $labels.node }} has OutOfDisk condition[copy]- alert: KubernetesOutOfDisk expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1 for: 5m labels: severity: critical annotations: summary: Kubernetes out of disk (instance {{ $labels.instance }}) description: {{ $labels.node }} has OutOfDisk condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.5. Kubernetes out of capacity{{ $labels.node }} is out of capacity[copy]- alert: KubernetesOutOfCapacity expr: sum(kube_pod_info) by (node) / sum(kube_node_status_allocatable_pods) by (node) * 100 > 90 for: 5m labels: severity: warning annotations: summary: Kubernetes out of capacity (instance {{ $labels.instance }}) description: {{ $labels.node }} is out of capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.6. Kubernetes Job failedJob {{$labels.namespace}}/{{$labels.exported_job}} failed to complete[copy]- alert: KubernetesJobFailed expr: kube_job_status_failed > 0 for: 5m labels: severity: warning annotations: summary: Kubernetes Job failed (instance {{ $labels.instance }}) description: Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.7. Kubernetes CronJob suspendedCronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended[copy]- alert: KubernetesCronjobSuspended expr: kube_cronjob_spec_suspend != 0 for: 5m labels: severity: warning annotations: summary: Kubernetes CronJob suspended (instance {{ $labels.instance }}) description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.8. Kubernetes PersistentVolumeClaim pendingPersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending[copy]- alert: KubernetesPersistentvolumeclaimPending expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1 for: 5m labels: severity: warning annotations: summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }}) description: PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.9. Kubernetes Volume out of disk spaceVolume is almost full (< 10% left)[copy]- alert: KubernetesVolumeOutOfDiskSpace expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10 for: 5m labels: severity: warning annotations: summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }}) description: Volume is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.10. Kubernetes Volume full in four days{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.[copy]- alert: KubernetesVolumeFullInFourDays expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0 for: 5m labels: severity: critical annotations: summary: Kubernetes Volume full in four days (instance {{ $labels.instance }}) description: {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.11. Kubernetes PersistentVolume errorPersistent volume is in bad state[copy]- alert: KubernetesPersistentvolumeError expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }}) description: Persistent volume is in bad state\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.12. Kubernetes StatefulSet downA StatefulSet went down[copy]- alert: KubernetesStatefulsetDown expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1 for: 5m labels: severity: critical annotations: summary: Kubernetes StatefulSet down (instance {{ $labels.instance }}) description: A StatefulSet went down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.13. Kubernetes HPA scaling abilityPod is unable to scale[copy]- alert: KubernetesHpaScalingAbility expr: kube_hpa_status_condition{status="false", condition ="AbleToScale"} == 1 for: 5m labels: severity: warning annotations: summary: Kubernetes HPA scaling ability (instance {{ $labels.instance }}) description: Pod is unable to scale\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.14. Kubernetes HPA metric availabilityHPA is not able to collect metrics[copy]- alert: KubernetesHpaMetricAvailability expr: kube_hpa_status_condition{status="false", condition="ScalingActive"} == 1 for: 5m labels: severity: warning annotations: summary: Kubernetes HPA metric availability (instance {{ $labels.instance }}) description: HPA is not able to collect metrics\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.15. Kubernetes HPA scale capabilityThe maximum number of desired Pods has been hit[copy]- alert: KubernetesHpaScaleCapability expr: kube_hpa_status_desired_replicas >= kube_hpa_spec_max_replicas for: 5m labels: severity: warning annotations: summary: Kubernetes HPA scale capability (instance {{ $labels.instance }}) description: The maximum number of desired Pods has been hit\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.16. Kubernetes Pod not healthyPod has been in a non-ready state for longer than an hour.[copy]- alert: KubernetesPodNotHealthy expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes Pod not healthy (instance {{ $labels.instance }}) description: Pod has been in a non-ready state for longer than an hour.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.17. Kubernetes pod crash loopingPod {{ $labels.pod }} is crash looping[copy]- alert: KubernetesPodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5 for: 5m labels: severity: warning annotations: summary: Kubernetes pod crash looping (instance {{ $labels.instance }}) description: Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.18. Kubernetes ReplicasSet mismatchDeployment Replicas mismatch[copy]- alert: KubernetesReplicassetMismatch expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas for: 5m labels: severity: warning annotations: summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }}) description: Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.19. Kubernetes Deployment replicas mismatchDeployment Replicas mismatch[copy]- alert: KubernetesDeploymentReplicasMismatch expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available for: 5m labels: severity: warning annotations: summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }}) description: Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.20. Kubernetes StatefulSet replicas mismatchA StatefulSet has not matched the expected number of replicas for longer than 15 minutes.[copy]- alert: KubernetesStatefulsetReplicasMismatch expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas for: 5m labels: severity: warning annotations: summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }}) description: A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.21. Kubernetes Deployment generation mismatchA Deployment has failed but has not been rolled back.[copy]- alert: KubernetesDeploymentGenerationMismatch expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 5m labels: severity: critical annotations: summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }}) description: A Deployment has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.22. Kubernetes StatefulSet generation mismatchA StatefulSet has failed but has not been rolled back.[copy]- alert: KubernetesStatefulsetGenerationMismatch expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation for: 5m labels: severity: critical annotations: summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }}) description: A StatefulSet has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.23. Kubernetes StatefulSet update not rolled outStatefulSet update has not been rolled out.[copy]- alert: KubernetesStatefulsetUpdateNotRolledOut expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated) for: 5m labels: severity: critical annotations: summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }}) description: StatefulSet update has not been rolled out.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.24. Kubernetes DaemonSet rollout stuckSome Pods of DaemonSet are not scheduled or not ready[copy]- alert: KubernetesDaemonsetRolloutStuck expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }}) description: Some Pods of DaemonSet are not scheduled or not ready\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.25. Kubernetes DaemonSet misscheduledSome DaemonSet Pods are running where they are not supposed to run[copy]- alert: KubernetesDaemonsetMisscheduled expr: kube_daemonset_status_number_misscheduled > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }}) description: Some DaemonSet Pods are running where they are not supposed to run\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.26. Kubernetes CronJob too longCronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.[copy]- alert: KubernetesCronjobTooLong expr: time() - kube_cronjob_next_schedule_time > 3600 for: 5m labels: severity: warning annotations: summary: Kubernetes CronJob too long (instance {{ $labels.instance }}) description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.27. Kubernetes job completionKubernetes Job failed to complete[copy]- alert: KubernetesJobCompletion expr: kube_job_spec_completions - kube_job_status_succeeded > 0 or kube_job_status_failed > 0 for: 5m labels: severity: critical annotations: summary: Kubernetes job completion (instance {{ $labels.instance }}) description: Kubernetes Job failed to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.28. Kubernetes API server errorsKubernetes API server is experiencing high error rate[copy]- alert: KubernetesApiServerErrors expr: sum(rate(apiserver_request_count{job="apiserver",code=~"^(?:5..)$"}[2m])) / sum(rate(apiserver_request_count{job="apiserver"}[2m])) * 100 > 3 for: 5m labels: severity: critical annotations: summary: Kubernetes API server errors (instance {{ $labels.instance }}) description: Kubernetes API server is experiencing high error rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.29. Kubernetes API client errorsKubernetes API client is experiencing high error rate[copy]- alert: KubernetesApiClientErrors expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[2m])) by (instance, job) / sum(rate(rest_client_requests_total[2m])) by (instance, job)) * 100 > 1 for: 5m labels: severity: critical annotations: summary: Kubernetes API client errors (instance {{ $labels.instance }}) description: Kubernetes API client is experiencing high error rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.30. Kubernetes client certificate expires next weekA client certificate used to authenticate to the apiserver is expiring next week.[copy]- alert: KubernetesClientCertificateExpiresNextWeek expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60 for: 5m labels: severity: warning annotations: summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }}) description: A client certificate used to authenticate to the apiserver is expiring next week.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.31. Kubernetes client certificate expires soonA client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.[copy]- alert: KubernetesClientCertificateExpiresSoon expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60 for: 5m labels: severity: critical annotations: summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }}) description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.1.32. Kubernetes API server latencyKubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.[copy]- alert: KubernetesApiServerLatency expr: histogram_quantile(0.99, sum(apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"}) WITHOUT (instance, resource)) / 1e+06 > 1 for: 5m labels: severity: warning annotations: summary: Kubernetes API server latency (instance {{ $labels.instance }}) description: Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
5. 2. Nomad : Embedded exporter// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
- 
5. 3. Consul : prometheus/consul_exporter (3 rules)[copy all]- 
5.3.1. Consul service healthcheck failedService: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`[copy]- alert: ConsulServiceHealthcheckFailed expr: consul_catalog_service_node_healthy == 0 for: 5m labels: severity: critical annotations: summary: Consul service healthcheck failed (instance {{ $labels.instance }}) description: Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.3.2. Consul missing master nodeNumbers of consul raft peers should be 3, in order to preserve quorum.[copy]- alert: ConsulMissingMasterNode expr: consul_raft_peers < 3 for: 5m labels: severity: critical annotations: summary: Consul missing master node (instance {{ $labels.instance }}) description: Numbers of consul raft peers should be 3, in order to preserve quorum.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.3.3. Consul agent unhealthyA Consul agent is down[copy]- alert: ConsulAgentUnhealthy expr: consul_health_node_status{status="critical"} == 1 for: 5m labels: severity: critical annotations: summary: Consul agent unhealthy (instance {{ $labels.instance }}) description: A Consul agent is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
5. 4. Etcd (13 rules)[copy all]- 
5.4.1. Etcd insufficient MembersEtcd cluster should have an odd number of members[copy]- alert: EtcdInsufficientMembers expr: count(etcd_server_id) % 2 == 0 for: 5m labels: severity: critical annotations: summary: Etcd insufficient Members (instance {{ $labels.instance }}) description: Etcd cluster should have an odd number of members\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.2. Etcd no LeaderEtcd cluster have no leader[copy]- alert: EtcdNoLeader expr: etcd_server_has_leader == 0 for: 5m labels: severity: critical annotations: summary: Etcd no Leader (instance {{ $labels.instance }}) description: Etcd cluster have no leader\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.3. Etcd high number of leader changesEtcd leader changed more than 3 times during last hour[copy]- alert: EtcdHighNumberOfLeaderChanges expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3 for: 5m labels: severity: warning annotations: summary: Etcd high number of leader changes (instance {{ $labels.instance }}) description: Etcd leader changed more than 3 times during last hour\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.4. Etcd high number of failed GRPC requestsMore than 1% GRPC request failure detected in Etcd for 5 minutes[copy]- alert: EtcdHighNumberOfFailedGrpcRequests expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01 for: 5m labels: severity: warning annotations: summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }}) description: More than 1% GRPC request failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.5. Etcd high number of failed GRPC requestsMore than 5% GRPC request failure detected in Etcd for 5 minutes[copy]- alert: EtcdHighNumberOfFailedGrpcRequests expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.05 for: 5m labels: severity: critical annotations: summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }}) description: More than 5% GRPC request failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.6. Etcd GRPC requests slowGRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes[copy]- alert: EtcdGrpcRequestsSlow expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15 for: 5m labels: severity: warning annotations: summary: Etcd GRPC requests slow (instance {{ $labels.instance }}) description: GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.7. Etcd high number of failed HTTP requestsMore than 1% HTTP failure detected in Etcd for 5 minutes[copy]- alert: EtcdHighNumberOfFailedHttpRequests expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.01 for: 5m labels: severity: warning annotations: summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }}) description: More than 1% HTTP failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.8. Etcd high number of failed HTTP requestsMore than 5% HTTP failure detected in Etcd for 5 minutes[copy]- alert: EtcdHighNumberOfFailedHttpRequests expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.05 for: 5m labels: severity: critical annotations: summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }}) description: More than 5% HTTP failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.9. Etcd HTTP requests slowHTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes[copy]- alert: EtcdHttpRequestsSlow expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15 for: 5m labels: severity: warning annotations: summary: Etcd HTTP requests slow (instance {{ $labels.instance }}) description: HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.10. Etcd member communication slowEtcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes[copy]- alert: EtcdMemberCommunicationSlow expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.15 for: 5m labels: severity: warning annotations: summary: Etcd member communication slow (instance {{ $labels.instance }}) description: Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.11. Etcd high number of failed proposalsEtcd server got more than 5 failed proposals past hour[copy]- alert: EtcdHighNumberOfFailedProposals expr: increase(etcd_server_proposals_failed_total[1h]) > 5 for: 5m labels: severity: warning annotations: summary: Etcd high number of failed proposals (instance {{ $labels.instance }}) description: Etcd server got more than 5 failed proposals past hour\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.12. Etcd high fsync durationsEtcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes[copy]- alert: EtcdHighFsyncDurations expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: Etcd high fsync durations (instance {{ $labels.instance }}) description: Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
5.4.13. Etcd high commit durationsEtcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes[copy]- alert: EtcdHighCommitDurations expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25 for: 5m labels: severity: warning annotations: summary: Etcd high commit durations (instance {{ $labels.instance }}) description: Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
5. 5. Linkerd : Embedded exporter (1 rules)[copy all]- 
5.5.1. Linkerd high error rateLinkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%[copy]- alert: LinkerdHighErrorRate expr: sum(rate(request_errors_total[5m])) by (deployment, statefulset, daemonset) / sum(rate(request_total[5m])) by (deployment, statefulset, daemonset) * 100 > 10 for: 5m labels: severity: warning annotations: summary: Linkerd high error rate (instance {{ $labels.instance }}) description: Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
5. 6. Istio// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
- 
6. 1. Ceph : Embedded exporter (13 rules)[copy all]- 
6.1.1. Ceph StateCeph instance unhealthy[copy]- alert: CephState expr: ceph_health_status != 0 for: 5m labels: severity: critical annotations: summary: Ceph State (instance {{ $labels.instance }}) description: Ceph instance unhealthy\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.2. Ceph monitor clock skewCeph monitor clock skew detected. Please check ntp and hardware clock settings[copy]- alert: CephMonitorClockSkew expr: abs(ceph_monitor_clock_skew_seconds) > 0.2 for: 5m labels: severity: warning annotations: summary: Ceph monitor clock skew (instance {{ $labels.instance }}) description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.3. Ceph monitor low spaceCeph monitor storage is low.[copy]- alert: CephMonitorLowSpace expr: ceph_monitor_avail_percent < 10 for: 5m labels: severity: warning annotations: summary: Ceph monitor low space (instance {{ $labels.instance }}) description: Ceph monitor storage is low.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.4. Ceph OSD DownCeph Object Storage Daemon Down[copy]- alert: CephOsdDown expr: ceph_osd_up == 0 for: 5m labels: severity: critical annotations: summary: Ceph OSD Down (instance {{ $labels.instance }}) description: Ceph Object Storage Daemon Down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.5. Ceph high OSD latencyCeph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.[copy]- alert: CephHighOsdLatency expr: ceph_osd_perf_apply_latency_seconds > 10 for: 5m labels: severity: warning annotations: summary: Ceph high OSD latency (instance {{ $labels.instance }}) description: Ceph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.6. Ceph OSD low spaceCeph Object Storage Daemon is going out of space. Please add more disks.[copy]- alert: CephOsdLowSpace expr: ceph_osd_utilization > 90 for: 5m labels: severity: warning annotations: summary: Ceph OSD low space (instance {{ $labels.instance }}) description: Ceph Object Storage Daemon is going out of space. Please add more disks.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.7. Ceph OSD reweightedCeph Object Storage Daemon take ttoo much time to resize.[copy]- alert: CephOsdReweighted expr: ceph_osd_weight < 1 for: 5m labels: severity: warning annotations: summary: Ceph OSD reweighted (instance {{ $labels.instance }}) description: Ceph Object Storage Daemon take ttoo much time to resize.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.8. Ceph PG downSome Ceph placement groups are down. Please ensure that all the data are available.[copy]- alert: CephPgDown expr: ceph_pg_down > 0 for: 5m labels: severity: critical annotations: summary: Ceph PG down (instance {{ $labels.instance }}) description: Some Ceph placement groups are down. Please ensure that all the data are available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.9. Ceph PG incompleteSome Ceph placement groups are incomplete. Please ensure that all the data are available.[copy]- alert: CephPgIncomplete expr: ceph_pg_incomplete > 0 for: 5m labels: severity: critical annotations: summary: Ceph PG incomplete (instance {{ $labels.instance }}) description: Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.10. Ceph PG inconsistantSome Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.[copy]- alert: CephPgInconsistant expr: ceph_pg_inconsistent > 0 for: 5m labels: severity: warning annotations: summary: Ceph PG inconsistant (instance {{ $labels.instance }}) description: Some Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.11. Ceph PG activation longSome Ceph placement groups are too long to activate.[copy]- alert: CephPgActivationLong expr: ceph_pg_activating > 0 for: 5m labels: severity: warning annotations: summary: Ceph PG activation long (instance {{ $labels.instance }}) description: Some Ceph placement groups are too long to activate.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.12. Ceph PG backfill fullSome Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.[copy]- alert: CephPgBackfillFull expr: ceph_pg_backfill_toofull > 0 for: 5m labels: severity: warning annotations: summary: Ceph PG backfill full (instance {{ $labels.instance }}) description: Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.1.13. Ceph PG unavailableSome Ceph placement groups are unavailable.[copy]- alert: CephPgUnavailable expr: ceph_pg_total - ceph_pg_active > 0 for: 5m labels: severity: critical annotations: summary: Ceph PG unavailable (instance {{ $labels.instance }}) description: Some Ceph placement groups are unavailable.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
6. 2. SpeedTest : Speedtest exporter (2 rules)[copy all]- 
6.2.1. SpeedTest Slow Internet DownloadInternet download speed is currently {{humanize $value}} Mbps.[copy]- alert: SpeedtestSlowInternetDownload expr: avg_over_time(speedtest_download[30m]) < 75 for: 5m labels: severity: warning annotations: summary: SpeedTest Slow Internet Download (instance {{ $labels.instance }}) description: Internet download speed is currently {{humanize $value}} Mbps.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.2.2. SpeedTest Slow Internet UploadInternet upload speed is currently {{humanize $value}} Mbps.[copy]- alert: SpeedtestSlowInternetUpload expr: avg_over_time(speedtest_upload[30m]) < 20 for: 5m labels: severity: warning annotations: summary: SpeedTest Slow Internet Upload (instance {{ $labels.instance }}) description: Internet upload speed is currently {{humanize $value}} Mbps.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
6. 3. ZFS : node-exporter// @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
- 
6. 4. OpenEBS : Embedded exporter (1 rules)[copy all]- 
6.4.1. OpenEBS used pool capacityOpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}[copy]- alert: OpenebsUsedPoolCapacity expr: (openebs_used_pool_capacity_percent) > 80 for: 5m labels: severity: warning annotations: summary: OpenEBS used pool capacity (instance {{ $labels.instance }}) description: OpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
6. 5. Minio : Embedded exporter (2 rules)[copy all]- 
6.5.1. Minio disk offlineMinio disk is offline[copy]- alert: MinioDiskOffline expr: minio_offline_disks > 0 for: 5m labels: severity: critical annotations: summary: Minio disk offline (instance {{ $labels.instance }}) description: Minio disk is offline\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.5.2. Minio storage space exhaustedMinio storage space is low (< 10 GB)[copy]- alert: MinioStorageSpaceExhausted expr: minio_disk_storage_free_bytes / 1024 / 1024 / 1024 < 10 for: 5m labels: severity: warning annotations: summary: Minio storage space exhausted (instance {{ $labels.instance }}) description: Minio storage space is low (< 10 GB)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
6. 6. Juniper : czerwonk/junos_exporter (3 rules)[copy all]- 
6.6.1. Juniper switch downThe switch appears to be down[copy]- alert: JuniperSwitchDown expr: junos_up == 0 for: 5m labels: severity: critical annotations: summary: Juniper switch down (instance {{ $labels.instance }}) description: The switch appears to be down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.6.2. Juniper high Bandwith Usage 1GiBInterface is highly saturated for at least 1 min. (> 0.90GiB/s)[copy]- alert: JuniperHighBandwithUsage1gib expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90 for: 5m labels: severity: critical annotations: summary: Juniper high Bandwith Usage 1GiB (instance {{ $labels.instance }}) description: Interface is highly saturated for at least 1 min. (> 0.90GiB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
6.6.3. Juniper high Bandwith Usage 1GiBInterface is getting saturated for at least 1 min. (> 0.80GiB/s)[copy]- alert: JuniperHighBandwithUsage1gib expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80 for: 5m labels: severity: warning annotations: summary: Juniper high Bandwith Usage 1GiB (instance {{ $labels.instance }}) description: Interface is getting saturated for at least 1 min. (> 0.80GiB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
6. 7. CoreDNS : Embedded exporter (1 rules)[copy all]- 
6.7.1. CoreDNS Panic CountNumber of CoreDNS panics encountered[copy]- alert: CorednsPanicCount expr: increase(coredns_panic_count_total[10m]) > 0 for: 5m labels: severity: critical annotations: summary: CoreDNS Panic Count (instance {{ $labels.instance }}) description: Number of CoreDNS panics encountered\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
- 
7. 1. Thanos (3 rules)[copy all]- 
7.1.1. Thanos compaction haltedThanos compaction has failed to run and is now halted.[copy]- alert: ThanosCompactionHalted expr: thanos_compactor_halted == 1 for: 5m labels: severity: critical annotations: summary: Thanos compaction halted (instance {{ $labels.instance }}) description: Thanos compaction has failed to run and is now halted.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
7.1.2. Thanos compact bucket operation failureThanos compaction has failing storage operations[copy]- alert: ThanosCompactBucketOperationFailure expr: rate(thanos_objstore_bucket_operation_failures_total[1m]) > 0 for: 5m labels: severity: critical annotations: summary: Thanos compact bucket operation failure (instance {{ $labels.instance }}) description: Thanos compaction has failing storage operations\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
- 
7.1.3. Thanos compact not runThanos compaction has not run in 24 hours.[copy]- alert: ThanosCompactNotRun expr: (time() - thanos_objstore_bucket_last_successful_upload_time) > 24*60*60 for: 5m labels: severity: critical annotations: summary: Thanos compact not run (instance {{ $labels.instance }}) description: Thanos compaction has not run in 24 hours.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}
 
- 
 
                    
                
 
                
            
         浙公网安备 33010602011771号
浙公网安备 33010602011771号