Awesome Prometheus alerts

http://t.zoukankan.com/shoufu-p-14110485.html

转载于https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware

Collection of alerting rules

AlertManager config Rules Contribute on GitHub

  •  
  •  

⚠️ Caution ⚠️

Alert thresholds depend on nature of applications.
Some queries in this page may have arbitrary tolerance threshold.

Building an efficient and battle-tested monitoring platform takes time. 😉



 

  • 1. 1. Prometheus self-monitoring (25 rules)[copy all]

    • 1.1.1. Prometheus job missing

       A Prometheus job has disappeared[copy]

       

        - alert: PrometheusJobMissing
          expr: absent(up{job="prometheus"})
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus job missing (instance {{ $labels.instance }})
            description: A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.2. Prometheus target missing

       A Prometheus target has disappeared. An exporter might be crashed.[copy]

       

        - alert: PrometheusTargetMissing
          expr: up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus target missing (instance {{ $labels.instance }})
            description: A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.3. Prometheus all targets missing

       A Prometheus job does not have living target anymore.[copy]

       

        - alert: PrometheusAllTargetsMissing
          expr: count by (job) (up) == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus all targets missing (instance {{ $labels.instance }})
            description: A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.4. Prometheus configuration reload failure

       Prometheus configuration reload error[copy]

       

        - alert: PrometheusConfigurationReloadFailure
          expr: prometheus_config_last_reload_successful != 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
            description: Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.5. Prometheus too many restarts

       Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.[copy]

       

        - alert: PrometheusTooManyRestarts
          expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus too many restarts (instance {{ $labels.instance }})
            description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.6. Prometheus AlertManager configuration reload failure

       AlertManager configuration reload error[copy]

       

        - alert: PrometheusAlertmanagerConfigurationReloadFailure
          expr: alertmanager_config_last_reload_successful != 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
            description: AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.7. Prometheus AlertManager config not synced

       Configurations of AlertManager cluster instances are out of sync[copy]

       

        - alert: PrometheusAlertmanagerConfigNotSynced
          expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
            description: Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.8. Prometheus AlertManager E2E dead man switch

       Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.[copy]

       

        - alert: PrometheusAlertmanagerE2eDeadManSwitch
          expr: vector(1)
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
            description: Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.9. Prometheus not connected to alertmanager

       Prometheus cannot connect the alertmanager[copy]

       

        - alert: PrometheusNotConnectedToAlertmanager
          expr: prometheus_notifications_alertmanagers_discovered < 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
            description: Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.10. Prometheus rule evaluation failures

       Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.[copy]

       

        - alert: PrometheusRuleEvaluationFailures
          expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.11. Prometheus template text expansion failures

       Prometheus encountered {{ $value }} template text expansion failures[copy]

       

        - alert: PrometheusTemplateTextExpansionFailures
          expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.12. Prometheus rule evaluation slow

       Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.[copy]

       

        - alert: PrometheusRuleEvaluationSlow
          expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
            description: Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.13. Prometheus notifications backlog

       The Prometheus notification queue has not been empty for 10 minutes[copy]

       

        - alert: PrometheusNotificationsBacklog
          expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus notifications backlog (instance {{ $labels.instance }})
            description: The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.14. Prometheus AlertManager notification failing

       Alertmanager is failing sending notifications[copy]

       

        - alert: PrometheusAlertmanagerNotificationFailing
          expr: rate(alertmanager_notifications_failed_total[1m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
            description: Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.15. Prometheus target empty

       Prometheus has no target in service discovery[copy]

       

        - alert: PrometheusTargetEmpty
          expr: prometheus_sd_discovered_targets == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus target empty (instance {{ $labels.instance }})
            description: Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.16. Prometheus target scraping slow

       Prometheus is scraping exporters slowly[copy]

       

        - alert: PrometheusTargetScrapingSlow
          expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus target scraping slow (instance {{ $labels.instance }})
            description: Prometheus is scraping exporters slowly\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.17. Prometheus large scrape

       Prometheus has many scrapes that exceed the sample limit[copy]

       

        - alert: PrometheusLargeScrape
          expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus large scrape (instance {{ $labels.instance }})
            description: Prometheus has many scrapes that exceed the sample limit\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.18. Prometheus target scrape duplicate

       Prometheus has many samples rejected due to duplicate timestamps but different values[copy]

       

        - alert: PrometheusTargetScrapeDuplicate
          expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
            description: Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.19. Prometheus TSDB checkpoint creation failures

       Prometheus encountered {{ $value }} checkpoint creation failures[copy]

       

        - alert: PrometheusTsdbCheckpointCreationFailures
          expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.20. Prometheus TSDB checkpoint deletion failures

       Prometheus encountered {{ $value }} checkpoint deletion failures[copy]

       

        - alert: PrometheusTsdbCheckpointDeletionFailures
          expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.21. Prometheus TSDB compactions failed

       Prometheus encountered {{ $value }} TSDB compactions failures[copy]

       

        - alert: PrometheusTsdbCompactionsFailed
          expr: increase(prometheus_tsdb_compactions_failed_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.22. Prometheus TSDB head truncations failed

       Prometheus encountered {{ $value }} TSDB head truncation failures[copy]

       

        - alert: PrometheusTsdbHeadTruncationsFailed
          expr: increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.23. Prometheus TSDB reload failures

       Prometheus encountered {{ $value }} TSDB reload failures[copy]

       

        - alert: PrometheusTsdbReloadFailures
          expr: increase(prometheus_tsdb_reloads_failures_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.24. Prometheus TSDB WAL corruptions

       Prometheus encountered {{ $value }} TSDB WAL corruptions[copy]

       

        - alert: PrometheusTsdbWalCorruptions
          expr: increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.1.25. Prometheus TSDB WAL truncations failed

       Prometheus encountered {{ $value }} TSDB WAL truncation failures[copy]

       

        - alert: PrometheusTsdbWalTruncationsFailed
          expr: increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
            description: Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 1. 2. Host and hardware : node-exporter (26 rules)[copy all]

    • 1.2.1. Host out of memory

       Node memory is filling up (< 10% left)[copy]

       

        - alert: HostOutOfMemory
          expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host out of memory (instance {{ $labels.instance }})
            description: Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.2. Host memory under memory pressure

       The node is under heavy memory pressure. High rate of major page faults[copy]

       

        - alert: HostMemoryUnderMemoryPressure
          expr: rate(node_vmstat_pgmajfault[1m]) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host memory under memory pressure (instance {{ $labels.instance }})
            description: The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.3. Host unusual network throughput in

       Host network interfaces are probably receiving too much data (> 100 MB/s)[copy]

       

        - alert: HostUnusualNetworkThroughputIn
          expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host unusual network throughput in (instance {{ $labels.instance }})
            description: Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.4. Host unusual network throughput out

       Host network interfaces are probably sending too much data (> 100 MB/s)[copy]

       

        - alert: HostUnusualNetworkThroughputOut
          expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host unusual network throughput out (instance {{ $labels.instance }})
            description: Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.5. Host unusual disk read rate

       Disk is probably reading too much data (> 50 MB/s)[copy]

       

        - alert: HostUnusualDiskReadRate
          expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host unusual disk read rate (instance {{ $labels.instance }})
            description: Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.6. Host unusual disk write rate

       Disk is probably writing too much data (> 50 MB/s)[copy]

       

        - alert: HostUnusualDiskWriteRate
          expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host unusual disk write rate (instance {{ $labels.instance }})
            description: Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.7. Host out of disk space

       Disk is almost full (< 10% left)[copy]

       

        # please add ignored mountpoints in node_exporter parameters like
        # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)"
        - alert: HostOutOfDiskSpace
          expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host out of disk space (instance {{ $labels.instance }})
            description: Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.8. Host disk will fill in 4 hours

       Disk will fill in 4 hours at current write rate[copy]

       

        - alert: HostDiskWillFillIn4Hours
          expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host disk will fill in 4 hours (instance {{ $labels.instance }})
            description: Disk will fill in 4 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.9. Host out of inodes

       Disk is almost running out of available inodes (< 10% left)[copy]

       

        - alert: HostOutOfInodes
          expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint ="/rootfs"} * 100 < 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host out of inodes (instance {{ $labels.instance }})
            description: Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.10. Host unusual disk read latency

       Disk latency is growing (read operations > 100ms)[copy]

       

        - alert: HostUnusualDiskReadLatency
          expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host unusual disk read latency (instance {{ $labels.instance }})
            description: Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.11. Host unusual disk write latency

       Disk latency is growing (write operations > 100ms)[copy]

       

        - alert: HostUnusualDiskWriteLatency
          expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host unusual disk write latency (instance {{ $labels.instance }})
            description: Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.12. Host high CPU load

       CPU load is > 80%[copy]

       

        - alert: HostHighCpuLoad
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host high CPU load (instance {{ $labels.instance }})
            description: CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.13. Host context switching

       Context switching is growing on node (> 1000 / s)[copy]

       

        # 1000 context switches is an arbitrary number.
        # Alert threshold depends on nature of application.
        # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
        - alert: HostContextSwitching
          expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host context switching (instance {{ $labels.instance }})
            description: Context switching is growing on node (> 1000 / s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.14. Host swap is filling up

       Swap is filling up (>80%)[copy]

       

        - alert: HostSwapIsFillingUp
          expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host swap is filling up (instance {{ $labels.instance }})
            description: Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.15. Host SystemD service crashed

       SystemD service crashed[copy]

       

        - alert: HostSystemdServiceCrashed
          expr: node_systemd_unit_state{state="failed"} == 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host SystemD service crashed (instance {{ $labels.instance }})
            description: SystemD service crashed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.16. Host physical component too hot

       Physical hardware component too hot[copy]

       

        - alert: HostPhysicalComponentTooHot
          expr: node_hwmon_temp_celsius > 75
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host physical component too hot (instance {{ $labels.instance }})
            description: Physical hardware component too hot\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.17. Host node overtemperature alarm

       Physical node temperature alarm triggered[copy]

       

        - alert: HostNodeOvertemperatureAlarm
          expr: node_hwmon_temp_alarm == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Host node overtemperature alarm (instance {{ $labels.instance }})
            description: Physical node temperature alarm triggered\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.18. Host RAID array got inactive

       RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.[copy]

       

        - alert: HostRaidArrayGotInactive
          expr: node_md_state{state="inactive"} > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Host RAID array got inactive (instance {{ $labels.instance }})
            description: RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.19. Host RAID disk failure

       At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap[copy]

       

        - alert: HostRaidDiskFailure
          expr: node_md_disks{state="failed"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host RAID disk failure (instance {{ $labels.instance }})
            description: At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.20. Host kernel version deviations

       Different kernel versions are running[copy]

       

        - alert: HostKernelVersionDeviations
          expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host kernel version deviations (instance {{ $labels.instance }})
            description: Different kernel versions are running\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.21. Host OOM kill detected

       OOM kill detected[copy]

       

        - alert: HostOomKillDetected
          expr: increase(node_vmstat_oom_kill[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host OOM kill detected (instance {{ $labels.instance }})
            description: OOM kill detected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.22. Host EDAC Correctable Errors detected

       {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.[copy]

       

        - alert: HostEdacCorrectableErrorsDetected
          expr: increase(node_edac_correctable_errors_total[5m]) > 0
          for: 5m
          labels:
            severity: info
          annotations:
            summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
            description: {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.23. Host EDAC Uncorrectable Errors detected

       {{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.[copy]

       

        - alert: HostEdacUncorrectableErrorsDetected
          expr: node_edac_uncorrectable_errors_total > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
            description: {{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.24. Host Network Receive Errors

       {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.[copy]

       

        - alert: HostNetworkReceiveErrors
          expr: increase(node_network_receive_errs_total[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host Network Receive Errors (instance {{ $labels.instance }})
            description: {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.25. Host Network Transmit Errors

       {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.[copy]

       

        - alert: HostNetworkTransmitErrors
          expr: increase(node_network_transmit_errs_total[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host Network Transmit Errors (instance {{ $labels.instance }})
            description: {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.2.26. Host Network Interface Saturated

       The network interface "{{ $labels.interface }}" on "{{ $labels.instance }}" is getting overloaded.[copy]

       

        - alert: HostNetworkInterfaceSaturated
          expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Host Network Interface Saturated (instance {{ $labels.instance }})
            description: The network interface "{{ $labels.interface }}" on "{{ $labels.instance }}" is getting overloaded.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 1. 3. Docker containers : google/cAdvisor (6 rules)[copy all]

    • 1.3.1. Container killed

       A container has disappeared[copy]

       

        - alert: ContainerKilled
          expr: time() - container_last_seen > 60
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Container killed (instance {{ $labels.instance }})
            description: A container has disappeared\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.3.2. Container CPU usage

       Container CPU usage is above 80%[copy]

       

        # cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
        # If you want to exclude it from this alert, just use: container_cpu_usage_seconds_total{name!=""}
        - alert: ContainerCpuUsage
          expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Container CPU usage (instance {{ $labels.instance }})
            description: Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.3.3. Container Memory usage

       Container Memory usage is above 80%[copy]

       

        # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
        - alert: ContainerMemoryUsage
          expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Container Memory usage (instance {{ $labels.instance }})
            description: Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.3.4. Container Volume usage

       Container Volume usage is above 80%[copy]

       

        - alert: ContainerVolumeUsage
          expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance)) * 100) > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Container Volume usage (instance {{ $labels.instance }})
            description: Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.3.5. Container Volume IO usage

       Container Volume IO usage is above 80%[copy]

       

        - alert: ContainerVolumeIoUsage
          expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Container Volume IO usage (instance {{ $labels.instance }})
            description: Container Volume IO usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.3.6. Container high throttle rate

       Container is being throttled[copy]

       

        - alert: ContainerHighThrottleRate
          expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Container high throttle rate (instance {{ $labels.instance }})
            description: Container is being throttled\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 1. 4. Blackbox : prometheus/blackbox_exporter (8 rules)[copy all]

    • 1.4.1. Blackbox probe failed

       Probe failed[copy]

       

        - alert: BlackboxProbeFailed
          expr: probe_success == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Blackbox probe failed (instance {{ $labels.instance }})
            description: Probe failed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.4.2. Blackbox slow probe

       Blackbox probe took more than 1s to complete[copy]

       

        - alert: BlackboxSlowProbe
          expr: avg_over_time(probe_duration_seconds[1m]) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Blackbox slow probe (instance {{ $labels.instance }})
            description: Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.4.3. Blackbox probe HTTP failure

       HTTP status code is not 200-399[copy]

       

        - alert: BlackboxProbeHttpFailure
          expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
            description: HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.4.4. Blackbox SSL certificate will expire soon

       SSL certificate expires in 30 days[copy]

       

        - alert: BlackboxSslCertificateWillExpireSoon
          expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
            description: SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.4.5. Blackbox SSL certificate will expire soon

       SSL certificate expires in 3 days[copy]

       

        - alert: BlackboxSslCertificateWillExpireSoon
          expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
            description: SSL certificate expires in 3 days\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.4.6. Blackbox SSL certificate expired

       SSL certificate has expired already[copy]

       

        - alert: BlackboxSslCertificateExpired
          expr: probe_ssl_earliest_cert_expiry - time() <= 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
            description: SSL certificate has expired already\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.4.7. Blackbox probe slow HTTP

       HTTP request took more than 1s[copy]

       

        - alert: BlackboxProbeSlowHttp
          expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})
            description: HTTP request took more than 1s\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.4.8. Blackbox probe slow ping

       Blackbox ping took more than 1s[copy]

       

        - alert: BlackboxProbeSlowPing
          expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Blackbox probe slow ping (instance {{ $labels.instance }})
            description: Blackbox ping took more than 1s\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 1. 5. Windows Server : prometheus-community/windows_exporter (5 rules)[copy all]

    • 1.5.1. Windows Server collector Error

       Collector {{ $labels.collector }} was not successful[copy]

       

        - alert: WindowsServerCollectorError
          expr: windows_exporter_collector_success == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Windows Server collector Error (instance {{ $labels.instance }})
            description: Collector {{ $labels.collector }} was not successful\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.5.2. Windows Server service Status

       Windows Service state is not OK[copy]

       

        - alert: WindowsServerServiceStatus
          expr: windows_service_status{status="ok"} != 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Windows Server service Status (instance {{ $labels.instance }})
            description: Windows Service state is not OK\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.5.3. Windows Server CPU Usage

       CPU Usage is more than 80%[copy]

       

        - alert: WindowsServerCpuUsage
          expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Windows Server CPU Usage (instance {{ $labels.instance }})
            description: CPU Usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.5.4. Windows Server memory Usage

       Memory usage is more than 90%[copy]

       

        - alert: WindowsServerMemoryUsage
          expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Windows Server memory Usage (instance {{ $labels.instance }})
            description: Memory usage is more than 90%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 1.5.5. Windows Server disk Space Usage

       Disk usage is more than 80%[copy]

       

        - alert: WindowsServerDiskSpaceUsage
          expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
            description: Disk usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 1. MySQL : prometheus/mysqld_exporter (8 rules)[copy all]

    • 2.1.1. MySQL down

       MySQL instance is down on {{ $labels.instance }}[copy]

       

        - alert: MysqlDown
          expr: mysql_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MySQL down (instance {{ $labels.instance }})
            description: MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.1.2. MySQL too many connections

       More than 80% of MySQL connections are in use on {{ $labels.instance }}[copy]

       

        - alert: MysqlTooManyConnections
          expr: avg by (instance) (max_over_time(mysql_global_status_threads_connected[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MySQL too many connections (instance {{ $labels.instance }})
            description: More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.1.3. MySQL high threads running

       More than 60% of MySQL connections are in running state on {{ $labels.instance }}[copy]

       

        - alert: MysqlHighThreadsRunning
          expr: avg by (instance) (max_over_time(mysql_global_status_threads_running[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 60
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MySQL high threads running (instance {{ $labels.instance }})
            description: More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.1.4. MySQL Slave IO thread not running

       MySQL Slave IO thread not running on {{ $labels.instance }}[copy]

       

        - alert: MysqlSlaveIoThreadNotRunning
          expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MySQL Slave IO thread not running (instance {{ $labels.instance }})
            description: MySQL Slave IO thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.1.5. MySQL Slave SQL thread not running

       MySQL Slave SQL thread not running on {{ $labels.instance }}[copy]

       

        - alert: MysqlSlaveSqlThreadNotRunning
          expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }})
            description: MySQL Slave SQL thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.1.6. MySQL Slave replication lag

       MysqL replication lag on {{ $labels.instance }}[copy]

       

        - alert: MysqlSlaveReplicationLag
          expr: mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 300
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MySQL Slave replication lag (instance {{ $labels.instance }})
            description: MysqL replication lag on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.1.7. MySQL slow queries

       MySQL server mysql has some new slow query.[copy]

       

        - alert: MysqlSlowQueries
          expr: rate(mysql_global_status_slow_queries[2m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MySQL slow queries (instance {{ $labels.instance }})
            description: MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.1.8. MySQL restarted

       MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.[copy]

       

        - alert: MysqlRestarted
          expr: mysql_global_status_uptime < 60
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MySQL restarted (instance {{ $labels.instance }})
            description: MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 2. PostgreSQL : wrouesnel/postgres_exporter (25 rules)[copy all]

    • 2.2.1. Postgresql down

       Postgresql instance is down[copy]

       

        - alert: PostgresqlDown
          expr: pg_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql down (instance {{ $labels.instance }})
            description: Postgresql instance is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.2. Postgresql restarted

       Postgresql restarted[copy]

       

        - alert: PostgresqlRestarted
          expr: time() - pg_postmaster_start_time_seconds < 60
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql restarted (instance {{ $labels.instance }})
            description: Postgresql restarted\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.3. Postgresql exporter error

       Postgresql exporter is showing errors. A query may be buggy in query.yaml[copy]

       

        - alert: PostgresqlExporterError
          expr: pg_exporter_last_scrape_error > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql exporter error (instance {{ $labels.instance }})
            description: Postgresql exporter is showing errors. A query may be buggy in query.yaml\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.4. Postgresql replication lag

       PostgreSQL replication lag is going up (> 10s)[copy]

       

        - alert: PostgresqlReplicationLag
          expr: (pg_replication_lag) > 10 and ON(instance) (pg_replication_is_replica == 1)
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql replication lag (instance {{ $labels.instance }})
            description: PostgreSQL replication lag is going up (> 10s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.5. Postgresql table not vaccumed

       Table has not been vaccum for 24 hours[copy]

       

        - alert: PostgresqlTableNotVaccumed
          expr: time() - pg_stat_user_tables_last_autovacuum > 60 * 60 * 24
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql table not vaccumed (instance {{ $labels.instance }})
            description: Table has not been vaccum for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.6. Postgresql table not analyzed

       Table has not been analyzed for 24 hours[copy]

       

        - alert: PostgresqlTableNotAnalyzed
          expr: time() - pg_stat_user_tables_last_autoanalyze > 60 * 60 * 24
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql table not analyzed (instance {{ $labels.instance }})
            description: Table has not been analyzed for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.7. Postgresql too many connections

       PostgreSQL instance has too many connections[copy]

       

        - alert: PostgresqlTooManyConnections
          expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql too many connections (instance {{ $labels.instance }})
            description: PostgreSQL instance has too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.8. Postgresql not enough connections

       PostgreSQL instance should have more connections (> 5)[copy]

       

        - alert: PostgresqlNotEnoughConnections
          expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql not enough connections (instance {{ $labels.instance }})
            description: PostgreSQL instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.9. Postgresql dead locks

       PostgreSQL has dead-locks[copy]

       

        - alert: PostgresqlDeadLocks
          expr: rate(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql dead locks (instance {{ $labels.instance }})
            description: PostgreSQL has dead-locks\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.10. Postgresql slow queries

       PostgreSQL executes slow queries[copy]

       

        - alert: PostgresqlSlowQueries
          expr: pg_slow_queries > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql slow queries (instance {{ $labels.instance }})
            description: PostgreSQL executes slow queries\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.11. Postgresql high rollback rate

       Ratio of transactions being aborted compared to committed is > 2 %[copy]

       

        - alert: PostgresqlHighRollbackRate
          expr: rate(pg_stat_database_xact_rollback{datname!~"template.*"}[3m]) / rate(pg_stat_database_xact_commit{datname!~"template.*"}[3m]) > 0.02
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql high rollback rate (instance {{ $labels.instance }})
            description: Ratio of transactions being aborted compared to committed is > 2 %\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.12. Postgresql commit rate low

       Postgres seems to be processing very few transactions[copy]

       

        - alert: PostgresqlCommitRateLow
          expr: rate(pg_stat_database_xact_commit[1m]) < 10
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql commit rate low (instance {{ $labels.instance }})
            description: Postgres seems to be processing very few transactions\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.13. Postgresql low XID consumption

       Postgresql seems to be consuming transaction IDs very slowly[copy]

       

        - alert: PostgresqlLowXidConsumption
          expr: rate(pg_txid_current[1m]) < 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql low XID consumption (instance {{ $labels.instance }})
            description: Postgresql seems to be consuming transaction IDs very slowly\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.14. Postgresqllow XLOG consumption

       Postgres seems to be consuming XLOG very slowly[copy]

       

        - alert: PostgresqllowXlogConsumption
          expr: rate(pg_xlog_position_bytes[1m]) < 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresqllow XLOG consumption (instance {{ $labels.instance }})
            description: Postgres seems to be consuming XLOG very slowly\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.15. Postgresql WALE replication stopped

       WAL-E replication seems to be stopped[copy]

       

        - alert: PostgresqlWaleReplicationStopped
          expr: rate(pg_xlog_position_bytes[1m]) == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql WALE replication stopped (instance {{ $labels.instance }})
            description: WAL-E replication seems to be stopped\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.16. Postgresql high rate statement timeout

       Postgres transactions showing high rate of statement timeouts[copy]

       

        - alert: PostgresqlHighRateStatementTimeout
          expr: rate(postgresql_errors_total{type="statement_timeout"}[5m]) > 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql high rate statement timeout (instance {{ $labels.instance }})
            description: Postgres transactions showing high rate of statement timeouts\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.17. Postgresql high rate deadlock

       Postgres detected deadlocks[copy]

       

        - alert: PostgresqlHighRateDeadlock
          expr: rate(postgresql_errors_total{type="deadlock_detected"}[1m]) * 60 > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql high rate deadlock (instance {{ $labels.instance }})
            description: Postgres detected deadlocks\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.18. Postgresql replication lab bytes

       Postgres Replication lag (in bytes) is high[copy]

       

        - alert: PostgresqlReplicationLabBytes
          expr: (pg_xlog_position_bytes and pg_replication_is_replica == 0) - GROUP_RIGHT(instance) (pg_xlog_position_bytes and pg_replication_is_replica == 1) > 1e+09
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql replication lab bytes (instance {{ $labels.instance }})
            description: Postgres Replication lag (in bytes) is high\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.19. Postgresql unused replication slot

       Unused Replication Slots[copy]

       

        - alert: PostgresqlUnusedReplicationSlot
          expr: pg_replication_slots_active == 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql unused replication slot (instance {{ $labels.instance }})
            description: Unused Replication Slots\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.20. Postgresql too many dead tuples

       PostgreSQL dead tuples is too large[copy]

       

        - alert: PostgresqlTooManyDeadTuples
          expr: ((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 unless ON(instance) (pg_replication_is_replica == 1)
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql too many dead tuples (instance {{ $labels.instance }})
            description: PostgreSQL dead tuples is too large\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.21. Postgresql split brain

       Split Brain, too many primary Postgresql databases in read-write mode[copy]

       

        - alert: PostgresqlSplitBrain
          expr: count(pg_replication_is_replica == 0) != 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql split brain (instance {{ $labels.instance }})
            description: Split Brain, too many primary Postgresql databases in read-write mode\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.22. Postgresql promoted node

       Postgresql standby server has been promoted as primary node[copy]

       

        - alert: PostgresqlPromotedNode
          expr: pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql promoted node (instance {{ $labels.instance }})
            description: Postgresql standby server has been promoted as primary node\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.23. Postgresql configuration changed

       Postgres Database configuration change has occurred[copy]

       

        - alert: PostgresqlConfigurationChanged
          expr: {__name__=~"pg_settings_.*"} != ON(__name__) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Postgresql configuration changed (instance {{ $labels.instance }})
            description: Postgres Database configuration change has occurred\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.24. Postgresql SSL compression active

       Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.[copy]

       

        - alert: PostgresqlSslCompressionActive
          expr: sum(pg_stat_ssl_compression) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql SSL compression active (instance {{ $labels.instance }})
            description: Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.2.25. Postgresql too many locks acquired

       Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.[copy]

       

        - alert: PostgresqlTooManyLocksAcquired
          expr: ((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Postgresql too many locks acquired (instance {{ $labels.instance }})
            description: Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 3. SQL Server : Ozarklake/prometheus-mssql-exporter (2 rules)[copy all]

    • 2.3.1. SQL Server down

       SQl server instance is down[copy]

       

        - alert: SqlServerDown
          expr: mssql_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: SQL Server down (instance {{ $labels.instance }})
            description: SQl server instance is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.3.2. SQL Server deadlock

       SQL Server is having some deadlock.[copy]

       

        - alert: SqlServerDeadlock
          expr: rate(mssql_deadlocks[1m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: SQL Server deadlock (instance {{ $labels.instance }})
            description: SQL Server is having some deadlock.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 4. PGBouncer : spreaker/prometheus-pgbouncer-exporter (3 rules)[copy all]

    • 2.4.1. PGBouncer active connectinos

       PGBouncer pools are filling up[copy]

       

        - alert: PgbouncerActiveConnectinos
          expr: pgbouncer_pools_server_active_connections > 200
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: PGBouncer active connectinos (instance {{ $labels.instance }})
            description: PGBouncer pools are filling up\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.4.2. PGBouncer errors

       PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.[copy]

       

        - alert: PgbouncerErrors
          expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[5m]) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: PGBouncer errors (instance {{ $labels.instance }})
            description: PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.4.3. PGBouncer max connections

       The number of PGBouncer client connections has reached max_client_conn.[copy]

       

        - alert: PgbouncerMaxConnections
          expr: rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: PGBouncer max connections (instance {{ $labels.instance }})
            description: The number of PGBouncer client connections has reached max_client_conn.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 5. Redis : oliver006/redis_exporter (11 rules)[copy all]

    • 2.5.1. Redis down

       Redis instance is down[copy]

       

        - alert: RedisDown
          expr: redis_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Redis down (instance {{ $labels.instance }})
            description: Redis instance is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.2. Redis missing master

       Redis cluster has no node marked as master.[copy]

       

        - alert: RedisMissingMaster
          expr: count(redis_instance_info{role="master"}) == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Redis missing master (instance {{ $labels.instance }})
            description: Redis cluster has no node marked as master.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.3. Redis too many masters

       Redis cluster has too many nodes marked as master.[copy]

       

        - alert: RedisTooManyMasters
          expr: count(redis_instance_info{role="master"}) > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Redis too many masters (instance {{ $labels.instance }})
            description: Redis cluster has too many nodes marked as master.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.4. Redis disconnected slaves

       Redis not replicating for all slaves. Consider reviewing the redis replication status.[copy]

       

        - alert: RedisDisconnectedSlaves
          expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Redis disconnected slaves (instance {{ $labels.instance }})
            description: Redis not replicating for all slaves. Consider reviewing the redis replication status.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.5. Redis replication broken

       Redis instance lost a slave[copy]

       

        - alert: RedisReplicationBroken
          expr: delta(redis_connected_slaves[1m]) < 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Redis replication broken (instance {{ $labels.instance }})
            description: Redis instance lost a slave\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.6. Redis cluster flapping

       Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).[copy]

       

        - alert: RedisClusterFlapping
          expr: changes(redis_connected_slaves[5m]) > 2
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Redis cluster flapping (instance {{ $labels.instance }})
            description: Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.7. Redis missing backup

       Redis has not been backuped for 24 hours[copy]

       

        - alert: RedisMissingBackup
          expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Redis missing backup (instance {{ $labels.instance }})
            description: Redis has not been backuped for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.8. Redis out of memory

       Redis is running out of memory (> 90%)[copy]

       

        - alert: RedisOutOfMemory
          expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Redis out of memory (instance {{ $labels.instance }})
            description: Redis is running out of memory (> 90%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.9. Redis too many connections

       Redis instance has too many connections[copy]

       

        - alert: RedisTooManyConnections
          expr: redis_connected_clients > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Redis too many connections (instance {{ $labels.instance }})
            description: Redis instance has too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.10. Redis not enough connections

       Redis instance should have more connections (> 5)[copy]

       

        - alert: RedisNotEnoughConnections
          expr: redis_connected_clients < 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Redis not enough connections (instance {{ $labels.instance }})
            description: Redis instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.5.11. Redis rejected connections

       Some connections to Redis has been rejected[copy]

       

        - alert: RedisRejectedConnections
          expr: increase(redis_rejected_connections_total[1m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Redis rejected connections (instance {{ $labels.instance }})
            description: Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 6. MongoDB : percona/mongodb_exporter

      // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
      

  • 2. 6. MongoDB : dcu/mongodb_exporter (10 rules)[copy all]

    • 2.6.1. MongoDB replication lag

       Mongodb replication lag is more than 10s[copy]

       

        - alert: MongodbReplicationLag
          expr: avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MongoDB replication lag (instance {{ $labels.instance }})
            description: Mongodb replication lag is more than 10s\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.2. MongoDB replication Status 3

       MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync[copy]

       

        - alert: MongodbReplicationStatus3
          expr: mongodb_replset_member_state == 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MongoDB replication Status 3 (instance {{ $labels.instance }})
            description: MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.3. MongoDB replication Status 6

       MongoDB Replication set member as seen from another member of the set, is not yet known[copy]

       

        - alert: MongodbReplicationStatus6
          expr: mongodb_replset_member_state == 6
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MongoDB replication Status 6 (instance {{ $labels.instance }})
            description: MongoDB Replication set member as seen from another member of the set, is not yet known\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.4. MongoDB replication Status 8

       MongoDB Replication set member as seen from another member of the set, is unreachable[copy]

       

        - alert: MongodbReplicationStatus8
          expr: mongodb_replset_member_state == 8
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MongoDB replication Status 8 (instance {{ $labels.instance }})
            description: MongoDB Replication set member as seen from another member of the set, is unreachable\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.5. MongoDB replication Status 9

       MongoDB Replication set member is actively performing a rollback. Data is not available for reads[copy]

       

        - alert: MongodbReplicationStatus9
          expr: mongodb_replset_member_state == 9
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MongoDB replication Status 9 (instance {{ $labels.instance }})
            description: MongoDB Replication set member is actively performing a rollback. Data is not available for reads\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.6. MongoDB replication Status 10

       MongoDB Replication set member was once in a replica set but was subsequently removed[copy]

       

        - alert: MongodbReplicationStatus10
          expr: mongodb_replset_member_state == 10
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: MongoDB replication Status 10 (instance {{ $labels.instance }})
            description: MongoDB Replication set member was once in a replica set but was subsequently removed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.7. MongoDB number cursors open

       Too many cursors opened by MongoDB for clients (> 10k)[copy]

       

        - alert: MongodbNumberCursorsOpen
          expr: mongodb_metrics_cursor_open{state="total_open"} > 10000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MongoDB number cursors open (instance {{ $labels.instance }})
            description: Too many cursors opened by MongoDB for clients (> 10k)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.8. MongoDB cursors timeouts

       Too many cursors are timing out[copy]

       

        - alert: MongodbCursorsTimeouts
          expr: increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MongoDB cursors timeouts (instance {{ $labels.instance }})
            description: Too many cursors are timing out\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.9. MongoDB too many connections

       Too many connections[copy]

       

        - alert: MongodbTooManyConnections
          expr: mongodb_connections{state="current"} > 500
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MongoDB too many connections (instance {{ $labels.instance }})
            description: Too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.6.10. MongoDB virtual memory usage

       High memory usage[copy]

       

        - alert: MongodbVirtualMemoryUsage
          expr: (sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: MongoDB virtual memory usage (instance {{ $labels.instance }})
            description: High memory usage\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 7. RabbitMQ (official exporter) : rabbitmq/rabbitmq-prometheus (9 rules)[copy all]

    • 2.7.1. Rabbitmq node down

       Less than 3 nodes running in RabbitMQ cluster[copy]

       

        - alert: RabbitmqNodeDown
          expr: sum(rabbitmq_build_info) < 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Rabbitmq node down (instance {{ $labels.instance }})
            description: Less than 3 nodes running in RabbitMQ cluster\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.2. Rabbitmq node not distributed

       Distribution link state is not 'up'[copy]

       

        - alert: RabbitmqNodeNotDistributed
          expr: erlang_vm_dist_node_state < 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Rabbitmq node not distributed (instance {{ $labels.instance }})
            description: Distribution link state is not 'up'\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.3. Rabbitmq instances different versions

       Running different version of Rabbitmq in the same cluster, can lead to failure.[copy]

       

        - alert: RabbitmqInstancesDifferentVersions
          expr: count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq instances different versions (instance {{ $labels.instance }})
            description: Running different version of Rabbitmq in the same cluster, can lead to failure.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.4. Rabbitmq memory high

       A node use more than 90% of allocated RAM[copy]

       

        - alert: RabbitmqMemoryHigh
          expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq memory high (instance {{ $labels.instance }})
            description: A node use more than 90% of allocated RAM\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.5. Rabbitmq file descriptors usage

       A node use more than 90% of file descriptors[copy]

       

        - alert: RabbitmqFileDescriptorsUsage
          expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq file descriptors usage (instance {{ $labels.instance }})
            description: A node use more than 90% of file descriptors\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.6. Rabbitmq too much unack

       Too much unacknowledged messages[copy]

       

        - alert: RabbitmqTooMuchUnack
          expr: sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq too much unack (instance {{ $labels.instance }})
            description: Too much unacknowledged messages\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.7. Rabbitmq too much connections

       The total connections of a node is too high[copy]

       

        - alert: RabbitmqTooMuchConnections
          expr: rabbitmq_connections > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq too much connections (instance {{ $labels.instance }})
            description: The total connections of a node is too high\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.8. Rabbitmq no queue consumer

       A queue has less than 1 consumer[copy]

       

        - alert: RabbitmqNoQueueConsumer
          expr: rabbitmq_queue_consumers < 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq no queue consumer (instance {{ $labels.instance }})
            description: A queue has less than 1 consumer\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.9. Rabbitmq unroutable messages

       A queue has unroutable messages[copy]

       

        - alert: RabbitmqUnroutableMessages
          expr: increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq unroutable messages (instance {{ $labels.instance }})
            description: A queue has unroutable messages\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 7. RabbitMQ (official exporter) : kbudde/rabbitmq-exporter (11 rules)[copy all]

    • 2.7.1. Rabbitmq down

       RabbitMQ node down[copy]

       

        - alert: RabbitmqDown
          expr: rabbitmq_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Rabbitmq down (instance {{ $labels.instance }})
            description: RabbitMQ node down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.2. Rabbitmq cluster down

       Less than 3 nodes running in RabbitMQ cluster[copy]

       

        - alert: RabbitmqClusterDown
          expr: sum(rabbitmq_running) < 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Rabbitmq cluster down (instance {{ $labels.instance }})
            description: Less than 3 nodes running in RabbitMQ cluster\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.3. Rabbitmq cluster partition

       Cluster partition[copy]

       

        - alert: RabbitmqClusterPartition
          expr: rabbitmq_partitions > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Rabbitmq cluster partition (instance {{ $labels.instance }})
            description: Cluster partition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.4. Rabbitmq out of memory

       Memory available for RabbmitMQ is low (< 10%)[copy]

       

        - alert: RabbitmqOutOfMemory
          expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq out of memory (instance {{ $labels.instance }})
            description: Memory available for RabbmitMQ is low (< 10%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.5. Rabbitmq too many connections

       RabbitMQ instance has too many connections (> 1000)[copy]

       

        - alert: RabbitmqTooManyConnections
          expr: rabbitmq_connectionsTotal > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq too many connections (instance {{ $labels.instance }})
            description: RabbitMQ instance has too many connections (> 1000)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.6. Rabbitmq dead letter queue filling up

       Dead letter queue is filling up (> 10 msgs)[copy]

       

        - alert: RabbitmqDeadLetterQueueFillingUp
          expr: rabbitmq_queue_messages{queue="my-dead-letter-queue"} > 10
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Rabbitmq dead letter queue filling up (instance {{ $labels.instance }})
            description: Dead letter queue is filling up (> 10 msgs)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.7. Rabbitmq too many messages in queue

       Queue is filling up (> 1000 msgs)[copy]

       

        - alert: RabbitmqTooManyMessagesInQueue
          expr: rabbitmq_queue_messages_ready{queue="my-queue"} > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq too many messages in queue (instance {{ $labels.instance }})
            description: Queue is filling up (> 1000 msgs)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.8. Rabbitmq slow queue consuming

       Queue messages are consumed slowly (> 60s)[copy]

       

        - alert: RabbitmqSlowQueueConsuming
          expr: time() - rabbitmq_queue_head_message_timestamp{queue="my-queue"} > 60
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq slow queue consuming (instance {{ $labels.instance }})
            description: Queue messages are consumed slowly (> 60s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.9. Rabbitmq no consumer

       Queue has no consumer[copy]

       

        - alert: RabbitmqNoConsumer
          expr: rabbitmq_queue_consumers == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Rabbitmq no consumer (instance {{ $labels.instance }})
            description: Queue has no consumer\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.10. Rabbitmq too many consumers

       Queue should have only 1 consumer[copy]

       

        - alert: RabbitmqTooManyConsumers
          expr: rabbitmq_queue_consumers > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Rabbitmq too many consumers (instance {{ $labels.instance }})
            description: Queue should have only 1 consumer\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.7.11. Rabbitmq unactive exchange

       Exchange receive less than 5 msgs per second[copy]

       

        - alert: RabbitmqUnactiveExchange
          expr: rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Rabbitmq unactive exchange (instance {{ $labels.instance }})
            description: Exchange receive less than 5 msgs per second\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 8. Elasticsearch : justwatchcom/elasticsearch_exporter (13 rules)[copy all]

    • 2.8.1. Elasticsearch Heap Usage Too High

       The heap usage is over 90% for 5m[copy]

       

        - alert: ElasticsearchHeapUsageTooHigh
          expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})
            description: The heap usage is over 90% for 5m\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.2. Elasticsearch Heap Usage warning

       The heap usage is over 80% for 5m[copy]

       

        - alert: ElasticsearchHeapUsageWarning
          expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }})
            description: The heap usage is over 80% for 5m\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.3. Elasticsearch disk space low

       The disk usage is over 80%[copy]

       

        - alert: ElasticsearchDiskSpaceLow
          expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Elasticsearch disk space low (instance {{ $labels.instance }})
            description: The disk usage is over 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.4. Elasticsearch disk out of space

       The disk usage is over 90%[copy]

       

        - alert: ElasticsearchDiskOutOfSpace
          expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Elasticsearch disk out of space (instance {{ $labels.instance }})
            description: The disk usage is over 90%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.5. Elasticsearch Cluster Red

       Elastic Cluster Red status[copy]

       

        - alert: ElasticsearchClusterRed
          expr: elasticsearch_cluster_health_status{color="red"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Elasticsearch Cluster Red (instance {{ $labels.instance }})
            description: Elastic Cluster Red status\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.6. Elasticsearch Cluster Yellow

       Elastic Cluster Yellow status[copy]

       

        - alert: ElasticsearchClusterYellow
          expr: elasticsearch_cluster_health_status{color="yellow"} == 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }})
            description: Elastic Cluster Yellow status\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.7. Elasticsearch Healthy Nodes

       Number Healthy Nodes less then number_of_nodes[copy]

       

        - alert: ElasticsearchHealthyNodes
          expr: elasticsearch_cluster_health_number_of_nodes < number_of_nodes
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
            description: Number Healthy Nodes less then number_of_nodes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.8. Elasticsearch Healthy Data Nodes

       Number Healthy Data Nodes less then number_of_data_nodes[copy]

       

        - alert: ElasticsearchHealthyDataNodes
          expr: elasticsearch_cluster_health_number_of_data_nodes < number_of_data_nodes
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }})
            description: Number Healthy Data Nodes less then number_of_data_nodes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.9. Elasticsearch relocation shards

       Number of relocation shards for 20 min[copy]

       

        - alert: ElasticsearchRelocationShards
          expr: elasticsearch_cluster_health_relocating_shards > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Elasticsearch relocation shards (instance {{ $labels.instance }})
            description: Number of relocation shards for 20 min\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.10. Elasticsearch initializing shards

       Number of initializing shards for 10 min[copy]

       

        - alert: ElasticsearchInitializingShards
          expr: elasticsearch_cluster_health_initializing_shards > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Elasticsearch initializing shards (instance {{ $labels.instance }})
            description: Number of initializing shards for 10 min\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.11. Elasticsearch unassigned shards

       Number of unassigned shards for 2 min[copy]

       

        - alert: ElasticsearchUnassignedShards
          expr: elasticsearch_cluster_health_unassigned_shards > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Elasticsearch unassigned shards (instance {{ $labels.instance }})
            description: Number of unassigned shards for 2 min\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.12. Elasticsearch pending tasks

       Number of pending tasks for 10 min. Cluster works slowly.[copy]

       

        - alert: ElasticsearchPendingTasks
          expr: elasticsearch_cluster_health_number_of_pending_tasks > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Elasticsearch pending tasks (instance {{ $labels.instance }})
            description: Number of pending tasks for 10 min. Cluster works slowly.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.8.13. Elasticsearch no new documents

       No new documents for 10 min![copy]

       

        - alert: ElasticsearchNoNewDocuments
          expr: rate(elasticsearch_indices_docs{es_data_node="true"}[10m]) < 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Elasticsearch no new documents (instance {{ $labels.instance }})
            description: No new documents for 10 min!\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 9. Cassandra : instaclustr/cassandra-exporter

      // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
      

  • 2. 9. Cassandra : criteo/cassandra_exporter (18 rules)[copy all]

    • 2.9.1. Cassandra hints count

       Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down[copy]

       

        - alert: CassandraHintsCount
          expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra hints count (instance {{ $labels.instance }})
            description: Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.2. Cassandra compaction task pending

       Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.[copy]

       

        - alert: CassandraCompactionTaskPending
          expr: avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[30m]) > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Cassandra compaction task pending (instance {{ $labels.instance }})
            description: Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.3. Cassandra viewwrite latency

       High viewwrite latency on {{ $labels.instance }} cassandra node[copy]

       

        - alert: CassandraViewwriteLatency
          expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Cassandra viewwrite latency (instance {{ $labels.instance }})
            description: High viewwrite latency on {{ $labels.instance }} cassandra node\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.4. Cassandra cool hacker

       Increase of Cassandra authentication failures[copy]

       

        - alert: CassandraCoolHacker
          expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Cassandra cool hacker (instance {{ $labels.instance }})
            description: Increase of Cassandra authentication failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.5. Cassandra node down

       Cassandra node down[copy]

       

        - alert: CassandraNodeDown
          expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra node down (instance {{ $labels.instance }})
            description: Cassandra node down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.6. Cassandra commitlog pending tasks

       Unexpected number of Cassandra commitlog pending tasks[copy]

       

        - alert: CassandraCommitlogPendingTasks
          expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
            description: Unexpected number of Cassandra commitlog pending tasks\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.7. Cassandra compaction executor blocked tasks

       Some Cassandra compaction executor tasks are blocked[copy]

       

        - alert: CassandraCompactionExecutorBlockedTasks
          expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
            description: Some Cassandra compaction executor tasks are blocked\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.8. Cassandra flush writer blocked tasks

       Some Cassandra flush writer tasks are blocked[copy]

       

        - alert: CassandraFlushWriterBlockedTasks
          expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
            description: Some Cassandra flush writer tasks are blocked\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.9. Cassandra repair pending tasks

       Some Cassandra repair tasks are pending[copy]

       

        - alert: CassandraRepairPendingTasks
          expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Cassandra repair pending tasks (instance {{ $labels.instance }})
            description: Some Cassandra repair tasks are pending\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.10. Cassandra repair blocked tasks

       Some Cassandra repair tasks are blocked[copy]

       

        - alert: CassandraRepairBlockedTasks
          expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Cassandra repair blocked tasks (instance {{ $labels.instance }})
            description: Some Cassandra repair tasks are blocked\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.11. Cassandra connection timeouts total

       Some connection between nodes are ending in timeout[copy]

       

        - alert: CassandraConnectionTimeoutsTotal
          expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
            description: Some connection between nodes are ending in timeout\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.12. Cassandra storage exceptions

       Something is going wrong with cassandra storage[copy]

       

        - alert: CassandraStorageExceptions
          expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra storage exceptions (instance {{ $labels.instance }})
            description: Something is going wrong with cassandra storage\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.13. Cassandra tombstone dump

       Too much tombstones scanned in queries[copy]

       

        - alert: CassandraTombstoneDump
          expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra tombstone dump (instance {{ $labels.instance }})
            description: Too much tombstones scanned in queries\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.14. Cassandra client request unvailable write

       Write failures have occurred because too many nodes are unavailable[copy]

       

        - alert: CassandraClientRequestUnvailableWrite
          expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra client request unvailable write (instance {{ $labels.instance }})
            description: Write failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.15. Cassandra client request unvailable read

       Read failures have occurred because too many nodes are unavailable[copy]

       

        - alert: CassandraClientRequestUnvailableRead
          expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra client request unvailable read (instance {{ $labels.instance }})
            description: Read failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.16. Cassandra client request write failure

       A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.[copy]

       

        - alert: CassandraClientRequestWriteFailure
          expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra client request write failure (instance {{ $labels.instance }})
            description: A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.17. Cassandra client request read failure

       A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.[copy]

       

        - alert: CassandraClientRequestReadFailure
          expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra client request read failure (instance {{ $labels.instance }})
            description: A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.9.18. Cassandra cache hit rate key cache

       Key cache hit rate is below 85%[copy]

       

        - alert: CassandraCacheHitRateKeyCache
          expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }})
            description: Key cache hit rate is below 85%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 2. 10. Zookeeper : cloudflare/kafka_zookeeper_exporter

      // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
      

  • 2. 11. Kafka : danielqsj/kafka_exporter (2 rules)[copy all]

    • 2.11.1. Kafka topics replicas

       Kafka topic in-sync partition[copy]

       

        - alert: KafkaTopicsReplicas
          expr: sum(kafka_topic_partition_in_sync_replica) by (topic) < 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kafka topics replicas (instance {{ $labels.instance }})
            description: Kafka topic in-sync partition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 2.11.2. Kafka consumers group

       Kafka consumers group[copy]

       

        - alert: KafkaConsumersGroup
          expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kafka consumers group (instance {{ $labels.instance }})
            description: Kafka consumers group\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 3. 1. Nginx : nginx-lua-prometheus (3 rules)[copy all]

    • 3.1.1. Nginx high HTTP 4xx error rate

       Too many HTTP requests with status 4xx (> 5%)[copy]

       

        - alert: NginxHighHttp4xxErrorRate
          expr: sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
            description: Too many HTTP requests with status 4xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.1.2. Nginx high HTTP 5xx error rate

       Too many HTTP requests with status 5xx (> 5%)[copy]

       

        - alert: NginxHighHttp5xxErrorRate
          expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
            description: Too many HTTP requests with status 5xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.1.3. Nginx latency high

       Nginx p99 latency is higher than 10 seconds[copy]

       

        - alert: NginxLatencyHigh
          expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Nginx latency high (instance {{ $labels.instance }})
            description: Nginx p99 latency is higher than 10 seconds\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 3. 2. Apache : Lusitaniae/apache_exporter (3 rules)[copy all]

    • 3.2.1. Apache down

       Apache down[copy]

       

        - alert: ApacheDown
          expr: apache_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Apache down (instance {{ $labels.instance }})
            description: Apache down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.2.2. Apache workers load

       Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}[copy]

       

        - alert: ApacheWorkersLoad
          expr: (sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Apache workers load (instance {{ $labels.instance }})
            description: Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.2.3. Apache restart

       Apache has just been restarted, less than one minute ago.[copy]

       

        - alert: ApacheRestart
          expr: apache_uptime_seconds_total / 60 < 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Apache restart (instance {{ $labels.instance }})
            description: Apache has just been restarted, less than one minute ago.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 3. 3. HaProxy : Embedded exporter (HAProxy >= v2)

      // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
      

  • 3. 3. HaProxy : prometheus/haproxy_exporter (HAProxy < v2) (16 rules)[copy all]

    • 3.3.1. HAProxy down

       HAProxy down[copy]

       

        - alert: HaproxyDown
          expr: haproxy_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy down (instance {{ $labels.instance }})
            description: HAProxy down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.2. HAProxy high HTTP 4xx error rate backend

       Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}[copy]

       

        - alert: HaproxyHighHttp4xxErrorRateBackend
          expr: sum by (backend) rate(haproxy_server_http_responses_total{code="4xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
            description: Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.3. HAProxy high HTTP 4xx error rate backend

       Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}[copy]

       

        - alert: HaproxyHighHttp4xxErrorRateBackend
          expr: sum by (backend) rate(haproxy_server_http_responses_total{code="5xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
            description: Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.4. HAProxy high HTTP 4xx error rate server

       Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}[copy]

       

        - alert: HaproxyHighHttp4xxErrorRateServer
          expr: sum by (server) rate(haproxy_server_http_responses_total{code="4xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})
            description: Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.5. HAProxy high HTTP 5xx error rate server

       Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}[copy]

       

        - alert: HaproxyHighHttp5xxErrorRateServer
          expr: sum by (server) rate(haproxy_server_http_responses_total{code="5xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})
            description: Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.6. HAProxy server response errors

       Too many response errors to {{ $labels.server }} server (> 5%).[copy]

       

        - alert: HaproxyServerResponseErrors
          expr: sum by (server) rate(haproxy_server_response_errors_total[1m]) / sum by (server) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy server response errors (instance {{ $labels.instance }})
            description: Too many response errors to {{ $labels.server }} server (> 5%).\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.7. HAProxy backend connection errors

       Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be to high.[copy]

       

        - alert: HaproxyBackendConnectionErrors
          expr: sum by (backend) rate(haproxy_backend_connection_errors_total[1m]) > 100
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy backend connection errors (instance {{ $labels.instance }})
            description: Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be to high.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.8. HAProxy server connection errors

       Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be to high.[copy]

       

        - alert: HaproxyServerConnectionErrors
          expr: sum by (server) rate(haproxy_server_connection_errors_total[1m]) > 100
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy server connection errors (instance {{ $labels.instance }})
            description: Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be to high.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.9. HAProxy backend max active session

       HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).[copy]

       

        - alert: HaproxyBackendMaxActiveSession
          expr: avg_over_time((sum by (backend) (haproxy_server_max_sessions) / sum by (backend) (haproxy_server_limit_sessions)) [2m]) * 100 > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: HAProxy backend max active session (instance {{ $labels.instance }})
            description: HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.10. HAProxy pending requests

       Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend[copy]

       

        - alert: HaproxyPendingRequests
          expr: sum by (backend) haproxy_backend_current_queue > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: HAProxy pending requests (instance {{ $labels.instance }})
            description: Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.11. HAProxy HTTP slowing down

       Average request time is increasing[copy]

       

        - alert: HaproxyHttpSlowingDown
          expr: avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
            description: Average request time is increasing\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.12. HAProxy retry high

       High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend[copy]

       

        - alert: HaproxyRetryHigh
          expr: rate(sum by (backend) (haproxy_backend_retry_warnings_total)) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: HAProxy retry high (instance {{ $labels.instance }})
            description: High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.13. HAProxy backend down

       HAProxy backend is down[copy]

       

        - alert: HaproxyBackendDown
          expr: haproxy_backend_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy backend down (instance {{ $labels.instance }})
            description: HAProxy backend is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.14. HAProxy server down

       HAProxy server is down[copy]

       

        - alert: HaproxyServerDown
          expr: haproxy_server_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: HAProxy server down (instance {{ $labels.instance }})
            description: HAProxy server is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.15. HAProxy frontend security blocked requests

       HAProxy is blocking requests for security reason[copy]

       

        - alert: HaproxyFrontendSecurityBlockedRequests
          expr: rate(sum by (frontend) (haproxy_frontend_requests_denied_total)) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }})
            description: HAProxy is blocking requests for security reason\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.3.16. HAProxy server healthcheck failure

       Some server healthcheck are failing on {{ $labels.server }}[copy]

       

        - alert: HaproxyServerHealthcheckFailure
          expr: increase(haproxy_server_check_failures_total) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})
            description: Some server healthcheck are failing on {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 3. 4. Traefik : Embedded exporter (3 rules)[copy all]

    • 3.4.1. Traefik backend down

       All Traefik backends are down[copy]

       

        - alert: TraefikBackendDown
          expr: count(traefik_backend_server_up) by (backend) == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Traefik backend down (instance {{ $labels.instance }})
            description: All Traefik backends are down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.4.2. Traefik high HTTP 4xx error rate backend

       Traefik backend 4xx error rate is above 5%[copy]

       

        - alert: TraefikHighHttp4xxErrorRateBackend
          expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }})
            description: Traefik backend 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.4.3. Traefik high HTTP 5xx error rate backend

       Traefik backend 5xx error rate is above 5%[copy]

       

        - alert: TraefikHighHttp5xxErrorRateBackend
          expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }})
            description: Traefik backend 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 3. 4. Traefik : Embedded exporter v2 (3 rules)[copy all]

    • 3.4.1. Traefik service down

       All Traefik services are down[copy]

       

        - alert: TraefikServiceDown
          expr: count(traefik_service_server_up) by (service) == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Traefik service down (instance {{ $labels.instance }})
            description: All Traefik services are down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.4.2. Traefik high HTTP 4xx error rate service

       Traefik service 4xx error rate is above 5%[copy]

       

        - alert: TraefikHighHttp4xxErrorRateService
          expr: sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }})
            description: Traefik service 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 3.4.3. Traefik high HTTP 5xx error rate service

       Traefik service 5xx error rate is above 5%[copy]

       

        - alert: TraefikHighHttp5xxErrorRateService
          expr: sum(rate(traefik_service_requests_total{code=~"5.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }})
            description: Traefik service 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 4. 1. PHP-FPM : bakins/php-fpm-exporter

      // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
      

  • 4. 2. JVM : java-client (1 rules)[copy all]

    • 4.2.1. JVM memory filling up

       JVM memory is filling up (> 80%)[copy]

       

        - alert: JvmMemoryFillingUp
          expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: JVM memory filling up (instance {{ $labels.instance }})
            description: JVM memory is filling up (> 80%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 4. 3. Sidekiq : Strech/sidekiq-prometheus-exporter (2 rules)[copy all]

    • 4.3.1. Sidekiq queue size

       Sidekiq queue {{ $labels.name }} is growing[copy]

       

        - alert: SidekiqQueueSize
          expr: sidekiq_queue_size > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Sidekiq queue size (instance {{ $labels.instance }})
            description: Sidekiq queue {{ $labels.name }} is growing\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 4.3.2. Sidekiq scheduling latency too high

       Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.[copy]

       

        - alert: SidekiqSchedulingLatencyTooHigh
          expr: max(sidekiq_queue_latency) > 120
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }})
            description: Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 5. 1. Kubernetes : kube-state-metrics (32 rules)[copy all]

    • 5.1.1. Kubernetes Node ready

       Node {{ $labels.node }} has been unready for a long time[copy]

       

        - alert: KubernetesNodeReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes Node ready (instance {{ $labels.instance }})
            description: Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.2. Kubernetes memory pressure

       {{ $labels.node }} has MemoryPressure condition[copy]

       

        - alert: KubernetesMemoryPressure
          expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes memory pressure (instance {{ $labels.instance }})
            description: {{ $labels.node }} has MemoryPressure condition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.3. Kubernetes disk pressure

       {{ $labels.node }} has DiskPressure condition[copy]

       

        - alert: KubernetesDiskPressure
          expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes disk pressure (instance {{ $labels.instance }})
            description: {{ $labels.node }} has DiskPressure condition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.4. Kubernetes out of disk

       {{ $labels.node }} has OutOfDisk condition[copy]

       

        - alert: KubernetesOutOfDisk
          expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes out of disk (instance {{ $labels.instance }})
            description: {{ $labels.node }} has OutOfDisk condition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.5. Kubernetes out of capacity

       {{ $labels.node }} is out of capacity[copy]

       

        - alert: KubernetesOutOfCapacity
          expr: sum(kube_pod_info) by (node) / sum(kube_node_status_allocatable_pods) by (node) * 100 > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes out of capacity (instance {{ $labels.instance }})
            description: {{ $labels.node }} is out of capacity\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.6. Kubernetes Job failed

       Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete[copy]

       

        - alert: KubernetesJobFailed
          expr: kube_job_status_failed > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes Job failed (instance {{ $labels.instance }})
            description: Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.7. Kubernetes CronJob suspended

       CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended[copy]

       

        - alert: KubernetesCronjobSuspended
          expr: kube_cronjob_spec_suspend != 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes CronJob suspended (instance {{ $labels.instance }})
            description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.8. Kubernetes PersistentVolumeClaim pending

       PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending[copy]

       

        - alert: KubernetesPersistentvolumeclaimPending
          expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }})
            description: PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.9. Kubernetes Volume out of disk space

       Volume is almost full (< 10% left)[copy]

       

        - alert: KubernetesVolumeOutOfDiskSpace
          expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }})
            description: Volume is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.10. Kubernetes Volume full in four days

       {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.[copy]

       

        - alert: KubernetesVolumeFullInFourDays
          expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes Volume full in four days (instance {{ $labels.instance }})
            description: {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.11. Kubernetes PersistentVolume error

       Persistent volume is in bad state[copy]

       

        - alert: KubernetesPersistentvolumeError
          expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})
            description: Persistent volume is in bad state\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.12. Kubernetes StatefulSet down

       A StatefulSet went down[copy]

       

        - alert: KubernetesStatefulsetDown
          expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})
            description: A StatefulSet went down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.13. Kubernetes HPA scaling ability

       Pod is unable to scale[copy]

       

        - alert: KubernetesHpaScalingAbility
          expr: kube_hpa_status_condition{status="false", condition ="AbleToScale"} == 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes HPA scaling ability (instance {{ $labels.instance }})
            description: Pod is unable to scale\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.14. Kubernetes HPA metric availability

       HPA is not able to collect metrics[copy]

       

        - alert: KubernetesHpaMetricAvailability
          expr: kube_hpa_status_condition{status="false", condition="ScalingActive"} == 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes HPA metric availability (instance {{ $labels.instance }})
            description: HPA is not able to collect metrics\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.15. Kubernetes HPA scale capability

       The maximum number of desired Pods has been hit[copy]

       

        - alert: KubernetesHpaScaleCapability
          expr: kube_hpa_status_desired_replicas >= kube_hpa_spec_max_replicas
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes HPA scale capability (instance {{ $labels.instance }})
            description: The maximum number of desired Pods has been hit\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.16. Kubernetes Pod not healthy

       Pod has been in a non-ready state for longer than an hour.[copy]

       

        - alert: KubernetesPodNotHealthy
          expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
            description: Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.17. Kubernetes pod crash looping

       Pod {{ $labels.pod }} is crash looping[copy]

       

        - alert: KubernetesPodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
            description: Pod {{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.18. Kubernetes ReplicasSet mismatch

       Deployment Replicas mismatch[copy]

       

        - alert: KubernetesReplicassetMismatch
          expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})
            description: Deployment Replicas mismatch\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.19. Kubernetes Deployment replicas mismatch

       Deployment Replicas mismatch[copy]

       

        - alert: KubernetesDeploymentReplicasMismatch
          expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
            description: Deployment Replicas mismatch\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.20. Kubernetes StatefulSet replicas mismatch

       A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.[copy]

       

        - alert: KubernetesStatefulsetReplicasMismatch
          expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})
            description: A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.21. Kubernetes Deployment generation mismatch

       A Deployment has failed but has not been rolled back.[copy]

       

        - alert: KubernetesDeploymentGenerationMismatch
          expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }})
            description: A Deployment has failed but has not been rolled back.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.22. Kubernetes StatefulSet generation mismatch

       A StatefulSet has failed but has not been rolled back.[copy]

       

        - alert: KubernetesStatefulsetGenerationMismatch
          expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }})
            description: A StatefulSet has failed but has not been rolled back.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.23. Kubernetes StatefulSet update not rolled out

       StatefulSet update has not been rolled out.[copy]

       

        - alert: KubernetesStatefulsetUpdateNotRolledOut
          expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }})
            description: StatefulSet update has not been rolled out.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.24. Kubernetes DaemonSet rollout stuck

       Some Pods of DaemonSet are not scheduled or not ready[copy]

       

        - alert: KubernetesDaemonsetRolloutStuck
          expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
            description: Some Pods of DaemonSet are not scheduled or not ready\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.25. Kubernetes DaemonSet misscheduled

       Some DaemonSet Pods are running where they are not supposed to run[copy]

       

        - alert: KubernetesDaemonsetMisscheduled
          expr: kube_daemonset_status_number_misscheduled > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }})
            description: Some DaemonSet Pods are running where they are not supposed to run\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.26. Kubernetes CronJob too long

       CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.[copy]

       

        - alert: KubernetesCronjobTooLong
          expr: time() - kube_cronjob_next_schedule_time > 3600
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes CronJob too long (instance {{ $labels.instance }})
            description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.27. Kubernetes job completion

       Kubernetes Job failed to complete[copy]

       

        - alert: KubernetesJobCompletion
          expr: kube_job_spec_completions - kube_job_status_succeeded > 0 or kube_job_status_failed > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes job completion (instance {{ $labels.instance }})
            description: Kubernetes Job failed to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.28. Kubernetes API server errors

       Kubernetes API server is experiencing high error rate[copy]

       

        - alert: KubernetesApiServerErrors
          expr: sum(rate(apiserver_request_count{job="apiserver",code=~"^(?:5..)$"}[2m])) / sum(rate(apiserver_request_count{job="apiserver"}[2m])) * 100 > 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes API server errors (instance {{ $labels.instance }})
            description: Kubernetes API server is experiencing high error rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.29. Kubernetes API client errors

       Kubernetes API client is experiencing high error rate[copy]

       

        - alert: KubernetesApiClientErrors
          expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[2m])) by (instance, job) / sum(rate(rest_client_requests_total[2m])) by (instance, job)) * 100 > 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes API client errors (instance {{ $labels.instance }})
            description: Kubernetes API client is experiencing high error rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.30. Kubernetes client certificate expires next week

       A client certificate used to authenticate to the apiserver is expiring next week.[copy]

       

        - alert: KubernetesClientCertificateExpiresNextWeek
          expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }})
            description: A client certificate used to authenticate to the apiserver is expiring next week.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.31. Kubernetes client certificate expires soon

       A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.[copy]

       

        - alert: KubernetesClientCertificateExpiresSoon
          expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }})
            description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.1.32. Kubernetes API server latency

       Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.[copy]

       

        - alert: KubernetesApiServerLatency
          expr: histogram_quantile(0.99, sum(apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"}) WITHOUT (instance, resource)) / 1e+06 > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Kubernetes API server latency (instance {{ $labels.instance }})
            description: Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 5. 2. Nomad : Embedded exporter

      // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
      

  • 5. 3. Consul : prometheus/consul_exporter (3 rules)[copy all]

    • 5.3.1. Consul service healthcheck failed

       Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`[copy]

       

        - alert: ConsulServiceHealthcheckFailed
          expr: consul_catalog_service_node_healthy == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Consul service healthcheck failed (instance {{ $labels.instance }})
            description: Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.3.2. Consul missing master node

       Numbers of consul raft peers should be 3, in order to preserve quorum.[copy]

       

        - alert: ConsulMissingMasterNode
          expr: consul_raft_peers < 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Consul missing master node (instance {{ $labels.instance }})
            description: Numbers of consul raft peers should be 3, in order to preserve quorum.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.3.3. Consul agent unhealthy

       A Consul agent is down[copy]

       

        - alert: ConsulAgentUnhealthy
          expr: consul_health_node_status{status="critical"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Consul agent unhealthy (instance {{ $labels.instance }})
            description: A Consul agent is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 5. 4. Etcd (13 rules)[copy all]

    • 5.4.1. Etcd insufficient Members

       Etcd cluster should have an odd number of members[copy]

       

        - alert: EtcdInsufficientMembers
          expr: count(etcd_server_id) % 2 == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Etcd insufficient Members (instance {{ $labels.instance }})
            description: Etcd cluster should have an odd number of members\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.2. Etcd no Leader

       Etcd cluster have no leader[copy]

       

        - alert: EtcdNoLeader
          expr: etcd_server_has_leader == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Etcd no Leader (instance {{ $labels.instance }})
            description: Etcd cluster have no leader\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.3. Etcd high number of leader changes

       Etcd leader changed more than 3 times during last hour[copy]

       

        - alert: EtcdHighNumberOfLeaderChanges
          expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd high number of leader changes (instance {{ $labels.instance }})
            description: Etcd leader changed more than 3 times during last hour\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.4. Etcd high number of failed GRPC requests

       More than 1% GRPC request failure detected in Etcd for 5 minutes[copy]

       

        - alert: EtcdHighNumberOfFailedGrpcRequests
          expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
            description: More than 1% GRPC request failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.5. Etcd high number of failed GRPC requests

       More than 5% GRPC request failure detected in Etcd for 5 minutes[copy]

       

        - alert: EtcdHighNumberOfFailedGrpcRequests
          expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
            description: More than 5% GRPC request failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.6. Etcd GRPC requests slow

       GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

       

        - alert: EtcdGrpcRequestsSlow
          expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd GRPC requests slow (instance {{ $labels.instance }})
            description: GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.7. Etcd high number of failed HTTP requests

       More than 1% HTTP failure detected in Etcd for 5 minutes[copy]

       

        - alert: EtcdHighNumberOfFailedHttpRequests
          expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.01
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
            description: More than 1% HTTP failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.8. Etcd high number of failed HTTP requests

       More than 5% HTTP failure detected in Etcd for 5 minutes[copy]

       

        - alert: EtcdHighNumberOfFailedHttpRequests
          expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
            description: More than 5% HTTP failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.9. Etcd HTTP requests slow

       HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

       

        - alert: EtcdHttpRequestsSlow
          expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
            description: HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.10. Etcd member communication slow

       Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

       

        - alert: EtcdMemberCommunicationSlow
          expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.15
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd member communication slow (instance {{ $labels.instance }})
            description: Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.11. Etcd high number of failed proposals

       Etcd server got more than 5 failed proposals past hour[copy]

       

        - alert: EtcdHighNumberOfFailedProposals
          expr: increase(etcd_server_proposals_failed_total[1h]) > 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd high number of failed proposals (instance {{ $labels.instance }})
            description: Etcd server got more than 5 failed proposals past hour\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.12. Etcd high fsync durations

       Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes[copy]

       

        - alert: EtcdHighFsyncDurations
          expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd high fsync durations (instance {{ $labels.instance }})
            description: Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 5.4.13. Etcd high commit durations

       Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes[copy]

       

        - alert: EtcdHighCommitDurations
          expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Etcd high commit durations (instance {{ $labels.instance }})
            description: Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 5. 5. Linkerd : Embedded exporter (1 rules)[copy all]

    • 5.5.1. Linkerd high error rate

       Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%[copy]

       

        - alert: LinkerdHighErrorRate
          expr: sum(rate(request_errors_total[5m])) by (deployment, statefulset, daemonset) / sum(rate(request_total[5m])) by (deployment, statefulset, daemonset) * 100 > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Linkerd high error rate (instance {{ $labels.instance }})
            description: Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 5. 6. Istio

      // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
      

  • 6. 1. Ceph : Embedded exporter (13 rules)[copy all]

    • 6.1.1. Ceph State

       Ceph instance unhealthy[copy]

       

        - alert: CephState
          expr: ceph_health_status != 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Ceph State (instance {{ $labels.instance }})
            description: Ceph instance unhealthy\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.2. Ceph monitor clock skew

       Ceph monitor clock skew detected. Please check ntp and hardware clock settings[copy]

       

        - alert: CephMonitorClockSkew
          expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Ceph monitor clock skew (instance {{ $labels.instance }})
            description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.3. Ceph monitor low space

       Ceph monitor storage is low.[copy]

       

        - alert: CephMonitorLowSpace
          expr: ceph_monitor_avail_percent < 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Ceph monitor low space (instance {{ $labels.instance }})
            description: Ceph monitor storage is low.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.4. Ceph OSD Down

       Ceph Object Storage Daemon Down[copy]

       

        - alert: CephOsdDown
          expr: ceph_osd_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Ceph OSD Down (instance {{ $labels.instance }})
            description: Ceph Object Storage Daemon Down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.5. Ceph high OSD latency

       Ceph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.[copy]

       

        - alert: CephHighOsdLatency
          expr: ceph_osd_perf_apply_latency_seconds > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Ceph high OSD latency (instance {{ $labels.instance }})
            description: Ceph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.6. Ceph OSD low space

       Ceph Object Storage Daemon is going out of space. Please add more disks.[copy]

       

        - alert: CephOsdLowSpace
          expr: ceph_osd_utilization > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Ceph OSD low space (instance {{ $labels.instance }})
            description: Ceph Object Storage Daemon is going out of space. Please add more disks.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.7. Ceph OSD reweighted

       Ceph Object Storage Daemon take ttoo much time to resize.[copy]

       

        - alert: CephOsdReweighted
          expr: ceph_osd_weight < 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Ceph OSD reweighted (instance {{ $labels.instance }})
            description: Ceph Object Storage Daemon take ttoo much time to resize.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.8. Ceph PG down

       Some Ceph placement groups are down. Please ensure that all the data are available.[copy]

       

        - alert: CephPgDown
          expr: ceph_pg_down > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Ceph PG down (instance {{ $labels.instance }})
            description: Some Ceph placement groups are down. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.9. Ceph PG incomplete

       Some Ceph placement groups are incomplete. Please ensure that all the data are available.[copy]

       

        - alert: CephPgIncomplete
          expr: ceph_pg_incomplete > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Ceph PG incomplete (instance {{ $labels.instance }})
            description: Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.10. Ceph PG inconsistant

       Some Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.[copy]

       

        - alert: CephPgInconsistant
          expr: ceph_pg_inconsistent > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Ceph PG inconsistant (instance {{ $labels.instance }})
            description: Some Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.11. Ceph PG activation long

       Some Ceph placement groups are too long to activate.[copy]

       

        - alert: CephPgActivationLong
          expr: ceph_pg_activating > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Ceph PG activation long (instance {{ $labels.instance }})
            description: Some Ceph placement groups are too long to activate.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.12. Ceph PG backfill full

       Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.[copy]

       

        - alert: CephPgBackfillFull
          expr: ceph_pg_backfill_toofull > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Ceph PG backfill full (instance {{ $labels.instance }})
            description: Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.1.13. Ceph PG unavailable

       Some Ceph placement groups are unavailable.[copy]

       

        - alert: CephPgUnavailable
          expr: ceph_pg_total - ceph_pg_active > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Ceph PG unavailable (instance {{ $labels.instance }})
            description: Some Ceph placement groups are unavailable.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 6. 2. SpeedTest : Speedtest exporter (2 rules)[copy all]

    • 6.2.1. SpeedTest Slow Internet Download

       Internet download speed is currently {{humanize $value}} Mbps.[copy]

       

        - alert: SpeedtestSlowInternetDownload
          expr: avg_over_time(speedtest_download[30m]) < 75
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: SpeedTest Slow Internet Download (instance {{ $labels.instance }})
            description: Internet download speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.2.2. SpeedTest Slow Internet Upload

       Internet upload speed is currently {{humanize $value}} Mbps.[copy]

       

        - alert: SpeedtestSlowInternetUpload
          expr: avg_over_time(speedtest_upload[30m]) < 20 
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: SpeedTest Slow Internet Upload (instance {{ $labels.instance }})
            description: Internet upload speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 6. 3. ZFS : node-exporter

      // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋
      

  • 6. 4. OpenEBS : Embedded exporter (1 rules)[copy all]

    • 6.4.1. OpenEBS used pool capacity

       OpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}[copy]

       

        - alert: OpenebsUsedPoolCapacity
          expr: (openebs_used_pool_capacity_percent) > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: OpenEBS used pool capacity (instance {{ $labels.instance }})
            description: OpenEBS Pool use more than 80% of his capacity\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 6. 5. Minio : Embedded exporter (2 rules)[copy all]

    • 6.5.1. Minio disk offline

       Minio disk is offline[copy]

       

        - alert: MinioDiskOffline
          expr: minio_offline_disks > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Minio disk offline (instance {{ $labels.instance }})
            description: Minio disk is offline\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.5.2. Minio storage space exhausted

       Minio storage space is low (< 10 GB)[copy]

       

        - alert: MinioStorageSpaceExhausted
          expr: minio_disk_storage_free_bytes / 1024 / 1024 / 1024 < 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Minio storage space exhausted (instance {{ $labels.instance }})
            description: Minio storage space is low (< 10 GB)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 6. 6. Juniper : czerwonk/junos_exporter (3 rules)[copy all]

    • 6.6.1. Juniper switch down

       The switch appears to be down[copy]

       

        - alert: JuniperSwitchDown
          expr: junos_up == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Juniper switch down (instance {{ $labels.instance }})
            description: The switch appears to be down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.6.2. Juniper high Bandwith Usage 1GiB

       Interface is highly saturated for at least 1 min. (> 0.90GiB/s)[copy]

       

        - alert: JuniperHighBandwithUsage1gib
          expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Juniper high Bandwith Usage 1GiB (instance {{ $labels.instance }})
            description: Interface is highly saturated for at least 1 min. (> 0.90GiB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 6.6.3. Juniper high Bandwith Usage 1GiB

       Interface is getting saturated for at least 1 min. (> 0.80GiB/s)[copy]

       

        - alert: JuniperHighBandwithUsage1gib
          expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Juniper high Bandwith Usage 1GiB (instance {{ $labels.instance }})
            description: Interface is getting saturated for at least 1 min. (> 0.80GiB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 6. 7. CoreDNS : Embedded exporter (1 rules)[copy all]

    • 6.7.1. CoreDNS Panic Count

       Number of CoreDNS panics encountered[copy]

       

        - alert: CorednsPanicCount
          expr: increase(coredns_panic_count_total[10m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: CoreDNS Panic Count (instance {{ $labels.instance }})
            description: Number of CoreDNS panics encountered\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


  • 7. 1. Thanos (3 rules)[copy all]

    • 7.1.1. Thanos compaction halted

       Thanos compaction has failed to run and is now halted.[copy]

       

        - alert: ThanosCompactionHalted
          expr: thanos_compactor_halted == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Thanos compaction halted (instance {{ $labels.instance }})
            description: Thanos compaction has failed to run and is now halted.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 7.1.2. Thanos compact bucket operation failure

       Thanos compaction has failing storage operations[copy]

       

        - alert: ThanosCompactBucketOperationFailure
          expr: rate(thanos_objstore_bucket_operation_failures_total[1m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Thanos compact bucket operation failure (instance {{ $labels.instance }})
            description: Thanos compaction has failing storage operations\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       

    • 7.1.3. Thanos compact not run

       Thanos compaction has not run in 24 hours.[copy]

       

        - alert: ThanosCompactNotRun
          expr: (time() - thanos_objstore_bucket_last_successful_upload_time) > 24*60*60
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: Thanos compact not run (instance {{ $labels.instance }})
            description: Thanos compaction has not run in 24 hours.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

       


Awesome Prometheus alerts is maintained by samber.

posted @ 2020-11-10 15:03  技术颜良  阅读(2486)  评论(0)    收藏  举报