Awesome Prometheus alerts

http://t.zoukankan.com/shoufu-p-14110485.html

转载于https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware

Collection of alerting rules

AlertManager config Rules Contribute on GitHub

⚠️ Caution ⚠️

Alert thresholds depend on nature of applications.
Some queries in this page may have arbitrary tolerance threshold.

Building an efficient and battle-tested monitoring platform takes time. 😉

1. 1. Prometheus self-monitoring (25 rules)[copy all]

1.1.1. Prometheus job missing

A Prometheus job has disappeared[copy]

  - alert: PrometheusJobMissing
    expr: absent(up{job="prometheus"})
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus job missing (instance {{ $labels.instance }})
      description: A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.2. Prometheus target missing

A Prometheus target has disappeared. An exporter might be crashed.[copy]

  - alert: PrometheusTargetMissing
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target missing (instance {{ $labels.instance }})
      description: A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.3. Prometheus all targets missing

A Prometheus job does not have living target anymore.[copy]

  - alert: PrometheusAllTargetsMissing
    expr: count by (job) (up) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus all targets missing (instance {{ $labels.instance }})
      description: A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.4. Prometheus configuration reload failure

Prometheus configuration reload error[copy]

  - alert: PrometheusConfigurationReloadFailure
    expr: prometheus_config_last_reload_successful != 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
      description: Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.5. Prometheus too many restarts

Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.[copy]

  - alert: PrometheusTooManyRestarts
    expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus too many restarts (instance {{ $labels.instance }})
      description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.6. Prometheus AlertManager configuration reload failure

AlertManager configuration reload error[copy]

  - alert: PrometheusAlertmanagerConfigurationReloadFailure
    expr: alertmanager_config_last_reload_successful != 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
      description: AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.7. Prometheus AlertManager config not synced

Configurations of AlertManager cluster instances are out of sync[copy]

  - alert: PrometheusAlertmanagerConfigNotSynced
    expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
      description: Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.8. Prometheus AlertManager E2E dead man switch

Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.[copy]

  - alert: PrometheusAlertmanagerE2eDeadManSwitch
    expr: vector(1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
      description: Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.9. Prometheus not connected to alertmanager

Prometheus cannot connect the alertmanager[copy]

  - alert: PrometheusNotConnectedToAlertmanager
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
      description: Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.10. Prometheus rule evaluation failures

Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.[copy]

  - alert: PrometheusRuleEvaluationFailures
    expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.11. Prometheus template text expansion failures

Prometheus encountered {{ $value }} template text expansion failures[copy]

  - alert: PrometheusTemplateTextExpansionFailures
    expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.12. Prometheus rule evaluation slow

Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.[copy]

  - alert: PrometheusRuleEvaluationSlow
    expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
      description: Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.13. Prometheus notifications backlog

The Prometheus notification queue has not been empty for 10 minutes[copy]

  - alert: PrometheusNotificationsBacklog
    expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus notifications backlog (instance {{ $labels.instance }})
      description: The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.14. Prometheus AlertManager notification failing

Alertmanager is failing sending notifications[copy]

  - alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
      description: Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.15. Prometheus target empty

Prometheus has no target in service discovery[copy]

  - alert: PrometheusTargetEmpty
    expr: prometheus_sd_discovered_targets == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target empty (instance {{ $labels.instance }})
      description: Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.16. Prometheus target scraping slow

Prometheus is scraping exporters slowly[copy]

  - alert: PrometheusTargetScrapingSlow
    expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scraping slow (instance {{ $labels.instance }})
      description: Prometheus is scraping exporters slowly\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.17. Prometheus large scrape

Prometheus has many scrapes that exceed the sample limit[copy]

  - alert: PrometheusLargeScrape
    expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus large scrape (instance {{ $labels.instance }})
      description: Prometheus has many scrapes that exceed the sample limit\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.18. Prometheus target scrape duplicate

Prometheus has many samples rejected due to duplicate timestamps but different values[copy]

  - alert: PrometheusTargetScrapeDuplicate
    expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
      description: Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.19. Prometheus TSDB checkpoint creation failures

Prometheus encountered {{ $value }} checkpoint creation failures[copy]

  - alert: PrometheusTsdbCheckpointCreationFailures
    expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.20. Prometheus TSDB checkpoint deletion failures

Prometheus encountered {{ $value }} checkpoint deletion failures[copy]

  - alert: PrometheusTsdbCheckpointDeletionFailures
    expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.21. Prometheus TSDB compactions failed

Prometheus encountered {{ $value }} TSDB compactions failures[copy]

  - alert: PrometheusTsdbCompactionsFailed
    expr: increase(prometheus_tsdb_compactions_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.22. Prometheus TSDB head truncations failed

Prometheus encountered {{ $value }} TSDB head truncation failures[copy]

  - alert: PrometheusTsdbHeadTruncationsFailed
    expr: increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.23. Prometheus TSDB reload failures

Prometheus encountered {{ $value }} TSDB reload failures[copy]

  - alert: PrometheusTsdbReloadFailures
    expr: increase(prometheus_tsdb_reloads_failures_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.24. Prometheus TSDB WAL corruptions

Prometheus encountered {{ $value }} TSDB WAL corruptions[copy]

  - alert: PrometheusTsdbWalCorruptions
    expr: increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.1.25. Prometheus TSDB WAL truncations failed

Prometheus encountered {{ $value }} TSDB WAL truncation failures[copy]

  - alert: PrometheusTsdbWalTruncationsFailed
    expr: increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
      description: Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1. 2. Host and hardware : node-exporter (26 rules)[copy all]

1.2.1. Host out of memory

Node memory is filling up (< 10% left)[copy]

  - alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.2. Host memory under memory pressure

The node is under heavy memory pressure. High rate of major page faults[copy]

  - alert: HostMemoryUnderMemoryPressure
    expr: rate(node_vmstat_pgmajfault[1m]) > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host memory under memory pressure (instance {{ $labels.instance }})
      description: The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.3. Host unusual network throughput in

Host network interfaces are probably receiving too much data (> 100 MB/s)[copy]

  - alert: HostUnusualNetworkThroughputIn
    expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in (instance {{ $labels.instance }})
      description: Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.4. Host unusual network throughput out

Host network interfaces are probably sending too much data (> 100 MB/s)[copy]

  - alert: HostUnusualNetworkThroughputOut
    expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput out (instance {{ $labels.instance }})
      description: Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.5. Host unusual disk read rate

Disk is probably reading too much data (> 50 MB/s)[copy]

  - alert: HostUnusualDiskReadRate
    expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read rate (instance {{ $labels.instance }})
      description: Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.6. Host unusual disk write rate

Disk is probably writing too much data (> 50 MB/s)[copy]

  - alert: HostUnusualDiskWriteRate
    expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write rate (instance {{ $labels.instance }})
      description: Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.7. Host out of disk space

Disk is almost full (< 10% left)[copy]

  # please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)"
  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.8. Host disk will fill in 4 hours

Disk will fill in 4 hours at current write rate[copy]

  - alert: HostDiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host disk will fill in 4 hours (instance {{ $labels.instance }})
      description: Disk will fill in 4 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.9. Host out of inodes

Disk is almost running out of available inodes (< 10% left)[copy]

  - alert: HostOutOfInodes
    expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint ="/rootfs"} * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host out of inodes (instance {{ $labels.instance }})
      description: Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.10. Host unusual disk read latency

Disk latency is growing (read operations > 100ms)[copy]

  - alert: HostUnusualDiskReadLatency
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read latency (instance {{ $labels.instance }})
      description: Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.11. Host unusual disk write latency

Disk latency is growing (write operations > 100ms)[copy]

  - alert: HostUnusualDiskWriteLatency
    expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write latency (instance {{ $labels.instance }})
      description: Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.12. Host high CPU load

CPU load is > 80%[copy]

  - alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.13. Host context switching

Context switching is growing on node (> 1000 / s)[copy]

  # 1000 context switches is an arbitrary number.
  # Alert threshold depends on nature of application.
  # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
  - alert: HostContextSwitching
    expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host context switching (instance {{ $labels.instance }})
      description: Context switching is growing on node (> 1000 / s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.14. Host swap is filling up

Swap is filling up (>80%)[copy]

  - alert: HostSwapIsFillingUp
    expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host swap is filling up (instance {{ $labels.instance }})
      description: Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.15. Host SystemD service crashed

SystemD service crashed[copy]

  - alert: HostSystemdServiceCrashed
    expr: node_systemd_unit_state{state="failed"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host SystemD service crashed (instance {{ $labels.instance }})
      description: SystemD service crashed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.16. Host physical component too hot

Physical hardware component too hot[copy]

  - alert: HostPhysicalComponentTooHot
    expr: node_hwmon_temp_celsius > 75
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host physical component too hot (instance {{ $labels.instance }})
      description: Physical hardware component too hot\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.17. Host node overtemperature alarm

Physical node temperature alarm triggered[copy]

  - alert: HostNodeOvertemperatureAlarm
    expr: node_hwmon_temp_alarm == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Host node overtemperature alarm (instance {{ $labels.instance }})
      description: Physical node temperature alarm triggered\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.18. Host RAID array got inactive

RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.[copy]

  - alert: HostRaidArrayGotInactive
    expr: node_md_state{state="inactive"} > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Host RAID array got inactive (instance {{ $labels.instance }})
      description: RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.19. Host RAID disk failure

At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap[copy]

  - alert: HostRaidDiskFailure
    expr: node_md_disks{state="failed"} > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host RAID disk failure (instance {{ $labels.instance }})
      description: At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.20. Host kernel version deviations

Different kernel versions are running[copy]

  - alert: HostKernelVersionDeviations
    expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host kernel version deviations (instance {{ $labels.instance }})
      description: Different kernel versions are running\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.21. Host OOM kill detected

OOM kill detected[copy]

  - alert: HostOomKillDetected
    expr: increase(node_vmstat_oom_kill[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host OOM kill detected (instance {{ $labels.instance }})
      description: OOM kill detected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.22. Host EDAC Correctable Errors detected

{{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.[copy]

  - alert: HostEdacCorrectableErrorsDetected
    expr: increase(node_edac_correctable_errors_total[5m]) > 0
    for: 5m
    labels:
      severity: info
    annotations:
      summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
      description: {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.23. Host EDAC Uncorrectable Errors detected

{{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.[copy]

  - alert: HostEdacUncorrectableErrorsDetected
    expr: node_edac_uncorrectable_errors_total > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
      description: {{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.24. Host Network Receive Errors

{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.[copy]

  - alert: HostNetworkReceiveErrors
    expr: increase(node_network_receive_errs_total[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host Network Receive Errors (instance {{ $labels.instance }})
      description: {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.25. Host Network Transmit Errors

{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.[copy]

  - alert: HostNetworkTransmitErrors
    expr: increase(node_network_transmit_errs_total[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host Network Transmit Errors (instance {{ $labels.instance }})
      description: {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.2.26. Host Network Interface Saturated

The network interface "{{ $labels.interface }}" on "{{ $labels.instance }}" is getting overloaded.[copy]

  - alert: HostNetworkInterfaceSaturated
    expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host Network Interface Saturated (instance {{ $labels.instance }})
      description: The network interface "{{ $labels.interface }}" on "{{ $labels.instance }}" is getting overloaded.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1. 3. Docker containers : google/cAdvisor (6 rules)[copy all]

1.3.1. Container killed

A container has disappeared[copy]

  - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container killed (instance {{ $labels.instance }})
      description: A container has disappeared\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.3.2. Container CPU usage

Container CPU usage is above 80%[copy]

  # cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
  # If you want to exclude it from this alert, just use: container_cpu_usage_seconds_total{name!=""}
  - alert: ContainerCpuUsage
    expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container CPU usage (instance {{ $labels.instance }})
      description: Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.3.3. Container Memory usage

Container Memory usage is above 80%[copy]

  # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
  - alert: ContainerMemoryUsage
    expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container Memory usage (instance {{ $labels.instance }})
      description: Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.3.4. Container Volume usage

Container Volume usage is above 80%[copy]

  - alert: ContainerVolumeUsage
    expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance)) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container Volume usage (instance {{ $labels.instance }})
      description: Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.3.5. Container Volume IO usage

Container Volume IO usage is above 80%[copy]

  - alert: ContainerVolumeIoUsage
    expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container Volume IO usage (instance {{ $labels.instance }})
      description: Container Volume IO usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.3.6. Container high throttle rate

Container is being throttled[copy]

  - alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container high throttle rate (instance {{ $labels.instance }})
      description: Container is being throttled\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1. 4. Blackbox : prometheus/blackbox_exporter (8 rules)[copy all]

1.4.1. Blackbox probe failed

Probe failed[copy]

  - alert: BlackboxProbeFailed
    expr: probe_success == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe failed (instance {{ $labels.instance }})
      description: Probe failed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.4.2. Blackbox slow probe

Blackbox probe took more than 1s to complete[copy]

  - alert: BlackboxSlowProbe
    expr: avg_over_time(probe_duration_seconds[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Blackbox slow probe (instance {{ $labels.instance }})
      description: Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.4.3. Blackbox probe HTTP failure

HTTP status code is not 200-399[copy]

  - alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
      description: HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.4.4. Blackbox SSL certificate will expire soon

SSL certificate expires in 30 days[copy]

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.4.5. Blackbox SSL certificate will expire soon

SSL certificate expires in 3 days[copy]

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: SSL certificate expires in 3 days\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.4.6. Blackbox SSL certificate expired

SSL certificate has expired already[copy]

  - alert: BlackboxSslCertificateExpired
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
      description: SSL certificate has expired already\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.4.7. Blackbox probe slow HTTP

HTTP request took more than 1s[copy]

  - alert: BlackboxProbeSlowHttp
    expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})
      description: HTTP request took more than 1s\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.4.8. Blackbox probe slow ping

Blackbox ping took more than 1s[copy]

  - alert: BlackboxProbeSlowPing
    expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow ping (instance {{ $labels.instance }})
      description: Blackbox ping took more than 1s\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1. 5. Windows Server : prometheus-community/windows_exporter (5 rules)[copy all]

1.5.1. Windows Server collector Error

Collector {{ $labels.collector }} was not successful[copy]

  - alert: WindowsServerCollectorError
    expr: windows_exporter_collector_success == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Windows Server collector Error (instance {{ $labels.instance }})
      description: Collector {{ $labels.collector }} was not successful\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.5.2. Windows Server service Status

Windows Service state is not OK[copy]

  - alert: WindowsServerServiceStatus
    expr: windows_service_status{status="ok"} != 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Windows Server service Status (instance {{ $labels.instance }})
      description: Windows Service state is not OK\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.5.3. Windows Server CPU Usage

CPU Usage is more than 80%[copy]

  - alert: WindowsServerCpuUsage
    expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Windows Server CPU Usage (instance {{ $labels.instance }})
      description: CPU Usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.5.4. Windows Server memory Usage

Memory usage is more than 90%[copy]

  - alert: WindowsServerMemoryUsage
    expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Windows Server memory Usage (instance {{ $labels.instance }})
      description: Memory usage is more than 90%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

1.5.5. Windows Server disk Space Usage

Disk usage is more than 80%[copy]

  - alert: WindowsServerDiskSpaceUsage
    expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
      description: Disk usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 1. MySQL : prometheus/mysqld_exporter (8 rules)[copy all]

2.1.1. MySQL down

MySQL instance is down on {{ $labels.instance }}[copy]

  - alert: MysqlDown
    expr: mysql_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MySQL down (instance {{ $labels.instance }})
      description: MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.1.2. MySQL too many connections

More than 80% of MySQL connections are in use on {{ $labels.instance }}[copy]

  - alert: MysqlTooManyConnections
    expr: avg by (instance) (max_over_time(mysql_global_status_threads_connected[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MySQL too many connections (instance {{ $labels.instance }})
      description: More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.1.3. MySQL high threads running

More than 60% of MySQL connections are in running state on {{ $labels.instance }}[copy]

  - alert: MysqlHighThreadsRunning
    expr: avg by (instance) (max_over_time(mysql_global_status_threads_running[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MySQL high threads running (instance {{ $labels.instance }})
      description: More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.1.4. MySQL Slave IO thread not running

MySQL Slave IO thread not running on {{ $labels.instance }}[copy]

  - alert: MysqlSlaveIoThreadNotRunning
    expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MySQL Slave IO thread not running (instance {{ $labels.instance }})
      description: MySQL Slave IO thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.1.5. MySQL Slave SQL thread not running

MySQL Slave SQL thread not running on {{ $labels.instance }}[copy]

  - alert: MysqlSlaveSqlThreadNotRunning
    expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }})
      description: MySQL Slave SQL thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.1.6. MySQL Slave replication lag

MysqL replication lag on {{ $labels.instance }}[copy]

  - alert: MysqlSlaveReplicationLag
    expr: mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 300
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MySQL Slave replication lag (instance {{ $labels.instance }})
      description: MysqL replication lag on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.1.7. MySQL slow queries

MySQL server mysql has some new slow query.[copy]

  - alert: MysqlSlowQueries
    expr: rate(mysql_global_status_slow_queries[2m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MySQL slow queries (instance {{ $labels.instance }})
      description: MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.1.8. MySQL restarted

MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.[copy]

  - alert: MysqlRestarted
    expr: mysql_global_status_uptime < 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MySQL restarted (instance {{ $labels.instance }})
      description: MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 2. PostgreSQL : wrouesnel/postgres_exporter (25 rules)[copy all]

2.2.1. Postgresql down

Postgresql instance is down[copy]

  - alert: PostgresqlDown
    expr: pg_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql down (instance {{ $labels.instance }})
      description: Postgresql instance is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.2. Postgresql restarted

Postgresql restarted[copy]

  - alert: PostgresqlRestarted
    expr: time() - pg_postmaster_start_time_seconds < 60
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql restarted (instance {{ $labels.instance }})
      description: Postgresql restarted\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.3. Postgresql exporter error

Postgresql exporter is showing errors. A query may be buggy in query.yaml[copy]

  - alert: PostgresqlExporterError
    expr: pg_exporter_last_scrape_error > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql exporter error (instance {{ $labels.instance }})
      description: Postgresql exporter is showing errors. A query may be buggy in query.yaml\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.4. Postgresql replication lag

PostgreSQL replication lag is going up (> 10s)[copy]

  - alert: PostgresqlReplicationLag
    expr: (pg_replication_lag) > 10 and ON(instance) (pg_replication_is_replica == 1)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql replication lag (instance {{ $labels.instance }})
      description: PostgreSQL replication lag is going up (> 10s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.5. Postgresql table not vaccumed

Table has not been vaccum for 24 hours[copy]

  - alert: PostgresqlTableNotVaccumed
    expr: time() - pg_stat_user_tables_last_autovacuum > 60 * 60 * 24
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql table not vaccumed (instance {{ $labels.instance }})
      description: Table has not been vaccum for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.6. Postgresql table not analyzed

Table has not been analyzed for 24 hours[copy]

  - alert: PostgresqlTableNotAnalyzed
    expr: time() - pg_stat_user_tables_last_autoanalyze > 60 * 60 * 24
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql table not analyzed (instance {{ $labels.instance }})
      description: Table has not been analyzed for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.7. Postgresql too many connections

PostgreSQL instance has too many connections[copy]

  - alert: PostgresqlTooManyConnections
    expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql too many connections (instance {{ $labels.instance }})
      description: PostgreSQL instance has too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.8. Postgresql not enough connections

PostgreSQL instance should have more connections (> 5)[copy]

  - alert: PostgresqlNotEnoughConnections
    expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql not enough connections (instance {{ $labels.instance }})
      description: PostgreSQL instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.9. Postgresql dead locks

PostgreSQL has dead-locks[copy]

  - alert: PostgresqlDeadLocks
    expr: rate(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql dead locks (instance {{ $labels.instance }})
      description: PostgreSQL has dead-locks\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.10. Postgresql slow queries

PostgreSQL executes slow queries[copy]

  - alert: PostgresqlSlowQueries
    expr: pg_slow_queries > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql slow queries (instance {{ $labels.instance }})
      description: PostgreSQL executes slow queries\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.11. Postgresql high rollback rate

Ratio of transactions being aborted compared to committed is > 2 %[copy]

  - alert: PostgresqlHighRollbackRate
    expr: rate(pg_stat_database_xact_rollback{datname!~"template.*"}[3m]) / rate(pg_stat_database_xact_commit{datname!~"template.*"}[3m]) > 0.02
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql high rollback rate (instance {{ $labels.instance }})
      description: Ratio of transactions being aborted compared to committed is > 2 %\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.12. Postgresql commit rate low

Postgres seems to be processing very few transactions[copy]

  - alert: PostgresqlCommitRateLow
    expr: rate(pg_stat_database_xact_commit[1m]) < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql commit rate low (instance {{ $labels.instance }})
      description: Postgres seems to be processing very few transactions\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.13. Postgresql low XID consumption

Postgresql seems to be consuming transaction IDs very slowly[copy]

  - alert: PostgresqlLowXidConsumption
    expr: rate(pg_txid_current[1m]) < 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql low XID consumption (instance {{ $labels.instance }})
      description: Postgresql seems to be consuming transaction IDs very slowly\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.14. Postgresqllow XLOG consumption

Postgres seems to be consuming XLOG very slowly[copy]

  - alert: PostgresqllowXlogConsumption
    expr: rate(pg_xlog_position_bytes[1m]) < 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresqllow XLOG consumption (instance {{ $labels.instance }})
      description: Postgres seems to be consuming XLOG very slowly\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.15. Postgresql WALE replication stopped

WAL-E replication seems to be stopped[copy]

  - alert: PostgresqlWaleReplicationStopped
    expr: rate(pg_xlog_position_bytes[1m]) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql WALE replication stopped (instance {{ $labels.instance }})
      description: WAL-E replication seems to be stopped\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.16. Postgresql high rate statement timeout

Postgres transactions showing high rate of statement timeouts[copy]

  - alert: PostgresqlHighRateStatementTimeout
    expr: rate(postgresql_errors_total{type="statement_timeout"}[5m]) > 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql high rate statement timeout (instance {{ $labels.instance }})
      description: Postgres transactions showing high rate of statement timeouts\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.17. Postgresql high rate deadlock

Postgres detected deadlocks[copy]

  - alert: PostgresqlHighRateDeadlock
    expr: rate(postgresql_errors_total{type="deadlock_detected"}[1m]) * 60 > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql high rate deadlock (instance {{ $labels.instance }})
      description: Postgres detected deadlocks\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.18. Postgresql replication lab bytes

Postgres Replication lag (in bytes) is high[copy]

  - alert: PostgresqlReplicationLabBytes
    expr: (pg_xlog_position_bytes and pg_replication_is_replica == 0) - GROUP_RIGHT(instance) (pg_xlog_position_bytes and pg_replication_is_replica == 1) > 1e+09
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql replication lab bytes (instance {{ $labels.instance }})
      description: Postgres Replication lag (in bytes) is high\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.19. Postgresql unused replication slot

Unused Replication Slots[copy]

  - alert: PostgresqlUnusedReplicationSlot
    expr: pg_replication_slots_active == 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql unused replication slot (instance {{ $labels.instance }})
      description: Unused Replication Slots\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.20. Postgresql too many dead tuples

PostgreSQL dead tuples is too large[copy]

  - alert: PostgresqlTooManyDeadTuples
    expr: ((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 unless ON(instance) (pg_replication_is_replica == 1)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql too many dead tuples (instance {{ $labels.instance }})
      description: PostgreSQL dead tuples is too large\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.21. Postgresql split brain

Split Brain, too many primary Postgresql databases in read-write mode[copy]

  - alert: PostgresqlSplitBrain
    expr: count(pg_replication_is_replica == 0) != 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql split brain (instance {{ $labels.instance }})
      description: Split Brain, too many primary Postgresql databases in read-write mode\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.22. Postgresql promoted node

Postgresql standby server has been promoted as primary node[copy]

  - alert: PostgresqlPromotedNode
    expr: pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql promoted node (instance {{ $labels.instance }})
      description: Postgresql standby server has been promoted as primary node\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.23. Postgresql configuration changed

Postgres Database configuration change has occurred[copy]

  - alert: PostgresqlConfigurationChanged
    expr: {__name__=~"pg_settings_.*"} != ON(__name__) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Postgresql configuration changed (instance {{ $labels.instance }})
      description: Postgres Database configuration change has occurred\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.24. Postgresql SSL compression active

Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.[copy]

  - alert: PostgresqlSslCompressionActive
    expr: sum(pg_stat_ssl_compression) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql SSL compression active (instance {{ $labels.instance }})
      description: Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.2.25. Postgresql too many locks acquired

Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.[copy]

  - alert: PostgresqlTooManyLocksAcquired
    expr: ((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Postgresql too many locks acquired (instance {{ $labels.instance }})
      description: Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 3. SQL Server : Ozarklake/prometheus-mssql-exporter (2 rules)[copy all]

2.3.1. SQL Server down

SQl server instance is down[copy]

  - alert: SqlServerDown
    expr: mssql_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: SQL Server down (instance {{ $labels.instance }})
      description: SQl server instance is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.3.2. SQL Server deadlock

SQL Server is having some deadlock.[copy]

  - alert: SqlServerDeadlock
    expr: rate(mssql_deadlocks[1m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: SQL Server deadlock (instance {{ $labels.instance }})
      description: SQL Server is having some deadlock.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 4. PGBouncer : spreaker/prometheus-pgbouncer-exporter (3 rules)[copy all]

2.4.1. PGBouncer active connectinos

PGBouncer pools are filling up[copy]

  - alert: PgbouncerActiveConnectinos
    expr: pgbouncer_pools_server_active_connections > 200
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: PGBouncer active connectinos (instance {{ $labels.instance }})
      description: PGBouncer pools are filling up\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.4.2. PGBouncer errors

PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.[copy]

  - alert: PgbouncerErrors
    expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: PGBouncer errors (instance {{ $labels.instance }})
      description: PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.4.3. PGBouncer max connections

The number of PGBouncer client connections has reached max_client_conn.[copy]

  - alert: PgbouncerMaxConnections
    expr: rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: PGBouncer max connections (instance {{ $labels.instance }})
      description: The number of PGBouncer client connections has reached max_client_conn.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 5. Redis : oliver006/redis_exporter (11 rules)[copy all]

2.5.1. Redis down

Redis instance is down[copy]

  - alert: RedisDown
    expr: redis_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Redis down (instance {{ $labels.instance }})
      description: Redis instance is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.2. Redis missing master

Redis cluster has no node marked as master.[copy]

  - alert: RedisMissingMaster
    expr: count(redis_instance_info{role="master"}) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Redis missing master (instance {{ $labels.instance }})
      description: Redis cluster has no node marked as master.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.3. Redis too many masters

Redis cluster has too many nodes marked as master.[copy]

  - alert: RedisTooManyMasters
    expr: count(redis_instance_info{role="master"}) > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Redis too many masters (instance {{ $labels.instance }})
      description: Redis cluster has too many nodes marked as master.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.4. Redis disconnected slaves

Redis not replicating for all slaves. Consider reviewing the redis replication status.[copy]

  - alert: RedisDisconnectedSlaves
    expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Redis disconnected slaves (instance {{ $labels.instance }})
      description: Redis not replicating for all slaves. Consider reviewing the redis replication status.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.5. Redis replication broken

Redis instance lost a slave[copy]

  - alert: RedisReplicationBroken
    expr: delta(redis_connected_slaves[1m]) < 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Redis replication broken (instance {{ $labels.instance }})
      description: Redis instance lost a slave\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.6. Redis cluster flapping

Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).[copy]

  - alert: RedisClusterFlapping
    expr: changes(redis_connected_slaves[5m]) > 2
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Redis cluster flapping (instance {{ $labels.instance }})
      description: Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.7. Redis missing backup

Redis has not been backuped for 24 hours[copy]

  - alert: RedisMissingBackup
    expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Redis missing backup (instance {{ $labels.instance }})
      description: Redis has not been backuped for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.8. Redis out of memory

Redis is running out of memory (> 90%)[copy]

  - alert: RedisOutOfMemory
    expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Redis out of memory (instance {{ $labels.instance }})
      description: Redis is running out of memory (> 90%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.9. Redis too many connections

Redis instance has too many connections[copy]

  - alert: RedisTooManyConnections
    expr: redis_connected_clients > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Redis too many connections (instance {{ $labels.instance }})
      description: Redis instance has too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.10. Redis not enough connections

Redis instance should have more connections (> 5)[copy]

  - alert: RedisNotEnoughConnections
    expr: redis_connected_clients < 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Redis not enough connections (instance {{ $labels.instance }})
      description: Redis instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.5.11. Redis rejected connections

Some connections to Redis has been rejected[copy]

  - alert: RedisRejectedConnections
    expr: increase(redis_rejected_connections_total[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Redis rejected connections (instance {{ $labels.instance }})
      description: Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 6. MongoDB : percona/mongodb_exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

2. 6. MongoDB : dcu/mongodb_exporter (10 rules)[copy all]

2.6.1. MongoDB replication lag

Mongodb replication lag is more than 10s[copy]

  - alert: MongodbReplicationLag
    expr: avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication lag (instance {{ $labels.instance }})
      description: Mongodb replication lag is more than 10s\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.2. MongoDB replication Status 3

MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync[copy]

  - alert: MongodbReplicationStatus3
    expr: mongodb_replset_member_state == 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 3 (instance {{ $labels.instance }})
      description: MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.3. MongoDB replication Status 6

MongoDB Replication set member as seen from another member of the set, is not yet known[copy]

  - alert: MongodbReplicationStatus6
    expr: mongodb_replset_member_state == 6
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 6 (instance {{ $labels.instance }})
      description: MongoDB Replication set member as seen from another member of the set, is not yet known\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.4. MongoDB replication Status 8

MongoDB Replication set member as seen from another member of the set, is unreachable[copy]

  - alert: MongodbReplicationStatus8
    expr: mongodb_replset_member_state == 8
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 8 (instance {{ $labels.instance }})
      description: MongoDB Replication set member as seen from another member of the set, is unreachable\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.5. MongoDB replication Status 9

MongoDB Replication set member is actively performing a rollback. Data is not available for reads[copy]

  - alert: MongodbReplicationStatus9
    expr: mongodb_replset_member_state == 9
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 9 (instance {{ $labels.instance }})
      description: MongoDB Replication set member is actively performing a rollback. Data is not available for reads\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.6. MongoDB replication Status 10

MongoDB Replication set member was once in a replica set but was subsequently removed[copy]

  - alert: MongodbReplicationStatus10
    expr: mongodb_replset_member_state == 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 10 (instance {{ $labels.instance }})
      description: MongoDB Replication set member was once in a replica set but was subsequently removed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.7. MongoDB number cursors open

Too many cursors opened by MongoDB for clients (> 10k)[copy]

  - alert: MongodbNumberCursorsOpen
    expr: mongodb_metrics_cursor_open{state="total_open"} > 10000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MongoDB number cursors open (instance {{ $labels.instance }})
      description: Too many cursors opened by MongoDB for clients (> 10k)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.8. MongoDB cursors timeouts

Too many cursors are timing out[copy]

  - alert: MongodbCursorsTimeouts
    expr: increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MongoDB cursors timeouts (instance {{ $labels.instance }})
      description: Too many cursors are timing out\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.9. MongoDB too many connections

Too many connections[copy]

  - alert: MongodbTooManyConnections
    expr: mongodb_connections{state="current"} > 500
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MongoDB too many connections (instance {{ $labels.instance }})
      description: Too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.6.10. MongoDB virtual memory usage

High memory usage[copy]

  - alert: MongodbVirtualMemoryUsage
    expr: (sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: MongoDB virtual memory usage (instance {{ $labels.instance }})
      description: High memory usage\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 7. RabbitMQ (official exporter) : rabbitmq/rabbitmq-prometheus (9 rules)[copy all]

2.7.1. Rabbitmq node down

Less than 3 nodes running in RabbitMQ cluster[copy]

  - alert: RabbitmqNodeDown
    expr: sum(rabbitmq_build_info) < 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq node down (instance {{ $labels.instance }})
      description: Less than 3 nodes running in RabbitMQ cluster\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.2. Rabbitmq node not distributed

Distribution link state is not 'up'[copy]

  - alert: RabbitmqNodeNotDistributed
    expr: erlang_vm_dist_node_state < 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq node not distributed (instance {{ $labels.instance }})
      description: Distribution link state is not 'up'\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.3. Rabbitmq instances different versions

Running different version of Rabbitmq in the same cluster, can lead to failure.[copy]

  - alert: RabbitmqInstancesDifferentVersions
    expr: count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq instances different versions (instance {{ $labels.instance }})
      description: Running different version of Rabbitmq in the same cluster, can lead to failure.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.4. Rabbitmq memory high

A node use more than 90% of allocated RAM[copy]

  - alert: RabbitmqMemoryHigh
    expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq memory high (instance {{ $labels.instance }})
      description: A node use more than 90% of allocated RAM\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.5. Rabbitmq file descriptors usage

A node use more than 90% of file descriptors[copy]

  - alert: RabbitmqFileDescriptorsUsage
    expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq file descriptors usage (instance {{ $labels.instance }})
      description: A node use more than 90% of file descriptors\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.6. Rabbitmq too much unack

Too much unacknowledged messages[copy]

  - alert: RabbitmqTooMuchUnack
    expr: sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq too much unack (instance {{ $labels.instance }})
      description: Too much unacknowledged messages\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.7. Rabbitmq too much connections

The total connections of a node is too high[copy]

  - alert: RabbitmqTooMuchConnections
    expr: rabbitmq_connections > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq too much connections (instance {{ $labels.instance }})
      description: The total connections of a node is too high\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.8. Rabbitmq no queue consumer

A queue has less than 1 consumer[copy]

  - alert: RabbitmqNoQueueConsumer
    expr: rabbitmq_queue_consumers < 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq no queue consumer (instance {{ $labels.instance }})
      description: A queue has less than 1 consumer\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.9. Rabbitmq unroutable messages

A queue has unroutable messages[copy]

  - alert: RabbitmqUnroutableMessages
    expr: increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq unroutable messages (instance {{ $labels.instance }})
      description: A queue has unroutable messages\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 7. RabbitMQ (official exporter) : kbudde/rabbitmq-exporter (11 rules)[copy all]

2.7.1. Rabbitmq down

RabbitMQ node down[copy]

  - alert: RabbitmqDown
    expr: rabbitmq_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq down (instance {{ $labels.instance }})
      description: RabbitMQ node down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.2. Rabbitmq cluster down

Less than 3 nodes running in RabbitMQ cluster[copy]

  - alert: RabbitmqClusterDown
    expr: sum(rabbitmq_running) < 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq cluster down (instance {{ $labels.instance }})
      description: Less than 3 nodes running in RabbitMQ cluster\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.3. Rabbitmq cluster partition

Cluster partition[copy]

  - alert: RabbitmqClusterPartition
    expr: rabbitmq_partitions > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq cluster partition (instance {{ $labels.instance }})
      description: Cluster partition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.4. Rabbitmq out of memory

Memory available for RabbmitMQ is low (< 10%)[copy]

  - alert: RabbitmqOutOfMemory
    expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq out of memory (instance {{ $labels.instance }})
      description: Memory available for RabbmitMQ is low (< 10%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.5. Rabbitmq too many connections

RabbitMQ instance has too many connections (> 1000)[copy]

  - alert: RabbitmqTooManyConnections
    expr: rabbitmq_connectionsTotal > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq too many connections (instance {{ $labels.instance }})
      description: RabbitMQ instance has too many connections (> 1000)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.6. Rabbitmq dead letter queue filling up

Dead letter queue is filling up (> 10 msgs)[copy]

  - alert: RabbitmqDeadLetterQueueFillingUp
    expr: rabbitmq_queue_messages{queue="my-dead-letter-queue"} > 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq dead letter queue filling up (instance {{ $labels.instance }})
      description: Dead letter queue is filling up (> 10 msgs)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.7. Rabbitmq too many messages in queue

Queue is filling up (> 1000 msgs)[copy]

  - alert: RabbitmqTooManyMessagesInQueue
    expr: rabbitmq_queue_messages_ready{queue="my-queue"} > 1000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq too many messages in queue (instance {{ $labels.instance }})
      description: Queue is filling up (> 1000 msgs)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.8. Rabbitmq slow queue consuming

Queue messages are consumed slowly (> 60s)[copy]

  - alert: RabbitmqSlowQueueConsuming
    expr: time() - rabbitmq_queue_head_message_timestamp{queue="my-queue"} > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq slow queue consuming (instance {{ $labels.instance }})
      description: Queue messages are consumed slowly (> 60s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.9. Rabbitmq no consumer

Queue has no consumer[copy]

  - alert: RabbitmqNoConsumer
    expr: rabbitmq_queue_consumers == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq no consumer (instance {{ $labels.instance }})
      description: Queue has no consumer\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.10. Rabbitmq too many consumers

Queue should have only 1 consumer[copy]

  - alert: RabbitmqTooManyConsumers
    expr: rabbitmq_queue_consumers > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Rabbitmq too many consumers (instance {{ $labels.instance }})
      description: Queue should have only 1 consumer\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.7.11. Rabbitmq unactive exchange

Exchange receive less than 5 msgs per second[copy]

  - alert: RabbitmqUnactiveExchange
    expr: rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Rabbitmq unactive exchange (instance {{ $labels.instance }})
      description: Exchange receive less than 5 msgs per second\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 8. Elasticsearch : justwatchcom/elasticsearch_exporter (13 rules)[copy all]

2.8.1. Elasticsearch Heap Usage Too High

The heap usage is over 90% for 5m[copy]

  - alert: ElasticsearchHeapUsageTooHigh
    expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})
      description: The heap usage is over 90% for 5m\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.2. Elasticsearch Heap Usage warning

The heap usage is over 80% for 5m[copy]

  - alert: ElasticsearchHeapUsageWarning
    expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }})
      description: The heap usage is over 80% for 5m\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.3. Elasticsearch disk space low

The disk usage is over 80%[copy]

  - alert: ElasticsearchDiskSpaceLow
    expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch disk space low (instance {{ $labels.instance }})
      description: The disk usage is over 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.4. Elasticsearch disk out of space

The disk usage is over 90%[copy]

  - alert: ElasticsearchDiskOutOfSpace
    expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch disk out of space (instance {{ $labels.instance }})
      description: The disk usage is over 90%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.5. Elasticsearch Cluster Red

Elastic Cluster Red status[copy]

  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_health_status{color="red"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Cluster Red (instance {{ $labels.instance }})
      description: Elastic Cluster Red status\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.6. Elasticsearch Cluster Yellow

Elastic Cluster Yellow status[copy]

  - alert: ElasticsearchClusterYellow
    expr: elasticsearch_cluster_health_status{color="yellow"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }})
      description: Elastic Cluster Yellow status\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.7. Elasticsearch Healthy Nodes

Number Healthy Nodes less then number_of_nodes[copy]

  - alert: ElasticsearchHealthyNodes
    expr: elasticsearch_cluster_health_number_of_nodes < number_of_nodes
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
      description: Number Healthy Nodes less then number_of_nodes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.8. Elasticsearch Healthy Data Nodes

Number Healthy Data Nodes less then number_of_data_nodes[copy]

  - alert: ElasticsearchHealthyDataNodes
    expr: elasticsearch_cluster_health_number_of_data_nodes < number_of_data_nodes
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }})
      description: Number Healthy Data Nodes less then number_of_data_nodes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.9. Elasticsearch relocation shards

Number of relocation shards for 20 min[copy]

  - alert: ElasticsearchRelocationShards
    expr: elasticsearch_cluster_health_relocating_shards > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch relocation shards (instance {{ $labels.instance }})
      description: Number of relocation shards for 20 min\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.10. Elasticsearch initializing shards

Number of initializing shards for 10 min[copy]

  - alert: ElasticsearchInitializingShards
    expr: elasticsearch_cluster_health_initializing_shards > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch initializing shards (instance {{ $labels.instance }})
      description: Number of initializing shards for 10 min\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.11. Elasticsearch unassigned shards

Number of unassigned shards for 2 min[copy]

  - alert: ElasticsearchUnassignedShards
    expr: elasticsearch_cluster_health_unassigned_shards > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch unassigned shards (instance {{ $labels.instance }})
      description: Number of unassigned shards for 2 min\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.12. Elasticsearch pending tasks

Number of pending tasks for 10 min. Cluster works slowly.[copy]

  - alert: ElasticsearchPendingTasks
    expr: elasticsearch_cluster_health_number_of_pending_tasks > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch pending tasks (instance {{ $labels.instance }})
      description: Number of pending tasks for 10 min. Cluster works slowly.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.8.13. Elasticsearch no new documents

No new documents for 10 min![copy]

  - alert: ElasticsearchNoNewDocuments
    expr: rate(elasticsearch_indices_docs{es_data_node="true"}[10m]) < 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch no new documents (instance {{ $labels.instance }})
      description: No new documents for 10 min!\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 9. Cassandra : instaclustr/cassandra-exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

2. 9. Cassandra : criteo/cassandra_exporter (18 rules)[copy all]

2.9.1. Cassandra hints count

Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down[copy]

  - alert: CassandraHintsCount
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra hints count (instance {{ $labels.instance }})
      description: Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.2. Cassandra compaction task pending

Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.[copy]

  - alert: CassandraCompactionTaskPending
    expr: avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[30m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction task pending (instance {{ $labels.instance }})
      description: Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.3. Cassandra viewwrite latency

High viewwrite latency on {{ $labels.instance }} cassandra node[copy]

  - alert: CassandraViewwriteLatency
    expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Cassandra viewwrite latency (instance {{ $labels.instance }})
      description: High viewwrite latency on {{ $labels.instance }} cassandra node\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.4. Cassandra cool hacker

Increase of Cassandra authentication failures[copy]

  - alert: CassandraCoolHacker
    expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Cassandra cool hacker (instance {{ $labels.instance }})
      description: Increase of Cassandra authentication failures\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.5. Cassandra node down

Cassandra node down[copy]

  - alert: CassandraNodeDown
    expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra node down (instance {{ $labels.instance }})
      description: Cassandra node down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.6. Cassandra commitlog pending tasks

Unexpected number of Cassandra commitlog pending tasks[copy]

  - alert: CassandraCommitlogPendingTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
      description: Unexpected number of Cassandra commitlog pending tasks\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.7. Cassandra compaction executor blocked tasks

Some Cassandra compaction executor tasks are blocked[copy]

  - alert: CassandraCompactionExecutorBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
      description: Some Cassandra compaction executor tasks are blocked\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.8. Cassandra flush writer blocked tasks

Some Cassandra flush writer tasks are blocked[copy]

  - alert: CassandraFlushWriterBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
      description: Some Cassandra flush writer tasks are blocked\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.9. Cassandra repair pending tasks

Some Cassandra repair tasks are pending[copy]

  - alert: CassandraRepairPendingTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Cassandra repair pending tasks (instance {{ $labels.instance }})
      description: Some Cassandra repair tasks are pending\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.10. Cassandra repair blocked tasks

Some Cassandra repair tasks are blocked[copy]

  - alert: CassandraRepairBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Cassandra repair blocked tasks (instance {{ $labels.instance }})
      description: Some Cassandra repair tasks are blocked\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.11. Cassandra connection timeouts total

Some connection between nodes are ending in timeout[copy]

  - alert: CassandraConnectionTimeoutsTotal
    expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
      description: Some connection between nodes are ending in timeout\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.12. Cassandra storage exceptions

Something is going wrong with cassandra storage[copy]

  - alert: CassandraStorageExceptions
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra storage exceptions (instance {{ $labels.instance }})
      description: Something is going wrong with cassandra storage\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.13. Cassandra tombstone dump

Too much tombstones scanned in queries[copy]

  - alert: CassandraTombstoneDump
    expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra tombstone dump (instance {{ $labels.instance }})
      description: Too much tombstones scanned in queries\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.14. Cassandra client request unvailable write

Write failures have occurred because too many nodes are unavailable[copy]

  - alert: CassandraClientRequestUnvailableWrite
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unvailable write (instance {{ $labels.instance }})
      description: Write failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.15. Cassandra client request unvailable read

Read failures have occurred because too many nodes are unavailable[copy]

  - alert: CassandraClientRequestUnvailableRead
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unvailable read (instance {{ $labels.instance }})
      description: Read failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.16. Cassandra client request write failure

A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.[copy]

  - alert: CassandraClientRequestWriteFailure
    expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request write failure (instance {{ $labels.instance }})
      description: A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.17. Cassandra client request read failure

A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.[copy]

  - alert: CassandraClientRequestReadFailure
    expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request read failure (instance {{ $labels.instance }})
      description: A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.9.18. Cassandra cache hit rate key cache

Key cache hit rate is below 85%[copy]

  - alert: CassandraCacheHitRateKeyCache
    expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }})
      description: Key cache hit rate is below 85%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2. 10. Zookeeper : cloudflare/kafka_zookeeper_exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

2. 11. Kafka : danielqsj/kafka_exporter (2 rules)[copy all]

2.11.1. Kafka topics replicas

Kafka topic in-sync partition[copy]

  - alert: KafkaTopicsReplicas
    expr: sum(kafka_topic_partition_in_sync_replica) by (topic) < 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kafka topics replicas (instance {{ $labels.instance }})
      description: Kafka topic in-sync partition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

2.11.2. Kafka consumers group

Kafka consumers group[copy]

  - alert: KafkaConsumersGroup
    expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kafka consumers group (instance {{ $labels.instance }})
      description: Kafka consumers group\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3. 1. Nginx : nginx-lua-prometheus (3 rules)[copy all]

3.1.1. Nginx high HTTP 4xx error rate

Too many HTTP requests with status 4xx (> 5%)[copy]

  - alert: NginxHighHttp4xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 4xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.1.2. Nginx high HTTP 5xx error rate

Too many HTTP requests with status 5xx (> 5%)[copy]

  - alert: NginxHighHttp5xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 5xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.1.3. Nginx latency high

Nginx p99 latency is higher than 10 seconds[copy]

  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Nginx latency high (instance {{ $labels.instance }})
      description: Nginx p99 latency is higher than 10 seconds\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3. 2. Apache : Lusitaniae/apache_exporter (3 rules)[copy all]

3.2.1. Apache down

Apache down[copy]

  - alert: ApacheDown
    expr: apache_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Apache down (instance {{ $labels.instance }})
      description: Apache down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.2.2. Apache workers load

Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}[copy]

  - alert: ApacheWorkersLoad
    expr: (sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Apache workers load (instance {{ $labels.instance }})
      description: Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.2.3. Apache restart

Apache has just been restarted, less than one minute ago.[copy]

  - alert: ApacheRestart
    expr: apache_uptime_seconds_total / 60 < 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Apache restart (instance {{ $labels.instance }})
      description: Apache has just been restarted, less than one minute ago.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3. 3. HaProxy : Embedded exporter (HAProxy >= v2)

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

3. 3. HaProxy : prometheus/haproxy_exporter (HAProxy < v2) (16 rules)[copy all]

3.3.1. HAProxy down

HAProxy down[copy]

  - alert: HaproxyDown
    expr: haproxy_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy down (instance {{ $labels.instance }})
      description: HAProxy down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.2. HAProxy high HTTP 4xx error rate backend

Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}[copy]

  - alert: HaproxyHighHttp4xxErrorRateBackend
    expr: sum by (backend) rate(haproxy_server_http_responses_total{code="4xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.3. HAProxy high HTTP 4xx error rate backend

Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}[copy]

  - alert: HaproxyHighHttp4xxErrorRateBackend
    expr: sum by (backend) rate(haproxy_server_http_responses_total{code="5xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.4. HAProxy high HTTP 4xx error rate server

Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}[copy]

  - alert: HaproxyHighHttp4xxErrorRateServer
    expr: sum by (server) rate(haproxy_server_http_responses_total{code="4xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.5. HAProxy high HTTP 5xx error rate server

Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}[copy]

  - alert: HaproxyHighHttp5xxErrorRateServer
    expr: sum by (server) rate(haproxy_server_http_responses_total{code="5xx"}[1m]) / sum by (backend) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.6. HAProxy server response errors

Too many response errors to {{ $labels.server }} server (> 5%).[copy]

  - alert: HaproxyServerResponseErrors
    expr: sum by (server) rate(haproxy_server_response_errors_total[1m]) / sum by (server) rate(haproxy_server_http_responses_total[1m]) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server response errors (instance {{ $labels.instance }})
      description: Too many response errors to {{ $labels.server }} server (> 5%).\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.7. HAProxy backend connection errors

Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be to high.[copy]

  - alert: HaproxyBackendConnectionErrors
    expr: sum by (backend) rate(haproxy_backend_connection_errors_total[1m]) > 100
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy backend connection errors (instance {{ $labels.instance }})
      description: Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be to high.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.8. HAProxy server connection errors

Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be to high.[copy]

  - alert: HaproxyServerConnectionErrors
    expr: sum by (server) rate(haproxy_server_connection_errors_total[1m]) > 100
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server connection errors (instance {{ $labels.instance }})
      description: Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be to high.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.9. HAProxy backend max active session

HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).[copy]

  - alert: HaproxyBackendMaxActiveSession
    expr: avg_over_time((sum by (backend) (haproxy_server_max_sessions) / sum by (backend) (haproxy_server_limit_sessions)) [2m]) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: HAProxy backend max active session (instance {{ $labels.instance }})
      description: HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.10. HAProxy pending requests

Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend[copy]

  - alert: HaproxyPendingRequests
    expr: sum by (backend) haproxy_backend_current_queue > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: HAProxy pending requests (instance {{ $labels.instance }})
      description: Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.11. HAProxy HTTP slowing down

Average request time is increasing[copy]

  - alert: HaproxyHttpSlowingDown
    expr: avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
      description: Average request time is increasing\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.12. HAProxy retry high

High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend[copy]

  - alert: HaproxyRetryHigh
    expr: rate(sum by (backend) (haproxy_backend_retry_warnings_total)) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: HAProxy retry high (instance {{ $labels.instance }})
      description: High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.13. HAProxy backend down

HAProxy backend is down[copy]

  - alert: HaproxyBackendDown
    expr: haproxy_backend_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy backend down (instance {{ $labels.instance }})
      description: HAProxy backend is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.14. HAProxy server down

HAProxy server is down[copy]

  - alert: HaproxyServerDown
    expr: haproxy_server_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server down (instance {{ $labels.instance }})
      description: HAProxy server is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.15. HAProxy frontend security blocked requests

HAProxy is blocking requests for security reason[copy]

  - alert: HaproxyFrontendSecurityBlockedRequests
    expr: rate(sum by (frontend) (haproxy_frontend_requests_denied_total)) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }})
      description: HAProxy is blocking requests for security reason\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.3.16. HAProxy server healthcheck failure

Some server healthcheck are failing on {{ $labels.server }}[copy]

  - alert: HaproxyServerHealthcheckFailure
    expr: increase(haproxy_server_check_failures_total) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})
      description: Some server healthcheck are failing on {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3. 4. Traefik : Embedded exporter (3 rules)[copy all]

3.4.1. Traefik backend down

All Traefik backends are down[copy]

  - alert: TraefikBackendDown
    expr: count(traefik_backend_server_up) by (backend) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Traefik backend down (instance {{ $labels.instance }})
      description: All Traefik backends are down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.4.2. Traefik high HTTP 4xx error rate backend

Traefik backend 4xx error rate is above 5%[copy]

  - alert: TraefikHighHttp4xxErrorRateBackend
    expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: Traefik backend 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.4.3. Traefik high HTTP 5xx error rate backend

Traefik backend 5xx error rate is above 5%[copy]

  - alert: TraefikHighHttp5xxErrorRateBackend
    expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }})
      description: Traefik backend 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3. 4. Traefik : Embedded exporter v2 (3 rules)[copy all]

3.4.1. Traefik service down

All Traefik services are down[copy]

  - alert: TraefikServiceDown
    expr: count(traefik_service_server_up) by (service) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Traefik service down (instance {{ $labels.instance }})
      description: All Traefik services are down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.4.2. Traefik high HTTP 4xx error rate service

Traefik service 4xx error rate is above 5%[copy]

  - alert: TraefikHighHttp4xxErrorRateService
    expr: sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }})
      description: Traefik service 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

3.4.3. Traefik high HTTP 5xx error rate service

Traefik service 5xx error rate is above 5%[copy]

  - alert: TraefikHighHttp5xxErrorRateService
    expr: sum(rate(traefik_service_requests_total{code=~"5.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }})
      description: Traefik service 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

4. 1. PHP-FPM : bakins/php-fpm-exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

4. 2. JVM : java-client (1 rules)[copy all]

4.2.1. JVM memory filling up

JVM memory is filling up (> 80%)[copy]

  - alert: JvmMemoryFillingUp
    expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: JVM memory filling up (instance {{ $labels.instance }})
      description: JVM memory is filling up (> 80%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

4. 3. Sidekiq : Strech/sidekiq-prometheus-exporter (2 rules)[copy all]

4.3.1. Sidekiq queue size

Sidekiq queue {{ $labels.name }} is growing[copy]

  - alert: SidekiqQueueSize
    expr: sidekiq_queue_size > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Sidekiq queue size (instance {{ $labels.instance }})
      description: Sidekiq queue {{ $labels.name }} is growing\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

4.3.2. Sidekiq scheduling latency too high

Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.[copy]

  - alert: SidekiqSchedulingLatencyTooHigh
    expr: max(sidekiq_queue_latency) > 120
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }})
      description: Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5. 1. Kubernetes : kube-state-metrics (32 rules)[copy all]

5.1.1. Kubernetes Node ready

Node {{ $labels.node }} has been unready for a long time[copy]

  - alert: KubernetesNodeReady
    expr: kube_node_status_condition{condition="Ready",status="true"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node ready (instance {{ $labels.instance }})
      description: Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.2. Kubernetes memory pressure

{{ $labels.node }} has MemoryPressure condition[copy]

  - alert: KubernetesMemoryPressure
    expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes memory pressure (instance {{ $labels.instance }})
      description: {{ $labels.node }} has MemoryPressure condition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.3. Kubernetes disk pressure

{{ $labels.node }} has DiskPressure condition[copy]

  - alert: KubernetesDiskPressure
    expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes disk pressure (instance {{ $labels.instance }})
      description: {{ $labels.node }} has DiskPressure condition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.4. Kubernetes out of disk

{{ $labels.node }} has OutOfDisk condition[copy]

  - alert: KubernetesOutOfDisk
    expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes out of disk (instance {{ $labels.instance }})
      description: {{ $labels.node }} has OutOfDisk condition\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.5. Kubernetes out of capacity

{{ $labels.node }} is out of capacity[copy]

  - alert: KubernetesOutOfCapacity
    expr: sum(kube_pod_info) by (node) / sum(kube_node_status_allocatable_pods) by (node) * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes out of capacity (instance {{ $labels.instance }})
      description: {{ $labels.node }} is out of capacity\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.6. Kubernetes Job failed

Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete[copy]

  - alert: KubernetesJobFailed
    expr: kube_job_status_failed > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Job failed (instance {{ $labels.instance }})
      description: Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.7. Kubernetes CronJob suspended

CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended[copy]

  - alert: KubernetesCronjobSuspended
    expr: kube_cronjob_spec_suspend != 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes CronJob suspended (instance {{ $labels.instance }})
      description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.8. Kubernetes PersistentVolumeClaim pending

PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending[copy]

  - alert: KubernetesPersistentvolumeclaimPending
    expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }})
      description: PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.9. Kubernetes Volume out of disk space

Volume is almost full (< 10% left)[copy]

  - alert: KubernetesVolumeOutOfDiskSpace
    expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }})
      description: Volume is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.10. Kubernetes Volume full in four days

{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.[copy]

  - alert: KubernetesVolumeFullInFourDays
    expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Volume full in four days (instance {{ $labels.instance }})
      description: {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.11. Kubernetes PersistentVolume error

Persistent volume is in bad state[copy]

  - alert: KubernetesPersistentvolumeError
    expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})
      description: Persistent volume is in bad state\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.12. Kubernetes StatefulSet down

A StatefulSet went down[copy]

  - alert: KubernetesStatefulsetDown
    expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})
      description: A StatefulSet went down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.13. Kubernetes HPA scaling ability

Pod is unable to scale[copy]

  - alert: KubernetesHpaScalingAbility
    expr: kube_hpa_status_condition{status="false", condition ="AbleToScale"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes HPA scaling ability (instance {{ $labels.instance }})
      description: Pod is unable to scale\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.14. Kubernetes HPA metric availability

HPA is not able to collect metrics[copy]

  - alert: KubernetesHpaMetricAvailability
    expr: kube_hpa_status_condition{status="false", condition="ScalingActive"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes HPA metric availability (instance {{ $labels.instance }})
      description: HPA is not able to collect metrics\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.15. Kubernetes HPA scale capability

The maximum number of desired Pods has been hit[copy]

  - alert: KubernetesHpaScaleCapability
    expr: kube_hpa_status_desired_replicas >= kube_hpa_spec_max_replicas
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes HPA scale capability (instance {{ $labels.instance }})
      description: The maximum number of desired Pods has been hit\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.16. Kubernetes Pod not healthy

Pod has been in a non-ready state for longer than an hour.[copy]

  - alert: KubernetesPodNotHealthy
    expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
      description: Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.17. Kubernetes pod crash looping

Pod {{ $labels.pod }} is crash looping[copy]

  - alert: KubernetesPodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
      description: Pod {{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.18. Kubernetes ReplicasSet mismatch

Deployment Replicas mismatch[copy]

  - alert: KubernetesReplicassetMismatch
    expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})
      description: Deployment Replicas mismatch\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.19. Kubernetes Deployment replicas mismatch

Deployment Replicas mismatch[copy]

  - alert: KubernetesDeploymentReplicasMismatch
    expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
      description: Deployment Replicas mismatch\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.20. Kubernetes StatefulSet replicas mismatch

A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.[copy]

  - alert: KubernetesStatefulsetReplicasMismatch
    expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})
      description: A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.21. Kubernetes Deployment generation mismatch

A Deployment has failed but has not been rolled back.[copy]

  - alert: KubernetesDeploymentGenerationMismatch
    expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }})
      description: A Deployment has failed but has not been rolled back.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.22. Kubernetes StatefulSet generation mismatch

A StatefulSet has failed but has not been rolled back.[copy]

  - alert: KubernetesStatefulsetGenerationMismatch
    expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }})
      description: A StatefulSet has failed but has not been rolled back.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.23. Kubernetes StatefulSet update not rolled out

StatefulSet update has not been rolled out.[copy]

  - alert: KubernetesStatefulsetUpdateNotRolledOut
    expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }})
      description: StatefulSet update has not been rolled out.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.24. Kubernetes DaemonSet rollout stuck

Some Pods of DaemonSet are not scheduled or not ready[copy]

  - alert: KubernetesDaemonsetRolloutStuck
    expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
      description: Some Pods of DaemonSet are not scheduled or not ready\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.25. Kubernetes DaemonSet misscheduled

Some DaemonSet Pods are running where they are not supposed to run[copy]

  - alert: KubernetesDaemonsetMisscheduled
    expr: kube_daemonset_status_number_misscheduled > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }})
      description: Some DaemonSet Pods are running where they are not supposed to run\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.26. Kubernetes CronJob too long

CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.[copy]

  - alert: KubernetesCronjobTooLong
    expr: time() - kube_cronjob_next_schedule_time > 3600
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes CronJob too long (instance {{ $labels.instance }})
      description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.27. Kubernetes job completion

Kubernetes Job failed to complete[copy]

  - alert: KubernetesJobCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded > 0 or kube_job_status_failed > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes job completion (instance {{ $labels.instance }})
      description: Kubernetes Job failed to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.28. Kubernetes API server errors

Kubernetes API server is experiencing high error rate[copy]

  - alert: KubernetesApiServerErrors
    expr: sum(rate(apiserver_request_count{job="apiserver",code=~"^(?:5..)$"}[2m])) / sum(rate(apiserver_request_count{job="apiserver"}[2m])) * 100 > 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes API server errors (instance {{ $labels.instance }})
      description: Kubernetes API server is experiencing high error rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.29. Kubernetes API client errors

Kubernetes API client is experiencing high error rate[copy]

  - alert: KubernetesApiClientErrors
    expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[2m])) by (instance, job) / sum(rate(rest_client_requests_total[2m])) by (instance, job)) * 100 > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes API client errors (instance {{ $labels.instance }})
      description: Kubernetes API client is experiencing high error rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.30. Kubernetes client certificate expires next week

A client certificate used to authenticate to the apiserver is expiring next week.[copy]

  - alert: KubernetesClientCertificateExpiresNextWeek
    expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }})
      description: A client certificate used to authenticate to the apiserver is expiring next week.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.31. Kubernetes client certificate expires soon

A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.[copy]

  - alert: KubernetesClientCertificateExpiresSoon
    expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }})
      description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.1.32. Kubernetes API server latency

Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.[copy]

  - alert: KubernetesApiServerLatency
    expr: histogram_quantile(0.99, sum(apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"}) WITHOUT (instance, resource)) / 1e+06 > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes API server latency (instance {{ $labels.instance }})
      description: Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5. 2. Nomad : Embedded exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

5. 3. Consul : prometheus/consul_exporter (3 rules)[copy all]

5.3.1. Consul service healthcheck failed

Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`[copy]

  - alert: ConsulServiceHealthcheckFailed
    expr: consul_catalog_service_node_healthy == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Consul service healthcheck failed (instance {{ $labels.instance }})
      description: Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.3.2. Consul missing master node

Numbers of consul raft peers should be 3, in order to preserve quorum.[copy]

  - alert: ConsulMissingMasterNode
    expr: consul_raft_peers < 3
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Consul missing master node (instance {{ $labels.instance }})
      description: Numbers of consul raft peers should be 3, in order to preserve quorum.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.3.3. Consul agent unhealthy

A Consul agent is down[copy]

  - alert: ConsulAgentUnhealthy
    expr: consul_health_node_status{status="critical"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Consul agent unhealthy (instance {{ $labels.instance }})
      description: A Consul agent is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5. 4. Etcd (13 rules)[copy all]

5.4.1. Etcd insufficient Members

Etcd cluster should have an odd number of members[copy]

  - alert: EtcdInsufficientMembers
    expr: count(etcd_server_id) % 2 == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Etcd insufficient Members (instance {{ $labels.instance }})
      description: Etcd cluster should have an odd number of members\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.2. Etcd no Leader

Etcd cluster have no leader[copy]

  - alert: EtcdNoLeader
    expr: etcd_server_has_leader == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Etcd no Leader (instance {{ $labels.instance }})
      description: Etcd cluster have no leader\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.3. Etcd high number of leader changes

Etcd leader changed more than 3 times during last hour[copy]

  - alert: EtcdHighNumberOfLeaderChanges
    expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of leader changes (instance {{ $labels.instance }})
      description: Etcd leader changed more than 3 times during last hour\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.4. Etcd high number of failed GRPC requests

More than 1% GRPC request failure detected in Etcd for 5 minutes[copy]

  - alert: EtcdHighNumberOfFailedGrpcRequests
    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
      description: More than 1% GRPC request failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.5. Etcd high number of failed GRPC requests

More than 5% GRPC request failure detected in Etcd for 5 minutes[copy]

  - alert: EtcdHighNumberOfFailedGrpcRequests
    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
      description: More than 5% GRPC request failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.6. Etcd GRPC requests slow

GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

  - alert: EtcdGrpcRequestsSlow
    expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd GRPC requests slow (instance {{ $labels.instance }})
      description: GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.7. Etcd high number of failed HTTP requests

More than 1% HTTP failure detected in Etcd for 5 minutes[copy]

  - alert: EtcdHighNumberOfFailedHttpRequests
    expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.01
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
      description: More than 1% HTTP failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.8. Etcd high number of failed HTTP requests

More than 5% HTTP failure detected in Etcd for 5 minutes[copy]

  - alert: EtcdHighNumberOfFailedHttpRequests
    expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
      description: More than 5% HTTP failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.9. Etcd HTTP requests slow

HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

  - alert: EtcdHttpRequestsSlow
    expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
      description: HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.10. Etcd member communication slow

Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes[copy]

  - alert: EtcdMemberCommunicationSlow
    expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.15
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd member communication slow (instance {{ $labels.instance }})
      description: Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.11. Etcd high number of failed proposals

Etcd server got more than 5 failed proposals past hour[copy]

  - alert: EtcdHighNumberOfFailedProposals
    expr: increase(etcd_server_proposals_failed_total[1h]) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of failed proposals (instance {{ $labels.instance }})
      description: Etcd server got more than 5 failed proposals past hour\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.12. Etcd high fsync durations

Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes[copy]

  - alert: EtcdHighFsyncDurations
    expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd high fsync durations (instance {{ $labels.instance }})
      description: Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5.4.13. Etcd high commit durations

Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes[copy]

  - alert: EtcdHighCommitDurations
    expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Etcd high commit durations (instance {{ $labels.instance }})
      description: Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5. 5. Linkerd : Embedded exporter (1 rules)[copy all]

5.5.1. Linkerd high error rate

Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%[copy]

  - alert: LinkerdHighErrorRate
    expr: sum(rate(request_errors_total[5m])) by (deployment, statefulset, daemonset) / sum(rate(request_total[5m])) by (deployment, statefulset, daemonset) * 100 > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Linkerd high error rate (instance {{ $labels.instance }})
      description: Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

5. 6. Istio

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

6. 1. Ceph : Embedded exporter (13 rules)[copy all]

6.1.1. Ceph State

Ceph instance unhealthy[copy]

  - alert: CephState
    expr: ceph_health_status != 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Ceph State (instance {{ $labels.instance }})
      description: Ceph instance unhealthy\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.2. Ceph monitor clock skew

Ceph monitor clock skew detected. Please check ntp and hardware clock settings[copy]

  - alert: CephMonitorClockSkew
    expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Ceph monitor clock skew (instance {{ $labels.instance }})
      description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.3. Ceph monitor low space

Ceph monitor storage is low.[copy]

  - alert: CephMonitorLowSpace
    expr: ceph_monitor_avail_percent < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Ceph monitor low space (instance {{ $labels.instance }})
      description: Ceph monitor storage is low.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.4. Ceph OSD Down

Ceph Object Storage Daemon Down[copy]

  - alert: CephOsdDown
    expr: ceph_osd_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Ceph OSD Down (instance {{ $labels.instance }})
      description: Ceph Object Storage Daemon Down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.5. Ceph high OSD latency

Ceph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.[copy]

  - alert: CephHighOsdLatency
    expr: ceph_osd_perf_apply_latency_seconds > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Ceph high OSD latency (instance {{ $labels.instance }})
      description: Ceph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.6. Ceph OSD low space

Ceph Object Storage Daemon is going out of space. Please add more disks.[copy]

  - alert: CephOsdLowSpace
    expr: ceph_osd_utilization > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Ceph OSD low space (instance {{ $labels.instance }})
      description: Ceph Object Storage Daemon is going out of space. Please add more disks.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.7. Ceph OSD reweighted

Ceph Object Storage Daemon take ttoo much time to resize.[copy]

  - alert: CephOsdReweighted
    expr: ceph_osd_weight < 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Ceph OSD reweighted (instance {{ $labels.instance }})
      description: Ceph Object Storage Daemon take ttoo much time to resize.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.8. Ceph PG down

Some Ceph placement groups are down. Please ensure that all the data are available.[copy]

  - alert: CephPgDown
    expr: ceph_pg_down > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Ceph PG down (instance {{ $labels.instance }})
      description: Some Ceph placement groups are down. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.9. Ceph PG incomplete

Some Ceph placement groups are incomplete. Please ensure that all the data are available.[copy]

  - alert: CephPgIncomplete
    expr: ceph_pg_incomplete > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Ceph PG incomplete (instance {{ $labels.instance }})
      description: Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.10. Ceph PG inconsistant

Some Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.[copy]

  - alert: CephPgInconsistant
    expr: ceph_pg_inconsistent > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Ceph PG inconsistant (instance {{ $labels.instance }})
      description: Some Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.11. Ceph PG activation long

Some Ceph placement groups are too long to activate.[copy]

  - alert: CephPgActivationLong
    expr: ceph_pg_activating > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Ceph PG activation long (instance {{ $labels.instance }})
      description: Some Ceph placement groups are too long to activate.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.12. Ceph PG backfill full

Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.[copy]

  - alert: CephPgBackfillFull
    expr: ceph_pg_backfill_toofull > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Ceph PG backfill full (instance {{ $labels.instance }})
      description: Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.1.13. Ceph PG unavailable

Some Ceph placement groups are unavailable.[copy]

  - alert: CephPgUnavailable
    expr: ceph_pg_total - ceph_pg_active > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Ceph PG unavailable (instance {{ $labels.instance }})
      description: Some Ceph placement groups are unavailable.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6. 2. SpeedTest : Speedtest exporter (2 rules)[copy all]

6.2.1. SpeedTest Slow Internet Download

Internet download speed is currently {{humanize $value}} Mbps.[copy]

  - alert: SpeedtestSlowInternetDownload
    expr: avg_over_time(speedtest_download[30m]) < 75
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: SpeedTest Slow Internet Download (instance {{ $labels.instance }})
      description: Internet download speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.2.2. SpeedTest Slow Internet Upload

Internet upload speed is currently {{humanize $value}} Mbps.[copy]

  - alert: SpeedtestSlowInternetUpload
    expr: avg_over_time(speedtest_upload[30m]) < 20 
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: SpeedTest Slow Internet Upload (instance {{ $labels.instance }})
      description: Internet upload speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6. 3. ZFS : node-exporter

  // @TODO: Please contribute => https://github.com/samber/awesome-prometheus-alerts 👋

6. 4. OpenEBS : Embedded exporter (1 rules)[copy all]

6.4.1. OpenEBS used pool capacity

OpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}[copy]

  - alert: OpenebsUsedPoolCapacity
    expr: (openebs_used_pool_capacity_percent) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: OpenEBS used pool capacity (instance {{ $labels.instance }})
      description: OpenEBS Pool use more than 80% of his capacity\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6. 5. Minio : Embedded exporter (2 rules)[copy all]

6.5.1. Minio disk offline

Minio disk is offline[copy]

  - alert: MinioDiskOffline
    expr: minio_offline_disks > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Minio disk offline (instance {{ $labels.instance }})
      description: Minio disk is offline\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.5.2. Minio storage space exhausted

Minio storage space is low (< 10 GB)[copy]

  - alert: MinioStorageSpaceExhausted
    expr: minio_disk_storage_free_bytes / 1024 / 1024 / 1024 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Minio storage space exhausted (instance {{ $labels.instance }})
      description: Minio storage space is low (< 10 GB)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6. 6. Juniper : czerwonk/junos_exporter (3 rules)[copy all]

6.6.1. Juniper switch down

The switch appears to be down[copy]

  - alert: JuniperSwitchDown
    expr: junos_up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Juniper switch down (instance {{ $labels.instance }})
      description: The switch appears to be down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.6.2. Juniper high Bandwith Usage 1GiB

Interface is highly saturated for at least 1 min. (> 0.90GiB/s)[copy]

  - alert: JuniperHighBandwithUsage1gib
    expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Juniper high Bandwith Usage 1GiB (instance {{ $labels.instance }})
      description: Interface is highly saturated for at least 1 min. (> 0.90GiB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6.6.3. Juniper high Bandwith Usage 1GiB

Interface is getting saturated for at least 1 min. (> 0.80GiB/s)[copy]

  - alert: JuniperHighBandwithUsage1gib
    expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Juniper high Bandwith Usage 1GiB (instance {{ $labels.instance }})
      description: Interface is getting saturated for at least 1 min. (> 0.80GiB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

6. 7. CoreDNS : Embedded exporter (1 rules)[copy all]

6.7.1. CoreDNS Panic Count

Number of CoreDNS panics encountered[copy]

  - alert: CorednsPanicCount
    expr: increase(coredns_panic_count_total[10m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: CoreDNS Panic Count (instance {{ $labels.instance }})
      description: Number of CoreDNS panics encountered\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

7. 1. Thanos (3 rules)[copy all]

7.1.1. Thanos compaction halted

Thanos compaction has failed to run and is now halted.[copy]

  - alert: ThanosCompactionHalted
    expr: thanos_compactor_halted == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos compaction halted (instance {{ $labels.instance }})
      description: Thanos compaction has failed to run and is now halted.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

7.1.2. Thanos compact bucket operation failure

Thanos compaction has failing storage operations[copy]

  - alert: ThanosCompactBucketOperationFailure
    expr: rate(thanos_objstore_bucket_operation_failures_total[1m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos compact bucket operation failure (instance {{ $labels.instance }})
      description: Thanos compaction has failing storage operations\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

7.1.3. Thanos compact not run

Thanos compaction has not run in 24 hours.[copy]

  - alert: ThanosCompactNotRun
    expr: (time() - thanos_objstore_bucket_last_successful_upload_time) > 24*60*60
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos compact not run (instance {{ $labels.instance }})
      description: Thanos compaction has not run in 24 hours.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}

Awesome Prometheus alerts is maintained by samber.

posted @ 2020-11-10 15:03 技术颜良阅读(2486) 评论(0) 收藏举报

刷新页面返回顶部