Grafana学习(9)—— Alerting - Labels and annotations

1. 简介

Labels and annotations contain information about an alert. Both labels and annotations have the same structure: a set of named values; however their intended uses are different. An example of label, or the equivalent annotation, might be alertname="test".

The main difference between a label and an annotation is that labels are used to differentiate an alert from all other alerts, while annotations are used to add additional information to an existing alert.

标签和注释包含有关警报的信息。标签和注释具有相同的结构:一组命名值;然而它们的预期用途是不同的。标签或等效注释的示例可能是alertname="test"
标签和注释之间的主要区别在于,标签用于将警报与所有其他警报区分开来,而注释用于向现有警报添加附加信息。

For example, consider two high CPU alerts: one for server1 and another for server2. In such an example we might have a label called server where the first alert has the label server="server1" and the second alert has the label server="server2". However, we might also want to add a description to each alert such as "The CPU usage for server1 is above 75%.", where server1 and 75% are replaced with the name and CPU usage of the server (please refer to the documentation on templating labels and annotations for how to do this). This kind of description would be more suitable as an annotation.

例如,考虑两个高CPU警报:一个用于server1,另一个用于server 2。在这样的例子中,我们可能有一个名为server的标签,其中第一个警报的标签为server=“server1”,第二个警报的标记为server=”server2“。但是,我们可能还想为每个警报添加一个描述,例如“服务器1的CPU使用率高于75%”,其中server175%被替换为服务器的名称和CPU使用率(有关如何执行此操作,请参阅模板标签和注释的文档)。这种描述将更适合作为注释。

Labels
Labels contain information that identifies an alert. An example of a label might be server=server1. Each alert can have more than one label, and the complete set of labels for an alert is called its label set. It is this label set that identifies the alert.

For example, an alert might have the label set {alertname="High CPU usage",server="server1"} while another alert might have the label set {alertname="High CPU usage",server="server2"}. These are two separate alerts because although their alertname labels are the same, their server labels are different.

The label set for an alert is a combination of the labels from the datasource, custom labels from the alert rule, and a number of reserved labels such as alertname.

标签

  • 标签包含标识警报的信息。标签的一个例子可能是server=server1。每个警报可以有多个标签,警报的完整标签集称为其标签集。正是这个标签集标识了警报。
  • 例如,一个警报可能具有标签集{alertname=“High CPU usage”,server=“server1”},而另一个警报则可能具有标签集{alertname=“High CPU use”,server=“server2”}。这是两个独立的警报,因为尽管它们的alertname标签相同,但它们的server标签不同。
  • 警报的标签集是数据源中的标签、警报规则中的自定义标签以及一些保留标签(如alertname)的组合。
Custom Labels
Custom labels are additional labels from the alert rule. Like annotations, custom labels must have a name, and their value can contain a combination of text and template code that is evaluated when an alert is fired. Documentation on how to template custom labels can be found here.

When using custom labels with templates it is important to make sure that the label value does not change between consecutive evaluations of the alert rule as this will end up creating large numbers of distinct alerts. However, it is OK for the template to produce different label values for different alerts. For example, do not put the value of the query in a custom label as this will end up creating a new set of alerts each time the value changes. Instead use annotations.

It is also important to make sure that the label set for an alert does not have two or more labels with the same name. If a custom label has the same name as a label from the datasource then it will replace that label. However, should a custom label have the same name as a reserved label then the custom label will be omitted from the alert.

自定义标签

  • 自定义标签是警报规则中的附加标签。与注释一样,自定义标签必须有一个名称,并且其值可以包含文本和模板代码的组合,这些代码在触发警报时进行评估。关于如何模板自定义标签的文档可以在这里找到。
  • 将自定义标签与模板一起使用时,重要的是要确保标签值在警报规则的连续评估之间不会发生变化,因为这最终会创建大量不同的警报。但是,模板可以为不同的警报生成不同的标签值。例如,不要将查询的值放在自定义标签中,因为每次值更改时都会创建一组新的警报。而是使用注释。
  • 同样重要的是要确保警报的标签集没有两个或多个同名标签。如果自定义标签与数据源中的标签同名,则它将替换该标签。但是,如果自定义标签与保留标签的名称相同,则该自定义标签将从警报中省略。
Annotations
Annotations are named pairs that add additional information to existing alerts. There are a number of suggested annotations in Grafana such as description, summary, runbook_url, dashboardUId and panelId. Like custom labels, annotations must have a name, and their value can contain a combination of text and template code that is evaluated when an alert is fired. If an annotation contains template code, the template is evaluated once when the alert is fired. It is not re-evaluated, even when the alert is resolved. Documentation on how to template annotations can be found here.

注释
注释是为现有警报添加附加信息的命名对。Grafana中有许多建议的注释,如description、summary、runbook_url、dashboardUId和panelId。与自定义标签一样,注释必须有一个名称,并且其值可以包含文本和模板代码的组合,这些代码在触发警报时进行评估。如果注释包含模板代码,则在触发警报时会对模板进行一次评估。即使警报已解决,也不会对其进行重新评估。关于如何对注释进行模板化的文档可以在这里找到。

2. 标签匹配器

Use labels and label matchers to link alert rules to notification policies and silences. This allows for a very flexible way to manage your alert instances, specify which policy should handle them, and which alerts to silence.
A label matchers consists of 3 distinct parts, the label, the value and the operator.
  The Label field is the name of the label to match. It must exactly match the label name.
  The Value field matches against the corresponding value for the specified Label name. How it matches depends on the Operator value.
  The Operator field is the operator to match against the label value. The available operators are

使用标签和标签匹配器将警报规则链接到通知策略和静默。这允许以一种非常灵活的方式来管理您的警报实例,指定应该处理它们的策略,以及静音哪些警报。
标签匹配器由3个不同的部分组成,即标签、值和运算符。

  • “标签”字段是要匹配的标签的名称。它必须与标签名称完全匹配。
  • “值”字段与指定标签名称的相应值匹配。它的匹配方式取决于运算符值。
  • 运算符字段是要与标签值匹配的运算符。可用的运算符有:
Operator Description
= Select labels that are exactly equal to the value.
!= Select labels that are not equal to the value.
=~ Select labels that regex-match the value.
!~ Select labels that do not regex-match the value.
If you are using multiple label matchers, they are combined using the AND logical operator. This means that all matchers must match in order to link a rule to a policy.

如果使用多个标签匹配器,则会使用AND逻辑运算符对它们进行组合。这意味着所有标签匹配器必须匹配,才能将规则链接到策略。

Example scenario
If you define the following set of labels for your alert:

{ foo=bar, baz=qux, id=12 }

then:

A label matcher defined as foo=bar matches this alert rule.
A label matcher defined as foo!=bar does not match this alert rule.
A label matcher defined as id=~[0-9]+ matches this alert rule.
A label matcher defined as baz!~[0-9]+ matches this alert rule.
Two label matchers defined as foo=bar and id=~[0-9]+ match this alert rule.

示例场景
如果您为警报定义了以下一组标签:
{foo=bar,baz=qux,id=12}
那么:

  • 定义为foo=bar的标签匹配器与此警报规则匹配。
  • 定义为foo!=bar的标签匹配器与此警报规则不匹配。
  • 定义为id=~[0-9]+的标签匹配器与此警报规则匹配。
  • 定义为baz!~[0-9]+的标签匹配器与此警报规则匹配。
  • 定义为foo=barandid=~[0-9]+的两个标签匹配符与此警报规则匹配。

3. Grafana Alerting中的标签

This topic explains why labels are a fundamental component of alerting.

The complete set of labels for an alert is what uniquely identifies an alert within Grafana alerts.
The Alertmanager uses labels to match alerts for silences and alert groups in notification policies.
The alerting UI shows labels for every alert instance generated during evaluation of that rule.
Contact points can access labels to dynamically generate notifications that contain information specific to the alert that is resulting in a notification.
You can add labels to an alerting rule. Labels are manually configurable, use template functions, and can reference other labels. Labels added to an alerting rule take precedence in the event of a collision between labels (except in the case of Grafana reserved labels).

本主题解释了为什么标签是alert的基本组成部分。

  • 警报的完整标签集是Grafana警报中唯一标识alert的标签。
  • Alertmanager使用标签来匹配通知策略中静默和警报组的警报。
  • 警报UI显示在评估该规则期间生成的每个警报实例的标签。
  • 触点可以访问标签以动态生成通知,这些通知包含造成通知的警报的特定信息。
  • 您可以向警报规则添加标签。标签可以手动配置,使用模板功能,并且可以引用其他标签。在标签之间发生冲突的情况下,添加到警报规则中的标签优先(Grafana保留标签除外)。
External Alertmanager Compatibility
Grafana’s built-in Alertmanager supports both Unicode label keys and values. If you are using an external Prometheus Alertmanager, label keys must be compatible with their data model. This means that label keys must only contain ASCII letters, numbers, as well as underscores and match the regex [a-zA-Z_][a-zA-Z0-9_]*. Any invalid characters will be removed or replaced by the Grafana alerting engine before being sent to the external Alertmanager according to the following rules:

外部警报管理器兼容性
Grafana内置的Alertmanager同时支持Unicode标签键和值。如果使用外部Prometheus Alertmanager,则标签密钥必须与其数据模型兼容。这意味着标签键必须只包含ASCII字母、数字以及下划线,并且与正则表达式[a-zA-Z_][a-zA-Z0-9_]*匹配。根据以下规则,在发送到外部Alertmanager之前,Grafana警报引擎将删除或替换任何无效字符:

  • Whitespace will be removed.
  • ASCII characters will be replaced with _.
  • All other characters will be replaced with their lower-case hex representation. If this is the first character it will be prefixed with _.
    Example: A label key/value pair Alert! 🔔="🔥" will become Alert_0x1f514="🔥".
    Note If multiple label keys are sanitized to the same value, the duplicates will have a short hash of the original label appended as a suffix.
Grafana reserved labels
Note: Labels prefixed with grafana_ are reserved by Grafana for special use. If a manually configured label is added beginning with grafana_ it may be overwritten in case of collision. To stop the Grafana Alerting engine from adding a reserved label, you can disable it via the `disabled_labels` option in [unified_alerting.reserved_labels][unified-alerting-reserved-labels] configuration.
Grafana reserved labels can be used in the same way as manually configured labels. The current list of available reserved labels are:

4. 模板化标签和注释

You can use templates to include data from queries and expressions in labels and annotations. For example, you might want to set the severity label for an alert based on the value of the query, or use the instance label from the query in a summary annotation so you know which server is experiencing high CPU usage.

All templates should be written in text/template. Regardless of whether you are templating a label or an annotation, you should write each template inline inside the label or annotation that you are templating. This means you cannot share templates between labels and annotations, and instead you will need to copy templates wherever you want to use them.

Each template is evaluated whenever the alert rule is evaluated, and is evaluated for every alert separately. For example, if your alert rule has a templated summary annotation, and the alert rule has 10 firing alerts, then the template will be executed 10 times, once for each alert. You should try to avoid doing expensive computations in your templates as much as possible.

可以使用模板将查询和表达式中的数据包含在标签和注释中。例如,您可能希望基于查询的值来设置警报的严重性标签,或者在摘要注释中使用查询中的实例标签,以便了解哪个服务器的CPU使用率较高。
所有模板都应以文本/模板形式编写。无论您是对标签还是注释进行模板化,都应该在正在进行模板化的标签或注释内内联编写每个模板。这意味着您不能在标签和注释之间共享模板,相反,您需要将模板复制到需要使用它们的任何位置。
每当评估警报规则时,都会评估每个模板,并分别为每个警报评估每个模板。例如,如果您的警报规则有一个模板化的摘要注释,并且警报规则有10个触发警报,则该模板将执行10次,每个警报执行一次。您应该尽量避免在模板中进行昂贵的计算。

Examples
Rather than write a complete tutorial on text/template, the following examples attempt to show the most common use-cases we have seen for templates. You can use these examples verbatim, or adapt them as necessary for your use case. For more information on how to write text/template refer to the text/template documentation.
Print all labels, comma separated
    To print all labels, comma separated, print the $labels variable: {{ $labels }}
For example, given an alert with the labels alertname=High CPU usage, grafana_folder=CPU alerts and instance=server1, this would print:   alertname=High CPU usage, grafana_folder=CPU alerts, instance=server1
If you are using classic conditions then $labels will not contain any labels from the query. Refer to the $labels variable for more information.

示例
以下示例不是编写一个关于text/template的完整教程,而是试图展示我们所看到的模板的最常见用例。您可以逐字逐句地使用这些示例,也可以根据用例的需要对它们进行调整。有关如何编写文本/模板的更多信息,请参阅文本/模板文档。

Print all labels, one per line
To print all labels, one per line, use a `range` to iterate over each key/value pair and print them individually. Here `$k` refers to the name and `$v` refers to the value of the current label:
{{ range $k, $v := $labels -}}
{{ $k }}={{ $v }}
{{ end }}
For example, given an alert with the labels `alertname=High CPU usage`, `grafana_folder=CPU alerts` and `instance=server1`, this would print:
alertname=High CPU usage
grafana_folder=CPU alerts
instance=server1
Print an individual label 
To print an individual label use the ` index` function with the `$labels` variable: `The host {{ index $labels "instance" }} has exceeded 80% CPU usage for the last 5 minutes`
For example, given an alert with the labels `instance=server1`, this would print: `The host server1 has exceeded 80% CPU usage for the last 5 minutes`
Print the value of a query
To print the value of an instant query you can print its Ref ID using the `index` function and the `$values` variable:  `{{ index $values "A" }}`
For example, given an instant query that returns the value 81.2345, this will print: `81.2345`
To print the value of a range query you must first reduce it from a time series to an instant vector with a reduce expression. You can then print the result of the reduce expression by using its Ref ID instead. For example, if the reduce expression takes the average of A and has the Ref ID B you would write: `{{ index $values "B" }}`

打印查询的值
要打印即时查询的值,可以使用“index”函数和“$values”变量打印其Ref ID:“{{index \(values “A”}”` 例如,给定一个返回值81.2345的即时查询,它将打印:`81.2345` 若要打印范围查询的值,必须首先使用reduce表达式将其从时间序列缩减为即时向量。然后,您可以使用reduce表达式的Ref ID来打印它的结果。例如,如果reduce表达式取A的平均值,并且Ref ID为B,则您将写:`{{index\)values“B”}}`

Print the humanized value of a query
To print the humanized value of an instant query use the humanize function: `{{ humanize (index $values "A").Value }}`
For example, given an instant query that returns the value 81.2345, this will print: `81.234`
To print the humanized value of a range query you must first reduce it from a time series to an instant vector with a reduce expression. You can then print the result of the reduce expression by using its Ref ID instead. For example, if the reduce expression takes the average of A and has the Ref ID B you would write: `{{ humanize (index $values "B").Value }}`

打印查询的值
要打印即时查询的人性化值,请使用humanize函数:{{humanize(index$values“A”).value}}
例如,给定一个返回值81.2345的即时查询,它将打印:81.234
要打印范围查询的人性化值,必须首先使用reduce表达式将其从时间序列缩减为即时向量。然后,您可以使用reduce表达式的Ref ID来打印它的结果。例如,如果reduce表达式取A的平均值,并且Ref ID为B,则您将写:{{humanize(index$values“B”).Value}}

Print the value of a query as a percentage
To print the value of an instant query as a percentage use the `humanizePercentage` function: `{{ humanizePercentage (index $values "A").Value }}`
This function expects the value to be a decimal number between 0 and 1. If the value is instead a decimal number between 0 and 100 you can either divide it by 100 in your query or using a math expression. If the query is a range query you must first reduce it from a time series to an instant vector with a reduce expression.

以百分比形式打印查询的值
要将即时查询的值打印为百分比,请使用“humanizePercentage”函数:“{{humanizePercentage(index$values“a”).value}}”`
此函数要求值为0到1之间的十进制数。如果该值是0到100之间的十进制数,则可以在查询中将其除以100,也可以使用数学表达式。如果查询是范围查询,则必须首先使用reduce表达式将其从时间序列缩减为即时向量。

Set a severity from the value of a query
To set a severity label from the value of a query use an if statement and the greater than comparison function. Make sure to use decimals (80.0, 50.0, 0.0, etc) when doing comparisons against $values as text/template does not support type coercion. You can find a list of all the supported comparison functions here.
{{ if (gt $values.A.Value 80.0) -}}
high
{{ else if (gt $values.A.Value 50.0) -}}
medium
{{ else -}}
low
{{- end }}

根据查询的值设置严重性
要根据查询的值设置严重性标签,请使用if语句和大于比较函数。在与$values进行比较时,请确保使用小数(80.050.00.0等),因为文本/模板不支持类型强制。您可以在此处找到所有支持的比较函数的列表。

Print all labels from a classic condition
You cannot use $labels to print labels from the query if you are using classic conditions, and must use $values instead. The reason for this is classic conditions discard these labels to enforce uni-dimensional behavior (at most one alert per alert rule). If classic conditions didn’t discard these labels, then queries that returned many time series would cause alerts to flap between firing and resolved constantly as the labels would change every time the alert rule was evaluated.

Instead, the $values variable contains the reduced values of all time series for all conditions that are firing. For example, if you have an alert rule with a query A that returns two time series, and a classic condition B with two conditions, then $values would contain B0, B1, B2 and B3. If the classic condition B had just one condition, then $values would contain just B0 and B1.

To print all labels of all firing time series use the following template (make sure to replace B in the regular expression with the Ref ID of the classic condition if it’s different):
{{ range $k, $v := $values -}}
{{ if (match "B[0-9]+" $k) -}}
{{ $k }}: {{ $v.Labels }}{{ end }}
{{ end }}

For example, a classic condition for two time series exceeding a single condition would print:
B0: instance=server1
B1: instance=server2

If the classic condition has two or more conditions, and a time series exceeds multiple conditions at the same time, then its labels will be duplicated for each condition that is exceeded:
B0: instance=server1
B1: instance=server2
B2: instance=server1
B3: instance=server2

If you need to print unique labels you should consider changing your alert rules from uni-dimensional to multi-dimensional instead. You can do this by replacing your classic condition with reduce and math expressions.

打印经典条件下的所有标签
如果使用的是经典条件,则不能使用\(labels从查询中打印标签,而必须使用\)values。原因是经典条件丢弃这些标签以强制执行一维行为(每个警报规则最多一个警报)。如果经典条件没有丢弃这些标签,那么返回许多时间序列的查询将导致警报在触发和不断解决之间切换,因为每次评估警报规则时标签都会发生变化。
相反,\(values变量包含所有触发条件的所有时间序列的缩减值。例如,如果您有一个警报规则,其中查询a返回两个时间序列,而经典条件B返回两个条件,那么\)values将包含B0、B1、B2和B3。如果经典条件B只有一个条件,那么$values将只包含B0和B1。
要打印所有激发时间序列的所有标签,请使用以下模板(如果正则表达式中的B不同,请确保将其替换为经典条件的Ref ID):

Print all values from a classic condition
To print all values from a classic condition take the previous example and replace $v.Labels with $v.Value:
{{ range $k, $v := $values -}}
{{ if (match "B[0-9]+" $k) -}}
{{ $k }}: {{ $v.Value }}{{ end }}
{{ end }}

For example, a classic condition for two time series exceeding a single condition would print:
B0: 81.2345
B1: 84.5678

If the classic condition has two or more conditions, and a time series exceeds multiple conditions at the same time, then $values will contain the values of all conditions:
B0: 81.2345
B1: 92.3456
B2: 84.5678
B3: 95.6789
Variables
The following variables are available to you when templating labels and annotations:
The labels variable
The $labels variable contains all labels from the query. For example, suppose you have a query that returns CPU usage for all of your servers, and you have an alert rule that fires when any of your servers have exceeded 80% CPU usage for the last 5 minutes. You want to add a summary annotation to the alert that tells you which server is experiencing high CPU usage. With the $labels variable you can write a template that prints a human-readable sentence such as:  CPU usage for {{ index $labels "instance" }} has exceeded 80% for the last 5 minutes

The value variable
The `$value` variable is a string containing the labels and values of all instant queries; threshold, reduce and math expressions, and classic conditions in the alert rule. It does not contain the results of range queries, as these can return anywhere from 10s to 10,000s of rows or metrics. If it did, for especially large queries a single alert could use 10s of MBs of memory and Grafana would run out of memory very quickly.
To print the `$value` variable in the summary you would write something like this: CPU usage for {{ index $labels "instance" }} has exceeded 80% for the last 5 minutes: {{ $value }})
And would look something like this:  `CPU usage for instance1 has exceeded 80% for the last 5 minutes:  [ var='A' labels={instance=instance1} value=81.234 ] `

Here `var='A'` refers to the instant query with Ref ID A, `labels={instance=instance1}` refers to the labels, and `value=81.234` refers to the average CPU usage over the last 5 minutes. 

If you want to print just some of the string instead of the full string then use the `$values` variable. It contains the same information as `$value`, but in a structured table, and is much easier to use then writing a regular expression to match just the text you want. 

The values variable
The $values variable is a table containing the labels and floating point values of all instant queries and expressions, indexed by their Ref IDs. 
To print the value of the instant query with Ref ID A:  CPU usage for {{ index $labels "instance" }} has exceeded 80% for the last 5 minutes: {{ index $values "A" }})
For example, given an alert with the labels `instance=server1` and an instant query with the value 81.2345, this would print: CPU usage for instance1 has exceeded 80% for the last 5 minutes: 81.2345 
If the query in Ref ID A is a range query rather than an instant query then add a reduce expression with Ref ID B and replace `(index $values "A")` with `(index $values "B")`: CPU usage for {{ index $labels "instance" }} has exceeded 80% for the last 5 minutes: {{ index $values "B" }}) 


posted @ 2023-11-22 16:44  钱塘江畔  阅读(2106)  评论(0)    收藏  举报