prometheus-注意点& psql整理

知识点备忘

零散知识点

tag的value必须不能为空
prometheus的指标类型一共只有四种：Counter（计数器）、Gauge（仪表盘）、Histogram（直方图）、Summary（摘要）
- 指标的描述 https://blog.csdn.net/qq_26531719/article/details/112391592
histogram 和summary都是为了展示数据的分布情况
*histogram类型的metrics特点: 三个metrics一起出现. mrtrics,{metrics}_count,{metrics}_sum.
某些需要聚合很多时间序列的查询可能会很费时,可以再prometheus中配置rule, 预处理,例子如下

job_instance_mode:node_cpu_seconds:avg_rate5m, create a file with the following recording rule and save it as prometheus.rules.yml:

groups:
- name: cpu-node
  rules:
  - record: job_instance_mode:node_cpu_seconds:avg_rate5m
    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

***
To make Prometheus pick up this new rule, add a rule_files statement in your prometheus.yml. The config should now look like this:

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # Evaluate rules every 15 seconds.

  # Attach these extra labels to all timeseries collected by this Prometheus instance.
  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - 'prometheus.rules.yml'

psql部分 https://prometheus.io/docs/prometheus/latest/querying/basics/

函数 https://prometheus.io/docs/prometheus/latest/querying/functions/
操作符 https://prometheus.io/docs/prometheus/latest/querying/operators/
例子 https://prometheus.io/docs/prometheus/latest/querying/examples/

数据类型

In Prometheus's expression language, an expression or sub-expression can evaluate to one of four types:

Instant vector  瞬时向量- a set of time series containing a single sample for each time series, all sharing the same timestamp
Range vector 区间向量 - a set of time series containing a range of data points over time for each time series
Scalar 标量 - a simple numeric floating point value
String string值,现在没用到 - a simple string value; currently unused

常量类型
- String literals 字符串
- Float literals 浮点数
时间序列选择器
Instant vector selectors 瞬时向量选择器, 得到的是在某个时间点的单一样本
- 样例 http_requests_total
Range Vector Selectors 区间向量选择器, 得到的是在指定时间区间内的所有样本
- 样例 http_requests_total{job="prometheus"}[5m]
指标选择器使用的的比较符

It is also possible to negatively match a label value, or to match label values against regular expressions. The following label matching operators exist:

=: Select labels that are exactly equal to the provided string.
!=: Select labels that are not equal to the provided string.
=~: Select labels that regex-match the provided string.
!~: Select labels that do not regex-match the provided string.
For example, this selects all http_requests_total time series for staging, testing, and development environments and HTTP methods other than GET.

http_requests_total{environment=~"staging|testing|development",method!="GET"}

时间单位,只能使用某个单位的整数, 不能混用,不能用小数

ms - milliseconds
s - seconds
m - minutes
h - hours
d - days - assuming a day has always 24h
w - weeks - assuming a week has always 7d
y - years - assuming a year has always 365d

Offset modifier 偏移量修饰符
修饰瞬时向量和区间向量

http_requests_total offset 5m
sum(http_requests_total{method="GET"} offset 5m)

@ modifier @修饰符 ** Prometheus v2.25.0中引入的,低版本报错,并且需要通过配置才能用 **
@修饰符允许更改查询中单个瞬时向量和范围向量的计算时间。提供给@修饰符的时间是unix时间戳，用浮点文字描述。

http_requests_total @ 1609746000
sum(http_requests_total{method="GET"} @ 1609746000) 
rate(http_requests_total[5m] @ 1609746000)

与offset连用, 以下效果一致

# offset after @
http_requests_total @ 1609746000 offset 5m
# offset before @
http_requests_total offset 5m @ 1609746000

***默认情况下，这个修饰符是禁用的，因为它打破了PromQL不会在样本求值时间之前查看的不变条件。它可以通过设置——enable-feature=promql-at-modifier flag来启用。有关此标志的详细信息，请参阅disabled features。 ***

Subquery 子查询

直方图和统计摘要

Histogram
histogram 在一段时间内进行采样，并能够对指定区间以及总数进行统计.。histogram会有一个基本的指标名称,由以下几部分组成

<basename>_bucket{le="<upper inclusive bound>"}             用来统计满足指标的情况
# 在总共2次请求当中。http请求响应时间 <=0.005 秒 的请求次数为0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.005",} 0.0
# 在总共2次请求当中。http请求响应时间 <=0.01 秒 的请求次数为0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.01",} 0.0
<basename>_sum                          值的总和
# 实际含义： 发生的2次http请求总的响应时间为13.107670803000001 秒
io_namespace_http_requests_latency_seconds_histogram_sum{path="/",method="GET",code="200",} 13.107670803000001
<basename>_count                        请求总数
# 实际含义： 当前一共发生了2次http请求
io_namespace_http_requests_latency_seconds_histogram_count{path="/",method="GET",code="200",} 2.0

Prometheus 的 histogram 是一种累积直方图，它的划分方式如下：假设每个 bucket 的宽度是 0.2s，那么第一个 bucket 表示响应时间小于等于 0.2s 的请求数量，第二个 bucket 表示响应时间小于等于 0.4s 的请求数量，以此类推。也就是说，每一个 bucket 的样本包含了之前所有 bucket 的样本，所以叫累积直方图。

为什么要设计为累积直方图？

想象一下，如果 histogram 类型的指标中加入了额外的标签，或者划分了更多的 bucket，那么样本数据的分析就会变得越来越复杂。如果 histogram 是累积的，在抓取指标时就可以根据需要丢弃某些 bucket，这样可以在降低 Prometheus 维护成本的同时，还可以粗略计算样本值的分位数。通过这种方法，用户不需要修改应用代码，便可以动态减少抓取到的样本数量。

Java使用

class YourClass {
  static final Histogram requestLatency = Histogram.build()
     .name("requests_latency_seconds").help("Request latency in seconds.").register();
 
  void processRequest(Request req) {
    Histogram.Timer requestTimer = requestLatency.startTimer();
    try {
      // Your code here.
    } finally {
      requestTimer.observeDuration();
    }
  }
}

Summary
summary与histogram类似，用于表示一段时间内的采样数据，但它直接存储了分位数，而不是通过区间来计算。

Summary与Histogram相比，存在如下区别：

都包含 < basename>_sum和< basename>_count;
Histogram需要通过< basename>_bucket计算quantile，而Summary直接存储了quantile的值
summary 会有一个基本的指标名称,由以下几部分组成

<basename>{quantile="<φ>"}
# 含义：这12次http请求响应时间的中位数是3.052404983s
io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.5",} 3.052404983
# 含义：这12次http请求响应时间的9分位数是8.003261666s
io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.9",} 8.003261666
<basename>_sum
#含义：这12次http请求的总响应时间为 51.029495508s
io_namespace_http_requests_latency_seconds_summary_sum{path="/",method="GET",code="200",} 51.029495508
<basename>_count
# 含义：当前http请求发生总次数为12次
io_namespace_http_requests_latency_seconds_summary_count{path="/",method="GET",code="200",} 12.0

java 使用

class YourClass {
  static final Summary receivedBytes = Summary.build()
     .name("requests_size_bytes").help("Request size in bytes.").register();
  static final Summary requestLatency = Summary.build()
  .quantile(0.5, 0.05)
            .quantile(0.9, 0.01)
     .name("requests_latency_seconds").help("Request latency in seconds.").register();
 
  void processRequest(Request req) {
    Summary.Timer requestTimer = requestLatency.startTimer();
    try {
      // Your code here.
    } finally {
      receivedBytes.observe(req.size());
      requestTimer.observeDuration();
    }
  }
}

Histogram 和 summary的区别
Summary 的分位数是直接在客户端计算完成的，处理过程有频繁的全局锁操作，对高并发程序性能存在一定影响。histogram仅仅是在客户端给每个桶做一个原子变量的计数就可以了。Summary 会占用更多的客户端的cpu和内存。
在服务端，不能对Summary产生的quantile值进行aggregation运算（例如sum, avg等），histogram可以进行各种操作。因此对服务端的消耗，histogram是大于Summary的。
histogram存储的是区间的样本数统计值，不能得到精确的分为数，而Summary可以。
两条经验
如果需要聚合（aggregate），选择histograms。
如果比较清楚要观测的指标的范围和分布情况，选择histograms。如果需要精确的分为数选择summary。

posted @ 2021-02-24 09:51 rudolf_lin 阅读(591) 评论(0) 收藏举报

刷新页面返回顶部

rudolf_lin

中正平和

prometheus-注意点& psql整理

零散知识点

psql部分 https://prometheus.io/docs/prometheus/latest/querying/basics/

直方图和统计摘要

公告

rudolf_lin

中 正 平 和

prometheus-注意点& psql整理

零散知识点

psql部分 https://prometheus.io/docs/prometheus/latest/querying/basics/

直方图和统计摘要

公告

中正平和