[metricsQL]通过VictoriaMetrics上报的 histogram 格式中的 `vmrange` 来不精确地统计超时时间

作者:张富春(ahfuzhang)，转载时请注明作者和引用链接，谢谢！

cnblogs博客
zhihu
Github
公众号:一本正经的瞎扯

1. VictoriaMetrics 上报 histogram 数据的套路

VictoriaMetrics 的轻量级高性能上报组件 https://github.com/VictoriaMetrics/metrics 中，使用了特别的方法来上报 histogram 数据。
例如这个分桶的时延数据：

vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="5.995e-04...6.813e-04"} 4
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="6.813e-04...7.743e-04"} 1415
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="7.743e-04...8.799e-04"} 6045
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="8.799e-04...1.000e-03"} 18729
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.000e-03...1.136e-03"} 11002
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.136e-03...1.292e-03"} 2296
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.292e-03...1.468e-03"} 475
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.468e-03...1.668e-03"} 143
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.668e-03...1.896e-03"} 74
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.896e-03...2.154e-03"} 31
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="2.154e-03...2.448e-03"} 22
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="2.448e-03...2.783e-03"} 20
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="2.783e-03...3.162e-03"} 3
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="3.162e-03...3.594e-03"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="3.594e-03...4.084e-03"} 2
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="4.084e-03...4.642e-03"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="4.642e-03...5.275e-03"} 4
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="5.275e-03...5.995e-03"} 3
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="6.813e-03...7.743e-03"} 5
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="7.743e-03...8.799e-03"} 5
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="8.799e-03...1.000e-02"} 8
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.000e-02...1.136e-02"} 6
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.136e-02...1.292e-02"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.292e-02...1.468e-02"} 2
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.468e-02...1.668e-02"} 3
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.896e-02...2.154e-02"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="2.448e-02...2.783e-02"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="4.642e-02...5.275e-02"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="5.275e-02...5.995e-02"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="5.995e-02...6.813e-02"} 2
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="7.743e-02...8.799e-02"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="1.136e-01...1.292e-01"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="2.448e-01...2.783e-01"} 1
vm_http_request_duration_seconds_bucket{path="/metrics",vmrange="4.642e-01...5.275e-01"} 1
vm_http_request_duration_seconds_sum{path="/metrics"} 40.88543205599984
vm_http_request_duration_seconds_count{path="/metrics"} 40310

VM 使用了 vmrange 这个字段，而不是 prometheus 规范中的 le 字段。

2. 需求

需求就是，假定接口的超时时间是 500 毫秒，能不能根据上面的 metrics 数据，统计出超时的请求数有多少？

3. 思考过程

vmrange 中是一个范围，设定了桶的起始值。
完美的情况是：500ms 刚好是某条数据的开始值，或者结束值。
但是，也可能出现 vmrange="0.4...0.6"，这样就无法精确得知 500 毫秒以上的请求数。
不过，不精确的统计也是够用的。
所以，正如标题所写，我试着实现不精确的统计。

$ 4. 查询表达式

sum(
(
	(label_value(label_replace(vm_http_request_duration_seconds_bucket {container="vlsingle",customer_brand="xx",customer_cluster="xxxx",customer_env="stress-test",endpoint="http",instance="172.20.10.252:9428",job="vlsingle-singlenode-k8s",namespace="logging",path="/metrics",pod="vlsingle-singlenode-k8s-86d48b689-67vrx",prometheus="metrics/vmcluster-realtime",service="vlsingle-singlenode-k8s"},"le","$1","vmrange", "([^\\n]+)\\.\\.\\.([^\\n]+)"),"le")>0.5)
	/
	(label_value(label_replace(vm_http_request_duration_seconds_bucket {container="vlsingle",customer_brand="xx",customer_cluster="xxxx",customer_env="stress-test",endpoint="http",instance="172.20.10.252:9428",job="vlsingle-singlenode-k8s",namespace="logging",path="/metrics",pod="vlsingle-singlenode-k8s-86d48b689-67vrx",prometheus="metrics/vmcluster-realtime",service="vlsingle-singlenode-k8s"},"le","$1","vmrange", "([^\\n]+)\\.\\.\\.([^\\n]+)"),"le")>0.5)
) and 
    label_replace(vm_http_request_duration_seconds_bucket {container="vlsingle",customer_brand="xx",customer_cluster="xxxx",customer_env="stress-test",endpoint="http",instance="172.20.10.252:9428",job="vlsingle-singlenode-k8s",namespace="logging",path="/metrics",pod="vlsingle-singlenode-k8s-86d48b689-67vrx",prometheus="metrics/vmcluster-realtime",service="vlsingle-singlenode-k8s"},"le","$1","vmrange", "([^\\n]+)\\.\\.\\.([^\\n]+)")
)

表达式的计算流程如下：

label_replace(vm_http_request_duration_seconds_bucket{vmrange="xxxx"}, ,"le","$1","vmrange", "([^{\n]+)\.\.\.([}\n]+)")
- 通过正则表达式，取出 vmrange 中的第一段，变成 prometheus 规范中的 le 字段
label_value(vm_http_request_duration_seconds_bucket{le="xx"}, "le")>0.5
- 过滤出时延范围大于 0.5 的 time series
- 这一步过滤后，并非返回 true/false 或者 1/0，而是返回了 label_value 的值
- 我认为这是不合理的，提了个 issue: label_value(timeseries,"label_name")>0.5 should be return 1 or 0, should not return label value
因为 label_value(xx,label_name)>0.5 这里返回了值，我不得不把这个表达式自己除以自己，以便得到 1
上面的表达式，再与全部的符合条件的 time series 做 and 运算，完成了过滤
过滤后再做 sum
- 最终实现了统计 延迟超过 500 毫秒的请求数

label_value 这部分表达式重复计算了，白白浪费了资源：

不管怎么样，记得上得到了期待的结果。
😃 have fun.

posted on 2025-09-19 18:05 ahfuzhang 阅读(18) 评论(0) 收藏举报

刷新页面返回顶部

ahfuzhang

[metricsQL]通过VictoriaMetrics上报的 histogram 格式中的 `vmrange` 来不精确地统计超时时间

1. VictoriaMetrics 上报 histogram 数据的套路

2. 需求

3. 思考过程

导航

公告