Loki 写入限流 ingestion rate limit exceeded：定位疯狂刷日志的服务

Loki 报 ingestion rate limit exceeded、日志开始丢，要止血很简单——把限流调大就行。但调大之前得先回答一个更难的问题：到底是谁在疯狂刷日志？ Loki 不像 Prometheus 有现成的 per-target 抓取量统计，你面对的只是一句 org_id=fake 的报错和一块越涨越快的磁盘。

这篇记录一条可复现的定位路径，核心结论先放这里：

定位"谁在刷 Loki"，不要依赖 /index/volume 端点——它在 boltdb-shipper 下按 label 聚合会给出误导性的空值；直接用 /loki/api/v1/query_range 拉原始日志，按流（stream）的 label 和行数看，一眼就能锁定大户。

注：文中主机名、服务名、命名空间、路径均为脱敏占位，替换成你自己的即可。

环境

Kubernetes 集群，日志方案 = Loki（单体 / monolithic）+ Promtail（DaemonSet）
Loki：grafana/loki 2.9.x，boltdb-shipper + filesystem 后端，auth_enabled: false（单租户，租户名固定为 fake）
Promtail：标准 kubernetes-pods scrape + relabel，push 到 http://loki.<ns>.svc.cluster.local:3100/loki/api/v1/push
排查全程只用 kubectl + Loki 的 HTTP API，只读

一、现象：distributor 在大量拒收

起点其实是个无关的目录排查——一个 redis pod 的挂载盘里冒出个陌生的 logger/ 目录，结构是 chunks / index / cache / compactor / wal，这是 Loki 的文件系统存储布局。顺藤摸瓜找到集群里唯一的 Loki（loki-0），翻它的日志时，真正的问题暴露了：

kubectl -n <ns> logs loki-0 | grep -iE 'level=error|rate limit' | tail

level=error component=distributor path=write msg="write operation failed"
  details="ingestion rate limit exceeded for user fake (limit: 67108864 bytes/sec)
  while attempting to ingest '1587' lines totaling '1048156' bytes,
  reduce log volume or contact your Loki administrator ..." org_id=fake

读三个数就够了：

limit: 67108864 bytes/sec = 64 MiB/s，这是 fake 租户的 ingestion rate limit。
每条报错丢一批 ~1500 行 / ~1MB。
时间戳连续刷——此刻仍在丢。

这里要强调代价，否则后面的折腾显得没必要：被 distributor 拒收的日志不会重试、永久丢失。也就是说，正当你需要靠日志排查别的故障时，Grafana 里对应时段是有缺口的。限流不是"限速慢一点"，是"直接扔掉"。

顺带确认一句 compactor / retention 是正常的（marker.go ... "no marks file found" 每分钟一条、无报错），排除"删不掉导致堆积"这条岔路。问题收敛到一点：有服务在以远超 64MB/s 的速率灌日志，得找出它。

二、弯路：`/index/volume` 端点把我带偏了

要找大户，第一反应是按 label 维度统计写入量。Loki 有个 /loki/api/v1/index/volume 端点，能按 label 聚合字节数。先 port-forward 把 Loki 的 3100 端口转出来：

kubectl -n <ns> port-forward pod/loki-0 13100:3100

然后按 namespace / pod / service_name 各聚合一次：

END=$(date +%s)000000000
START=$(date -d '-30 min' +%s)000000000
for tl in namespace pod service_name node_name; do
  echo "== by $tl =="
  curl -sG http://127.0.0.1:13100/loki/api/v1/index/volume \
    --data-urlencode "query={$tl=~\".+\"}" \
    --data-urlencode "start=$START" --data-urlencode "end=$END" \
    --data-urlencode 'aggregateBy=labels' \
    --data-urlencode "targetLabels=$tl" --data-urlencode 'limit=20'
  echo
done

结果让人懵——每个维度都只返回一条，且 label 值是空的：

by namespace   → 2718 MB  {namespace: ""}
by pod         → 3193 MB  {pod: ""}
by service_name→ 3170 MB  {service_name: ""}
by node_name   → 3322 MB  {node_name: ""}

30 分钟 ~3.3GB 全归到了"空 label"。我当时的（错误）推断是：这些日志根本没带 label，relabel 失效了，所以才无法按服务区分。还顺着编了套"可能是 vcluster 环境拿不到 pod metadata 导致 relabel 把标签都置空"的故事。

这个判断是错的。 教训写在这：/index/volume 端点在 boltdb-shipper 存储下，配 aggregateBy=labels + targetLabels 时会把结果错误地归并到空值桶，这是查询端的统计假象，不代表数据真没 label。被一个聚合端点的输出带着跑，是这次最大的弯路。

三、实锤：`query_range` 直接拉原文

与其信聚合，不如直接看数据。换 /loki/api/v1/query_range 拉最近 15 分钟的原始日志，direction=backward 取最新，selector 用一个一定有值的 label——stream（promtail 的 cri pipeline 一定会打 stream=stdout/stderr）：

END=$(date +%s)000000000
START=$(date -d '-15 min' +%s)000000000
curl -sG http://127.0.0.1:13100/loki/api/v1/query_range \
  --data-urlencode 'query={stream=~".+"}' \
  --data-urlencode "start=$START" --data-urlencode "end=$END" \
  --data-urlencode 'limit=120' --data-urlencode 'direction=backward' \
  | python3 -c "
import sys, json
d = json.load(sys.stdin)
rs = d['data']['result']
print('streams:', len(rs))
for s in rs:
    print('=== LBL:', s['stream'], 'lines=', len(s['values']))
    for v in s['values'][:3]:
        print('   |', v[1][:160])
"

这里有个容易踩的坑：query_range 返回的每个 result，label 集合在 s['stream'] 字段里，不是 s['metric']（metric 是 metric query 才有的）。一开始我写成 s['metric'] 直接 KeyError，白跑一趟。

修正后，结果一锤定音——最新的日志几乎被同一个服务占满，而且 label 完全正常：

streams: 2
=== LBL: {'namespace': 'prod', 'app': 'app-inference-free',
          'pod': 'app-inference-free-86886fcf8d-dpzcm',
          'node_name': 'node-xxxxx', 'stream': 'stderr', ...}  lines=115
   | 14:57:26.978 DEBUG infer.binding:bind_input_data:867 - Tensor 'xxx_cond' not found in engine bindings. Skipping.
   | 14:57:26.977 DEBUG infer.binding:bind_input_data:867 - Tensor 'xxx_cond' not found in engine bindings. Skipping.
   | 14:57:26.953 DEBUG infer.binding:bind_input_data:867 - Tensor 'xxx_cond' not found in engine bindings. Skipping.

两件事同时被证伪 / 证实：

label 是正常的（namespace=prod、pod、node_name 全有值），第二步"relabel 失效"的判断彻底推翻——promtail 工作得好好的，是 volume 端点骗了我。
元凶清晰：prod 命名空间的某推理服务（app-inference-*，且是多副本部署），把一条 DEBUG 日志毫秒级连刷。

四、根因：一条无害 DEBUG × 高频 × 多副本

那条日志本身：

DEBUG infer.binding:bind_input_data:867 - Tensor 'xxx_cond' not found in engine bindings. Skipping.

含义是推理引擎每次绑定输入张量时，发现某个可选张量不在引擎 bindings 里、跳过——这是完全正常、无害的行为。问题纯粹在于：

它被放在 DEBUG 级别，却在生产环境开着 DEBUG 打印；
它在每次推理的热路径上，单副本每秒就能打几十上百条；
这个服务是多副本（十几个 pod）。

三者一乘，单一条无意义日志的总写入轻松冲破 64MB/s。这解释了全部现象：limit 被打满、日志被拒收、磁盘被噪声快速填满。

五、横向核查：换个集群，同一个坑

定位到根因后有个容易被跳过、却很关键的动作：确认这个 DEBUG 到底是怎么开的。 因为它决定了影响面是一个集群还是一片。

查这个服务的 deployment，env 里 grep log / level / debug，一个都没有；启动命令也只是 python3 -m run.xxx、无参数：

kubectl -n prod get deploy <app-inference> \
  -o jsonpath='{range .spec.template.spec.containers[*].env[*]}{.name}{"\n"}{end}' \
  | grep -iE 'log|level|debug'
# （无输出）

日志格式 时间 | DEBUG | 模块:函数:行号 - msg 是 loguru 的默认样式，而 loguru 不显式设级别时默认就落在 DEBUG。两件事就此明确：

运维侧没有 env 开关可以关它——必须业务在代码/配置里设 logger 级别。
既然级别是镜像/代码里写死的默认，那所有跑同一镜像的集群，行为必然一致。

顺着第二点，切到另外两个集群的 kubeconfig 拉同一服务的日志验证：

for ctx in cluster-b cluster-c; do
  echo "== $ctx =="
  kubectl --context $ctx -n prod logs deploy/<app-inference> --tail=200 \
    | awk -F'|' '{gsub(/ /,"",$2); print $2}' | sort | uniq -c
done

结果三个集群全部在打 DEBUG。其中一个集群该服务有上百个副本，日志量级最大，它的 DEBUG 里甚至有单条打印多行张量 shape 的 dump，比那条 "Tensor not found" 更费。

至此问题升级：这不是"某个集群的 Loki 被打满"，而是"同一镜像在多个集群普遍开着 DEBUG"。最先告警的集群，只是因为它的日志后端容量最小（单租户限流 64MB/s）先扛不住而已——它是症状最早暴露的地方，不是问题唯一所在的地方。

六、整改：从源头关，而不是把限流调大

按优先级（注意是所有同镜像集群统一改，不是只改最先告警的那个）：

【首选】业务调日志级别：把该服务由 DEBUG 调到 INFO。噪声从源头消失，限流和磁盘问题同时解决。这类"热路径上的可选项缺失"日志，本就不该在生产以 DEBUG 高频输出。

【次选】Promtail 侧丢弃（治标，业务还没改时止血）：

scrape_configs:
  - job_name: kubernetes-pods
    pipeline_stages:
      - match:
          selector: '{namespace="prod", app=~"app-inference.*"}'
          stages:
            - drop:
                expression: ".*not found in engine bindings. Skipping.*"

【兜底】抬高限流：limits_config.ingestion_rate_mb / per_stream_rate_limit 调大。但这只是治标，且这种噪声留着白占磁盘，不推荐单独用。

limits_config:
  ingestion_rate_mb: 64
  ingestion_burst_size_mb: 128
  per_stream_rate_limit: 32MB
  per_stream_rate_limit_burst: 64MB

真正的修复是第 1 条。限流是症状，DEBUG 刷屏才是病。

快速参考

遇到 ingestion rate limit exceeded 的定位顺序：

确认报错和限速值：

kubectl -n <ns> logs <loki-pod> | grep -i 'rate limit'
# limit: 67108864 bytes/sec == 64 MiB/s

port-forward 出 Loki HTTP 端口（注意：若 kubectl 是 shell alias，后台 nohup 起 port-forward 时 alias 不展开，要用完整命令）：
```
kubectl -n <ns> port-forward pod/<loki-pod> 13100:3100
```

别用 /index/volume 按 label 聚合（boltdb-shipper 下会误归并到空值）。直接 query_range 拉原文：

curl -sG http://127.0.0.1:13100/loki/api/v1/query_range \
  --data-urlencode 'query={stream=~".+"}' \
  --data-urlencode "start=$(date -d '-15 min' +%s)000000000" \
  --data-urlencode "end=$(date +%s)000000000" \
  --data-urlencode 'limit=120' --data-urlencode 'direction=backward'

解析返回时，label 在 result[].stream，不是 result[].metric。

几条铁律：

限流 = 直接丢弃日志，不是减速；定位要趁它还在刷。
定位日志大户，原始数据（query_range）优先于聚合端点（volume）。
找到大户后优先从源头降日志级别，promtail drop 和抬限流都是治标。
生产环境别在推理 / 请求热路径上开 DEBUG。
根因若在镜像/代码默认（而非 env 开关），横向把所有用同镜像的集群都查一遍——最先告警的往往只是日志后端容量最小的那个。

posted @ 2026-06-03 15:11 Hello_worlds 阅读(7) 评论(0) 收藏举报

刷新页面返回顶部

大军军军之技术落地

责人之心责己、恕己之心恕人

Loki 写入限流 ingestion rate limit exceeded：定位疯狂刷日志的服务

环境

一、现象：distributor 在大量拒收

二、弯路：`/index/volume` 端点把我带偏了

三、实锤：`query_range` 直接拉原文

四、根因：一条无害 DEBUG × 高频 × 多副本

五、横向核查：换个集群，同一个坑

六、整改：从源头关，而不是把限流调大

快速参考

公告

大军军军之技术落地

责人之心责己、恕己之心恕人

Loki 写入限流 ingestion rate limit exceeded：定位疯狂刷日志的服务

环境

一、现象：distributor 在大量拒收

二、弯路：/index/volume 端点把我带偏了

三、实锤：query_range 直接拉原文

四、根因：一条无害 DEBUG × 高频 × 多副本

五、横向核查：换个集群，同一个坑

六、整改：从源头关，而不是把限流调大

快速参考

公告

二、弯路：`/index/volume` 端点把我带偏了

三、实锤：`query_range` 直接拉原文