tikv排查指南1

下面给你一份严格按你指定链路 + TiDB v8.1.0 + 你现有 dashboards(online-*)整理的“标准排查流程表格”
这是可以直接当 SOP(生产排障手册) 用的版本。


🧭 TiKV v8.1.0 标准排查流程(表格版 SOP)

顺序排查层级(固定链路)Dashboard核心观察指标类型(真实存在于面板中)判断标准(异常信号)根因方向下一步
1 入口(Read / Write) online-Performance-Overview Read latency / Write latency / QPS latency ↑ 或 QPS波动 系统是否异常 KV / Transaction
2 KV / Transaction(请求类型) online-Performance-Read / Write scan / get / prewrite / commit / lock wait scan↑ / commit↑ / lock wait↑ 请求类型问题 Coprocessor / RocksDB / Raft
3 Coprocessor(是否扫描) online-Performance-Read scan keys / coprocessor latency scan keys ↑↑ SQL扫描压力 RocksDB
4 RocksDB(存储是否卡住) online-TiKV-Details / online-TiKV-Raw write stall / compaction / cache miss / read latency write stall > 0 / compaction ↑ LSM-tree瓶颈 Disk
5 Disk(是否IO瓶颈) online-Disk-Performance disk read/write latency / IO util / throughput latency > 10ms / IO 100% 磁盘瓶颈 Raft
6 Raft(复制是否慢) online-TiKV-Details / online-TiKV-Raw propose / apply / snapshot propose↑ / apply↑ / snapshot↑ raft同步问题 Region
7 Region(是否热点) online-PD hot region / leader distribution / region load leader集中 / hotspot明显 热点key /倾斜 CPU
8 CPU(最终表现) online-TiKV-Details CPU usage / raftstore / coprocessor thread CPU > 80% 最终资源瓶颈 回溯上层

🧠 标准排查路径(必须严格顺序)

online-Performance-Overview
   ↓
online-Performance-Read / Write
   ↓
KV / Transaction 分析
   ↓
Coprocessor(scan)
   ↓
RocksDB(存储)
   ↓
Disk(IO)
   ↓
Raft(复制)
   ↓
PD Region(热点)
   ↓
CPU(最终表现)

入口(Read/Write)

KV / Transaction(请求类型)

Coprocessor(是否扫描)

RocksDB(是否存储卡住)

Disk(是否IO瓶颈)

Raft(是否复制慢)

Region(是否热点)

CPU(最终表现)


🔥 三大标准根因归类(最终结论层)

根因类型出现位置典型现象
🔥 热点问题 PD / Region leader集中 / QPS集中
💾 存储瓶颈 RocksDB write stall / compaction
🧱 IO瓶颈 Disk latency高 / IO满
🔁 Raft瓶颈 TiKV Details apply/propose/snapshot异常

 


🧭 二、入口层(全局是否异常)

层级具体成因线程/模块Dashboard指标表现如何确认根因方向
入口层 业务流量突增 client / API online-Performance-Overview QPS突增 + latency上升 对比历史QPS基线 流量驱动
入口层 延迟整体升高 gRPC / TiKV endpoint online-Performance-Overview read/write latency ↑ p99/p999同步上升 系统性瓶颈
入口层 请求抖动 scheduler / endpoint online-Performance-Overview latency波动大 是否周期性抖动 资源争用

🧭 三、KV / Transaction 层(请求语义拆解)

层级具体成因线程/模块Dashboard指标表现如何确认根因方向
KV层 Scan放大 coprocessor + storage online-Performance-Read scan keys ↑↑ scan远高于get SQL扫描问题
KV层 Get热点 storage read thread online-Performance-Read get latency ↑ 单key QPS集中 热点key
KV层 Prewrite阻塞 txn scheduler online-Performance-Write prewrite latency ↑ write阶段卡住 事务冲突
KV层 Commit慢 raftstore + txn online-Performance-Write commit latency ↑ commit堆积 Raft/IO
KV层 Lock wait lock manager online-Performance-Write lock wait ↑ wait时间占比高 并发冲突

🧭 四、Coprocessor 层(SQL扫描压力)

层级具体成因线程/模块Dashboard指标表现如何确认根因方向
Coprocessor 大范围 Scan coprocessor thread online-Performance-Read scan keys ↑↑ read latency + scan同步上升 SQL不合理
Coprocessor 索引未命中 coprocessor online-Performance-Read region scan ↑ explain确认全表扫 索引缺失
Coprocessor Cop task堆积 coprocessor queue online-TiKV-Details cop queue ↑ pending task增长 CPU不足
Coprocessor Region scan热点 coprocessor + PD online-PD leader集中 hotspot明显 数据倾斜

🧭 五、RocksDB 层(LSM存储引擎)

层级具体成因线程/模块Dashboard指标表现如何确认根因方向
RocksDB write stall compaction thread online-TiKV-Details write stall > 0 stall持续时间 LSM阻塞
RocksDB compaction backlog compaction worker online-TiKV-Details compaction score ↑ score > 1持续 写放大
RocksDB L0过多 compaction pipeline online-TiKV-Raw L0 file ↑ level0 slowdown 写入抖动
RocksDB cache miss block cache online-TiKV-Details read miss ↑ cache hit率下降 冷数据
RocksDB WAL fsync慢 wal thread online-TiKV-Raw fsync latency ↑ disk await同步上升 IO问题

🧭 六、Disk IO 层(物理存储瓶颈)

层级具体成因线程/模块Dashboard指标表现如何确认根因方向
Disk IO延迟高 kernel / disk online-Disk-Performance latency > 10ms iostat await ↑ 磁盘慢
Disk IO利用率100% block device online-Disk-Performance util 100% iowait ↑ 饱和
Disk 吞吐打满 disk bandwidth online-Disk-Performance throughput max 达到上限 带宽瓶颈
Disk snapshot写入 raft snapshot online-TiKV-Raw write spike snapshot count ↑ 副本恢复
Disk IO队列拥塞 OS scheduler online-Disk-Performance avg queue ↑ await持续增长 调度瓶颈

🧭 七、Raft 层(分布式复制一致性)

层级具体成因线程/模块Dashboard指标表现如何确认根因方向
Raft propose堆积 raftstore thread online-TiKV-Details propose ↑ write latency ↑ 写入压力
Raft apply慢 apply thread online-TiKV-Details apply latency ↑ commit延迟 ↑ IO瓶颈
Raft snapshot频繁 raft snapshot online-TiKV-Raw snapshot ↑↑ replication slow 副本重建
Raft leader变动频繁 raft election online-PD leader churn ↑ region抖动 不稳定
Raft region同步慢 raft log online-TiKV-Details log lag ↑ follower落后 网络/IO

🧭 八、PD / Region 层(热点与调度)

层级具体成因线程/模块Dashboard指标表现如何确认根因方向
Region hotspot key PD scheduler online-PD leader集中 QPS集中 数据倾斜
Region region过大 region split online-PD region size ↑ split频率低 粒度问题
Region 负载不均 scheduler balance online-PD store QPS不均 leader分布偏斜 调度问题
Region split过频 region split online-PD split rate ↑ IO spike 写放大

🧭 九、CPU 层(最终资源瓶颈)

层级具体成因线程/模块Dashboard指标表现如何确认根因方向
CPU Coprocessor计算 coprocessor thread online-TiKV-Details CPU ↑ + scan ↑ 火焰图scan占比高 SQL压力
CPU hotspot单核打满 raftstore single thread online-TiKV-Details 单core 100% top线程定位 热点key
CPU compaction CPU compaction thread online-TiKV-Details CPU持续高 compaction score ↑ 存储压力
CPU raft apply CPU apply thread online-TiKV-Details CPU ↑ apply latency ↑ 写入压力
CPU txn冲突处理 lock / scheduler online-TiKV-Details CPU + wait ↑ lock wait ↑ 并发冲突
CPU GC压力 GC thread online-TiKV-Raw CPU缓慢上升 tombstone ↑ 历史数据

🔥 十、三大终极根因归纳(工程收敛层)

大类子类
🔥 热点问题 leader集中 / 单key热点 / region倾斜
💾 存储瓶颈 compaction backlog / write stall / WAL fsync
🧱 IO瓶颈 disk await / snapshot写入 / cache miss
🔁 Raft瓶颈 propose/apply backlog / snapshot恢复
🧠 CPU瓶颈 coprocessor / compaction / hotspot / txn冲突

🎯 十一、工程级核心总结(必须保留)

TiKV 性能问题本质是:
SQL访问模式 → Coprocessor扫描 → RocksDB LSM结构 → Disk IO物理限制 → Raft复制 → PD调度 → CPU线程竞争 的逐层放大效应。


如果你还要再升级这份 SOP,我可以帮你做一个真正生产杀手级版本

  • 🚀 “故障现象 → 1分钟定位路径(无图排障)”

  • 🔥 “火焰图 + TiKV线程 + RocksDB状态映射图”

  • 🧠 “20个真实线上事故复盘库(含误判点)”

这个会直接进入“高级DBA手册级别”。

posted on 2026-06-09 17:40  小镇-做题家  阅读(2)  评论(0)    收藏  举报

导航