下面给你一份严格按你指定链路 + TiDB v8.1.0 + 你现有 dashboards(online-*)整理的“标准排查流程表格”。
这是可以直接当 SOP(生产排障手册) 用的版本。
🧭 TiKV v8.1.0 标准排查流程(表格版 SOP)
| 顺序 | 排查层级(固定链路) | Dashboard | 核心观察指标类型(真实存在于面板中) | 判断标准(异常信号) | 根因方向 | 下一步 |
| 1 |
入口(Read / Write) |
online-Performance-Overview |
Read latency / Write latency / QPS |
latency ↑ 或 QPS波动 |
系统是否异常 |
KV / Transaction |
| 2 |
KV / Transaction(请求类型) |
online-Performance-Read / Write |
scan / get / prewrite / commit / lock wait |
scan↑ / commit↑ / lock wait↑ |
请求类型问题 |
Coprocessor / RocksDB / Raft |
| 3 |
Coprocessor(是否扫描) |
online-Performance-Read |
scan keys / coprocessor latency |
scan keys ↑↑ |
SQL扫描压力 |
RocksDB |
| 4 |
RocksDB(存储是否卡住) |
online-TiKV-Details / online-TiKV-Raw |
write stall / compaction / cache miss / read latency |
write stall > 0 / compaction ↑ |
LSM-tree瓶颈 |
Disk |
| 5 |
Disk(是否IO瓶颈) |
online-Disk-Performance |
disk read/write latency / IO util / throughput |
latency > 10ms / IO 100% |
磁盘瓶颈 |
Raft |
| 6 |
Raft(复制是否慢) |
online-TiKV-Details / online-TiKV-Raw |
propose / apply / snapshot |
propose↑ / apply↑ / snapshot↑ |
raft同步问题 |
Region |
| 7 |
Region(是否热点) |
online-PD |
hot region / leader distribution / region load |
leader集中 / hotspot明显 |
热点key /倾斜 |
CPU |
| 8 |
CPU(最终表现) |
online-TiKV-Details |
CPU usage / raftstore / coprocessor thread |
CPU > 80% |
最终资源瓶颈 |
回溯上层 |
🧠 标准排查路径(必须严格顺序)
online-Performance-Overview
↓
online-Performance-Read / Write
↓
KV / Transaction 分析
↓
Coprocessor(scan)
↓
RocksDB(存储)
↓
Disk(IO)
↓
Raft(复制)
↓
PD Region(热点)
↓
CPU(最终表现)
入口(Read/Write)
↓
KV / Transaction(请求类型)
↓
Coprocessor(是否扫描)
↓
RocksDB(是否存储卡住)
↓
Disk(是否IO瓶颈)
↓
Raft(是否复制慢)
↓
Region(是否热点)
↓
CPU(最终表现)
🔥 三大标准根因归类(最终结论层)
| 根因类型 | 出现位置 | 典型现象 |
| 🔥 热点问题 |
PD / Region |
leader集中 / QPS集中 |
| 💾 存储瓶颈 |
RocksDB |
write stall / compaction |
| 🧱 IO瓶颈 |
Disk |
latency高 / IO满 |
| 🔁 Raft瓶颈 |
TiKV Details |
apply/propose/snapshot异常 |
🧭 二、入口层(全局是否异常)
| 层级 | 具体成因 | 线程/模块 | Dashboard | 指标表现 | 如何确认 | 根因方向 |
| 入口层 |
业务流量突增 |
client / API |
online-Performance-Overview |
QPS突增 + latency上升 |
对比历史QPS基线 |
流量驱动 |
| 入口层 |
延迟整体升高 |
gRPC / TiKV endpoint |
online-Performance-Overview |
read/write latency ↑ |
p99/p999同步上升 |
系统性瓶颈 |
| 入口层 |
请求抖动 |
scheduler / endpoint |
online-Performance-Overview |
latency波动大 |
是否周期性抖动 |
资源争用 |
🧭 三、KV / Transaction 层(请求语义拆解)
| 层级 | 具体成因 | 线程/模块 | Dashboard | 指标表现 | 如何确认 | 根因方向 |
| KV层 |
Scan放大 |
coprocessor + storage |
online-Performance-Read |
scan keys ↑↑ |
scan远高于get |
SQL扫描问题 |
| KV层 |
Get热点 |
storage read thread |
online-Performance-Read |
get latency ↑ |
单key QPS集中 |
热点key |
| KV层 |
Prewrite阻塞 |
txn scheduler |
online-Performance-Write |
prewrite latency ↑ |
write阶段卡住 |
事务冲突 |
| KV层 |
Commit慢 |
raftstore + txn |
online-Performance-Write |
commit latency ↑ |
commit堆积 |
Raft/IO |
| KV层 |
Lock wait |
lock manager |
online-Performance-Write |
lock wait ↑ |
wait时间占比高 |
并发冲突 |
🧭 四、Coprocessor 层(SQL扫描压力)
| 层级 | 具体成因 | 线程/模块 | Dashboard | 指标表现 | 如何确认 | 根因方向 |
| Coprocessor |
大范围 Scan |
coprocessor thread |
online-Performance-Read |
scan keys ↑↑ |
read latency + scan同步上升 |
SQL不合理 |
| Coprocessor |
索引未命中 |
coprocessor |
online-Performance-Read |
region scan ↑ |
explain确认全表扫 |
索引缺失 |
| Coprocessor |
Cop task堆积 |
coprocessor queue |
online-TiKV-Details |
cop queue ↑ |
pending task增长 |
CPU不足 |
| Coprocessor |
Region scan热点 |
coprocessor + PD |
online-PD |
leader集中 |
hotspot明显 |
数据倾斜 |
🧭 五、RocksDB 层(LSM存储引擎)
| 层级 | 具体成因 | 线程/模块 | Dashboard | 指标表现 | 如何确认 | 根因方向 |
| RocksDB |
write stall |
compaction thread |
online-TiKV-Details |
write stall > 0 |
stall持续时间 |
LSM阻塞 |
| RocksDB |
compaction backlog |
compaction worker |
online-TiKV-Details |
compaction score ↑ |
score > 1持续 |
写放大 |
| RocksDB |
L0过多 |
compaction pipeline |
online-TiKV-Raw |
L0 file ↑ |
level0 slowdown |
写入抖动 |
| RocksDB |
cache miss |
block cache |
online-TiKV-Details |
read miss ↑ |
cache hit率下降 |
冷数据 |
| RocksDB |
WAL fsync慢 |
wal thread |
online-TiKV-Raw |
fsync latency ↑ |
disk await同步上升 |
IO问题 |
🧭 六、Disk IO 层(物理存储瓶颈)
| 层级 | 具体成因 | 线程/模块 | Dashboard | 指标表现 | 如何确认 | 根因方向 |
| Disk |
IO延迟高 |
kernel / disk |
online-Disk-Performance |
latency > 10ms |
iostat await ↑ |
磁盘慢 |
| Disk |
IO利用率100% |
block device |
online-Disk-Performance |
util 100% |
iowait ↑ |
饱和 |
| Disk |
吞吐打满 |
disk bandwidth |
online-Disk-Performance |
throughput max |
达到上限 |
带宽瓶颈 |
| Disk |
snapshot写入 |
raft snapshot |
online-TiKV-Raw |
write spike |
snapshot count ↑ |
副本恢复 |
| Disk |
IO队列拥塞 |
OS scheduler |
online-Disk-Performance |
avg queue ↑ |
await持续增长 |
调度瓶颈 |
🧭 七、Raft 层(分布式复制一致性)
| 层级 | 具体成因 | 线程/模块 | Dashboard | 指标表现 | 如何确认 | 根因方向 |
| Raft |
propose堆积 |
raftstore thread |
online-TiKV-Details |
propose ↑ |
write latency ↑ |
写入压力 |
| Raft |
apply慢 |
apply thread |
online-TiKV-Details |
apply latency ↑ |
commit延迟 ↑ |
IO瓶颈 |
| Raft |
snapshot频繁 |
raft snapshot |
online-TiKV-Raw |
snapshot ↑↑ |
replication slow |
副本重建 |
| Raft |
leader变动频繁 |
raft election |
online-PD |
leader churn ↑ |
region抖动 |
不稳定 |
| Raft |
region同步慢 |
raft log |
online-TiKV-Details |
log lag ↑ |
follower落后 |
网络/IO |
🧭 八、PD / Region 层(热点与调度)
| 层级 | 具体成因 | 线程/模块 | Dashboard | 指标表现 | 如何确认 | 根因方向 |
| Region |
hotspot key |
PD scheduler |
online-PD |
leader集中 |
QPS集中 |
数据倾斜 |
| Region |
region过大 |
region split |
online-PD |
region size ↑ |
split频率低 |
粒度问题 |
| Region |
负载不均 |
scheduler balance |
online-PD |
store QPS不均 |
leader分布偏斜 |
调度问题 |
| Region |
split过频 |
region split |
online-PD |
split rate ↑ |
IO spike |
写放大 |
🧭 九、CPU 层(最终资源瓶颈)
| 层级 | 具体成因 | 线程/模块 | Dashboard | 指标表现 | 如何确认 | 根因方向 |
| CPU |
Coprocessor计算 |
coprocessor thread |
online-TiKV-Details |
CPU ↑ + scan ↑ |
火焰图scan占比高 |
SQL压力 |
| CPU |
hotspot单核打满 |
raftstore single thread |
online-TiKV-Details |
单core 100% |
top线程定位 |
热点key |
| CPU |
compaction CPU |
compaction thread |
online-TiKV-Details |
CPU持续高 |
compaction score ↑ |
存储压力 |
| CPU |
raft apply CPU |
apply thread |
online-TiKV-Details |
CPU ↑ |
apply latency ↑ |
写入压力 |
| CPU |
txn冲突处理 |
lock / scheduler |
online-TiKV-Details |
CPU + wait ↑ |
lock wait ↑ |
并发冲突 |
| CPU |
GC压力 |
GC thread |
online-TiKV-Raw |
CPU缓慢上升 |
tombstone ↑ |
历史数据 |
🔥 十、三大终极根因归纳(工程收敛层)
| 大类 | 子类 |
| 🔥 热点问题 |
leader集中 / 单key热点 / region倾斜 |
| 💾 存储瓶颈 |
compaction backlog / write stall / WAL fsync |
| 🧱 IO瓶颈 |
disk await / snapshot写入 / cache miss |
| 🔁 Raft瓶颈 |
propose/apply backlog / snapshot恢复 |
| 🧠 CPU瓶颈 |
coprocessor / compaction / hotspot / txn冲突 |
🎯 十一、工程级核心总结(必须保留)
TiKV 性能问题本质是:
SQL访问模式 → Coprocessor扫描 → RocksDB LSM结构 → Disk IO物理限制 → Raft复制 → PD调度 → CPU线程竞争 的逐层放大效应。
如果你还要再升级这份 SOP,我可以帮你做一个真正生产杀手级版本:
这个会直接进入“高级DBA手册级别”。