simpleperf 抓 cache-miss
一、简介
1. Android 常见伪共享雷区
(1) 常见产生原因
线程池统计计数器数组(相邻元素被不同线程写);ring buffer 的 head/tail 与统计字段挤在一起;
binder/渲染链路中的共享状态结构频繁写小字段;
Java 中多个 volatile 字段集中在一个对象,被不同线程更新;
(2) 常见修复策略
避免多个高频写字段放同一对象;分拆为多个对象(让内存布局自然分离);
LongAdder 优先于单个 AtomicLong 在高并发场景;
若是 native 层热点,优先在 C++ 做 align/padding;
一句话判断法:“多线程写不同变量,但性能像在抢同一把锁;加 64B 对齐后明显变好”,基本就是缓存伪共享。
4. 查看系统支持的miss
8295:/ # simpleperf list | grep miss branch-load-misses dTLB-load-misses iTLB-load-misses L1-dcache-load-misses L1-icache-load-misses branch-misses cache-misses //通用名 armv8_pmuv3/ll_cache_miss_rd/ raw-l1d-cache-lmiss-rd # Level 1 data cache long-latency read miss raw-l1i-cache-lmiss # Level 1 instruction cache long-latency miss raw-l2d-cache-lmiss-rd # Level 2 data cache long-latency read miss raw-l2i-cache-lmiss (may not supported) # Level 2 instruction cache long-latency miss raw-l3d-cache-lmiss-rd # Level 3 data cache long-latency read miss raw-ll-cache-miss (may not supported) # Attributable Last level data or unified cache miss raw-ll-cache-miss-rd # Attributable Last Level cache memory read miss
二、cache-miss 排查
在 Android 上发现高 cache miss,最实用的是用 PMU 计数器做两步:先量化,再定位热点。
1. 先全局看是否存在“系统性高 miss”
8295:/# simpleperf stat -a --duration 10 -e cpu-cycles,instructions,cache-misses,cache-references Performance counter statistics: # count event_name # count / runtime, runtime / enabled_time 40,875,971,576 cpu-cycles # 0.510409 GHz (100%) 47,001,761,611 instructions # 0.869669 cycles per instruction (100%) 337,555,623 cache-misses # 1.924752% miss rate (100%) //跑那个实验,全局cache-miss高达12% 17,537,614,171 cache-references # 218.994 M/sec (100%) Total test time: 10.008294 seconds.
这个抓取的数据基于 cache-miss 的比重进行展示(不再是基于CPU占用了),重点看 3 个指标:
miss rate = cache-misses / cache-references = 337,555,623 / 17,537,614,171 = 1.92% IPC = instructions / cpu-cycles = 47,001,761,611 / 40,875,971,576 = 1.15 MPKI = cache-misses / instructions × 1000 = 337,555,623 / 47,001,761,611 * 1000 = 7.18
三个参数可接受范围:
(1) miss rate 常见可接受:1% ~ 5%, 较好:< 2%, 需要关注:> 5%, 明显异常(多数业务场景):> 10%
(2) IPC 小核常见:0.6 ~ 1.2, 大核常见:1.0 ~ 2.0(部分高 ILP 场景可更高), 若长期 < 0.7,通常说明访存等待、分支失败或流水线气泡较重.
(3) MPKI 较好:< 5,可接受:5 ~ 15,偏高:15 ~ 30,很高:> 30(通常会明显拖慢性能)
2. 再定位 cache-miss 热点函数
8295:/# simpleperf record -a -g --duration 10 -e cache-misses -o perf_miss_a.data
会在当前目录下生成 perf.data, 然后使用 D:\android-ndk-r25c\output 进行解析。
实测,主要抓取这个就可以定位是下面那个实验程序导致的 cache-miss 高了。
3. 缩小到目标进程/线程
//进程级 8295:/# simpleperf stat -p <PID> --duration 10 -e cpu-cycles,instructions,cache-misses,cache-references //线程级 8295:/# simpleperf stat -t <TID> --duration 10 -e cpu-cycles,instructions,cache-misses,cache-references //抓cache-miss热点函数: 8295:/# simpleperf record -p <PID> -g --duration 10 -e cache-misses -o perf_miss_a.data
二、实验
1. 实验代码
main()中构建两个函数,一个函数导致大量cache miss,而另一个函数忙等相同时间,然后抓cache miss数据,对比抓取结果是否符合预期。
android/frameworks/native/cmds/my_cache_miss_test$ tree
.
├── Android.bp
└── cache_miss_test.cpp
(1) Android.bp
android/frameworks/native/cmds/my_cache_miss_test$ cat Android.bp cc_binary { name: "cache_miss_test_static", srcs: ["cache_miss_test.cpp"], static_executable: true, static_libs: ["libc"], cflags: [ "-Wall", "-Werror", "-Wno-unused-function", "-Wno-unused-parameter", "-Wno-unused-variable", ], }
(2) cache_miss_test.c
#include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <time.h> #define L1_CACHE_BYTES 64 #define ONE_MB (1024 * 1024) __attribute__((noinline)) uint64_t cmt_now_ns(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec; } __attribute__((noinline)) void cmt_do_busy_delay(uint64_t start, uint64_t delta) { int i = 0; uint64_t now; while(1) { i++; if (i % (ONE_MB/8) == 0) { now = cmt_now_ns(); if (now > start + delta) { break; } } } } __attribute__((noinline)) void cmt_read_write_every_cache_byte(char *p, int rw, int size) { int i, v = 0xab; for (i = 0; i < size; i += L1_CACHE_BYTES) { if (rw) { v = p[i]; } else { p[i] = v; } } } //pp <vevery_byte> <read_write> <size> int main(int argc, char *argv[]) { int ret, cnt = 0; int rw = 1, eb = L1_CACHE_BYTES, size = 1024 * ONE_MB; char *p; switch (argc) { case 4: size = atoi(argv[3]) * ONE_MB; [[fallthrough]]; case 3: rw = atoi(argv[2]); [[fallthrough]]; case 2: eb = atoi(argv[1]); [[fallthrough]]; default: break; }; printf("rw=%d, eb=%d, size=%dMB\n", rw, eb, (size/ONE_MB)); //p = (char *)malloc(size); ret = posix_memalign((void **)&p, L1_CACHE_BYTES, size); if (ret || !p) { printf("posix_memalign failed\n"); return -1; } while (1) { uint64_t before, now; printf("cnt=%d\n", ++cnt); before = cmt_now_ns(); cmt_read_write_every_cache_byte(p, rw, size); now = cmt_now_ns(); cmt_do_busy_delay(now, now - before); //printf("delta_1=%lu, delta_2=%lu\n", (now - before)/1000, (cmt_now_ns()-now)/1000); } free(p); return 0; }
2. 实验数据
# simpleperf stat -t 24246 --duration 10 -e cpu-cycles,instructions,cache-misses,cache-references Performance counter statistics: # count event_name # count / runtime, runtime / enabled_time 22,591,403,594 cpu-cycles # 2.359433 GHz (100%) 42,973,017,079 instructions # 0.525711 cycles per instruction (100%) 1,561,376,318 cache-misses # 96.397930% miss rate (100%) 1,619,719,750 cache-references # 169.226 M/sec (100%) Total test time: 10.002188 seconds.
可以看到测试进程有 96.4% 的 miss rate。执行下面指令抓 cache miss 火焰图:
# simpleperf record -p 24246 -g --duration 15 -e cache-misses -o perf_3.data
放到 D:\android-ndk-r25c 下解析后,可以看到只有 cmt_read_write_every_cache_byte() 中占了 99.96% 的cache miss, 而另一个忙等的 cmt_do_busy_delay() 函数中没有cache-miss,符合预期。显示如下:

posted on 2026-06-04 18:08 Hello-World3 阅读(0) 评论(0) 收藏 举报
浙公网安备 33010602011771号