simpleperf 抓 cache-miss

一、简介

1. Android 常见伪共享雷区

(1) 常见产生原因

线程池统计计数器数组(相邻元素被不同线程写);
ring buffer 的 head/tail 与统计字段挤在一起;
binder/渲染链路中的共享状态结构频繁写小字段;
Java 中多个 volatile 字段集中在一个对象，被不同线程更新;

(2) 常见修复策略

避免多个高频写字段放同一对象;
分拆为多个对象（让内存布局自然分离）;
LongAdder 优先于单个 AtomicLong 在高并发场景;
若是 native 层热点，优先在 C++ 做 align/padding;

一句话判断法：“多线程写不同变量，但性能像在抢同一把锁；加 64B 对齐后明显变好”，基本就是缓存伪共享。

4. 查看系统支持的miss

8295:/ # simpleperf list | grep miss
  branch-load-misses
  dTLB-load-misses
  iTLB-load-misses
  L1-dcache-load-misses
  L1-icache-load-misses
  branch-misses
  cache-misses //通用名
  armv8_pmuv3/ll_cache_miss_rd/
  raw-l1d-cache-lmiss-rd                   # Level 1 data cache long-latency read miss
  raw-l1i-cache-lmiss                      # Level 1 instruction cache long-latency miss
  raw-l2d-cache-lmiss-rd                   # Level 2 data cache long-latency read miss
  raw-l2i-cache-lmiss (may not supported)  # Level 2 instruction cache long-latency miss
  raw-l3d-cache-lmiss-rd                   # Level 3 data cache long-latency read miss
  raw-ll-cache-miss (may not supported)    # Attributable Last level data or unified cache miss
  raw-ll-cache-miss-rd                     # Attributable Last Level cache memory read miss

二、cache-miss 排查

在 Android 上发现高 cache miss，最实用的是用 PMU 计数器做两步：先量化，再定位热点。

1. 先全局看是否存在“系统性高 miss”

8295:/# simpleperf stat -a --duration 10 -e cpu-cycles,instructions,cache-misses,cache-references
Performance counter statistics:

#          count  event_name         # count / runtime,  runtime / enabled_time
  40,875,971,576  cpu-cycles         # 0.510409 GHz                     (100%)
  47,001,761,611  instructions       # 0.869669 cycles per instruction  (100%)
     337,555,623  cache-misses       # 1.924752% miss rate              (100%) //跑那个实验，全局cache-miss高达12%
  17,537,614,171  cache-references   # 218.994 M/sec                    (100%)

Total test time: 10.008294 seconds.

这个抓取的数据基于 cache-miss 的比重进行展示(不再是基于CPU占用了)，重点看 3 个指标:

miss rate = cache-misses / cache-references = 337,555,623 / 17,537,614,171 = 1.92%
IPC = instructions / cpu-cycles = 47,001,761,611 / 40,875,971,576 = 1.15
MPKI = cache-misses / instructions × 1000 = 337,555,623 / 47,001,761,611 * 1000 = 7.18

三个参数可接受范围：

(1) miss rate 常见可接受：1% ~ 5%, 较好：< 2%, 需要关注：> 5%, 明显异常(多数业务场景)：> 10%
(2) IPC 小核常见：0.6 ~ 1.2, 大核常见：1.0 ~ 2.0(部分高 ILP 场景可更高), 若长期 < 0.7，通常说明访存等待、分支失败或流水线气泡较重.
(3) MPKI 较好：< 5，可接受：5 ~ 15，偏高：15 ~ 30，很高：> 30(通常会明显拖慢性能)

2. 再定位 cache-miss 热点函数

8295:/# simpleperf record -a -g --duration 10 -e cache-misses -o perf_miss_a.data

会在当前目录下生成 perf.data, 然后使用 D:\android-ndk-r25c\output 进行解析。

实测，主要抓取这个就可以定位是下面那个实验程序导致的 cache-miss 高了。

3. 缩小到目标进程/线程

//进程级
8295:/# simpleperf stat -p <PID> --duration 10 -e cpu-cycles,instructions,cache-misses,cache-references
//线程级
8295:/# simpleperf stat -t <TID> --duration 10 -e cpu-cycles,instructions,cache-misses,cache-references

//抓cache-miss热点函数:
8295:/# simpleperf record -p <PID> -g --duration 10 -e cache-misses -o perf_miss_a.data

二、实验

1. 实验代码

main()中构建两个函数，一个函数导致大量cache miss，而另一个函数忙等相同时间，然后抓cache miss数据，对比抓取结果是否符合预期。

android/frameworks/native/cmds/my_cache_miss_test$ tree
.
├── Android.bp
└── cache_miss_test.cpp

(1) Android.bp

android/frameworks/native/cmds/my_cache_miss_test$ cat Android.bp 

cc_binary {
    name: "cache_miss_test_static",
    srcs: ["cache_miss_test.cpp"],
    static_executable: true,
    static_libs: ["libc"],
    cflags: [
        "-Wall",
        "-Werror",
        "-Wno-unused-function",
        "-Wno-unused-parameter",
        "-Wno-unused-variable",
    ],
}

(2) cache_miss_test.c

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>

#define L1_CACHE_BYTES 64
#define ONE_MB (1024 * 1024)

__attribute__((noinline)) uint64_t cmt_now_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
}

__attribute__((noinline)) void cmt_do_busy_delay(uint64_t start, uint64_t delta)
{
    int i = 0;
    uint64_t now;

    while(1) {
        i++;
        if (i % (ONE_MB/8) == 0) {
            now = cmt_now_ns();
            if (now > start + delta) {
                break;
            }
        }
    }
}

__attribute__((noinline)) void cmt_read_write_every_cache_byte(char *p, int rw, int size)
{
    int i, v = 0xab;

    for (i = 0; i < size; i += L1_CACHE_BYTES) {
        if (rw) {
            v = p[i];
        } else {
            p[i] = v;
        }
    }
}

//pp <vevery_byte> <read_write> <size>
int main(int argc, char *argv[])
{
    int ret, cnt = 0;
    int rw = 1, eb = L1_CACHE_BYTES, size = 1024 * ONE_MB;
    char *p;

    switch (argc) {
    case 4:
        size = atoi(argv[3]) * ONE_MB;
        [[fallthrough]];
    case 3:
        rw = atoi(argv[2]);
        [[fallthrough]];
    case 2:
        eb = atoi(argv[1]);
        [[fallthrough]];
    default:
        break;
    };
    printf("rw=%d, eb=%d, size=%dMB\n", rw, eb, (size/ONE_MB));

    //p = (char *)malloc(size);
    ret = posix_memalign((void **)&p, L1_CACHE_BYTES, size);
    if (ret || !p) {
        printf("posix_memalign failed\n");
        return -1;
    }

    while (1) {
        uint64_t before, now;
        printf("cnt=%d\n", ++cnt);
        before = cmt_now_ns();
        cmt_read_write_every_cache_byte(p, rw, size);
        now = cmt_now_ns();
        cmt_do_busy_delay(now, now - before);
        //printf("delta_1=%lu, delta_2=%lu\n", (now - before)/1000, (cmt_now_ns()-now)/1000);
    }
    free(p);
    return 0;
}

2. 实验数据

# simpleperf stat -t 24246 --duration 10 -e cpu-cycles,instructions,cache-misses,cache-references
Performance counter statistics:

#          count  event_name         # count / runtime,  runtime / enabled_time
  22,591,403,594  cpu-cycles         # 2.359433 GHz                     (100%)
  42,973,017,079  instructions       # 0.525711 cycles per instruction  (100%)
   1,561,376,318  cache-misses       # 96.397930% miss rate             (100%)
   1,619,719,750  cache-references   # 169.226 M/sec                    (100%)

Total test time: 10.002188 seconds.

可以看到测试进程有 96.4% 的 miss rate。执行下面指令抓 cache miss 火焰图：

# simpleperf record -p 24246 -g --duration 15 -e cache-misses -o perf_3.data

放到 D:\android-ndk-r25c 下解析后，可以看到只有 cmt_read_write_every_cache_byte() 中占了 99.96% 的cache miss, 而另一个忙等的 cmt_do_busy_delay() 函数中没有cache-miss，符合预期。显示如下：

posted on 2026-06-04 18:08 Hello-World3 阅读(0) 评论(0) 收藏举报

刷新页面返回顶部