bthread_join 内存可见性问题复现 Demo
测试环境:
cpu: aarch 64
os: openEuler 22.03 (LTS-SP4)
代码准备:
1 /** 2 * bthread_join 内存可见性问题复现 Demo 3 * 4 * 原理:bthread 内写入大量数据到共享结构,主线程 bthread_join 后立即读取。 5 * 在 ARM (aarch64) 弱内存序下,由于 bthread_join 缺少 acquire fence, 6 * joining 线程可能读到过期缓存数据(stale data)。 7 * 8 * 提高复现概率的策略: 9 * 1. 使用多个 CPU 核心,确保 bthread 和主线程在不同物理核上执行 10 * 2. 大量迭代,统计出错次数 11 * 3. 写入足够多的数据,增大 store buffer / invalidate queue 延迟窗口 12 * 4. join 后立即读取,缩小时间窗口 13 * 5. 多线程并发执行,增加竞争 14 */ 15 16 #include <bthread/bthread.h> 17 #include <butil/logging.h> 18 #include <atomic> 19 #include <cstdint> 20 #include <cstdio> 21 #include <cstring> 22 #include <vector> 23 #include <thread> 24 #include "gflags/gflags.h" 25 26 // 魔数,用于检测写入是否可见 27 static constexpr uint64_t MAGIC_VALUE = 0xDEADBEEFCAFEBABEULL; 28 static constexpr uint64_t JEMALLOC_JUNK = 0xa5a5a5a5a5a5a5a5ULL; 29 30 // 每次迭代中 bthread 写入的数据量(越大,store buffer 排空延迟越大) 31 static constexpr int DATA_SIZE = 64; 32 33 struct SharedData { 34 // 用 vector 模拟生产环境的 emplace_back 场景 35 std::vector<uint64_t> values; 36 // 额外的填充数据,增大写入量以扩大 stale read 窗口 37 uint64_t padding[DATA_SIZE]; 38 }; 39 40 static void* bthread_worker(void* arg) { 41 SharedData* data = reinterpret_cast<SharedData*>(arg); 42 43 // 先写填充数据(撑大 store buffer) 44 for (int i = 0; i < DATA_SIZE; ++i) { 45 data->padding[i] = MAGIC_VALUE; 46 } 47 48 // 最后写入 vector(模拟生产环境的 emplace_back) 49 data->values.emplace_back(MAGIC_VALUE); 50 51 return nullptr; 52 } 53 54 struct Stats { 55 std::atomic<uint64_t> total_iters{0}; 56 std::atomic<uint64_t> stale_reads{0}; 57 std::atomic<uint64_t> vector_stale{0}; 58 std::atomic<uint64_t> padding_stale{0}; 59 }; 60 61 static void run_test(Stats* stats, int iterations) { 62 uint64_t local_stale = 0; 63 uint64_t local_vec_stale = 0; 64 uint64_t local_pad_stale = 0; 65 66 for (int i = 0; i < iterations; ++i) { 67 // 每次迭代使用新的 SharedData,确保内存是"新鲜"的 68 // 这样更容易触发 jemalloc junk 填充 -> stale read 的场景 69 SharedData* data = new SharedData(); 70 data->values.reserve(1); // 预分配,避免 emplace_back 时 realloc 71 72 bthread_t tid; 73 int rc = bthread_start_background(&tid, nullptr, bthread_worker, data); 74 if (rc != 0) { 75 fprintf(stderr, "bthread_start_background failed: %d\n", rc); 76 delete data; 77 continue; 78 } 79 80 // bthread_join 返回后,按语义应该能看到 bthread 内所有写入 81 bthread_join(tid, nullptr); 82 83 // 立即检查数据可见性(不要有任何额外操作,缩小窗口) 84 bool has_stale = false; 85 86 // 检查填充数据 87 for (int j = 0; j < DATA_SIZE; ++j) { 88 if (data->padding[j] != MAGIC_VALUE) { 89 local_pad_stale++; 90 has_stale = true; 91 fprintf(stderr, "[%s:%d][iter %d] padding[%d] stale read: got 0x%lx%s\n", 92 __FILE__, __LINE__, i, j, data->padding[j], 93 (data->padding[j] == JEMALLOC_JUNK) ? " (jemalloc junk!)" : ""); 94 break; // 只报告第一个 95 } 96 } 97 98 // 检查 vector 数据 99 if (data->values.empty() || data->values[0] != MAGIC_VALUE) { 100 local_vec_stale++; 101 has_stale = true; 102 if (data->values.empty()) { 103 fprintf(stderr, "[%s:%d][iter %d] vector is EMPTY after join!\n", __FILE__, __LINE__, i); 104 } else { 105 uint64_t got = data->values[0]; 106 fprintf(stderr, "[%s:%d][iter %d] vector stale read: got 0x%lx, expected 0x%lx%s\n", 107 __FILE__, __LINE__, i, got, MAGIC_VALUE, 108 (got == JEMALLOC_JUNK) ? " (jemalloc junk!)" : ""); 109 } 110 } 111 112 if (has_stale) { 113 local_stale++; 114 } 115 116 delete data; 117 } 118 119 stats->total_iters.fetch_add(iterations, std::memory_order_relaxed); 120 stats->stale_reads.fetch_add(local_stale, std::memory_order_relaxed); 121 stats->vector_stale.fetch_add(local_vec_stale, std::memory_order_relaxed); 122 stats->padding_stale.fetch_add(local_pad_stale, std::memory_order_relaxed); 123 } 124 125 int main(int argc, char* argv[]) { 126 google::ParseCommandLineFlags(&argc, &argv, true); 127 int num_threads = 4; // 并发线程数 128 int iterations = 1000000; // 每线程迭代次数 129 130 if (argc > 1) num_threads = atoi(argv[1]); 131 if (argc > 2) iterations = atoi(argv[2]); 132 133 printf("=== bthread_join Memory Visibility Stress Test ===\n"); 134 printf("Threads: %d, Iterations per thread: %d\n", num_threads, iterations); 135 printf("Total iterations: %d\n", num_threads * iterations); 136 printf("Platform: "); 137 #if defined(__aarch64__) 138 printf("aarch64 (ARM64) - weak memory model, expect stale reads\n"); 139 #elif defined(__x86_64__) 140 printf("x86_64 - TSO strong memory model, stale reads unlikely\n"); 141 #else 142 printf("unknown\n"); 143 #endif 144 printf("Running...\n\n"); 145 146 Stats stats; 147 std::vector<std::thread> threads; 148 149 for (int i = 0; i < num_threads; ++i) { 150 threads.emplace_back(run_test, &stats, iterations); 151 } 152 153 for (auto& t : threads) { 154 t.join(); 155 } 156 157 printf("\n=== Results ===\n"); 158 printf("Total iterations: %lu\n", stats.total_iters.load()); 159 printf("Stale read incidents: %lu\n", stats.stale_reads.load()); 160 printf(" - vector stale: %lu\n", stats.vector_stale.load()); 161 printf(" - padding stale: %lu\n", stats.padding_stale.load()); 162 163 if (stats.stale_reads.load() > 0) { 164 double rate = 100.0 * stats.stale_reads.load() / stats.total_iters.load(); 165 printf("\n*** STALE READS DETECTED! Rate: %.6f%% ***\n", rate); 166 printf("This confirms bthread_join lacks acquire fence on this platform.\n"); 167 } else { 168 printf("\nNo stale reads detected in this run.\n"); 169 #if defined(__aarch64__) 170 printf("Try increasing iterations or thread count for higher probability.\n"); 171 #elif defined(__x86_64__) 172 printf("Expected on x86 (TSO) - hardware provides strong ordering.\n"); 173 #endif 174 } 175 176 return stats.stale_reads.load() > 0 ? 1 : 0; 177 }
brpc中bthread_join修复:
--- a/src/bthread/task_group.cpp +++ b/src/bthread/task_group.cpp namespace bthread { @@ -594,6 +596,14 @@ int TaskGroup::join(bthread_t tid, void** return_value) { return errno; } } + if (brpc::FLAGS_brpc_bthread_join_add_fence) { +#if defined(__aarch64__) || defined(__arm__) + // On ARM's weak memory model, ensure all memory writes made by the + // joined bthread are visible to the joining thread after join returns. + // This matches the semantic guarantee provided by pthread_join(). + butil::atomic_thread_fence(butil::memory_order_acquire); +#endif + } if (return_value) { *return_value = NULL; }
运行结果:
未开启gflag时运行结果如下:
=== bthread_join Memory Visibility Stress Test ===
Threads: 4, Iterations per thread: 1000000
Total iterations: 4000000
Platform: aarch64 (ARM64) - weak memory model, expect stale reads
Running...
[bthread_join_visibility_demo.cpp:92][iter 171378] padding[13] stale read: got 0x0
[bthread_join_visibility_demo.cpp:92][iter 557221] padding[0] stale read: got 0x0=== Results ===
Total iterations: 4000000
Stale read incidents: 2
- vector stale: 0
- padding stale: 2*** STALE READS DETECTED! Rate: 0.000050% ***
This confirms bthread_join lacks acquire fence on this platform.
开启gflag后运行效果如下:
=== bthread_join Memory Visibility Stress Test ===
Threads: 4, Iterations per thread: 1000000
Total iterations: 4000000
Platform: aarch64 (ARM64) - weak memory model, expect stale reads
Running...
=== Results ===
Total iterations: 4000000
Stale read incidents: 0
- vector stale: 0
- padding stale: 0No stale reads detected in this run.
结果分析
对比
| 场景 | stale read | 说明 | |------|-----------|------|
| FLAGS_brpc_bthread_join_add_fence=false | 2 次(0.00005%) | padding[0]、padding[13] 读到 0x0 |
| FLAGS_brpc_bthread_join_add_fence=true | 0 次 | 400万次全部正确 |
分析
开启 gflag 后,TaskGroup::join() 在 while 循环退出后执行了 atomic_thread_fence(memory_order_acquire)(第 604 行),对应 ARM 指令 DMB ISHLD,强制 CPU 立即清空 invalidate queue,确保 bthread 写入的所有数据对 joining 线程可见。
关闭时,while 循环通过普通读 *m->version_butex 退出,无 acquire fence,invalidate queue 中可能还有未处理的失效消息,导致主线程读到过期缓存行(构造时的 0x0)。
结论
这组对比实验直接证明了:
- 根因确认:bthread_join 缺少 acquire fence 是 ARM 上 stale read 的根因
- 修复有效:一条
DMB ISHLD即可彻底消除问题
浙公网安备 33010602011771号