bthread_join 内存可见性问题复现 Demo

测试环境：

cpu: aarch 64

os: openEuler 22.03 (LTS-SP4)

代码准备：

  1 /**
  2  * bthread_join 内存可见性问题复现 Demo
  3  *
  4  * 原理：bthread 内写入大量数据到共享结构，主线程 bthread_join 后立即读取。
  5  * 在 ARM (aarch64) 弱内存序下，由于 bthread_join 缺少 acquire fence，
  6  * joining 线程可能读到过期缓存数据（stale data）。
  7  *
  8  * 提高复现概率的策略：
  9  * 1. 使用多个 CPU 核心，确保 bthread 和主线程在不同物理核上执行
 10  * 2. 大量迭代，统计出错次数
 11  * 3. 写入足够多的数据，增大 store buffer / invalidate queue 延迟窗口
 12  * 4. join 后立即读取，缩小时间窗口
 13  * 5. 多线程并发执行，增加竞争
 14  */
 15 
 16 #include <bthread/bthread.h>
 17 #include <butil/logging.h>
 18 #include <atomic>
 19 #include <cstdint>
 20 #include <cstdio>
 21 #include <cstring>
 22 #include <vector>
 23 #include <thread>
 24 #include "gflags/gflags.h"
 25 
 26 // 魔数，用于检测写入是否可见
 27 static constexpr uint64_t MAGIC_VALUE = 0xDEADBEEFCAFEBABEULL;
 28 static constexpr uint64_t JEMALLOC_JUNK = 0xa5a5a5a5a5a5a5a5ULL;
 29 
 30 // 每次迭代中 bthread 写入的数据量（越大，store buffer 排空延迟越大）
 31 static constexpr int DATA_SIZE = 64;
 32 
 33 struct SharedData {
 34     // 用 vector 模拟生产环境的 emplace_back 场景
 35     std::vector<uint64_t> values;
 36     // 额外的填充数据，增大写入量以扩大 stale read 窗口
 37     uint64_t padding[DATA_SIZE];
 38 };
 39 
 40 static void* bthread_worker(void* arg) {
 41     SharedData* data = reinterpret_cast<SharedData*>(arg);
 42 
 43     // 先写填充数据（撑大 store buffer）
 44     for (int i = 0; i < DATA_SIZE; ++i) {
 45         data->padding[i] = MAGIC_VALUE;
 46     }
 47 
 48     // 最后写入 vector（模拟生产环境的 emplace_back）
 49     data->values.emplace_back(MAGIC_VALUE);
 50 
 51     return nullptr;
 52 }
 53 
 54 struct Stats {
 55     std::atomic<uint64_t> total_iters{0};
 56     std::atomic<uint64_t> stale_reads{0};
 57     std::atomic<uint64_t> vector_stale{0};
 58     std::atomic<uint64_t> padding_stale{0};
 59 };
 60 
 61 static void run_test(Stats* stats, int iterations) {
 62     uint64_t local_stale = 0;
 63     uint64_t local_vec_stale = 0;
 64     uint64_t local_pad_stale = 0;
 65 
 66     for (int i = 0; i < iterations; ++i) {
 67         // 每次迭代使用新的 SharedData，确保内存是"新鲜"的
 68         // 这样更容易触发 jemalloc junk 填充 -> stale read 的场景
 69         SharedData* data = new SharedData();
 70         data->values.reserve(1);  // 预分配，避免 emplace_back 时 realloc
 71 
 72         bthread_t tid;
 73         int rc = bthread_start_background(&tid, nullptr, bthread_worker, data);
 74         if (rc != 0) {
 75             fprintf(stderr, "bthread_start_background failed: %d\n", rc);
 76             delete data;
 77             continue;
 78         }
 79 
 80         // bthread_join 返回后，按语义应该能看到 bthread 内所有写入
 81         bthread_join(tid, nullptr);
 82 
 83         // 立即检查数据可见性（不要有任何额外操作，缩小窗口）
 84         bool has_stale = false;
 85 
 86         // 检查填充数据
 87         for (int j = 0; j < DATA_SIZE; ++j) {
 88             if (data->padding[j] != MAGIC_VALUE) {
 89                 local_pad_stale++;
 90                 has_stale = true;
 91                 fprintf(stderr, "[%s:%d][iter %d] padding[%d] stale read: got 0x%lx%s\n",
 92                         __FILE__, __LINE__, i, j, data->padding[j],
 93                         (data->padding[j] == JEMALLOC_JUNK) ? " (jemalloc junk!)" : "");
 94                 break;  // 只报告第一个
 95             }
 96         }
 97 
 98         // 检查 vector 数据
 99         if (data->values.empty() || data->values[0] != MAGIC_VALUE) {
100             local_vec_stale++;
101             has_stale = true;
102             if (data->values.empty()) {
103                 fprintf(stderr, "[%s:%d][iter %d] vector is EMPTY after join!\n", __FILE__, __LINE__, i);
104             } else {
105                 uint64_t got = data->values[0];
106                 fprintf(stderr, "[%s:%d][iter %d] vector stale read: got 0x%lx, expected 0x%lx%s\n",
107                         __FILE__, __LINE__, i, got, MAGIC_VALUE,
108                         (got == JEMALLOC_JUNK) ? " (jemalloc junk!)" : "");
109             }
110         }
111 
112         if (has_stale) {
113             local_stale++;
114         }
115 
116         delete data;
117     }
118 
119     stats->total_iters.fetch_add(iterations, std::memory_order_relaxed);
120     stats->stale_reads.fetch_add(local_stale, std::memory_order_relaxed);
121     stats->vector_stale.fetch_add(local_vec_stale, std::memory_order_relaxed);
122     stats->padding_stale.fetch_add(local_pad_stale, std::memory_order_relaxed);
123 }
124 
125 int main(int argc, char* argv[]) {
126     google::ParseCommandLineFlags(&argc, &argv, true);
127     int num_threads = 4;        // 并发线程数
128     int iterations = 1000000;   // 每线程迭代次数
129 
130     if (argc > 1) num_threads = atoi(argv[1]);
131     if (argc > 2) iterations = atoi(argv[2]);
132 
133     printf("=== bthread_join Memory Visibility Stress Test ===\n");
134     printf("Threads: %d, Iterations per thread: %d\n", num_threads, iterations);
135     printf("Total iterations: %d\n", num_threads * iterations);
136     printf("Platform: ");
137 #if defined(__aarch64__)
138     printf("aarch64 (ARM64) - weak memory model, expect stale reads\n");
139 #elif defined(__x86_64__)
140     printf("x86_64 - TSO strong memory model, stale reads unlikely\n");
141 #else
142     printf("unknown\n");
143 #endif
144     printf("Running...\n\n");
145 
146     Stats stats;
147     std::vector<std::thread> threads;
148 
149     for (int i = 0; i < num_threads; ++i) {
150         threads.emplace_back(run_test, &stats, iterations);
151     }
152 
153     for (auto& t : threads) {
154         t.join();
155     }
156 
157     printf("\n=== Results ===\n");
158     printf("Total iterations:     %lu\n", stats.total_iters.load());
159     printf("Stale read incidents: %lu\n", stats.stale_reads.load());
160     printf("  - vector stale:     %lu\n", stats.vector_stale.load());
161     printf("  - padding stale:    %lu\n", stats.padding_stale.load());
162 
163     if (stats.stale_reads.load() > 0) {
164         double rate = 100.0 * stats.stale_reads.load() / stats.total_iters.load();
165         printf("\n*** STALE READS DETECTED! Rate: %.6f%% ***\n", rate);
166         printf("This confirms bthread_join lacks acquire fence on this platform.\n");
167     } else {
168         printf("\nNo stale reads detected in this run.\n");
169 #if defined(__aarch64__)
170         printf("Try increasing iterations or thread count for higher probability.\n");
171 #elif defined(__x86_64__)
172         printf("Expected on x86 (TSO) - hardware provides strong ordering.\n");
173 #endif
174     }
175 
176     return stats.stale_reads.load() > 0 ? 1 : 0;
177 }

Demo

brpc中bthread_join修复：

--- a/src/bthread/task_group.cpp
+++ b/src/bthread/task_group.cpp

 namespace bthread {
@@ -594,6 +596,14 @@ int TaskGroup::join(bthread_t tid, void** return_value) {
             return errno;
         }
     }
+    if (brpc::FLAGS_brpc_bthread_join_add_fence) {
+#if defined(__aarch64__) || defined(__arm__)
+        // On ARM's weak memory model, ensure all memory writes made by the
+        // joined bthread are visible to the joining thread after join returns.
+        // This matches the semantic guarantee provided by pthread_join().
+        butil::atomic_thread_fence(butil::memory_order_acquire);
+#endif
+    }
     if (return_value) {
         *return_value = NULL;
     }

运行结果：

未开启gflag时运行结果如下：

=== bthread_join Memory Visibility Stress Test ===
Threads: 4, Iterations per thread: 1000000
Total iterations: 4000000
Platform: aarch64 (ARM64) - weak memory model, expect stale reads
Running...

[bthread_join_visibility_demo.cpp:92][iter 171378] padding[13] stale read: got 0x0
[bthread_join_visibility_demo.cpp:92][iter 557221] padding[0] stale read: got 0x0

=== Results ===
Total iterations: 4000000
Stale read incidents: 2
- vector stale: 0
- padding stale: 2

*** STALE READS DETECTED! Rate: 0.000050% ***
This confirms bthread_join lacks acquire fence on this platform.

开启gflag后运行效果如下：

=== bthread_join Memory Visibility Stress Test ===
Threads: 4, Iterations per thread: 1000000
Total iterations: 4000000
Platform: aarch64 (ARM64) - weak memory model, expect stale reads
Running...

=== Results ===
Total iterations: 4000000
Stale read incidents: 0
- vector stale: 0
- padding stale: 0

No stale reads detected in this run.

结果分析

对比

| 场景 | stale read | 说明 | |------|-----------|------|

| FLAGS_brpc_bthread_join_add_fence=false | 2 次（0.00005%） | padding[0]、padding[13] 读到 0x0 |

| FLAGS_brpc_bthread_join_add_fence=true | 0 次 | 400万次全部正确 |

分析

开启 gflag 后，TaskGroup::join() 在 while 循环退出后执行了 atomic_thread_fence(memory_order_acquire)（第 604 行），对应 ARM 指令 DMB ISHLD，强制 CPU 立即清空 invalidate queue，确保 bthread 写入的所有数据对 joining 线程可见。

关闭时，while 循环通过普通读 *m->version_butex 退出，无 acquire fence，invalidate queue 中可能还有未处理的失效消息，导致主线程读到过期缓存行（构造时的 0x0）。

结论

这组对比实验直接证明了：

根因确认：bthread_join 缺少 acquire fence 是 ARM 上 stale read 的根因
修复有效：一条 DMB ISHLD 即可彻底消除问题

posted @ 2026-04-24 08:56 郭流水阅读(22) 评论(0) 收藏举报

刷新页面返回顶部