False Sharing 伪共享(翻译 by chatgpt)

原文:https://www.kernel.org/doc/html/latest/kernel-hacking/false-sharing.html

What is False Sharing

什么是伪共享

False sharing is related with cache mechanism of maintaining the data coherence of one cache line stored in multiple CPU's caches; then academic definition for it is in 1. Consider a struct with a refcount and a string:
False sharing(伪共享)与多个CPU缓存中维护的同一缓存行的数据一致性有关;关于它的学术定义可以在1中找到。考虑一个包含引用计数和字符串的结构体:

struct foo {
        refcount_t refcount;
        ...
        char name[16];
} ____cacheline_internodealigned_in_smp;

Member 'refcount'(A) and 'name'(B) _share_ one cache line like below:
成员'refcount'(A)和'name'(B) 共享一个缓存行,如下所示:

              +-----------+                     +-----------+
              |   CPU 0   |                     |   CPU 1   |
              +-----------+                     +-----------+
             /                                        |
            /                                         |
           V                                          V
       +----------------------+             +----------------------+
       | A      B             | Cache 0     | A       B            | Cache 1
       +----------------------+             +----------------------+
                           |                  |
---------------------------+------------------+-----------------------------
                           |                  |
                         +----------------------+
                         |                      |
                         +----------------------+
            Main Memory  | A       B            |
                         +----------------------+

'refcount' is modified frequently, but 'name' is set once at object creation time and is never modified. When many CPUs access 'foo' at the same time, with 'refcount' being only bumped by one CPU frequently and 'name' being read by other CPUs, all those reading CPUs have to reload the whole cache line over and over due to the 'sharing', even though 'name' is never changed.
'refcount'经常被修改,但'name'在对象创建时只设置一次,之后不再修改。当许多CPU同时访问'foo'时,由于'refcount'经常被一个CPU增加,而'name'被其他CPU读取,所有这些读取CPU由于'共享'而不得不一遍又一遍地重新加载整个缓存行,尽管'name'从未改变。

There are many real-world cases of performance regressions caused by false sharing. One of these is a rw_semaphore 'mmap_lock' inside mm_struct struct, whose cache line layout change triggered a regression and Linus analyzed in 2.
有许多真实世界的性能退化案例是由于错误共享引起的。其中之一是在mm_struct结构内部的rw_semaphore 'mmap_lock',其缓存行布局的改变引发了一个退化,Linus在2中进行了分析。

There are two key factors for a harmful false sharing:
对于有害的错误共享,有两个关键因素:

  • A global datum accessed (shared) by many CPUs
    一个被许多CPU访问(共享)的全局数据

  • In the concurrent accesses to the data, there is at least one write operation: write/write or write/read cases.
    在对数据的并发访问中,至少有一个写操作:写/写或写/读情况。

The sharing could be from totally unrelated kernel components, or different code paths of the same kernel component.
共享可能来自完全不相关的内核组件,或者同一内核组件的不同代码路径。

False Sharing Pitfalls

伪共享陷阱

Back in time when one platform had only one or a few CPUs, hot data members could be purposely put in the same cache line to make them cache hot and save cacheline/TLB, like a lock and the data protected by it. But for recent large system with hundreds of CPUs, this may not work when the lock is heavily contended, as the lock owner CPU could write to the data, while other CPUs are busy spinning the lock.
在过去,当一个平台只有一个或几个CPU时,热数据成员可以被有意地放在同一个缓存行中,使它们成为缓存热点,并节省缓存行/TLB的使用,比如一个锁和被其保护的数据。但对于最近的拥有数百个CPU的大型系统来说,当锁被大量争用时,这种做法可能行不通,因为拥有锁的CPU可能会写入数据,而其他CPU正在忙于自旋锁。

Looking at past cases, there are several frequently occurring patterns for false sharing:
回顾过去的案例,伪共享经常出现以下几种模式:

  • lock (spinlock/mutex/semaphore) and data protected by it are purposely put in one cache line.
    锁(自旋锁/互斥锁/信号量)和被其保护的数据被有意地放在一个缓存行中。

  • global data being put together in one cache line. Some kernel subsystems have many global parameters of small size (4 bytes), which can easily be grouped together and put into one cache line.
    全局数据被放在一个缓存行中。一些内核子系统有许多小尺寸的全局参数(4字节),可以很容易地被组合放入一个缓存行中。

  • data members of a big data structure randomly sitting together without being noticed (cache line is usually 64 bytes or more), like 'mem_cgroup' struct.
    大型数据结构的数据成员随意地放在一起而没有被注意到(缓存行通常是64字节或更多),比如'mem_cgroup'结构。

Following 'mitigation' section provides real-world examples.
以下的“缓解”部分提供了真实世界的例子。

False sharing could easily happen unless they are intentionally checked, and it is valuable to run specific tools for performance critical workloads to detect false sharing affecting performance case and optimize accordingly.
除非有意地进行检查,否则伪共享很容易发生,对于性能关键的工作负载来说,运行特定工具以检测影响性能的伪共享并进行相应的优化是非常有价值的。

How to detect and analyze False Sharing

如何检测和分析伪共享

perf record/report/stat are widely used for performance tuning, and once hotspots are detected, tools like 'perf-c2c' and 'pahole' can be further used to detect and pinpoint the possible false sharing data structures. 'addr2line' is also good at decoding instruction pointer when there are multiple layers of inline functions.
perf record/report/stat 是广泛用于性能调优的工具,一旦发现热点,可以进一步使用像 perf-c2c 和 pahole 这样的工具来检测和定位可能存在伪共享的数据结构。当存在多层内联函数时,addr2line 也擅长解码指令指针。

perf-c2c can capture the cache lines with most false sharing hits, decoded functions (line number of file) accessing that cache line, and in-line offset of the data. Simple commands are:
perf-c2c 能够捕获伪共享命中最多的缓存行,解码访问该缓存行的函数(文件的行号),以及数据的内联偏移量。简单的命令如下:

$ perf c2c record -ag sleep 3
$ perf c2c report --call-graph none -k vmlinux

When running above during testing will-it-scale's tlb_flush1 case, perf reports something like:
在进行上述测试时,例如 will-it-scale 的 tlb_flush1 测试用例,perf 报告类似以下内容:

Total records                     :    1658231
Locked Load/Store Operations      :      89439
Load Operations                   :     623219
Load Local HITM                   :      92117
Load Remote HITM                  :        139

#----------------------------------------------------------------------
    4        0     2374        0        0        0  0xff1100088366d880
#----------------------------------------------------------------------
  0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1
  0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1
  0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1
  0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1
  0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1

A nice introduction for perf-c2c is 3.
perf-c2c的一个不错的介绍是3

'pahole' decodes data structure layouts delimited in cache line granularity. Users can match the offset in perf-c2c output with pahole's decoding to locate the exact data members. For global data, users can search the data address in System.map.
'pahole'可以解码以缓存行为粒度的数据结构布局。用户可以将perf-c2c输出中的偏移量与pahole的解码进行匹配,以定位准确的数据成员。对于全局数据,用户可以在System.map中搜索数据地址。

Possible Mitigations

可能的缓解措施

False sharing does not always need to be mitigated. False sharing mitigations should balance performance gains with complexity and space consumption. Sometimes, lower performance is OK, and it's unnecessary to hyper-optimize every rarely used data structure or a cold data path.
并非总是需要缓解伪共享。缓解伪共享应该在性能提升、复杂性和空间消耗之间取得平衡。有时,降低性能是可以接受的,不需要对每个很少使用的数据结构或冷数据路径进行超级优化。

False sharing hurting performance cases are seen more frequently with core count increasing. Because of these detrimental effects, many patches have been proposed across variety of subsystems (like networking and memory management) and merged. Some common mitigations (with examples) are:
随着核心数量的增加,伪共享对性能的影响越来越明显。由于这些不利影响,已经在各个子系统(如网络和内存管理)中提出了许多补丁并合并。一些常见的缓解措施(带有示例)包括:

  • Separate hot global data in its own dedicated cache line, even if it is just a 'short' type. The downside is more consumption of memory, cache line and TLB entries.
    将热全局数据分配到自己专用的缓存行中,即使它只是一个“short”类型。缺点是会消耗更多的内存、缓存行和TLB条目。

    • Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
  • Reorganize the data structure, separate the interfering members to different cache lines. One downside is it may introduce new false sharing of other members.
    重新组织数据结构,将相互干扰的成员分开放置在不同的缓存行中。其中一个缺点是可能引入其他成员的新伪共享。

    • Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
  • Replace 'write' with 'read' when possible, especially in loops. Like for some global variable, use compare(read)-then-write instead of unconditional write. For example, use:
    在可能的情况下,将“写”替换为“读”,特别是在循环中。例如,对于某些全局变量,使用比较(读取)-然后写入的方式,而不是无条件写入。例如,使用:

    if (!test_bit(XXX))
            set_bit(XXX);
    

    instead of directly "set_bit(XXX);", similarly for atomic_t data:
    而不是直接使用"set_bit(XXX);",对于atomic_t类型的数据也是类似的:

    if (atomic_read(XXX) == AAA)
            atomic_set(XXX, BBB);
    
    • Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")

    • Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")

  • Turn hot global data to 'per-cpu data + global data' when possible, or reasonably increase the threshold for syncing per-cpu data to global data, to reduce or postpone the 'write' to that global data.
    在可能的情况下,将热全局数据转换为“每CPU数据+全局数据”,或者合理增加将每CPU数据与全局数据同步的阈值,以减少或推迟对该全局数据的“写”操作。

    • Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")

    • Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")

Surely, all mitigations should be carefully verified to not cause side effects. To avoid introducing false sharing when coding, it's better to:
当然,所有的缓解措施都应该经过仔细验证,以确保不会引起副作用。为了避免在编码时引入伪共享,最好:

  • Be aware of cache line boundaries
    注意缓存行边界

  • Group mostly read-only fields together
    将大部分只读字段分组在一起

  • Group things that are written at the same time together
    将同时写入的内容分组在一起

  • Separate frequently read and frequently written fields on different cache lines.
    将频繁读取和频繁写入的字段分开放置在不同的缓存行中。

and better add a comment stating the false sharing consideration.
最好在代码中添加一条注释,说明考虑了伪共享的情况。

One note is, sometimes even after a severe false sharing is detected and solved, the performance may still have no obvious improvement as the hotspot switches to a new place.
需要注意的是,有时即使检测到并解决了严重的伪共享问题,性能可能仍然没有明显改善,因为热点切换到了新的位置。

Miscellaneous

其他

One open issue is that kernel has an optional data structure randomization mechanism, which also randomizes the situation of cache line sharing of data members.
一个未解决的问题是内核具有可选的数据结构随机化机制,该机制还会随机化数据成员的缓存行共享情况。

posted @ 2023-12-01 11:01  dolinux  阅读(119)  评论(0)    收藏  举报