We mention the phrase Mechanical Sympathy quite a lot, in fact it's even Martin's blog title.  It's about understanding how the underlying hardware operates and programming in a way that works with that, not against it.

We get a number of comments and questions about the mysterious cache line padding in theRingBuffer, and I referred to it in the last post.  Since this lends itself to pretty pictures, it's the next thing I thought I would tackle.

Comp Sci 101

One of the things I love about working at LMAX is all that stuff I learnt at university and in my A Level Computing actually means something.  So often as a developer you can get away with not understanding the CPU, data structures or Big O notation - I spent 10 years of my career forgetting all that.  But it turns out that if you do know about these things, and you apply that knowledge, you can come up with some very clever, very fast code.

So, a refresher for those of us who studied this at school, and an intro for those who didn't.  Beware - this post contains massive over-simplifications.

The CPU is the heart of your machine and the thing that ultimately has to do all the operations, executing your program.  Main memory (RAM) is where your data (including the lines of your program) lives.  We're going to ignore stuff like hard drives and networks here because the Disruptor is aimed at running as much as possible in memory.

CPU是机器的心脏和最终做所有运算（执行你的程序）的东西。主内存（RAM）是你的数据（包括你的程序）生存的地方。在这里我们将忽略硬件驱动和网络之类的东西因为Disruptor的目标是在内存中尽可能多地运行。

The CPU has several layers of cache between it and main memory, because even accessing main memory is too slow.  If you're doing the same operation on a piece of data multiple times, it makes sense to load this into a place very close to the CPU when it's performing the operation (think a loop counter - you don't want to be going off to main memory to fetch this to increment it every time you loop around).

CPU和主内存之间有好几层缓存，因为即使访问主内存也太慢。如果你正在对一块数据做多次相同的运算，那么在执行运算的时候把它加载到离CPU很近的地方就有意义了（想一个循环计数－你不想每次循环都跑到主内存去取这个数据来增长它）。

The closer the cache is to the CPU, the faster it is and the smaller it is.  L1 cache is small and very fast, and right next to the core that uses it.  L2 is bigger and slower, and still only used by a single core.  L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket.  Finally you have main memory, which is shared across all cores and all sockets.

When the CPU is performing an operation, it's first going to look in L1 for the data it needs, then L2, then L3, and finally if it's not in any of the caches the data needs to be fetched all the way from main memory.  The further it has to go, the longer the operation will take.  So if you're doing something very frequently, you want to make sure that data is in L1 cache.

Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:

Martin和Mike的 QCon presentation演讲给出了一些高速缓存未命中的成本的指示数据。

 Latency from CPU to... Approx. number ofCPU cycles Approx. time in nanoseconds Main memory ~60-80ns QPI transit(between sockets, not drawn) ~20ns L3 cache ~40-45 cycles, ~15ns L2 cache ~10 cycles, ~3ns L1 cache ~3-4 cycles, ~1ns Register 1 cycle

If you're aiming for an end-to-end latency of something like 10 milliseconds, an 80 nanosecond trip to main memory to get some missing data is going to take a serious chunk of that.
如果你要做一件延时10毫秒左右的事情，一个80纳秒的去主内存拿一些未命中的数据的旅程将占很重的一块。

Cache lines

Now the interesting thing to note is that it's not individual items that get stored in the cache - i.e. it's not a single variable, a single pointer.  The cache is made up of cache lines, typically 64 bytes, and it effectively references a location in main memory.  A Java long is 8 bytes, so in a single cache line you could have 8 long variables.

(I'm going to ignore the multiple cache-levels for simplicity)
（简单起见我将忽略多级缓存）

This is brilliant if you're accessing an array of longs - when one value from the array gets loaded into the cache, you get up to 7 more for free.  So you can walk that array very quickly.  In fact, you can iterate over any data structure that is allocated to contiguous blocks in memory very quickly.  I made a passing reference to this in the very first post about the ring buffer, and it explains why we use an array for it.

So if items in your data structure aren't sat next to each other in memory (linked lists, I'm looking at you) you don't get the advantage of freebie cache loading.  You could be getting a cache miss for every item in that data structure.
因此如果你的数据结构中的项在内存中不是连在一起的（链表集合，我指的是你）你得不到缓存加载免费附带的好处。在那种数据结构中的每一个项都可能会出现高速缓存未命中。

However, there is a drawback to all this free loading.  Imagine your long isn't part of an array.  Imagine it's just a single variable.  Let's call it head, for no real reason.  Then imagine you have another variable in your class right next to it.  Let's arbitrarily call it tail.  Now, when you load head into your cache, you get tail for free.

Which sounds fine.  Until you realise that tail is being written to by your producer, and head is being written to by your consumer.  These two variables aren't actually closely associated, and in fact are going to be used by two different threads that might be running on two different cores.

Imagine your consumer updates the value of head.  The cache value is updated, the value in memory is updated, and any other cache lines that contain head are invalidated because other caches will not have the shiny new value.  And remember that we deal with the level of the whole line, we can't just mark head as being invalid.

Now if some process running on the other core just wants to read the value of tail, the whole cache line needs to be re-read from main memory.  So a thread which is nothing to do with your consumer is reading a value which is nothing to do with head, and it's slowed down by a cache miss.
现在如果一些进程正在另一个内核上运行，只是想读“尾”的值，整个缓存块需要被重新从主内存读取。那么一个和你的消费者无关的线程读一个和“头”无关的值，它被高速缓存未命中给拖慢了。

Of course this is even worse if two separate threads are writing to the two different values. Both cores are going to be invalidating the cache line on the other core and having to re-read it every time the other thread has written to it. You've basically got write-contention between the two threads even though they're writing to two different variables.
当然如果两个独立的线程同时对那两个值写入会更糟。每次当另一个线程对高速缓存块做了写操作的时候，每个内核都要把另一个内核上的高速缓存块失效掉并重新读取里面的数据。你基本上是遇到两个线程之间的写冲突了尽管它们写入的是不同的变量。

This is called false sharing, because every time you access head you get tail too, and every time you access tail, you get head as well.  All this is happening under the covers, and no compiler warning is going to tell you that you just wrote code that's going to be very inefficient for concurrent access.
这叫作“伪共享”，因为每次你访问“头”你都会得到“尾”，而且每次你访问“尾”，你同样会得到“头”。这一切都在后台发生，没有任何编译警告会告诉你你正在写一个并发访问效率很低的代码。

Our solution - magic cache line padding

You'll see that the Disruptor eliminates this problem, at least for architecture that has a cache size of 64 bytes or less, by adding padding to ensure the ring buffer's sequence number is never in a cache line with anything else.

 publiclong p1, p2, p3, p4, p5, p6, p7;// cache line padding privatevolatilelong cursor = INITIAL_CURSOR_VALUE; publiclong p8, p9, p10, p11, p12, p13, p14;// cache line padding

So there's no false sharing, no unintended contention with any other variables, no needless cache misses.
因此没有伪共享，没有和其它任何变量的意外冲突，没有不必要的高速缓存未命中。

It's worth doing this on your Entry classes too - if you have different consumers writing to different fields, you're going to need to make sure there's no false sharing between each of the fields.

EDIT: Martin wrote a more technically correct and detailed post about false sharing, and posted performance results too.

posted on 2012-02-17 11:09  ﹎敏ō  阅读(1158)  评论(0编辑  收藏  举报