

原文地址http://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html
My recent slow-down in posting is because I've been trying to write a post explaining memory barriersand their applicability in the Disruptor. The problem is, no matter how much I read and no matter how many times I ask the ever-patient Martin and Mike questions trying to clarify some point, I just don't intuitively grasp the subject. I guess I don't have the deep background knowledge required to fully understand.
最近我的写作速度慢下来了是因为我正在尝试写一篇解释内存障和它们在Disruptor中适用性的文章。问题是,不管我查看多少资料,也不管我多少次问毫不厌倦的Martin和Mike问题,试图弄清一些观点,但还是不能直观地掌握主旨。估计我没有足够深厚的背景知识来完全理解。
So, rather than make an idiot of myself trying to explain something I don't really get, I'm going to try and cover, at an abstract / massive-simplification level, what I do understand in the area. Martin has written a post going into memory barriers in some detail, so hopefully I can get away with skimming the subject.
因此比起傻乎乎地尝试解释一些我自己都不明白的东西,我将在抽象的/大量简化的程度上试着涵盖那些我懂的地方。Martin写过一篇有些详细的走进内存障,多希望我可以先去略读一下它。
Disclaimer: any errors in the explanation are completely my own, and no reflection on the implementation of the Disruptor or on the LMAX guys who actually do know about this stuff.
免责声明:讲解中出现的任何错误都完全是我自己的,不反映Disruptor的实现水平或LMAX的那些确实知道这事的伙计们的水平。
What's the point?
重点是什么呢?
My main aim in this series of blog posts is to explain how the Disruptor works and, to a slightly lesser extent, why. In theory I should be able to provide a bridge between the code and the technical paper by talking about it from the point of view of a developer who might want to use it.
我发这一系列博文的主要目的是解释Disruptor如何工作,和较小程度的解释为什么。从理论上说我应该能提供一座代码和技术文档之间的桥梁,通过从想要使用它的开发者的角度来讲解。
The paper mentioned memory barriers, and I wanted to understand what they were, and how they apply.
那个文档提到了内存障,而我想理解它们是什么,并且怎么适用。
What's a Memory Barrier?
内存障是什么?
It's a CPU instruction. Yes, once again, we're thinking about CPU-level stuff in order to get the performance we need (Martin's famous Mechanical Sympathy). Basically it's an instruction to a) ensure the order in which certain operations are executed and b) influence visibility of some data (which might be the result of executing some instruction).
它是一个CPU指令。是的,再一次,我们在思考CPU级的东西以便得到我们需要的性能(Martin著名的Mechanical Sympathy)。基本上它是一个指令,为了a)确保特定运算的执行顺序和b)影响一些数据(可能是一些指令的执行结果)的可见性。
Compilers and CPUs can re-order instructions, provided the end result is the same, to try and optimise performance. Inserting a memory barrier tells the CPU and the compiler that what happened before that command needs to stay before that command, and what happens after needs to stay after. All similarities to a trip to Vegas are entirely in your own mind.
编译器和CPU能对指令重新排序,来尝试优化性能,最终执行结果是一样的。插入一个内存障会告诉CPU和编译器在那个命令之前执行的需要呆在那个命令之前,在那个命令之后执行的需要呆在那之后。就像一次去拉斯维加斯的旅游全在你脑子里一样。

The other thing a memory barrier does is force an update of the various CPU caches - for example, a write barrier will flush all the data that was written before the barrier out to cache, therefore any other thread that tries to read that data will get the most up-to-date version regardless of which core or which socket it might be executing by.
内存障做的另一件事是强制各种CPU缓存的更新-比如,一个写障会把在这个障之前写到缓存的数据全刷新,于是其他任何线程去读那个数据都会拿到最新的版本,不管它是由哪个内核或socket执行的。
What's this got to do with Java?
这和JAVA有什么关系?
Now I know what you're thinking - this isn't assembler. It's Java.
我知道你现在在想什么-我们讲的不是汇编。是Java。
The magic incantation here is the word volatile (something I felt was never clearly explained in the Java certification). If your field is volatile, the Java Memory Model inserts a write barrier instruction after you write to it, and a read barrier instruction before you read from it.
这里神奇的咒语是单词"volatile"(一个我觉得在Java认证中从没明确地解释过的东西)。如果你的字段是volatile的,Java内存模型会在你对它写入之后插入一个写障指令,并且在你对它读取之前插入一个读障指令。

This means if you write to a volatile field, you know that:
这意味着如果你对一个volatile字段写入,你知道的:
1.Any thread accessing that field after the point at which you wrote to it will get the updated value
任何在你对这个字段写入之后访问它的线程都会得到更新后的值。
2.Anything you did before you wrote that field is guaranteed to have happened and any updated data values will also be visible, because the memory barrier flushed all earlier writes to the cache.
任何你在对这个字段写入之前做的事都被确保发生过了,而任何更新过的数据值都会变得可见,因为内存障把所有早先对缓存的写入都刷新了。
cursor is one of these magic volatile thingies, and it's one of the reasons we can get away with implementing the Disruptor without locking.RingBuffer游标是这些神奇的volatile类型的东西之一,这也是我们可以不用锁而实现Disruptor的原因之一。

cursor) creates a memory barrier which ultimately brings all the caches up to date (or at least invalidates them accordingly). 
So, if your downstream consumer (C2) sees that an earlier consumer (C1) reaches number 12, when C2 reads entries up to 12 from the ring buffer it will get all updates C1 made to the entries before it updated its sequence number.
因此,如果你的下游消费者(C2)看到较早的消费者(C1)到达过12号,当C2从ring buffer读取到12为止的条目的时候在它更新自己的序列号之前它会拿到所有C1所做的更新。
Basically everything that happens after C2 gets the updated sequence number (shown in blue above) must occur after everything C1 did to the ring buffer before updating its sequence number (shown in black).
基本上所有在C2拿到更新过的序列号(上面蓝色表示的)之后发生的事情都必须出现在C1在更新自己的序列号之前对ring buffer做的事情之后。
Impact on performance
对性能的影响
Memory barriers, being another CPU-level instruction, don't have the same cost as locks - the kernel isn't interfering and arbitrating between multiple threads. But nothing comes for free. Memory barriers do have a cost - the compiler/CPU cannot re-order instructions, which could potentially lead to not using the CPU as efficiently as possible, and refreshing the caches obviously has a performance impact. So don't think that using volatile instead of locking will get you away scot free.
内存障,作为另一个CPU级的指令,没有锁那样的代价-内核没有在多个线程之间干涉和协调。但是没有免费的午餐。内存障也有代价-编译器/CPU不能对指令重新排序,这将隐约导致不能尽可能高效地使用CPU,而且刷新缓存显然也有性能上的影响。因此不要认为用volatile代替锁就能让你逍遥法外。
You'll notice that the Disruptor implementation tries to read from and write to the sequence number as infrequently as possible. Every read or write of a volatile field is a relatively costly operation. However, recognising this also plays in quite nicely with batching behaviour - if you know you shouldn't read from or write to the sequences too frequently, it makes sense to grab a whole batch of Entries and process them before updating the sequence number, both on the Producer and Consumer side. Here's an example from BatchConsumer:
你会注意到Disruptor的实现尽可能少地对序列号进行读写。每次对volatile字段的读或写都是相对高成本的操作。尽管如此,认识到这在批量的情况也表现得很好-如果你知道的话,你不该对序列号作过多的读写操作,在Producer或Consumer两边抓取一整批Entry并且在更新序列号之前加工它们,都是有意义的。这里有一个来自BatchConsumer的例子:
| long nextSequence = sequence +1; |
| while(running) |
| { |
| try |
| { |
| finallong availableSequence = consumerBarrier.waitFor(nextSequence); |
| while(nextSequence <= availableSequence) |
| { |
| entry = consumerBarrier.getEntry(nextSequence); |
| handler.onAvailable(entry); |
| nextSequence++; |
| } |
| handler.onEndOfBatch(); |
| sequence = entry.getSequence(); |
| } |
| ... |
| catch(finalException ex) |
| { |
| exceptionHandler.handle(ex, entry); |
| sequence = entry.getSequence(); |
| nextSequence = entry.getSequence()+1; |
| } |
| } |
(You'll note this is the "old" code and naming conventions, because this is inline with my previous blog posts, I thought it was slightly less confusing than switching straight to the new conventions).
(你会注意到这是旧代码和约定名称,因为这和我之前发的博文对应,我觉得比起直接切换到新的约定,这样疑惑会更少一些)。
In the code above, we use a local variable to increment during our loop over the entries the consumer is processing. This means we read from and write to the volatile sequence field (shown in bold) as infrequently as we can get away with.
在上面的代码里,我们在对消费者处理的条目进行循环的时候使用一个局部变量来递增。这表明我们尽可能少地读写那个volatile类型的序列号(粗体的)。
In Summary
总结
Memory barriers are CPU instructions that allow you to make certain assumptions about when data will be visible to other processes. In Java, you implement them with the volatile keyword. Using volatile means you don't necessarily have to add locks willy nilly, and will give you performance improvements over using them. However you need to think a little more carefully about your design, in particular how frequently you use volatile fields, and how frequently you read and write them.
内存障是CPU指令,它们允许你对什么时候数据对其他进程可见作一些假定。在Java中,你通过volatile关键字来实现它们。使用volatile意味着你不管愿不愿意都不必加入锁,并且通过使用它们会给你带来性能上的提升。然而你需要对你的设计想得更仔细一些,特别是你使用volatile字段有多频繁,和对它们读写有多频繁。
PS Given that the New World Order in the Disruptor uses totally different naming conventions now to everything I've blogged about so far, I guess the next post is mapping the old world to the new one.
备注中讲了Disrupor中的”世界新秩序“使用了和我到目前为止发的博文不同的命名约定,我猜下一篇文章应该是将它们做一个对照了。
原文地址http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast_22.html
We mention the phrase Mechanical Sympathy quite a lot, in fact it's even Martin's blog title. It's about understanding how the underlying hardware operates and programming in a way that works with that, not against it.
我们没少提到Mechanical Sympathy,事实上它甚至是Martin博客的标题。关于了解底层硬件如何运转和以协同而不是相抵的方式编程。
We get a number of comments and questions about the mysterious cache line padding in theRingBuffer, and I referred to it in the last post. Since this lends itself to pretty pictures, it's the next thing I thought I would tackle.
我们收到一些关于RingBuffer中神秘的高速缓存块补全的评论和疑问,我在上一篇文章中已经提到它了。由于这个适合漂亮的图片,我想这是下一件我该解决的事了。
Comp Sci 101
计算机科学101
One of the things I love about working at LMAX is all that stuff I learnt at university and in my A Level Computing actually means something. So often as a developer you can get away with not understanding the CPU, data structures or Big O notation - I spent 10 years of my career forgetting all that. But it turns out that if you do know about these things, and you apply that knowledge, you can come up with some very clever, very fast code.
我爱在LMAX工作的原因之一就是诸如我从大学和A Level Computing所学之类的东西实际上还是有意义的。做为一个开发者你可以不了解CPU,数据结构或者大O符号是如此常见-我用了10年的职业生涯来忘记这些东西。但是现在看来如果你知道这些知识,并且应用它们,你能写出一些非常巧妙和快速的代码。
So, a refresher for those of us who studied this at school, and an intro for those who didn't. Beware - this post contains massive over-simplifications.
因此,作为对我们这些在学校里学过的人的进修,和对那些没有学过的人的介绍。当心-这篇文章包含大量的过度简化。
The CPU is the heart of your machine and the thing that ultimately has to do all the operations, executing your program. Main memory (RAM) is where your data (including the lines of your program) lives. We're going to ignore stuff like hard drives and networks here because the Disruptor is aimed at running as much as possible in memory.
CPU是机器的心脏和最终做所有运算(执行你的程序)的东西。主内存(RAM)是你的数据(包括你的程序)生存的地方。在这里我们将忽略硬件驱动和网络之类的东西因为Disruptor的目标是在内存中尽可能多地运行。
The CPU has several layers of cache between it and main memory, because even accessing main memory is too slow. If you're doing the same operation on a piece of data multiple times, it makes sense to load this into a place very close to the CPU when it's performing the operation (think a loop counter - you don't want to be going off to main memory to fetch this to increment it every time you loop around).
CPU和主内存之间有好几层缓存,因为即使访问主内存也太慢。如果你正在对一块数据做多次相同的运算,那么在执行运算的时候把它加载到离CPU很近的地方就有意义了(想一个循环计数-你不想每次循环都跑到主内存去取这个数据来增长它)。

The closer the cache is to the CPU, the faster it is and the smaller it is. L1 cache is small and very fast, and right next to the core that uses it. L2 is bigger and slower, and still only used by a single core. L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket. Finally you have main memory, which is shared across all cores and all sockets.
离CPU越近,缓存越快,也越小。缓存L1很小很快,并且紧靠在使用它的内核边上。L2大一些,也慢一些,并且仍然只由单个内核使用。L3在现代多核机器中更普遍,仍然更大,更慢,由内核之间通过单个socket共同使用。
When the CPU is performing an operation, it's first going to look in L1 for the data it needs, then L2, then L3, and finally if it's not in any of the caches the data needs to be fetched all the way from main memory. The further it has to go, the longer the operation will take. So if you're doing something very frequently, you want to make sure that data is in L1 cache.
当CPU执行运算的时候,它先去L1查找所需的数据,再去L2,然后是L3,最后如果这些缓存中都没有,所需的数据就要去主内存拿。去得越远,运算耗费的时间就越长。所以如果你在做一些很频繁的事,你要确保数据在缓存L1中。
Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:
Martin和Mike的 QCon presentation演讲给出了一些高速缓存未命中的成本的指示数据。
| Latency from CPU to... | Approx. number of CPU cycles |
Approx. time in nanoseconds |
| Main memory | ~60-80ns | |
| QPI transit (between sockets, not drawn) |
~20ns | |
| L3 cache | ~40-45 cycles, | ~15ns |
| L2 cache | ~10 cycles, | ~3ns |
| L1 cache | ~3-4 cycles, | ~1ns |
| Register | 1 cycle |
long is 8 bytes, so in a single cache line you could have 8 long variables.现在注意有趣是它存在缓存中时不是独立的项-比如它不是一个单独的变量,单独的指针。高速缓存是由缓存块组成的,通常64字节,并且有效地引用主内存中的地址。一个Java的long类型是8字节,因此在一个缓存块中,你可以有8个long类型的变量。

long isn't part of an array. Imagine it's just a single variable. Let's call it head, for no real reason. Then imagine you have another variable in your class right next to it. Let's arbitrarily call it tail. Now, when you load head into your cache, you get tail for free.不过,这种免费加载也有一个缺陷。设想你的long数据不是数组的一部分。设想它只是单独的一个变量。让我们称它为“头”,没什么理由。然后再设想在你的类中有另一个变量紧挨着它。让我们直接称它为“尾”。现在,当你加载"头"到高速缓存的时候,你免费加载了“尾”。

Which sounds fine. Until you realise that tail is being written to by your producer, and head is being written to by your consumer. These two variables aren't actually closely associated, and in fact are going to be used by two different threads that might be running on two different cores.
听想来不错。直到你意识到“尾”正在被你的生产者写入,而“头”正在被你的消费者写入。这两个变量实际上并不是密切相关的,而且事实上是要被两个可能在两个不同的内核中运行的不同线程使用的。

Imagine your consumer updates the value of head. The cache value is updated, the value in memory is updated, and any other cache lines that contain head are invalidated because other caches will not have the shiny new value. And remember that we deal with the level of the whole line, we can't just mark head as being invalid.
设想你的消费者更新了“头”。缓存中的值被更新了,内存中的值被更新了,而其他任何缓存块中存在的“头”都失效了因为其他缓存不会有崭新的值。记住我们是在整个块级处理的,我们没法只把“头”标记为无效。

tail, the whole cache line needs to be re-read from main memory. So a thread which is nothing to do with your consumer is reading a value which is nothing to do with head, and it's slowed down by a cache miss.head you get tail too, and every time you access tail, you get head as well. All this is happening under the covers, and no compiler warning is going to tell you that you just wrote code that's going to be very inefficient for concurrent access.你会看到Disruptor消除这个问题,至少对于高速缓存大小是64位或更少的结构是这样的,通过增加补丁来确保ring buffer的序列号不会和其他东西同时存在于一个高速缓存块。
| publiclong p1, p2, p3, p4, p5, p6, p7;// cache line padding |
| privatevolatilelong cursor = INITIAL_CURSOR_VALUE; |
| publiclong p8, p9, p10, p11, p12, p13, p14;// cache line padding |
Entry classes too - if you have different consumers writing to different fields, you're going to need to make sure there's no false sharing between each of the fields.在你的Entry类中也值得这样做-如果你有不同的消费者往不同的字段写入,你需要确保各个字段间不会出现伪共享。
EDIT: Martin wrote a more technically correct and detailed post about false sharing, and posted performance results too.
修改:Martin写了一个从技术上来说更准确更详细的关于伪共享的文章,并且发布了性能结果。




