What We Talk About When We Talk About Distributed Systems
For quite some time now I’ve been trying to learn about distributedsystems, and it’s appropriate to say that once you start digging,there seems to be no end to it, the rabbit hole goes on and on. Theliterature in distributed systems is quite extensive, with lots ofpapers coming from different universities, plus quite a few books tochoose from. For a total beginner like me, it proved hard to decidewhat paper to read, or what book to buy.
At the same time, I’ve found that several bloggers recommend this orthat paper that one must know in order to become a distributedsystems engineer (whatever that means). So the list of things to readgrows:FLP,Zab,Time, Clocks and the Ordering of Events in a Distributed Systems,Viewstamped Replication,Paxos,Chubbyand on and on. My problem is that many times I don’t see ajustification of why I should read this or that paper. I love theidea of learning just for knowledge’s sake, to satisfy curiosity, butat the same time one needs to prioritise what to read, since the dayonly has 24hs.
Apart from the abundance of papers and research material, as mentionedabove, there are also lots books. Having bought quite a few of themand read chapters here and there, I started to see that a book thathad a promising title was quite unrelated to what I was looking for,or that the content didn’t directly target the problems I would haveliked to solve. Therefore, I would like to go over what I think arethe main concepts of distributed systems, citing papers, books orresources where one can learn about them.
As I’m continuously learning while writing these words, please havesome patience, expect some mistakes, and be aware that I will try toexpand whatever I end up writing here.
Before we start, I must tell you that I have presented this blog postat various conferences, so here are the slides if you are interested:
And here’s a video of when I presented this talk at the Erlang UserConference in Stockholm:
Let’s start with the article:
Distributed systems algorithms can be classified according todifferent kinds of attributes. Some classifications are: the timingmodel; the kind of interprocess communication used; the failure modelassumed for the algorithm; and many others as we will see.
Here are the main concepts we will see:
- Timing Model
- Interprocess Communication
- Failure Modes
- Failure Detectors
- Leader Election
- Time In Distributed Systems
- A Quick Look At FLP
Here we have the synchronous model, the asynchronous model and thepartially synchronous model.
The synchronous model is the simplest one to use; here componentstake steps simultaneously, in what are called synchronousrounds. The time for a message to be delivered is usually known, andwe also can assume the speed of each process, ie.: how long it takesfor a process to execute one step of the algorithm. The problem withthis model is that it doesn’t reflect reality so well, even less in adistributed system context, where one can send a message to anotherprocess, and wish that the stars are aligned for us, so the messagearrives to said process. The good thing is that using this model it ispossible to achieve theoretical results that later can be translatedto other models. For example, due to the guarantees of this modelabout time, we could see that if a problem cannot be solved underthese timing guarantees, then it would probably be impossible to solveonce we relax them (think of a Perfect Failure Detector for instance).
The asynchronous model gets a bit more complex. Here componentscan take steps in whatever order they choose and they don’t offer anyguarantee about the speed in which they will take such steps. Oneproblem with this model is that while it can be simple to describe andcloser to reality, it still doesn’t reflect it properly. For example,a process might take infinitely long to respond to a request, but in areal project, we would probably impose a timeout on said request, andonce the timeout expires we will abort the request. A difficulty thatcomes with this model is how to assure the liveness condition of aprocess. One of the most famous impossibility results, the“Impossibility of Consensus with one Faulty Process”belongs to this timing model, where it is not possible to detect if aprocess has crashed, or if the process is just taking an infinitely longtime to reply to a message.
In the partially synchronous model, components have someinformation about timing, having access to almost synchronised clocks,or they might have approximations of how long messages take to bedelivered, or how long it takes a processes to execute a step.
The bookDistributed Algorithmsby Nancy Lynch is actually organised in sections based on these timingmodels.
Here we need to think about how processes in the system exchangeinformation. They can do it by sending messages to each other, in themessage passing model, or by using the shared memory model,where they share data by accessing shared variables.
One thing to keep in mind is that we can use a message passingalgorithm to build a distributed shared memory object. A common examplein books is the implementation of a read/write register. We also havequeues and stacks, which are used by some authors to describeconsistency properties, likelinearizabilty. Weshould not confuse shared memory as a way to share data betweenprocesses by accessing a shared variable with shared memoryabstractions built on top of message passing, like the ones justmentioned.
Back to the message passing model, we have another abstraction toconsider when trying to understand algorithms: the kind of link usedbetween processes (think of channels used to send messages back andforth between processes). These links will offer certain guarantees tothe algorithm using them. For example there’s the Perfect Linksabstraction that has reliable delivery and sends no duplicates; thisabstraction also assures exactly once delivery. We can easily see thatthis abstraction doesn’t reflect the real world either, thereforethere are other kinds of link abstractions used by algorithm designerswhen they try to design models that are closer to real systems. Keepin mind that even if the Perfect Links abstraction is not so real,it can still be useful, for example if we can prove a problem is notpossible to solve even assuming perfect links, then we know of a wholebunch of related problems which might not be solvable as well. On thetopic of links authors usually consider or assume FIFO messageordering, like inZab
I already wrote an article aboutfailure modes in distributed systemsbut it is worth reiterating here. A property of a distributed systemmodel is what kind of process failures are assumed. On thecrash-stop failure mode, a process is assumed to be correct until itcrashes. Once it crashes, it never recovers. There’s also thecrash-recovery model, where processes can recover after a fault, inthis case, some algorithms also include a way for the process torecover the state it had before before crashing. This can be doneeither by reading from persistent storage or by communicating withother processes in a group. It’s worth noting that for some groupmembership algorithms, a process that crashes and then recovers couldnot be considered as the same process that was alive before. Thisusually depends if we have dynamic groups or fixed groups.
There are also failure modes where processes fail to receive or sendmessages, these are called omission failure modes. There aredifferent kind of omissions as well, a process can fail to receive amessage, or to send a message. Why does this matter? Imagine thescenario where a group of processes implement a distributed cache. Ifa process is failing to reply to requests from other processes on thegroup, even though it is able to receive requests from them, that processwill still have its state up to date, which means it can reply to readrequests from clients.
A more complex failure mode is the one called Byzantine or arbitraryfailures mode, where processes can send wrong information to theirpeers; they can impersonate processes; reply to other process with thecorrect data, but garble its local database contents, and more.
When thinking about the design of a system, we should consider whichkind of process failures we want to cope with. Birman (seeGuide to Reliable Distributed Systems)argues that usually we don’t need to cope with Byzantine failures. Hecites work done at Yahoo! where they concluded that crash failures areway more common than Byzantine failures.
Depending on the process failure mode and timing assumptions we canconstruct abstractions that take care of reporting to the system if aprocess has crashed, or if it is suspected to have crashed. There arePerfect Failure detectors that never give a false positive. Having acrash-stop failure mode plus a synchronous system, we can implementthis algorithm by just using timeouts. If we ask processes toperiodically ping back to the Failure Detector Process, we knowexactly when a ping should arrive to the failure detector (due to thesynchronous model guarantees). If the ping doesn’t arrive after acertain configurable timeout, then we can assume the other node hascrashed.
On a more realistic system, it might not be possible to always assumethe time needed for a message to reach its destination, or how longit will take for a process to execute a step. In this case we can havea failure detector
p that would report a process
q as suspectedif
q doesn’t reply after a timeout of
N milliseconds. If
q laterreplies, then
p will remove
q from the list of suspectedprocesses, and it will increase
N, since it doesn’t know what theactual network delay between itself and
q is, but it wants to stopsuspecting
q of having crashed, since
q was alive, as it tooklonger than
N to ping back. If at some point
p will first suspect it has crashed, and it will never reviseits judgement (since
q will never ping back). A better descriptionof this algorithm can be found inIntroduction to Reliable and Secure Distributed Programmingunder the name “Eventually Perfect Failure Detector”.
Failure Detectors usually offer two properties: completeness andaccuracy. For the Eventually Perfect Failure Detector type, we havethe following:
- Strong Completeness: Eventually, every process that crashes ispermanently suspected by every correct process.
- Eventual Strong Accuracy: Eventually, no correct process issuspected by any correct process.
Failure detectors have been crucial in solving consensus in theasynchronous model. There’s a quite famous impossibility resultpresented in theFLPpaper mentioned above. This paper talks about the impossibility ofconsensus in asynchronous distributed systems where one process mightfail. One way to go around this impossibility result is to introduce afailure detector that cancircumvent the problem.
Related to the problem of failure detectors is that of actually doingthe opposite, to determine which process hasn’t crashed and istherefore working properly. This process will then be trusted by otherpeers in the network and it will be considered as the leader that cancoordinate some distributed actions. This is the case of protocolslike Raft orZab thatdepend on a leader to coordinate actions.
Having a leader in a protocol introduces asymmetry between nodes,since non-leader nodes will then be followers. This will have theconsequence that the leader node will end up being a bottleneck formany operations, so depending on the problem we are trying to solve,using a protocol that requires leader election might not be what wewant. Note that most protocols that achieve consistency via some sortof consensus use a leader process and a set of followers. SeePaxos,Zab orRaft for some examples.
The consensus or agreement problem was first introduced in the paperReaching Agreement in the Presence of Faultsby Pease, Shostak and Lamport. There they introduced the problem likethis:
Fault-tolerant systems often require a means by which independent processors or processes can arrive at an exact mutual agreement of some kind. It may be necessary, for example, for the processors of a redundant system to synchronise their internal clocks periodically. Or they may have to settle upon a value of a time-varying input sensor that gives each of them a slightly different reading.
So consensus is a problem of reaching agreement among independentprocesses. These processes will propose some values for a certainproblem, like what’s the current reading of their sensor, and thenagree on a common action based on the proposed values. For example, acar might have various sensors providing it with information about thebreaks temperature levels. These readings might have some variationsdepending on the precision of each sensor and so on, but the ABScomputer needs to agree on how much pressure it should apply on thebreaks. That’s a consensus problem being solved in our everyday lives.The bookFault-Tolerant Real-Time Systemsexplains consensus and other problems in distributed systems in thecontext of the automotive industry.
A process that implements some form of consensus works via exposing anAPI with propose and decide functions. A process will proposea certain value when consensus starts and then it will have to decide ona value, based on the values that were proposed in the system. Thesealgorithms must satisfy certain properties, which are: Termination,Validity, Integrity and Agreement. For example for Regular Consensuswe have:
- Termination: Every correct process eventually decides some value.
- Validity: If a process decides v, then v was proposed by someprocess.
- Integrity: No process decides twice.
- Agreement: No two correct processes decide differently.
For more details on consensus please consult the original papermentioned above. Also the following books are a great reference:
- Introduction to Reliable and Secure Distributed Programming, Chapter 5.
- Fault-tolerant Agreement in Synchronous Message-passing Systems.
- Communication and Agreement Abstractions for Fault-tolerant Asynchronous Distributed Systems.
Quorums are a tool used for designing fault-tolerant distributedsystems. Quorums refer to intersecting sets of processes that can beused to understand the characteristic of a sytem when some processesmight fail.
For example if we have an algorithm where N processes have crash-failuremodes, we have a quorum of processes whenever we have a majority ofprocesses applying certain operation to the system, for example awrite to the database. If a minority of process might crash, that is
N/2 - 1 process crashes, we still have a majority of processes thatknow about the last operation applied to the system. For exampleRaft uses majorities when committing logs to the system. The leaderwill apply an entry to its state machine as soon as half the serversin the cluster have replied to its request of log replication. Theleader plus half the servers constitute a majority. This has theadvantage that Raft doesn’t need to wait for the whole cluster to replyto a log-replication RPC request.
Another example would be: let’s say we want to limit the access to ashared resource by one process at a time. This resource is guarded bya set
S of processes. Whenever process
p wants to access theresource it needs to first ask permission to a majority of theprocesses
S guarding the resource. A majority of processes in
Sgrant access to the resource to
p. Now process
q arrives into thesystem and tries to access the shared resource. No matter whichprocesses it contacts in
q will never arrive to a majority ofprocesses that will grant it access to the shared resource until theresource is freed by
p. SeeThe Load, Capacity, and Availability of Quorum Systemsfor more details.
Quorums don’t always refer to a majority of processes. Sometimes theyeven need more processes to form a quorum for an operation to succeed,like in the case of a group
N processes that can suffer Byzantinefailures. In this case if
f is the number of tolerated processfailures, a quorum will be a set of more than
(N + f) / 2 processesSeeIntroduction to Reliable and Secure Distributed Programming.
If you are interested in this topic, there’s a whole book dedicated toquorums in distributed systems:
Time in Distributed Systems
Understanding time and its consequences is one of the biggest problemsin distributed systems. We are used to the concept of events in ourlife happening one after the other, with a perfectly definedhappened beforeorder, but when we have a series of distributed processes, exchangingmessages, accessing resources concurrently, and so on, how can we tellwhich process event happened before? To be able to answer these kindsof questions, processes would need to share a synchronised clock, andknow exactly how long it takes for electrons to move around thenetwork; for CPUs to schedule tasks, and so on. Clearly this is notquite possible on a real-world system.
The seminal paper that discusses these issues is calledTime, Clocks, and the Ordering of Events in a Distributed Systemwhich introduced the concept of logical clocks. Logical Clocks are away of assigning a number to an event in the system; said numbers arenot related to the actual passage of time, but to the processing ofevents by a node in a distributed system.
For a very interesting discussion on time in distributed systems Irecommend reading the articleThere Is No Now byJustin Sheehy.
I would claim that time and its problems in distributed systems isone of the crucial concepts to understand. The idea of simultaneityis something we have to let go. This is related with the old beliefof “Absolute Knowledge”, where we used to think that such a thing asabsolute knowledge was attainable. The laws of physics show us thateven light requires some time in order to get from one place toanother, so whenever it reaches our eyes, and it’s processed by ourbrains, whatever the light is communicating, it’s an old view of theworld. This idea is discussed by Umberto Eco in the bookInventing the Enemy,chapter “Absolute and Relative”.
A Quick look at FLP
To finalize this article, let’s take a quick look at theImpossibility of Distributed Consensus with One Faulty Processpaper to try to relate the concepts we have just learned aboutdistributed systems.
The abstract starts like this:
The consensus problem involves an asynchronous system of processes, some of which may be unreliable.
So we have an asynchronous system, where no timing assumptions aremade, either on processing speed or the time required for messages toreach other processes. We also know that some of these process maycrash.
The issue here is that in usual technical jargon,asynchronous mightrefer to a way of processing requests, like RPC for example, where aprocess p sends an asynchronous request to process q, and whileq is processing the request, p keeps doing other things, that is:p doesn’t block waiting for a reply. We can see that this definitionis completely different from the one used in the distributed systemsliterature, so without having this knowledge, it’s quite hard to fullyunderstand the meaning of just the first sentence of the FLP paper.
Later in the paper they say:
In this paper, we show the surprising result that no completely asynchronous consensus protocol can tolerate even a single unannounced process death. We do not consider Byzantine failures, and we assume that the message system is reliable—it delivers all messages correctly and exactly once.
So the paper only considers the crash-stop failure mode discussedabove (sometimes called fail-stop). We can also see that there are no omission failures, since themessage system is reliable.
And finally they also add this constraint:
Finally, we do not postulate the ability to detect the death of a process, so it is impossible for one process to tell whether another has died (stopped entirely) or is just running very slowly.
So we can’t use failure detectors either.
To recap, this means that the FLP impossibility applies for asynchronoussystems with fail-stop processors, with access to a reliable messagesystem, and where detecting the death of a process is notpossible. Without knowing the theory related to the different modelsof distributed systems, it might be possible that we miss many ofthese details, or we just interpret them in a totally different wayfrom what the authors meant.
For a more detailed overview of FLP please take a look at this blogpost:A Brief Tour of FLP Impossibility
Also, it is interesting to read the paperStumbling over Consensus Research: Misunderstandings and Issuesby Marcos Aguilera, which has a discussion about what it means for FLPto be an impossibility result for distributed systems (spoileralert: is not the same level of impossibility as the haltingproblem).
As you can see, learning about distributed systems takes time. It’s avery vast topic, with tons of research on each of its sub-areas. Atthe same time implementing and verifying distributed systems is alsoquite complex. There are many subtle places where to commit mistakescan make our implementations totally broken under unexpectedcircumstances.
What if we choose the wrong quorum and then our new fancy replicationalgorithm loses critical data? Or we choose a very conservative quorumthat slows down our application without need, making us break SLAswith customers? What if the problem we are trying to solve doesn’tneed consensus at all and we can live with eventual consistency?Perhaps our system has the wrong timing assumptions? Or it uses afailure detector unfit for the underlying system properties? What ifwe decide to optimise an algorithm like Raft, by avoiding a small stephere or there and we end up breaking its safety guarantees? All thesethings and many more can happen if we don’t understand the underlyingtheory of distributed systems.
OK, I get it, I won’t reinvent the distributed systems wheel, but withsuch vast literature and set of problems, where to start, then? Asstated at the top of this article, I think randomly reading papers willget you nowhere, as shown with the FLP paper, where understanding thefirst sentence requires you to know about the various timingmodels. Therefore I recommend the following books in order to getstarted:
Distributed Algorithmsby Nancy Lynch. This book is kinda the bible of distributedsystems. It covers the various models cited above, having sectionswith algorithms for each of them.
Introduction to Reliable and Secure Distributed Programmingby Christian Cachin et al. Besides from being a very goodintroduction, it covers many kinds of consensus algorithms. The bookis full of pseudo-code explaining the algorithms which is a good thingto have.
Of course there are many more books, but I think these two are a goodstart. If you feel you need to diver deeper, here’s the list ofresources used in this article:
- Marcos K. Aguilera. 2010. Stumbling over consensus research: misunderstandings and issues. In Replication, Bernadette Charron-Bost, Fernando Pedone, and André Schiper (Eds.). Springer-Verlag, Berlin, Heidelberg 59-72.
- Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte. 2008. Interval Tree Clocks. In Proceedings of the 12th International Conference on Principles of Distributed Systems (OPODIS ‘08), Theodore P. Baker, Alain Bui, and Sébastien Tixeuil (Eds.). Springer-Verlag, Berlin, Heidelberg, 259-274.
- Kenneth P. Birman. 2012. Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services. Springer Publishing Company, Incorporated.
- Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation (OSDI ‘06). USENIX Association, Berkeley, CA, USA, 335-350.
- Christian Cachin, Rachid Guerraoui, and Luis Rodrigues. 2014. Introduction to Reliable and Secure Distributed Programming (2nd ed.). Springer Publishing Company, Incorporated.
- Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (March 1996), 225-267.
- Umberto Eco. 2013. Inventing the Enemy: Essays. Mariner Books.
- Colin J. Fidge. 1988. Timestamps in message-passing systems that preserve the partial ordering. Proceedings of the 11th Australian Computer Science Conference 10 (1) , 56–66.
- Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1983. Impossibility of distributed consensus with one faulty process. In Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems (PODS ‘83). ACM, New York, NY, USA, 1-7.
- Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463-492.
- Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (July 1978), 558-565.
- Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133-169.
- Nancy A. Lynch. 1996. Distributed Algorithms. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
- Moni Naor and Avishai Wool. 1998. The Load, Capacity, and Availability of Quorum Systems. SIAM J. Comput. 27, 2 (April 1998), 423-447.
- Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. In Proceedings of the seventh annual ACM Symposium on Principles of distributed computing (PODC ‘88). ACM, New York, NY, USA, 8-17.
- Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC’14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.
- M. Pease, R. Shostak, and L. Lamport. 1980. Reaching Agreement in the Presence of Faults. J. ACM 27, 2 (April 1980), 228-234.
- Stefan Poledna. 1996. Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism. Kluwer Academic Publishers, Norwell, MA, USA.
- Michel Raynal. 2010. Communication and Agreement Abstractions for Fault-Tolerant Asynchronous Distributed Systems (1st ed.). Morgan and Claypool Publishers.
- Michel Raynal. 2010. Fault-tolerant Agreement in Synchronous Message-passing Systems (1st ed.). Morgan and Claypool Publishers.
- Benjamin Reed and Flavio P. Junqueira. 2008. A simple totally ordered broadcast protocol. In Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware (LADIS ‘08). ACM, New York, NY, USA, , Article 2 , 6 pages.
- Justin Sheehy. 2015. There Is No Now. ACM Queue
- Marko Vukolic. 2012. Quorum Systems: With Applications to Storage and Consensus. Morgan and Claypool Publishers.