# 分布式系统理论之Quorum机制

WARO牺牲了更新服务的可用性，最大程度地增强了读服务的可用性。而Quorum就是更新服务和读服务之间进行一个折衷。

Quorum机制是“抽屉原理”的一个应用。定义如下：假设有N个副本，更新操作wi 在W个副本中更新成功之后，才认为此次更新操作wi 成功。称成功提交的更新操作对应的数据为：“成功提交的数据”。对于读操作而言，至少需要读R个副本才能读到此次更新的数据。其中，W+R>N ，即W和R有重叠。一般，W+R=N+1

①Quorum机制无法保证强一致性

因为，仅仅通过Quorum机制无法确定最新已经成功提交的版本号。

1）如何读取最新的数据？---在已经知道最近成功提交的数据版本号的前提下，最多读R个副本就可以读到最新的数据了。

2）如何确定 最高版本号 的数据是一个成功提交的数据？---继续读其他的副本，直到读到的 最高版本号副本 出现了W次。

②基于Quorum机制选择 primary

比如，(V2，V2，V1，V1，V1），R=3，如果读取的3个副本是：(V1，V1，V1)则高版本的 V2需要丢弃。

HDFS高可用性实现

HDFS的运行依赖于NameNode，如果NameNode挂了，那么整个HDFS就用不了了，因此就存在单点故障(single point of failure)；其次，如果需要升级或者维护停止NameNode，整个HDFS也用不了。为了解决这个问题，采用了QJM机制(Quorum Journal Manager)实现HDFS的HA（High Availability）。注意，一开始采用的“共享存储”机制，关于共享存储机制的不足，可参考：（还提到了QJM的优点）

In a typical HA cluster, two separate machines are configured as NameNodes.
At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state.
The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

In order for the Standby node to keep its state synchronized with the Active node,
both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). 

''''''''''''''''''''''''''''''''''

'''''''''''''''''''''''''''''''''

Active NameNode 向 JournalNode 集群提交 EditLog 是同步的

In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information
regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes,
and send block location information and heartbeats to both.

Split Brain 是指在同一时刻有两个认为自己处于 Active 状态的 NameNode。

when a NameNode sends any message (or remote procedure call) to a JournalNode, it includes its epoch number as part of the request.
Whenever the JournalNode receives such a message, it compares the epoch number against a locally stored value called the promised epoch.
If the request is coming from a newer epoch, then it records that new epoch as its promised epoch.
If instead the request is coming from an older epoch, then it rejects the request. This simple policy avoids split-brain

when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes,
so there is no potential for corrupting the file system metadata from a split-brain scenario.

https://issues.apache.org/jira/secure/attachment/12547598/qjournal-design.pdf