# 一文搞懂Raft算法

raft是工程上使用较为广泛的强一致性、去中心化、高可用的分布式协议。在这里强调了是在工程上，因为在学术理论界，最耀眼的还是大名鼎鼎的Paxos。但Paxos是：少数真正理解的人觉得简单，尚未理解的人觉得很难，大多数人都是一知半解。本人也花了很多时间、看了很多材料也没有真正理解。直到看到raft的论文，两位研究者也提到，他们也花了很长的时间来理解Paxos，他们也觉得很难理解，于是研究出了raft算法。

本文基于论文In Search of an Understandable Consensus Algorithm对raft协议进行分析，当然，还是建议读者直接看论文。

# raft算法概览

Raft算法的头号目标就是容易理解（UnderStandable），这从论文的标题就可以看出来。当然，Raft增强了可理解性，在性能、可靠性、可用性方面是不输于Paxos的。

Raft more understandable than Paxos and also provides a better foundation for building practical systems

为了达到易于理解的目标，raft做了很多努力，其中最主要是两件事情：

• 问题分解
• 状态简化

Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.

raft协议中，一个节点任一时刻处于以下三个状态之一：

• follower
• candidate

给出状态转移图能很直观的直到这三个状态的区别

## 选举过程详解

• 增加节点本地的 current term ，切换到candidate状态
• 投自己一票
• 并行给其他节点发送 RequestVote RPCs
• 等待其他节点的回复

在这个过程中，根据来自其他节点的消息，可能出现三种结果

2. 被告知别人已当选，那么自行切换到follower
3. 一段时间内没有收到majority投票，则保持candidate状态，重新发出选举

• 在任一任期内，单个节点最多只能投一票
• 候选人知道的信息不能比自己的少（这一部分，后面介绍log replication和safety的时候会详细介绍）
• first-come-first-served 先来先得

第三种情况，没有任何节点获得majority投票，比如下图这种情况：

# log replication

## Replicated state machines

共识算法的实现一般是基于复制状态机（Replicated state machines），何为复制状态机：

If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state.

简单来说：相同的初识状态 + 相同的输入 = 相同的结束状态。引文中有一个很重要的词deterministic，就是说不同节点要以相同且确定性的函数来处理输入，而不要引入一下不确定的值，比如本地时间等。如何保证所有节点 get the same inputs in the same order，使用replicated log是一个很不错的注意，log具有持久化、保序的特点，是大多数分布式系统的基石。

下图形象展示了这种log-based replicated state machine

## 请求完整流程

• leader issue AppendEntries RPC in parallel
• leader wait for majority response
• leader apply entry to state machine
• leader notify follower apply log

那么日志在每个节点上是什么样子的呢

The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers

# safety

衡量一个分布式算法，有许多属性，如

• liveness： something good eventually happens.

在任何系统模型下，都需要满足safety属性，即在任何情况下，系统都不能出现不可逆的错误，也不能向客户端返回错误的内容。比如，raft保证被复制到大多数节点的日志不会被回滚，那么就是safety属性。而raft最终会让所有节点状态一致，这属于liveness属性。

raft协议会保证以下属性

### Election safety

• 一个节点某一任期内最多只能投一票；

### log matching

• If two entries in different logs have the same index and term, then they store the same command.
• If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

在没有异常的情况下，log matching是很容易满足的，但如果出现了node crash，情况就会变得负责。比如下图

注意：上图的a-f不是6个follower，而是某个follower可能存在的六个状态

To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point.

s2 AppendEntries里prevLogTerm prevLogIndex来自 logs[nextIndex[x] - 1]
s4 leader收到follower的恢复，如果返回值是True，则nextIndex[x] -= 1, 跳转到s2. 否则
s5 同步nextIndex[x]后的所有log entries

### leader completeness vs elcetion restriction

• 一个日志被复制到majority节点才算committed

voter denies its vote if its own log is more up-to-date than that of the candidate.

If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.

# corner case

在这样的情况下，我们来考虑读写。

首先，如果客户端将请求发送到了NodeB，NodeB无法将log entry 复制到majority节点，因此不会告诉客户端写入成功，这就不会有问题。

## State Machine Safety

前面在介绍safety的时候有一条属性没有详细介绍，那就是State Machine Safety：

State Machine Safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

如果节点将某一位置的log entry应用到了状态机，那么其他节点在同一位置不能应用不同的日志。简单点来说，所有节点在同一位置（index in log entries）应该应用同样的日志。但是似乎有某些情况会违背这个原则：

Raft never commits log entries from previous terms by counting replicas.
Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property.

Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term.

# 总结

• 同一任期内最多只能投一票，先来先得