1. The client asks the master which chunkserver holdsthe current lease for the chunkan d the locations ofthe other replicas. If no one has a lease, the mastergrants one to a replica it chooses (not shown).
2. The master replies with the identity of the primary andthe locations of the other (secondary) replicas. Theclient caches this data for future mutations. It needsto contact the master again only when the primary becomes unreachable or replies that it no longer holdsa lease.
3. The client pushes the data to all the replicas. A clientcan do so in any order. Each chunkserver will storethe data in an internal LRU buffer cache until thedata is used or aged out. By decoupling the data flowfrom the control flow, we can improve performance byscheduling the expensive data flow based on the networktopology regardless of which chunkserver is theprimary. Section 3.2 discusses this further.
4. Once all the replicas have acknowledged receiving thedata, the client sends a write request to the primary.The request identifies the data pushed earlier to all ofthe replicas. The primary assigns consecutive serialnumbers to all the mutations it receives, possibly frommultiple clients, which provides the necessary serialization.It applies the mutation to its own local statein serial number order.
5. The primary forwards the write request to all secondaryreplicas. Each secondary replica applies mutationsin the same serial number order assigned bythe primary.
6. The secondaries all reply to the primary indicatingthat they have completed the operation.
7. The primary replies to the client. Any errors encounteredat any of the replicas are reported to the client.In case of errors, the write may have succeeded at theprimary and an arbitrary subset of the secondary replicas.(If it had failed at the primary, it would nothave been assigned a serial number and forwarded.)The client request is considered to have failed, and themodified region is left in an inconsistent state. Ourclient code handles such errors by retrying the failedmutation. It will make a few attempts at steps (3)through (7) before falling backt o a retry from the beginningof the write.
这里需要注意的是client首先将数据发送到各个replicas，之后等待各个replicas发回接收到数据的响应，之后client向primary chunk server（primary chunk server是有master指定的）发送WRITE命令，之后由primary chunk server协调写入顺序，分别向其他chunk server发送WRITE命令，primary chunk server等到其他chunk server发送到的ACK之后，才向client发送写成功结果。