Zookeeper Study Notes

Zookeeper is a general-purpose coordination service.
The ZooKeeper service comprises an ensemble of servers that use replication to achieve high availability and performance.

What do we mean by coordination as a service?
　　Example: VMware-FT's test-and-set (a.k.a. c-a-s) server
　　　　If one replica can't talk to the other, grabs t-a-s lock, becomes sole server
　　　　Must be exclusive to avoid two primaries (e.g. if network partition)
　　　　Must be fault-tolerant
　　Example: GFS (more speculative)
　　　　Perhaps agreement on which meta-data replica should be master
　　　　Perhaps recording list of chunk servers, which chunks, who is primary
　　Other examples: MapReduce, YMB, Crawler, etc.
　　　　Who is the master, list or workers (Group Membership)
　　　　Master failover (Failure Dectection & Leader Election)
　　　　Division of works; status of tasks (Configuration Management)
　　A general-purpose service would save much effort!

Could we use a linearizable key/value store as a generic coordination service?
　　For example, to choose new GFS master if multiple replicas want to take over?
　　perhaps
　　　　Put("master", my IP address)
　　　　if Get("master") == my IP address:
　　　　　　act as master
　　problem: a racing Put() may execute after the Get()
　　　　2nd Put() overwrites first, so two masters, oops
　　　　Put() and Get() are not a good API for mutual exclusion!
　　problem: what to do if master fails?
　　　　perhaps master repeatedly Put()s a fresh timestamp?
　　　　lots of polling...
　　problem: clients need to know when master changes
　　　　periodic Get()s?
　　　　lots of polling...

Zookeeper API overview
　　data model: a file-system-like tree of znodes
　　　　file names, file content, directories, path names
　　typical use: configuration info in znodes
　　　　set of machines that participate in the application
　　　　which machine is the primary
　　each znode has a version number
　　types of znodes:
　　　　regular (a.k.a. persistent)
　　　　ephemeral
　　　　sequential: name + seqno
　　watches
　　　　allow clients to receive timely notifications of changes w/o requiring polling
　　　　one-time trigger associated with a session
　　　　client lib will restablish watches on a new server if the original connection gets lost
　　sessions
　　　　have an associated timeout, for detecting faulty clients
　　　　persist across Zookeeper servers (a connection loss does not end a session)

Operations on znodes
　　create(path, data, flags)
　　　　exclusive -- only first create indicates success
　　delete(path, version)
　　　　if znode.version = version, then delete
　　exists(path, watch)
　　　　watch=true means also send notification if path is later created/deleted
　　getData(path, watch)
　　setData(path, data, version)
　　　　if znode.version = version, then update
　　getChildren(path, watch)
　　sync()
　　　　sync then read ensures writes before sync are visible to same client's read
　　　　client could instead submit a write

ZooKeeper API are well tuned to synchronization:
　　+ exclusive file creation; exactly one concurrent create returns success
　　+ getData()/setData(x, version) supports mini-transactions
　　+ sessions automate actions when clients fail (e.g. release lock on failure)
　　+ sequential files create order among multiple clients
　　+ watches -- avoid polling

Ordering guarantees
　　* Linearizable writes (use ZAB to totally order writes)
　　　　clients send writes to the leader
　　　　the leader chooses an order, numbered by "zxid"
　　　　the leader sends writes to replicas, which all execute in zxid order
　　* FIFO client order (Session consistency)
　　　　each client specifies an order for its operations (reads AND writes)
　　　　writes:
　　　　　　writes appear in the global write order in client-specified order
　　　　reads:
　　　　　　each read executes at a particular point in the write order
　　　　　　a client's successive reads execute at non-decreasing points in the order
　　　　　　a client's read executes after all previous writes by that client
　　　　　　　　a server may block a client's read to wait for previous write, or sync()

Q: Why does this make sense?
　　I.e. why OK for reads to return stale data?
　　　　why OK for client 1 to see new data, then client 2 sees older data?
　　Note that the staleness of reads are bounded
　　　　syncLimit:
　　　　　　Amount of time, in ticks (see tickTime), to allow followers to sync with ZooKeeper
　　　　If followers fall too far behind a leader, they will be dropped

　　At a high level:
　　　　not as painful for programmers as it may seem
　　　　very helpful for read performance!
　　　　　　Zookeeper process reads locally at each server
　　　　　　read capacity scales linearly with the num of zk servers

　　Why is ZooKeeper useful despite loose consistency (compared to linearizability)?
　　　　sync() causes subsequent client reads to see preceding writes.
　　　　　　useful when a read must see latest data
　　　　　　sync() makes linearizable history possible, but it hurts performance
　　　　Writes are well-behaved, e.g. exclusive test-and-set operations
　　　　　　writes really do execute in order, on latest data.
　　　　Read order rules ensure "read your own writes". (Read-your-write Consistency)
　　　　Read order rules help reasoning.

A few consequences for the ordering guarentees:
　　Leader must preserve client write order across leader failure.
　　Replicas must enforce "a client's reads never go backwards in zxid order" (Monotonic Reads)
　　　　despite replica failure.
　　Client must track highest zxid it has read
　　　　to help ensure next read doesn't go backwards
　　　　even if sent to a different replica

Example 1: Add one to a number stored in a ZooKeeper znode
　　what if the read returns stale data?
　　　　write will write the wrong value!
　　what if another client concurrently updates?
　　　　will one of the increments be lost?

　　while true:
　　　　x, v := getData("f")
　　　　if setData(x + 1, version=v):
　　　　　　break
　　this is a "mini-transaction", effect is atomic read-modify-write

Example 2: Simple Locks
　　acquire():
　　　　while true:
　　　　　　if create("lf", ephemeral=true), success
　　　　　　if exists("lf", watch=true)
　　　　　　　　wait for notification

　　release(): (voluntarily or session timeout)
　　　　delete("lf")

Example 3: Locks without Herd Effect

　　1. create a "sequential" file
　　2. list files
　　3. if no lower-numbered, lock is acquired!
　　4. if exists(next-lower-numbered, watch=true)
　　5. wait for event...
　　6. goto 2

Note on using ZK locks
　　Different from single-machine thread locks!
　　If lock holder fails, system automatically releases locks.
　　So locks are not really enforcing atomicity of other activities.
　　To make writes atomic, use "ready" trick or mini-transactions.
　　Useful for master/leader election.
　　　　New leader must inspect state and clean up.
　　Or soft locks, for performance but not correctness
　　　　e.g. only one worker does each Map or Reduce task (but OK if done twice)
　　　　e.g. a URL crawled by only one worker (but OK if done twice)

Zookeeper Performance Optimizations
　　Reads are performed on a local replica of the database
　　Clients can send async writes to leader (async = don't have to wait).
　　Leader batches up many requests to reduce net and disk-write overhead.
　　Assumes lots of active clients.
　　Fuzzy snapshots (and idempotent updates) so snapshot doesn't stop writes.

Is the resulting performance good?
　　Table 1 in the paper
　　High read throughput -- and goes up with number of servers!
　　Lower write throughput -- and goes down with number of servers!
　　21,000 writes/second is pretty good!
　　　　Maybe limited by time to persist log to hard drives.
　　　　But still MUCH higher than 10 milliseconds per disk write -- batching.

ZooKeeper is a successful design
　　see ZooKeeper's Wikipedia page for a list of projects that use it
　　Rarely eliminates all the complexity from distribution.
　　　　e.g. GFS master still needs to replicate file meta-data.
　　　　e.g. GFS primary has its own plan for replicating chunks.
　　But does bite off a bunch of common cases:
　　　　Master election.
　　　　Persistent master state (if state is small).
　　　　Who is the current master? (name service).
　　　　Worker registration.
　　　　Work queues.

----------------------------------------------------------------------------

Persistence

　　write-ahead log of commited operations
　　weriodic snapshots of the in-memory database

Idempotent Operations
　　operation: <txnType, path, value, newVersionNumber>
　　　　e.g. <SetDataTXN, /foo, f3, 2>
　　leader transforms a write request to a txn and fills updated state of the znode into the txn

Fuzzy Snapshots
　　ZooKeeper creates the snapshot from its in-memory database while allowing writes to the database
　　depth-first scan of the whole tree
　　atomatically read metadata and data of each znode, write them to disk
　　snapshots may not corresspond to state of zk at any point in time
　　　　but it's ok:
　　　　　　After rebbot, Zookeeper apply commit logs from the point at which the snapshot started
　　　　　　The replay fixes the fuzzy snapshot to be a consistent snapshot of the application state

Details of batching and pipelining for performance　　

There are two things going on here. First, the ZooKeeper leader (really the leader's Zab layer) batches together multiple client operations in order to send them efficiently over the network, and in order to efficiently write them to disk. For both network and disk, it's often far more efficient to send a batch of N small items all at once than it is to send or write them one at a time. This kind of batching is only effective if the leader sees many client requests at the same time; so it depends on there being lots of active clients.

The second aspect of pipelining is that ZooKeeper makes it easy for each client to keep many write requests outstanding at a time, by supporting asynchronous operations. From the client's point of view, it can send lots of write requests without having to wait for the responses (which arrive later, as notifications after the writes commit). From the leader's point of view, that client behavior gives the leader lots of requests to accumulate into big efficient batches.

Notification Corner Case

There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.

book p103 [TBD]

Q: Why do authors of the paper say Zookeeper is wait-free ?

The precise definition of wait-free: A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. This definition was introduced in the following paper by Herlihy:

https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf

Definition of wait-freedom from Wikipedia: Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.

Zookeeper is wait-free because it processes one client's requests without needing to wait for other clients to take action. This is partially a consequence of the API: despite being designed to support client/client coordination and synchronization, no ZooKeeper API call is defined in a way that would require one client to wait for another. In contrast, a system that supported a lock acquire operation that waited for the current lock holder to release the lock would not be wait-free.

Ultimately, however, ZooKeeper clients often need to wait for each other, and ZooKeeper does provide a waiting mechanism -- watches. The main effect of wait-freedom on the API is that watches are factored out from other operations. The combination of atomic test-and-set updates (e.g. file creation and writes condition on version) with watches allows clients to synthesize more complex blocking abstractions (e.g. locks and barriers).

Q: Zookeeper session timeout v.s. Chubby lease timeout ?
A:
[TBD]

Q: How to do leader election with Zookeeper
A:

[TBD]

See https://zookeeper.apache.org/doc/current/recipes.html

Q: Any order guarantees for Zookeeper notifications ?

If a client is watching for a change, the client will see the notification event before it sees the new state of the system after the change is made.

Q. Zab v.s. Raft/Paxos ?
A:
[TBD]

See https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs.+Paxos

Q: The ZooKeeper ensemble can be configured such that leaders do not allow connections from clients ?
A:
Yes. leaderServes (Cluster Option):
　　Leader accepts client connections. Default value is "yes". The leader machine coordinates updates. For higher update throughput at thes slight expense of read throughput the leader can be configured to not accept clients and focus on coordination. The default to this option is yes, which means that a leader will accept client connections.
　　See https://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html

Q: Zookeeper slow read (sync+read) is linearizable ?
A:
　　No. The use of the sync operation before performing a read does not guarantee linearizable reads, as the following snippet taken from the zookeeper book states:

　　"There is a caveat to the use of sync, which is fairly technical and deeply entwined with ZooKeeper internals. (Feel free to skip it.) Because ZooKeeper is supposed to serve reads fast and scale for read-dominated workloads, the implementation of sync has been simplified and it doesn't really traverse the execution pipeline as a regular update operation, like create, setData, or delete. It simply reaches the leader, and the leader queues a response back to the follower that sent it. There is a small chance that the leader thinks that it is the leader l, but doesn't have support from a quorum any longer because the quorum now supports a different leader, lʹ . In this case, the leader l might not have all updates that have been processed, and the sync call might not be able to honor its guarantee."

Q: Zookeeper v.s. Chubby
A:
　　Chubby is a lock service; Zookeeper is not a lock service, it is a coordination service, clients can use its API to implement locks
　　Zookeeper' consistency model (sequential consistency, linearizable writes) is more relaxed than Chubby (use paxos, consistency mode tbd)
　　Zookeeper provides watches to enable efficient waiting, Chubby dont have such a notification mechanism

References

Paper: "ZooKeeper: wait-free coordination for internet-scale systems" (USENIX ATC 2010)
　　Zookeeper programmer's guide: https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html
　　https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf (wait free, universal objects, etc.)
　　MIT 6.824 Zookeeper case study: https://pdos.csail.mit.edu/6.824/notes/l-zookeeper.txt
　　The Zookeeper book: https://t.hao0.me/files/zookeeper.pdf

posted @ 2019-03-20 15:00 william-cheung 阅读(249) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Escape The Well

Zookeeper Study Notes

公告