亚马逊Dynamo数据库解读(英文版)

最近看了亚麻的Dynamo,个人认为其中always writeable的业务目标,对于DHT,vector clock,merkel tree的应用,包括对于一致性和高可用的权衡(基于CAP猜想,实现默认保证分区容错,因此二选一)等都很有意思。建议参考原论文食用。

What is the problem that this paper tries to solve? How would summarise its main idea in a few sentences? How does it work in more detail?

What is good about the paper? What is not good about the paper?

To what extent is the design of Dynamo inspired by Distributed Hash Tables (DHTs)? What are the advantages and disadvantages of such a design?
(part 3.3)
can be described as a zero-hop DHT
P2P:global
dynamo:locality

How does the design of Dynamo compare to that of BigTable?
Dynamo:for ACID(transaction)
BigTable: for structured data


key point:

target: always writeable
consistency & available(dynamo) : always conflict
dynamo: weak consistency: eventual consistency
vector clocks

Dynamo

Requirements

  • simple query model: r/w op for unique key to value, no mutli-data & relational schema

  • consistency & available : sometimes conflict

    Experience at Amazon has shown that data stores that provide ACID guarantees tend to have poor availability.

  • efficiency: commodity hardware infrastructure(通用硬件), achieve SLA

  • other: internal service without security related requirements such as authentication and authorization.

Target: meet SLA

Figure 1: Typically, the aggregator services are stateless, although they use extensive caching.

common standard: average, median and expected variance
while amazon: measured at the 99.9th percentile of the distribution


design

it is well known that when dealing with the possibility of network failures, strong consistency and high data availability cannot be achieved simultaneously

conflict resolution: eventually consistent data store

An important design consideration is to decide when to perform the process of resolving update conflicts
eg. whether conflicts should be resolved during reads(tradition) or writes(dynamo, for "always writeable")

who performs the process of conflict resolution

  • data store: simple, eg. "last write win"
  • application: flexible & suitable

Other key principles:Incremental scalability, Symmetry, Decentralization, Heterogeneity


P2P system


Architecture

partitioning, replication, versioning, membership, failure handling and scaling.

interface

get() put()

partitioning

basic consistent hashing algorithm(hash ring):

  • non-uniform data and load distribution
  • oblivious to the heterogeneity

improvement:

virtual node: A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node.
when a new node is added to the system, it is assigned multiple positions (henceforth, “tokens”) in the ring.

Replication

In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 clockwise successor nodes in the ring.
eg. in figure2: B itself, & C,D replicated

for virtual nodes, avoid dual node -> preference list stepping position(num > N for possible node failure) -> distinct physical nodes

Data versioning(important for consistency)

Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously.
(temporary inconsistencies)

thus, possible multi-versions(even the same data)

vector clocks: capture causality between different versions of the same object

format: a list of (node, counter) pairs

data conflict: return all the data to the client/logic to deal with

size restriction(possible for node failure)

Execution of operation: get() & put()

how to get node?

  1. load balancer route choose
  2. partition-aware client library

configurable values: R and W.
R is the minimum number of nodes that must participate in a successful read operation.
W is the minimum number of nodes that must participate in a successful write operation.
Setting R and W such that R + W > N yields a quorum-like system.
In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas.
For this reason, R and W are usually configured to be less than N, to provide better latency.

Handling Failures(temporary node failure): Hinted Handoff

sloppy quorum

handling the failure of an entire data center: each object is replicated across multiple data centers

Handling Failures(permanent node failure): Replica synchronization

Merkle tree: To detect the inconsistencies between replicas faster and to minimize the amount of transferred data

hash the childnode, construct tree from bottom to uphill, anti-entropy

Ring Membership

how virtual node mapped to physical node?

When a node starts for the first time, it chooses its set of tokens (virtual nodes in the consistent hash space) and maps nodes to their respective token sets.

Adding/Removing Storage Nodes

add front keys to new nodes, then remove related repetitive keys from back nodes


Implementation: all Java

  • request coordination
  • membership and failure detection
  • local persistence engine

EXPERIENCES & LESSONS

Class discussion

internal service so dont care about the security problem

virtual node idea -> load balance(flexibility): random -> logical ring depend on token sets

large-scale distributed system:

block chain: for security & anonymous
web3

consistent hash works: the ring partition

DHT(distributed hash table) ring: each node contains previous range

how the data stored: checking alongside the ring efficiently

gossip-based protocol: propagates membership changes and maintains an eventually consistent view of membership

use binary research to find the destination

distinct physical nodes: the preference list skipping particular position in the ring
N: virtual nodes, while it is possible that the multi virtual nodes on the same physical nodes, thus skipping the same physical nodes.

Brewer's conjecture: CAP Theorem

consistency, availability, and partition-tolerance: pick 2 out of 3!
native design: confirm partition, thus sacrifice strong consistency to earn high availability

posted @ 2023-10-19 22:34  真昼小天使daisuki  阅读(11)  评论(0编辑  收藏  举报