Dynamo: Amazon's Highly Available Key-value Store

Amazon's platform is comprised of many different services with varied characteristics
- Shopping carts, customer preferences, product catalog
Reliability and scalability are critical
- Customers will take their business elsewhere if they don't get the service they expect
Millions of components work together to provide Amazon's service using a highly decentralized, loosely coupled, service oriented architecture
- Google also uses this model of building reliable software to run on commodity PCs
- Bottom line: failure is the norm
Dynamo's focus: providing primary-key access to a data store
- Sufficient for many services like shopping carts, customer preferences
- RDBMS is too much and doesn't allow consistency/availability tradeoff
- Uses eventual consistency (lazy replication/gossiping) to trade consistency for availability

Query Model: DHT-like semantics. A data item (blob) is identified by a key.
ACID Properties: Trade consistency for availability. No isolation guarantees are provided.
Efficiency: Example SLA: provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.
Assumptions: The environment is non-hostile.
Key Design Choices:
- Increase availability using optimistic replication
- Data store is always writable -- conflict resolution happens during reads
- Application can perform conflict resolution
- Incremental scalability
- Symmetry - nodes have same responsibilities
- Decentralization
- Heterogeneity

Operations supported are get and put
Each node is assigned an ID in a Chord-like circular ID space (chosen randomly from the ID space)
Each key is hashed using MD5 to generate an ID
Data with ID id is stored at node n with ID successor(id) (in Chord terminology)
Data is replicated at N-1 successors of n
- Nodes responsible for storing a particular key are its preference list
- e.g., Node D below stores (A, B], (B, C], and (C, D]
- Image at: http://s3.amazonaws.com/wernervogels/public/sosp/sosp-figure2-small.png

An administrator uses a command-line or browser-based tool to add/remove nodes
The tool contacts a node, provides it with the new membership information, and the node writes the info to persistent storage
Every second, each node randomly chooses another node and exchanges a view of the membership
At startup, nodes choose their token sets (IDs) and this partitioning information is also exchanged via the gossiping protocol
If a new node, e.g., X inserted between A and B in the figure above, joins and becomes responsible for a set of keys, its successors (e.g, B, C, and D) will pass off the appropriate set of their keys to the new node
Each node knows the token ranges handled by the other nodes and can forward requests directly
To avoid partitioning, all nodes know about a set of seeds and exchange membership information with them
Failure is detected when a node fails to respond to communication
Because data is replicated, a failed node does not impact system behavior

A request is routed to a coordinator using a generic load balancer or partition-aware client
The coordinator is one of the top N nodes in the preference list
- To meet SLAs a write coordinator is usually chosen as the node that replied fastest to a prior read (assumes writes typically follow reads)
A quorum-like system is used to ensure that R replicas participate in a successful read and W replicas participate in a successful write (where R and W are configurable parameters)
The coordinator replies that a get is successful if R replicas respond, and a put is successful if W replicas respond
- Lower values for R and W provide better latency, but worse consistency

A get may produce several versions of the same object
- e.g., the latest write did not propagate to a replica that responded to a subsequent read
Vector clocks are used to keep versioning information
- Image at: http://s3.amazonaws.com/wernervogels/public/sosp/sosp-figure3-small.png
When a client does a put, it must specify which version it is updating
If a get results in multiple versions, the client must reconcile

Typically, N=3, R=2, and W=2
Experiments were run on a couple hundred nodes
The system is configured so that data is replicated across nodes at different data centers
Divergent Versions: an experiment looked at the shopping cart service over a 24 hour period
- 99.94% of requests saw 1 version
- .00057% saw 2 versions
- .00047% saw 3 versions
- .00009% saw 4 versions
Does this system scale?

Wednesday, 07-Jan-2009 15:13:20 PST Wednesday, 07-Jan-2009 15:13:20 PST