Replication

Overview

Replication involves storing copies of data at multiple computers. There are several benefits of using this technique:

Performance enhancement - more load (e.g., client requests) can be tolerated because workload is shared among several processes. In addition, latency can be reduced by replicating data closer to the user. Unfortunately, benefits are reduced if data is read/write.
Increased availability - replication helps a service to overcome individual server failures. If a server will fail with probability p, the number of servers needed to provide a given level of service is Avail = 1-pⁿ.
Fault tolerance - similar to availability, but ensures correctness in addition to availability. For example, if one server of a group of n servers provides bad information, the others can outvote the incorrect server and provide correct data to the client.

From the client's point of view, there is a single logical copy of the data. If a client makes an update to one replica, that change should be reflected at all other replicas. Consider the following scenario -- we have a bank with two replicated servers. A user may connect to either one. Initially, account A has $5 and account B has 0.

The user connects to server 1 and updates account A to contain $10.
The user transfers $5 from account A to account B via server 2.
Server 1 fails before data is propagated.
Account A has a balance of 0 and account B has a balance of $5.

This example violates sequential consistency.

The interleaved sequence of operations meets the specification of a (single) correct copy of the objects.

The order of operations in the interleaving is consistent with the program order in which each individual client executed them.

This is similar to causality.

The basic model for managing replicated data includes the following components:

Clients issue requests to a frond end.
The frond end provides transparency by hiding the fact that data is replicated.
The front end contacts one or more replica managers to retrieve/store the data.
The replica managers interact to ensure that data is consistent.

There are several models for providing replication: passive, active, and lazy.

Passive Replication

In the passive replication model, front ends interact with a single, primary replica manager. The primary replica manager responds to requests, and sends updates to several secondary replica managers. In the event that the primary fails, a secondary replica manager can take its place. The sequence of events is as follows:

Request: The front end issues the request, containing a unique identifier, to the primary replica manager.

Coordination: The primary takes each request atomically, in the order in which receives it. It checks the unique identifier, in case it has already executed the request and if so it simply re-sends the response.

Execution: The primary executes the request and stores the response.

Agreement: If the request is an update then the primary sends the updated state, the response, and the unique identifier to all the backups. The backups send an acknowledgement.

Response: The primary responds to the front end, which hands the response back to the client.

How can we communicate updates while ensuring that we can tolerate a failure of the primary replica before, during, and after updating? This is a group communication problem. However, we need the ability to manage group membership so that we can be certain that everyone has received all updates. A group membership process must:

Provide an interface for membership changes.
Provide a failure detector.
Notify members of membership changes.
Perform address expansion to ensure that messages sent to the group reach all replicas.

Using this service, we can provide view-synchronous communication. View-synchronous communication is an extension of reliable multicast. It uses a view, which is a list of the processes currently belonging to the group. When membership changes, a new view is sent to all members. All messages that originate in a given view must be delivered before a new view is delivered. It is a bit like a cut in the message timeline.

Implementing view-synchronous communication is costly and requires several rounds of communication for each multicast. This can also lead to delays in responding the client -- a clear disadvantage of passive replication. However, passive replication can tolerate n crash failures if n+1 replicas are present. It is also very easy from the front end's point of view since it only communicates with a single server. Finally, because all replicas have the same record of updates, it is easy for a secondary replica to take over a failed primary replica.

Active Replication

In the active replication model, a front end multicasts a request to all replicas. The sequence of events is as follows:

Request: The front end attaches a unique identifier to the request and multicasts it to the group of replica managers, using a totally ordered, reliable multicast primitive. The front end is assumed to fail by crashing at worst. It does not issue the next request until it has received a response.

Coordination: The group communication system delivers the request to every correct replica manager in the same (total) order.

Execution: Every replica manager executes the request. Since they are state machines and since requests are delivered int he same total order, correct replica managers all process the request identically. The response contains the client's unique request identifier.

Agreement: No agreement phase is needed, because of the multicast delivery semantics.

Response: Each replica manager sends it response to the front end. The number of replies that the front end collects depends upon the failure assumptions and on the multicast algorithm. If, for example, the goal is to tolerate only crash failures and the multicast satisfies uniform agreement and ordering properties, then the front end passes the first response to arrive back to the client and discards the rest.

If the goal of the system is to tolerate byzantine failures, to tolerate f failures the system must use 2f+1 replicas and the front end must wait until it collects f+1 identical responses.

Implementing totally ordered, reliable multicast is equivalent to the consensus problem. It cannot be done in an asynchronous system unless we use a technique such as failure detectors. In addition, passive replication may be slow from the client's point of view.

Lazy Replication

The goal of lazy replication is to provide increased performance (lower latency) at the cost of reduced consistency. Replicas share information by periodically sending gossip messages. Replicas may be slightly out-of-sync, but eventually catch up. Vector clocks are used to ensure causal ordering of messages. The process works as follows:

Request: The front end normally sends requests to only a single replica manager. However, a front end will communicate with a different replica manager when the one it normally uses fails or becomes unreachable, and it may try on or more others if the normal manager is heavily loaded.

Update response: If the request is an update then the replica manager replies as soon as it has received the update.

Coordination: The replica manager that receives a request does not process it until it can apply the request according to the required ordering constraints. This may involve receiving updates from other replica managers, in gossip messages. No other coordination between replica managers is involved.

Execution: The replica manager executes the request.

Query response: If the request is a query then the replica manager replies at this point.

Agreement: The replica managers update one another by exchanging gossip messages, which contain the most recent updates they have received. They are said to update one another in a lazy fashion, in that gossip messages may be exchanged only occasionally, after several updates have been collected, or when a replica manager finds out that it is missing an update sent to one of its peers that it needs to process a request.

Causality is preserved by using vector clocks. The front end keeps a vector clock with an entry for each replica manager. The clock is included in every request and the replica manager returns an updated clock with every response.

Queries: When a replica receives a query, it can only respond if its timestamp is greater or equal to the timestamp sent by the client -- that is, the replica has seen as many updates as the client has. If the replica's timestamp is smaller than the client's timestamp, the replica will queue the request until it has received the relevant updates.

Updates: An update is essentially a write request. They are processed much the same way that queries are processed. A replica must wait until all causally-previous messages arrive before applying the update.

Gossiping: Periodically, replicas exchange messages containing a log of past updates. The frequency of gossip messages depends on the application. More frequent messages results in overhead while less frequent messages results is great opportunity that replicas will be out of sync.

Lazy replication is good enough for lots of applications. It deals with failure nicely and provides high availability. Unfortunately, updates can be lost in the event that a replica crashes before its updates are sent to the other replicas. Lazy replication is also not appropriate for applications where tight synchronization (e.g., video conferencing) is required.

Sami Rollins

Date: 2008-02-13