Multicast and Group Communication

Overview

A common operation in many applications is to deliver the same message to multiple recipients. Your booksite CDN is a good example of this. When a new book is added to the site it needs to be distributed to all of the mirror sites. One way of doing this is to have the main server send the new content to each of the mirrors -- multiple unicast. This approach puts a significant burden on the server. IP multicast and overlay multicast provide mechanisms for delivering the same message to multiple recipients in a more efficient manner.

More generally, group communication refers to the idea of enabling a process to communicate with a group of processes, typically without knowing which processes belong to the group. Group communication is often concerned with reliable delivery of messages and the order in which messages are received at each process.

IP Multicast

IP Multicast enables one-to-many communication at the network layer. A source sends a packet destined for a group of other hosts and the intermediate routers take care of replicating the packet as necessary. The intermediate routers are also responsible for determining which hosts are members of the group. IP Multicast uses UDP for communication, therefore it is unreliable.

To join a multicast group, a host sends a join message, using the Internet Group Management Protocol (IGMP), to its first-hop router. Groups are identified by a single class D IP address (in the range 224.0.0.0 to 239.255.255.255). In this way, messages destined for a group are addressed to the appropriate IP address, just like any other message.

In order to deliver a single message to several destinations, the routers that connect the members of the group organize into a tree. There are several algorithms for determining the edges of the tree. A shared tree is formed by selecting a center node. When a router receives a join from an attached host, it sends a join message toward the center node for the given group. Each router that receives the join will note that a downstream router is joined to the given group. The join will stop when it reaches the center, or if it reaches another router that is already part of the shared tree.

Overlay Multicast

Unfortunately, IP multicast is not widely supported. Though not as efficient, overlay multicast achieves many of the same benefits without requiring support from ISPs. In an overlay multicast scheme, end hosts organize into a delivery tree. Each host receives content from its parent and delivers content to its children. In this way, the load of distributing content is shared by all of the members of the group.

The main components of an overlay multicast application are the join algorithm and the procedure for ensuring fault tolerance.

Join Algorithm

Joining an overlay multicast group involves determining an appropriate parent and possibly a set of children. This is not unlike joining a peer-to-peer network, which we'll discuss in a few weeks. One method, derived from the Host Multicast Tree Protocol (HMTP), is to use a rendezvous point (RP). A new node (n) queries the RP for the root of the tree. n begins at the root and recursively walks down the tree searching for an appropriate place to join. At each iteration, n determines whether the current node is an appropriate parent, for example by asking whether it can accept new children and/or measuring the RTT to reach it. n either selects the current node as a parent, or asks it for its set of children and repeats the process for each child.

Fault Tolerance

Nodes may fail at any time, therefore, detecting and tolerating faults is extremely important. Typically, nodes exchange "keep alive" messages to notify neighbors that they are still alive. In HMTP, child nodes periodically send a REFRESH message to parents. If a parent fails to receive a REFRESH message in a given period of time, it can remove the given child from its child list. If a child attempts to contact a parent and cannot, it must find a new parent. One option is for the parentless child to reinitiate the join procedure. For efficiency, a child can also cache information about nodes it tried during its initial join procedure and attempt to join to nodes it previously identified as potential good parents.

Group Communication and Reliable Multicast

Multicast is essentially an implementation of group communication. The basic model for multicast supports the following operations:

To B-multicast (g, m): for each process p in g, send(p, m);

On receive(m) at p: B-deliver(m) at p.

If the communication channels between processes are unreliable, it may be desirable to add reliability to basic multicast. Reliability typically involves some kind of acknowledgement scheme where processes acknowledge receipt of a particular message. In practice, this can result in ACK implosion; a process sends 1 message and receives n-1 responses where n is the total number of processes in the group. To reduce the number of acknowledgements required, a negative acknowledgement scheme can be used.

Basic, reliable multicast does not ensure that messages are delivered to the application in any particular order. If a message is delayed in the network, it may arrive after another message that was actually sent first. Depending on the application, there are several possible ordering requirements a developer might enforce:

FIFO ordering: If a correct process issues multicast(g, m) and then multicast(g, m'), then every correct process that delivers m' will deliver m before m'.

Causal ordering: If multicast(g, m) $\to$ multicast(g, m'), where $\to$ is the happened-before relation induced only by messages sent between the members of g, then any correct process that delivers m' will deliver m before m'.

Total ordering: If a correct process delivers message m before it delivers m', then any other correct process that delivers m' will deliver m before m'.

FIFO ordering can be achieved by having each process maintain timestamps for the messages that it sends to the group. Total ordering can be achieved by using a separate sequencer process that assigns sequence numbers or by using a distributed algorithm whereby a process sends a message, the remaining processes propose a sequence number for the message, and the sender selects the highest sequence number proposed. Causal ordering can be achieved by using vector timestamps. Refer to Figure 12.16 for the algorithm.

Sami Rollins

Wednesday, 07-Jan-2009 15:13:20 PST