Lab 3: Distributed Failure Detection

Create repository on GitHub: https://classroom.github.com/a/Nbg82g3n

One of the most difficult problems in distributed systems is determining when a node has failed. Is it simply running slow? Is the network down but the node is up? Did it shut down cleanly? Crash? In this lab, you will build a failure detection framework using heartbeats – intermittent status updates that notify the system that a node is still online.

To get started, design an overlay network that describes how communication will happen between your components. You may choose a hub-and-spoke model where a central component listens for pings from participating nodes and records their liveness information, or you could use a ring topology to pass liveness messages through the network. Use protocol buffers for communication.

After designing your topology, build your components and allow for configurable thresholds to determine when a node has failed. For instance, perhaps you expect a heartbeat every 5 seconds from components, and if three heartbeats are missed you will consider the node as failed. Once a node is considered failed, you should NOT let it re-enter the system. In the previous example, if a node responds to a heartbeat after 25 seconds the update should be rejected and the node should be told to re-initialize itself as a new node.

This lab is less prescriptive than the last; you have much more freedom to design it the way you think will work best. To test your code, run it with at least 100 nodes on the orion cluster. If you’re wondering how to start so many nodes on the cluster, set up Passwordless SSH and use a script like this:

#!/usr/bin/env bash

port_prefix=35 # Put your assigned port prefix here.
               # See: https://www.cs.usfca.edu/~mmalensek/cs677/schedule/materials/ports.html
nodes=100      # Number of nodes to run

# Server list. You can comment out servers that you don't want to use with '#'
servers=(
    "orion01"
    "orion02"
    "orion03"
    "orion04"
    "orion05"
    "orion06"
    "orion07"
    "orion08"
    "orion09"
    "orion10"
    "orion11"
    "orion12"
)

for (( i = 0; i <= nodes; i++ )); do
    port=$(( port_prefix * 1000 + i ))
    server=$(( i % ${#servers[@]} ))

    # This will ssh to the machine, and run 'node orion01 <some port>' in the
    # background.
    echo "Starting node on ${servers[${server}]} on port ${port}"
    ssh ${servers[${server}]} "${HOME}/go/bin/node orion01 ${port}" &
done

echo "Startup complete"

(place in something like startup.sh and run chmod +x startup.sh). This script assumes your node is called node, installed in ${HOME}/go/bin, and has the node connect to a central server on orion01 and listen on a variety of ports, so you’ll have to tweak it as necessary. You might not need something this complicated depending on your network design.

To kill nodes:

servers=(
    "orion01"
    "orion02"
    "orion03"
    "orion04"
    "orion05"
    "orion06"
    "orion07"
    "orion08"
    "orion09"
    "orion10"
    "orion11"
    "orion12"
)

for server in ${servers[@]}; do
    echo "${server}"
    ssh "${server}" "pkill -u "$(whoami)" node"
done

NOTE: You don’t have to use these scripts. Just make sure you can demonstrate your cluster running with at least 100 nodes.

Creating Failures

After you have your nodes started and running, you should be able to demonstrate (1) that no nodes fail under normal operating conditions, and (2) if you kill a node by sshing to the target machine and killing it with kill some-pid your system detects and reports the failure. (Use pgrep -u$(whoami) node to find processes named “node” running under your account, then use the PIDs you find to issue kill commands).

Submission

  1. Check your code (including your .proto file(s)) into your repository.
  2. Provide a short writeup about your network topology, thresholds, and communication flow. You can use diagrams if it makes your life easier.
  3. Create a short video (screen recording) of your system running and detecting failures. It’s probably easiest to do this with Zoom. Check the file into your repository or email it separately with Google Drive to the professor if it’s too large.