CS 677 Big Data

Project 1: Distributed File System (v 1.1)

Starter repository on GitHub: https://classroom.github.com/a/ofomjCru

In this project, you will build your own distributed file system (DFS) based on the technologies we’ve studied from Amazon, Google, and others. Your DFS will support multiple storage nodes responsible for managing data. Key features include:

Your implementation must be done in Java, and we will test it using the orion cluster here in the CS department. Communication between components must be implemented via sockets (not RMI, RPC or similar technologies) and you may not use any external libraries. The Java Development Kit has everything you need to complete this assignment.

Since this is a graduate-level class, you have leeway on how you design and implement your system. However, you should be able to explain your design decisions. Additionally, you must include the following components:

Coordinator

The Coordinator’s job is simple: it acts as a gatekeeper to the system so that the administrator can add or remove nodes and monitor the health of the cluster. The Coordinator maintains a canonical routing table, which contains a list of active storage nodes and their positions in the system hash space. Your DFS will implement a zero-hop distributed hash table (DHT) design where each node can locate a file given its name without intermediate routing steps. We will use the SHA-1 hash algorithm.

When a new storage node joins your DFS, the first thing it does is contact the Coordinator. The Coordinator will determine whether or not the node is allowed to enter the system, assigns it a Node ID, and places it in the system hash ring. You get to choose how nodes are positioned within the hash space – remember to justify your algorithm in your design document.

The Coordinator is also responsible for detecting storage node failures and ensuring the system replication level is maintained. In your DFS, every chunk will be replicated twice for a total of 3 duplicate chunks. This means if a node goes down, you can re-route retrievals to a backup copy. You’ll also maintain the replication level by creating more copies in the event of a failure. You will need to design an algorithm for determining replica placement.

The Coordinator should never see any files or file names, and does not handle any client storage or retrieval requests. If the Coordinator goes down, file storage and retrieval operations should continue to work. When the Coordinator comes back online, it will request a copy of the last known good hash space and resume its usual operations.

Storage Node

Storage nodes are responsible for routing client requests as well as storing and retrieving file chunks. When a chunk is stored, it will be checksummed so on-disk corruption can be detected.

Some messages that your storage node could accept (although you are certainly free to design your own):

Another approach is encoding the chunk number in the file names. Metadata (checksums, chunk numbers) should be stored alongside the files on disk.

The storage nodes will send a heartbeat to the Coordinator periodically to let it know that they are still alive. Every 5 seconds is a good interval for sending these. The heartbeat contains the free space available at the node and the total number of requests processed (storage, retrievals, etc.). If the layout of the hash space has changed, the Coordinator will respond with an updated node list.

On startup: provide a storage directory and the hostname/IP of the Coordinator. Any old files present in the storage directory should be removed. The Coordinator will respond with the current state of the hash space (including the position of the new node).

Client

The client’s main functions include:

The client will also be able to print out a list of active nodes (retrieved from the Coordinator), the total disk space available in the cluster (in GB), and number of requests handled by each node. Given a specific storage node, the client should be able to retrieve a list of files stored there (including the chunk number, e.g., ‘5 of 17’ or similar).

NOTE: Your client must either accept command line arguments or provide its own text-based command entry interface. Recompiling your client to execute different actions is not allowed and will incur a 5 point deduction.

Tips and Resources

Project Deliverables

This project will be worth 20% of your course grade (20 points). The deliverables include:

Note: your system must be able to support at least 12 active storage nodes, i.e., the entire orion cluster.

Extra Credit

Milestones and Checkpoints

Here’s some milestones to guide your implementation:

Checkpoints: we’ll have two checkpoints where you demonstrate completion of the components listed above. You do not have to complete the components in the order listed, but you should be making steady progress. Checkpoint grading:

You are required to work alone on this project. However, you are certainly free to discuss the project with your peers. We will also conduct in-class lab sessions where you can get help, discuss issues, and think about your design.

Grading

You will have a one-on-one interview and code review to grade your assignment. You will demonstrate the required functionality and walk through your design.

I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.

Changelog