CS 686 Big Data

CS686 Project 1: Distributed File System (v 1.4)

Due: October 18

In this project, you will build your own distributed file system (DFS) based on the technologies we’ve studied from Amazon, Google, and others. Your DFS will support multiple storage nodes responsible for managing data. Key features include:

Your implementation must be done in Java, and we will test it using the bass cluster here in the CS department. Communication between components must be implemented via sockets (not RMI or similar technologies) and you may not use any external libraries. The Java Development Kit has everything you need to complete this assignment.

Since this is a graduate-level class, you have leeway on how you design and implement your system. However, you should be able to explain your design decisions. Additionally, you must include the following components:

Version Control

To set up your submission repository on GitHub, visit: https://classroom.github.com/a/Yoknj2ce

In the spirit of versioning, I will update the version number at the top of this document every time a change is made and list any changes in the changelog below.

Controller

The Controller is responsible for managing resources in the system, somewhat like an HDFS NameNode. When a new storage node joins your DFS, the first thing it does is contact the Controller. The Controller manages a few data structures:

When clients wish to store a new file, they will send a storage request to the controller, and it will reply with a list of destination storage nodes to send the chunks to. The Controller itself should never see any of the actual files, only their metadata.

The Controller is also responsible for detecting storage node failures and ensuring the system replication level is maintained. In your DFS, every chunk will be replicated twice for a total of 3 duplicate chunks. This means if a system goes down, you can re-route retrievals to a backup copy. You’ll also maintain the replication level by creating more copies in the event of a failure.

Storage Node

Storage nodes are responsible for storing and retrieving file chunks. When a chunk is stored, it will be checksummed so on-disk corruption can be detected.

Some messages that your storage node could accept (although you are certainly free to design your own):

One alternative is creating unique identifiers for each chunk, in which case you wouldn’t need the file name + chunk number combo. In that case, you’ll have to think about how you inform clients on how to reconstruct the file correctly.

Finally, the storage nodes will send a heartbeat to the controller periodically. The heartbeat includes chunk metadata to keep the Controller up to date, while also letting it know that the node is still alive. Heartbeats should be sent every 5 seconds and only include the latest changes at the node, not an entire list of its files. However, the Controller can also ask the storage nodes to send a complete file list (useful if the Controller failed and wants to rebuild its view of the system state). You can also include the amount of free space available at the node in your heartbeat messages so that the Controller has an idea of resource availability.

Client

The client’s main functions include:

The client will also be able to print out a list of files (retrieved from the Controller), and the total available disk space in the cluster (in GB).

Tips and Resources

Project Deliverables

This project will be worth 20% of your course grade (20 points). The deliverables include:

Note: your system must be able to support at least 10 active storage nodes. During grading, you will launch these components on the bass cluster.

Milestones

Here’s some milestones to guide your implementation:

You are required to work alone on this project. However, you are certainly free to discuss the project with your peers. We will also conduct in-class lab sessions where you can:

Grading

You will have a one-on-one interview and code review to grade your assignment. You will demonstrate the required functionality and walk through your design.

I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.

Changelog