Project 2: Distributed Computation Engine (v 1.0)

Starter repository on GitHub: https://classroom.github.com/a/g3hbBmuV

Project 1 dealt with the storage aspects of big data, so in Project 2 we will shift our focus to computations. You will extend your DFS to support MapReduce jobs. Specific features we’ll support include:

Your implementation must be done in Go (unless otherwise arranged with the instructor), and we will test it using the orion cluster here in the CS department. Communication between components must be implemented via sockets (not RMI, RPC or similar technologies. In particular, you are not allowed to use gRPC for this project) and you may not use any external libraries other than those explicitly stated in the project spec without instructor approval.

Once again, since this is a graduate-level class, you have leeway on how you design and implement your system. As usual, you should be able to explain your design decisions.

Partitioning the Data

While your DFS in Project 1 treated all files as opaque blobs of binary data, your MapReduce implementation must be able to process files line by line. This means that your partitioner will now need to be datatype aware: i.e., if the file is text-based then it should be split on the closest line boundaries rather than an exact amount of bytes based on the chunk size. (Note: if a file is binary, then you should partition it in the same way as Project 1).

This partitioning strategy allows us to provide the first input to a MapReduce job: <line_number, line_text> pairs.

Job Submission

It is highly recommended to add a new node to your system to manage computations. This node will receive job submissions, determine relevant storage nodes based on the input file, and then transfer the job to the target nodes for execution.

Since Go projects are compiled to native binaries, distributing jobs is easy: receive the compiled Go program as input, transfer it to the relevant storage nodes, and then run it. You can assume that all nodes in your system will be running on the same platform (Linux).

As an example, imagine the following:

# Submit the job to the Computation Manager:
submit_job ./jobs/wordcount_job huge_input_file.txt output_file.txt

# (Computation Manager distributes the binary to Storage Nodes that hold chunks
# of 'huge_input_file.txt')

# On the Storage Node, the binary is run:
/some/tmp/dir/wordcount_job localhost:35008 <job_id>

In this example, wordcount_job connects to the Storage Node running on the local machine (localhost:35008) and provides a job ID number. The Storage Node will stream <line_number, line_text> pairs to the job based on the job ID. Alternatively, a list of chunk names could be provided to the job that would be passed back to the Storage Node for retrieval.

NOTE: This is one approach for distributing and running the jobs. You are free to design your job submission algorithm as you wish (often around your DFS design).

Load Balancing

Since each file stored in your DFS will have multiple chunks associated with it, and each chunk will have at least 3 replicas available, we can determine a job scheduling strategy that will balance the load across cluster nodes. Ideally we want as many nodes as possible to participate in the computation to increase parallelism.

You should have a way to determine the number of reducers required by your jobs so that the Computation Manager can provide a list of reducers to the Storage Nodes during the Map phase. NOTE: the number of reducers needed is ultimately up to the algorithm, so the job itself should provide this configuration information. Choose reducers that will be co-located with active Map tasks to improve data locality.

The Map Phase

In the Map phase, your job will accept <line_number, line_text> pairs, process them, and then produce <key, value> pairs.

You don’t need to worry about the datatypes of the key or value; treat them as raw bytes and convert them to other types (strings, ints, etc.) as necessary in your jobs.

The Shuffle Phase

In the Shuffle phase, <key, value> pairs from the Map phase are sent to their destination reducer nodes based on the key. E.g., all outputs with key=hello will be sent to reducer 1, outputs with key=world will be sent to reducer 2, and so on. Your method for assigning reducers does not have to be complicated; modulo would be acceptable here.

This phase creates groupings of data. Ultimately, you might go from something like:

<San_Francisco, Golden_Gate_Park> <San_Francisco, Ghirardelli_Square> <San_Francisco, Fishermans_Wharf> <San_Francisco, USF>

…from several mappers to:

<San_Francisco, [Golden_Gate_Park, Ghirardelli_Square, Fishermans_Wharf, USF]>

All sent to a particular reducer node for post-processing.

The Reduce Phase

The Reduce phase is nearly identical to the Map phase, except it receives <key, [list, of, value]> pairs and produces <key, value> pairs as its final output.

Since our DFS in Project 1 was not required to support append operations, you should store intermediate files in a temporary location, append to them as necessary, and then only store the final outputs back in the DFS.

Reporting Progress

Each time a map or reduce task completes, send a message to the Computation Manager so progress can be reported to the client. Progress does not have to be more fine grained than this.

Tips and Resources

Project Deliverables

This project will be worth 12 points. The deliverables include:

Note: your system must be able to support at least 12 active storage nodes, i.e., the entire orion cluster.

Grading

We’ll schedule a demo and code review to grade your assignment. You will demonstrate the required functionality and walk through your design.

I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes or modules based on functionality, and include comments in your source where appropriate.

Changelog