Project 2: Distributed Computation Engine (v 1.0)
Starter repository on GitHub: https://classroom.github.com/a/g3hbBmuV
Project 1 dealt with the storage aspects of big data, so in Project 2 we will shift our focus to computations. You will extend your DFS to support MapReduce jobs. Specific features we’ll support include:
- Datatype-aware chunk partitioning
- Job submission and monitoring, including pushing computations to nodes for data locality
- Load balancing across computation nodes
- The Map, Shuffle, and Reduce phases of computation
Your implementation must be done in Go (unless otherwise arranged with the instructor), and we will test it using the orion cluster here in the CS department. Communication between components must be implemented via sockets (not RMI, RPC or similar technologies. In particular, you are not allowed to use gRPC for this project) and you may not use any external libraries other than those explicitly stated in the project spec without instructor approval.
Once again, since this is a graduate-level class, you have leeway on how you design and implement your system. As usual, you should be able to explain your design decisions.
Partitioning the Data
While your DFS in Project 1 treated all files as opaque blobs of binary data, your MapReduce implementation must be able to process files line by line. This means that your partitioner will now need to be datatype aware: i.e., if the file is text-based then it should be split on the closest line boundaries rather than an exact amount of bytes based on the chunk size. (Note: if a file is binary, then you should partition it in the same way as Project 1).
This partitioning strategy allows us to provide the first input to a MapReduce job: <line_number, line_text>
pairs.
Job Submission
It is highly recommended to add a new node to your system to manage computations. This node will receive job submissions, determine relevant storage nodes based on the input file, and then transfer the job to the target nodes for execution.
Since Go projects are compiled to native binaries, distributing jobs is easy: receive the compiled Go program as input, transfer it to the relevant storage nodes, and then run it. You can assume that all nodes in your system will be running on the same platform (Linux).
As an example, imagine the following:
# Submit the job to the Computation Manager:
submit_job ./jobs/wordcount_job huge_input_file.txt output_file.txt
# (Computation Manager distributes the binary to Storage Nodes that hold chunks
# of 'huge_input_file.txt')
# On the Storage Node, the binary is run:
/some/tmp/dir/wordcount_job localhost:35008 <job_id>
In this example, wordcount_job
connects to the Storage Node running on the local machine (localhost:35008
) and provides a job ID number. The Storage Node will stream <line_number, line_text>
pairs to the job based on the job ID. Alternatively, a list of chunk names could be provided to the job that would be passed back to the Storage Node for retrieval.
NOTE: This is one approach for distributing and running the jobs. You are free to design your job submission algorithm as you wish (often around your DFS design).
Load Balancing
Since each file stored in your DFS will have multiple chunks associated with it, and each chunk will have at least 3 replicas available, we can determine a job scheduling strategy that will balance the load across cluster nodes. Ideally we want as many nodes as possible to participate in the computation to increase parallelism.
You should have a way to determine the number of reducers required by your jobs so that the Computation Manager can provide a list of reducers to the Storage Nodes during the Map phase. NOTE: the number of reducers needed is ultimately up to the algorithm, so the job itself should provide this configuration information. Choose reducers that will be co-located with active Map tasks to improve data locality.
The Map Phase
In the Map phase, your job will accept <line_number, line_text>
pairs, process them, and then produce <key, value>
pairs.
You don’t need to worry about the datatypes of the key or value; treat them as raw bytes and convert them to other types (strings, ints, etc.) as necessary in your jobs.
The Shuffle Phase
In the Shuffle phase, <key, value>
pairs from the Map phase are sent to their destination reducer nodes based on the key
. E.g., all outputs with key=hello
will be sent to reducer 1, outputs with key=world
will be sent to reducer 2, and so on. Your method for assigning reducers does not have to be complicated; modulo would be acceptable here.
This phase creates groupings of data. Ultimately, you might go from something like:
<San_Francisco, Golden_Gate_Park>
<San_Francisco, Ghirardelli_Square>
<San_Francisco, Fishermans_Wharf>
<San_Francisco, USF>
…from several mappers to:
<San_Francisco, [Golden_Gate_Park, Ghirardelli_Square, Fishermans_Wharf, USF]>
All sent to a particular reducer node for post-processing.
The Reduce Phase
The Reduce phase is nearly identical to the Map phase, except it receives <key, [list, of, value]>
pairs and produces <key, value>
pairs as its final output.
Since our DFS in Project 1 was not required to support append operations, you should store intermediate files in a temporary location, append to them as necessary, and then only store the final outputs back in the DFS.
Reporting Progress
Each time a map or reduce task completes, send a message to the Computation Manager so progress can be reported to the client. Progress does not have to be more fine grained than this.
Tips and Resources
- Once again: log events in your system! In particular, you should print out your load balancing decisions as they are made.
- Use the orion cluster (orion01 – orion12) to test your code in a distributed setting.
- To store your data, use
/bigdata/$(whoami)
, where$(whoami)
expands to your user name. DO NOT use your regular home directory even for intermediate files, as it will fill up and your account will get locked (and you can potentially lose data).
- To store your data, use
Project Deliverables
This project will be worth 12 points. The deliverables include:
-
[3 pts]: Computation Manager
- [1] Determining chunk locations and distributing jobs to corresponding Storage Nodes
- [1] Load balancing (mapper and reducer selections)
- [1] Notifying clients of job progress and that the job is complete
-
[5 pts]: Computation Nodes
- [1] Receiving and running jobs
- [1] Map phase
- [1] Shuffle phase
- [1] Reduce phase
- [1] Reporting progress
-
[1 pts]: A classic MapReduce word count implementation using your framework
-
[2 pts]: Another job of your choosing. Find a dataset, come up with some type of analysis to perform, and implement the job.
-
[1 pts]: Design document and retrospective (due after code submission). You may use UML diagrams, Vizio, OmniGraffle, etc. This is more to benefit you later when you want to refer back to the project or explain it in interviews etc. It outlines:
- Components of your MapReduce implementation
- Design decisions
- Messages the components will use to communicate
- Answers to retrospective questions
Note: your system must be able to support at least 12 active storage nodes, i.e., the entire orion cluster.
Grading
We’ll schedule a demo and code review to grade your assignment. You will demonstrate the required functionality and walk through your design.
I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes or modules based on functionality, and include comments in your source where appropriate.
Changelog
- 10/20: Version 1.0 posted