Project 2: Distributed Computation Engine (v 1.0)

Starter repository on GitHub: https://classroom.github.com/a/B597N69M

Project 1 dealt with the storage aspects of big data, so in Project 2 we will shift our focus to computations. You will extend your DFS to support MapReduce jobs. Specific features we’ll support include:

Datatype-aware chunk partitioning
Job submission and monitoring, including pushing computations to nodes for data locality
Load balancing across computation nodes
The Map, Shuffle, and Reduce phases of computation

Your implementation must be done in Go (unless otherwise arranged with the instructor), and we will test it using the orion cluster here in the CS department. Communication between components must be implemented via sockets (not RMI, RPC or similar technologies. In particular, you are not allowed to use gRPC for this project) and you may not use any external libraries other than those explicitly stated in the project spec without instructor approval.

Once again, since this is a graduate-level class, you have leeway on how you design and implement your system. As usual, you should be able to explain your design decisions.

Partitioning the Data

While your DFS in Project 1 treated all files as opaque blobs of binary data, your MapReduce implementation must be able to process files line by line. This means that your partitioner will now need to be datatype aware: i.e., if the file is text-based then it should be split on the closest line boundaries rather than an exact amount of bytes based on the chunk size. (Note: if a file is binary, then you should partition it in the same way as Project 1).

This partitioning strategy allows us to provide the first input to a MapReduce job: <line_number, line_text> pairs.

Job Submission

It is highly recommended to add a new node to your system to manage computations. This node will receive job submissions, determine relevant storage nodes based on the input file, and then transfer the job to the target nodes for execution.

Go supports plugins — compiled shared object files — that can be transferred to machines, loaded, and executed. You can encapsulate your MapReduce jobs as plugin files that are pushed to the storage nodes for processing. Alternatively, since Go projects are compiled to native binaries, you can encapsulate your jobs as standalone executables as well. In that case, distributing jobs is easy: receive the compiled Go program as input, transfer it to the relevant storage nodes, and then run it. You can assume that all nodes in your system will be running on the same platform (Linux). If you go the compiled executable route, you’ll need a way for the job to communicate with the storage node; stdin/stdout is acceptable, although sockets may be easier to integrate into your project.

NOTE: the crucial thing to understand here is that MapReduce jobs should NOT be compiled into your computation engine. A user should be able to build a new MapReduce job (based on an interface you provide), compile it, and submit it to your system without needing to shut down or recompile your project.

Load Balancing

Since each file stored in your DFS will have multiple chunks associated with it, and each chunk will have at least 3 replicas available, we can determine a job scheduling strategy that will balance the load across cluster nodes. Ideally we want as many nodes as possible to participate in the computation to increase parallelism.

You should have a way to determine the number of reducers required by your jobs so that the Computation Manager can provide a list of reducers to the Storage Nodes during the Map phase. NOTE: the number of reducers needed is ultimately up to the algorithm, so the job itself should provide this configuration information. Choose reducers that will be co-located with active Map tasks to improve data locality.

The Map Phase

In the Map phase, your job will accept <line_number, line_text> pairs, process them, and then produce <key, value> pairs.

You don’t need to worry about the datatypes of the key or value; treat them as raw bytes and convert them to other types (strings, ints, etc.) as necessary in your jobs. This puts more work on job developers but makes our implementation simpler.

The Shuffle Phase

In the Shuffle phase, <key, value> pairs from the Map phase are sent to their destination reducer nodes based on the key. E.g., all outputs with key=hello will be sent to reducer 1, outputs with key=world will be sent to reducer 2, and so on. Your method for assigning reducers does not have to be complicated; modulo would be acceptable here.

This phase creates groupings of data. Ultimately, you might go from something like:

<San_Francisco, Golden_Gate_Park> <San_Francisco, Ghirardelli_Square> <San_Francisco, Fishermans_Wharf> <San_Francisco, USF>

…from several mappers to:

<San_Francisco, [Golden_Gate_Park, Ghirardelli_Square, Fishermans_Wharf, USF]>

All sent to a particular reducer node for post-processing.

NOTE: in a fully-fledged implementation of MapReduce, sorting operations must be carried out using an external sort – an algorithm that does not require all data to reside in memory. You are allowed to perform in-memory sorts for this assignment because it will simplify your implementation. However, bonus points will be awarded for implementing an external sort.

The Reduce Phase

The Reduce phase is nearly identical to the Map phase, except it receives <key, [list, of, value]> pairs and produces <key, value> pairs as its final output.

Since our DFS in Project 1 was not required to support append operations, you should store intermediate files in a temporary location, append to them as necessary, and then only store the final outputs back in the DFS.

Reporting Progress

Each time a map or reduce task completes, send a message to the Computation Manager so progress can be reported to the client. Progress does not have to be more fine grained than this.

Tips and Resources

Once again: log events in your system! In particular, you should print out your load balancing decisions as they are made.
Use the orion cluster (orion01 – orion12) to test your code in a distributed setting.
- To store your data, use /bigdata/$(whoami), where $(whoami) expands to your user name. DO NOT use your regular home directory even for intermediate files, as it will fill up and your account will get locked (and you can potentially lose data).

Project Deliverables

This project will be worth 12 points. The deliverables include:

[3 pts]: Computation Manager
- [1] Determining chunk locations and distributing jobs to corresponding Storage Nodes
- [1] Load balancing (mapper and reducer selections)
- [1] Notifying clients of job progress and that the job is complete
[5 pts]: Computation Nodes
- [1] Receiving and running jobs
- [1] Map phase
- [1] Shuffle phase
- [1] Reduce phase
- [1] Reporting progress
[1 pts]: A classic MapReduce word count implementation using your framework
[2 pts]: Re-implementing the log analyzer lab as a MapReduce job
[1 pts]: Design document and retrospective (due after code submission). You may use UML diagrams, Vizio, OmniGraffle, etc. This is more to benefit you later when you want to refer back to the project or explain it in interviews etc. It outlines:
- Components of your MapReduce implementation
- Design decisions
- Messages the components will use to communicate
- Answers to retrospective questions

Note: your system must be able to support at least 12 active storage nodes, i.e., the entire orion cluster.

Grading

We’ll schedule a demo and code review to grade your assignment. You will demonstrate the required functionality and walk through your design. Here is what you will be required to do during the demo

I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes or modules based on functionality, and include comments in your source where appropriate.