CS 686 Big Data

CS686 Project 2: Spatiotemporal Analysis with MapReduce (v 1.3)

Due: November 13

In this assignment, we’ll analyze a dataset collected by NOAA for modeling and predicting climate phenomena: the North American Mesoscale Forecast System (NAM). You’ll write MapReduce jobs that filter and aggregate features from the dataset. You are allowed to use any programming language you wish, although the examples will be provided in Java. You are not allowed to use any 3rd party libraries or tools without running it by the instructor first.

For more information about the dataset, see the data dictionary page.

Version Control

To set up your submission repository on GitHub, visit: https://classroom.github.com/a/LPs4Bqng

In the spirit of versioning, I will update the version number at the top of this document every time a change is made and list any changes in the changelog below.

Deliverable I

For this project, you’ll produce several small MapReduce jobs. Each of the tasks below can be broken up into several jobs, or you can combine some of them. As usual, some aspects of these questions are left up to your own interpretation. There are no right/wrong answers, but you should be able to justify your approach.

Important: many of the questions are answered with a time or location. You should also output relevant feature values to back up your answer. For instance, if I ask you which city has the most fast food restaurants per capita, you shouldn’t just say “Paducah, Kentucky.” You should also output how many restaurants are there, the population, etc. Please also include how long the MapReduce job ran for.



Point Breakdown

Getting Started

Here’s some useful commands for working with HDFS and Hadoop.

# --- Initial Setup ---
# Create a directory for our project outputs:
hdfs dfs -mkdir "/tmp/$(whoami)"

# Don't allow others to read the directory:
hdfs dfs -chmod 700 "/tmp/$(whoami)"
# ---------------------

# View the files in the dataset:
hdfs dfs -ls /tmp/cs686/nam

# Launch MapReduce application
yarn jar \
    ./target/project2-1.0.jar \
    edu.usfca.cs.mr.wordcount.WordCountJob \
    /tmp/cs686/nam/nam_mini.tdv \

# View active MapReduce applications, determine your applicationId
yarn application -list

# After your job completes, view its logs:
yarn logs -applicationId <app_id_here>

# Killing an application:
yarn application -kill <app_id_here>

# Deleting an output directory:
hdfs dfs -rm -r -skipTrash /tmp/yourname/output_123


Here’s some milestones to guide your implementation:

You are required to work alone on this project. However, you are certainly free to discuss the project with your peers. We will also conduct in-class lab sessions where you can:


You will have a one-on-one interview and code review to grade your assignment. You will demonstrate the required functionality and walk through your design.

I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.