CS686 Project 2: Spatiotemporal Analysis with MapReduce (v 1.3)

Due: November 13

In this assignment, we’ll analyze a dataset collected by NOAA for modeling and predicting climate phenomena: the North American Mesoscale Forecast System (NAM). You’ll write MapReduce jobs that filter and aggregate features from the dataset. You are allowed to use any programming language you wish, although the examples will be provided in Java. You are not allowed to use any 3rd party libraries or tools without running it by the instructor first.

For more information about the dataset, see the data dictionary page.

Version Control

To set up your submission repository on GitHub, visit: https://classroom.github.com/a/LPs4Bqng

In the spirit of versioning, I will update the version number at the top of this document every time a change is made and list any changes in the changelog below.

Deliverable I

For this project, you’ll produce several small MapReduce jobs. Each of the tasks below can be broken up into several jobs, or you can combine some of them. As usual, some aspects of these questions are left up to your own interpretation. There are no right/wrong answers, but you should be able to justify your approach.

Important: many of the questions are answered with a time or location. You should also output relevant feature values to back up your answer. For instance, if I ask you which city has the most fast food restaurants per capita, you shouldn’t just say “Paducah, Kentucky.” You should also output how many restaurants are there, the population, etc. Please also include how long the MapReduce job ran for.

Warm-up

[0.5 pt] How many records are in the dataset?
[0.5 pt] Are there any Geohashes that have snow depths greater than zero for the entire year? List some of the top Geohashes.
[0.5 pt] When and where was the hottest temperature observed in the dataset? Is it an anomaly?

Analysis

[1 pt] Where are you most likely to be struck by lightning? Use a precision of 4 Geohash characters and provide the top 3 locations.
[1.5 pt] What is the driest month in the bay area? This should include a histogram with data from each month. (Note: how did you determine what data points are in the bay area?)
[3 pt] After graduating from USF, you found a startup that aims to provide personalized travel itineraries using big data analysis. Given your own personal preferences, build a plan for a year of travel across 5 locations. Or, in other words: pick 5 regions. What is the best time of year to visit them based on the dataset?
[3 pt] Your travel startup is so successful that you move on to green energy; here, you want to help power companies plan out the locations of solar and wind farms across North America. Write a MapReduce job that locates the top 3 places for solar and wind farms, as well as a combination of both (solar + wind farm). You will report a total of 9 Geohashes as well as their relevant attributes (for example, cloud cover and wind speeds).
- If you’d like to do some data fusion to answer this question, the maps here and here might be helpful.
[3 pt] Given a Geohash prefix, create a climate chart for the region. This includes high, low, and average temperatures, as well as monthly average rainfall (precipitation). Here’s a (poor quality) script that will generate this for you.

Point Breakdown

[13 pts] - Deliverable I
[2 pts] - Project Retrospective

Getting Started

Here’s some useful commands for working with HDFS and Hadoop.

# --- Initial Setup ---
# Create a directory for our project outputs:
hdfs dfs -mkdir "/tmp/$(whoami)"

# Don't allow others to read the directory:
hdfs dfs -chmod 700 "/tmp/$(whoami)"
# ---------------------

# View the files in the dataset:
hdfs dfs -ls /tmp/cs686/nam

# Launch MapReduce application
yarn jar \
    ./target/project2-1.0.jar \
    edu.usfca.cs.mr.wordcount.WordCountJob \
    /tmp/cs686/nam/nam_mini.tdv \
    /tmp/$(whoami)/output_job_1

# View active MapReduce applications, determine your applicationId
yarn application -list

# After your job completes, view its logs:
yarn logs -applicationId <app_id_here>

# Killing an application:
yarn application -kill <app_id_here>

# Deleting an output directory:
hdfs dfs -rm -r -skipTrash /tmp/yourname/output_123

Milestones

Here’s some milestones to guide your implementation:

Week 1: Make sure you can run a MapReduce job, and get familiar with the tools.
Week 2, 3: Deliverable I

You are required to work alone on this project. However, you are certainly free to discuss the project with your peers. We will also conduct in-class lab sessions where you can:

Get help, discuss issues, think about your design
Demonstrate working functionality early to receive partial credit for completed milestones. If you didn’t understand a requirement or have an error in your logic, you will know early.

Grading

You will have a one-on-one interview and code review to grade your assignment. You will demonstrate the required functionality and walk through your design.

I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.

Changelog

11/14: Added link to project retrospective document.
11/5: Removed Deliverable II due to cluster troubles. Added a link to the 30% sample dataset for local analysis.
11/2: Added a few more hints
10/29: Added a few minor clarifications, more info to the data dictionary.
10/23: Version 1.0 posted