CS 686 Big Data

CS686 Project 3: In-Memory Analysis with Spark (v 1.1)

Due: December 7

This assignment extends our previous work with MapReduce. For the first part of the assignment, we’ll continue to use the North American Mesoscale Forecast System (NAM) dataset. As with the previous assignment, you are allowed to use any programming language you wish. You are not allowed to use any 3rd party libraries or tools without asking the instructor first.

For more information about the dataset, see the data dictionary page.

NOTE: Since everyone is required to use their own machine (or AWS) for this assignment, I have uploaded a small sample of the entire dataset if your machine can’t fit NAM_2015_S in memory.

Version Control

To set up your submission repository on GitHub, visit:

In the spirit of versioning, I will update the version number at the top of this document every time a change is made and list any changes in the changelog below.

Deliverable I: Individual Work

Example report (1 feature):

Feature: temperature_tropopause
Max value: 272.3
Min value: 250.4
Average: 266.6
Std. Dev: 54.8

Feature: friction_velocity_surface
...
(etc)

Example report:

temperature_tropopause, friction_velocity_surface    0.99
temperature_tropopause, relative_humidity_zerodegc_isotherm    0.72
...
(etc)

Deliverable II: Team Analysis

Given the work you did in the previous assignments, design your own problem to analyze as a team.

The first step is choosing a dataset to analyze. Since this course is all about big data, the dataset should be sufficiently large; 5 GB is the minimum, but you should also take the number of files/records into account. A very large number of tiny records is fine; a small number of very large files is not (for instance, a 5 GB Blu-Ray movie dataset would probably only contain half a movie…). Moral of the story: you have some flexibility here, but if you have any doubts ask the instructor. And if you absolutely cannot find an interesting dataset that meets these requirements, ask the instructor.

You are encouraged to use any resources at your disposal; AWS, Azure, Google Compute Engine are all fair game here.

Requirements:

Point Breakdown

Deadlines

Grading

I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.

Changelog