CS686 Project 3: In-Memory Analysis with Spark (v 1.1)

Due: December 7

This assignment extends our previous work with MapReduce. For the first part of the assignment, we’ll continue to use the North American Mesoscale Forecast System (NAM) dataset. As with the previous assignment, you are allowed to use any programming language you wish. You are not allowed to use any 3rd party libraries or tools without asking the instructor first.

For more information about the dataset, see the data dictionary page.

NOTE: Since everyone is required to use their own machine (or AWS) for this assignment, I have uploaded a small sample of the entire dataset if your machine can’t fit NAM_2015_S in memory.

Version Control

To set up your submission repository on GitHub, visit:

(for Deliverable I): https://classroom.github.com/a/LjB4wIsw
(for Deliverable II): https://classroom.github.com/g/RXRW2Ah2

In the spirit of versioning, I will update the version number at the top of this document every time a change is made and list any changes in the changelog below.

Deliverable I: Individual Work

[3 pt] Spark provides a different programming paradigm compared to MapReduce. Choose three of the questions from the previous assignment (not including the record count), and re-implement them using Spark.
- For this question, provide an overview of your experience with these jobs: did they complete faster? Was the programming model better or worse? Which did you prefer?
[2 pt] It is often useful to understand feature properties and distributions. Build a Spark job that computes summary statistics for each feature. This should include:
- Min, max values
- Average
- Standard deviation

Example report (1 feature):

Feature: temperature_tropopause
Max value: 272.3
Min value: 250.4
Average: 266.6
Std. Dev: 54.8

Feature: friction_velocity_surface
...
(etc)

[2 pt] Determine how features influence each other using Pearson’s correlation coefficient (PCC). The output for this job should include (1) feature pairs sorted by absolute correlation coefficient, and (2) a correlation matrix visualization (heatmaps are a good option).
- Here’s an example heatmap generation script.

Example report:

temperature_tropopause, friction_velocity_surface    0.99
temperature_tropopause, relative_humidity_zerodegc_isotherm    0.72
...
(etc)

[2 pt] Using the feature summaries and correlations as your guide, choose a feature to predict via machine learning models in MLlib. You will need to explain:
- The feature you will predict
- Features used to train the model
- How you partitioned your data
- Why the model choices make sense (for example: if two features are highly correlated, then perhaps you don’t need a model…)
[1 pt] A visualization built from the dataset. You can choose absolutely any type of visualization here, but it should help tell a story about the data.

Deliverable II: Team Analysis

Given the work you did in the previous assignments, design your own problem to analyze as a team.

The first step is choosing a dataset to analyze. Since this course is all about big data, the dataset should be sufficiently large; 5 GB is the minimum, but you should also take the number of files/records into account. A very large number of tiny records is fine; a small number of very large files is not (for instance, a 5 GB Blu-Ray movie dataset would probably only contain half a movie…). Moral of the story: you have some flexibility here, but if you have any doubts ask the instructor. And if you absolutely cannot find an interesting dataset that meets these requirements, ask the instructor.

You are encouraged to use any resources at your disposal; AWS, Azure, Google Compute Engine are all fair game here.

Requirements:

[1 pt] Start with a collaboration plan that lists your group members and the dataset you will analyze. If you will collect the data yourself, describe how. You should provide a high-level overview of your goals for the project – this is also how your grade will be determined.
[1 pt] Collecting and cleaning the data for analysis.
[1 pt] Providing a basic set of summary information about the features in your dataset (you can use the code you developed in Deliverable I for this).
[4 pt] Analysis. You may employ machine learning models, visualizations, etc., and can leverage the code you developed previously.
[6 pt] Final project presentation. This will be a short outline of your dataset, the goals for analysis, results, and demo (if applicable).

Point Breakdown

[10 pt] - Deliverable I
[13 pt] - Deliverable II
[2 pt] - Project Retrospective

Deadlines

Nov 22: Collaboration plan
Dec 1: Deliverable I
Dec 6: Final presentations

Grading

I will deduct points if you violate any of the requirements listed in this document — for example, using an unauthorized external library. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.

Changelog

11/30: Added small sample dataset
11/20: Version 1.0 posted