Project 3: Spatiotemporal Analysis with Spark (v 1.0)

Starter repository on GitHub: https://classroom.github.com/a/Zp2cUk4s

In this assignment, we’ll analyze a dataset collected by NOAA for modeling and predicting climate phenomena: the North American Mesoscale Forecast System (NAM). You’ll write Spark jobs that filter and aggregate features from the dataset. You are allowed to use any Spark-compatible programming language, and can use any libraries as long as they don’t implement/complete the assignment for you (check with the instructor first if you’re unsure).

Dataset Location

You can find the files in /bigdata/mmalensek/nam/3hr on orion03 and orion04. The dataset is in .tdv format: tab-delimited values with an initial header that describes the features. You can find a list of features and their units here

A smaller 10% sample is available for local analysis in /bigdata/mmalensek/nam/3hr_sample. If you use this dataset, please make note in your project report.

Deliverables

You will submit two deliverables for Project 3:

Some aspects of these questions are left up to your own interpretation. Occasionally there are no exactly right/wrong answers, but you should be able to justify your approach.

Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your notebook or provide some statistics about it (population, nearby landmarks). Perhaps the question involves some obscure concept or subculture; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!) Please also include how long each Spark job ran for. Failure to include necessary contextual information will result in deductions, regardless if you found the correct answer.

Analysis

Self-Guided Analysis

Now it’s time to analyze a dataset of your own. Find a dataset online and use Spark to analyze it. You should:

  1. [0.5 pt] Describe the dataset
  2. [0.5 pt] Outline the types of insights you hope to gain from it and make hypotheses about what you might find.
  3. [1 pt] Collect and clean the data for analysis and provide a basic set of summary information about the features in your dataset.
  4. [6 pt] Design at least 3 “questions” (along the lines of those above) and answer them. Remember that presentation matters here.

Grading

Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.

Steep deductions will be applied if you do most of your analysis on the driver, i.e., not distributed using Spark!