CS 677 Big Data

Project 3: Spatiotemporal Analysis with Spark (v 1.0)

Starter repository on GitHub: https://classroom.github.com/g/lY4Jm4vN

In this assignment, we’ll analyze a dataset collected by NOAA for modeling and predicting climate phenomena: the North American Mesoscale Forecast System (NAM). You’ll write Spark jobs that filter and aggregate features from the dataset. You are allowed to use any Spark-compatible programming language, and can use any libraries as long as they don’t implement/complete the assignment for you (check with the instructor first if you’re unsure).

Dataset Location

You can find the files in /bigdata/mmalensek/data/nam on orion01.

For more information about the dataset, see the data dictionary page.


You will submit two deliverables for Project 3:

As usual, some aspects of these questions are left up to your own interpretation. Occasionally there are no exactly right/wrong answers, but you should be able to justify your approach.

Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your notebook or provide some statistics about it (population, nearby landmarks). Perhaps the question involves some obscure concept or subculture; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!) Please also include how long each Spark job ran for. Failure to include necessary contextual information will result in deductions, regardless if you found the correct answer.


Note: each member of your group (if applicable) should answer the following questions individually.


Example report:

temperature_tropopause, friction_velocity_surface    0.99
temperature_tropopause, relative_humidity_zerodegc_isotherm    0.72

Option 1: Advanced Analysis

You’ve had the opportunity to analyze two datasets thus far; now it’s time to analyze a dataset of your own. Find a dataset online and use Spark (or Hadoop) to analyze it. You should:

  1. [0.5 pt] Describe the dataset
  2. [0.5 pt] Outline the types of insights you hope to gain from it
  3. [1 pt] Make hypotheses about what you might find
  4. [6 pt] Design at least 3 “questions” (along the lines of those above) and answer them. Remember that presentation matters here.

Option 2: Big Data Systems

As an alternative, you can set up a big data system that you are interested in using (such as Flink, Kafka, Cassandra, etc). The process here is:

  1. [1 pt] System setup (must run on all 12 orion machines)
  2. [2 pt] Setup document: describe the tutorials/resources you used for and any issues you came across during the setup process
  3. [3 pt] Store some of the data from P2 or P3 and then port 3 questions to the system. (Note: this may be difficult depending on the type of system you’re experimenting with. You can propose alternatives)
  4. [2 pt] Comparison: benchmark one aspect of this system against one we used previously (Cassandra vs HDFS, Spark vs Flink, etc). It’s okay if the comparison isn’t 100% fair.



Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.