Project 3: Spatiotemporal Analysis with Spark (v 1.0)
Starter repository on GitHub: https://classroom.github.com/g/lY4Jm4vN
In this assignment, we’ll analyze a dataset collected by NOAA for modeling and predicting climate phenomena: the North American Mesoscale Forecast System (NAM). You’ll write Spark jobs that filter and aggregate features from the dataset. You are allowed to use any Spark-compatible programming language, and can use any libraries as long as they don’t implement/complete the assignment for you (check with the instructor first if you’re unsure).
You can find the files in
For more information about the dataset, see the data dictionary page.
You will submit two deliverables for Project 3:
- A Jupyter notebook containing your code and answers to the questions below with your thoughts/opinions/analysis.
- A project retrospective
As usual, some aspects of these questions are left up to your own interpretation. Occasionally there are no exactly right/wrong answers, but you should be able to justify your approach.
Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your notebook or provide some statistics about it (population, nearby landmarks). Perhaps the question involves some obscure concept or subculture; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!) Please also include how long each Spark job ran for. Failure to include necessary contextual information will result in deductions, regardless if you found the correct answer.
Note: each member of your group (if applicable) should answer the following questions individually.
[0.5 pt] Unknown Feature: Choose a feature from the data dictionary above that you have never heard of before. Inspect some of the values for the feature (such as its average, min, max, etc.) and try to guess what it measures. Was your hypothesis correct? (Note: if you are a professional meteorologist, you can skip this question ;-))
[0.5 pt] Hot hot hot: When and where was the hottest temperature observed in the dataset? Is it an anomaly?
[1 pt] So Snowy: Find a location that is snowy all year (there are several). Locate a nearby town/city and provide a small writeup about it. Include pictures if you’d like.
[1 pt] Strangely Snowy: Find a location that contains snow while its surroundings do not. Why does this occur? Is it a high mountain peak in a desert?
[1 pt] Lightning rod: Where are you most likely to be struck by lightning? Use a precision of at least 4 Geohash characters and provide the top 3 locations.
[1 pt] Drying out: Choose a region in North America (defined by one or more Geohashes) and determine when its driest month is. This should include a histogram with data from each month.
[2 pt] Travel Startup: After graduating from USF, you found a startup that aims to provide personalized travel itineraries using big data analysis. Given your own personal preferences, build a plan for a year of travel across 5 locations. Or, in other words: pick 5 regions. What is the best time of year to visit them based on the dataset?
- One avenue here could be determining the comfort index for a region. You could incorporate several features: not too hot, not too cold, dry, humid, windy, etc. There are several different ways of calculating this available online, and you could also analyze how well your own metrics do.
[1 pt] Escaping the fog: After becoming rich from your startup, you are looking for the perfect location to build your Bay Area mansion with unobstructed views. Find the locations that are the least foggy and show them on a map.
[2 pt] SolarWind, Inc.: You get bored enjoying the amazing views from your mansion, so you start a new company; here, you want to help power companies plan out the locations of solar and wind farms across North America. Locate the top 3 places for solar and wind farms, as well as a combination of both (solar + wind farm). You will report a total of 9 Geohashes as well as their relevant attributes (for example, cloud cover and wind speeds).
[2 pt] Climate Chart: Given a Geohash prefix, create a climate chart for the region. This includes high, low, and average temperatures, as well as monthly average rainfall (precipitation). Here’s a (poor quality) script that will generate this for you.
- Earn up to 1 point of extra credit for enhancing/improving this chart (or porting it to a more feature-rich visualization library)
[2 pt] Influencers: Determine how features influence each other using Pearson’s correlation coefficient (PCC). The output for this job should include (1) feature pairs sorted by absolute correlation coefficient, and (2) a correlation matrix visualization (heatmaps are a good option).
- Here’s an example heatmap generation script.
temperature_tropopause, friction_velocity_surface 0.99 temperature_tropopause, relative_humidity_zerodegc_isotherm 0.72 ... (etc)
- [2 pt] Prediction/Classification: Using what you learned above as your guide, choose a feature to predict or classify via machine learning models in MLlib. You will need to explain:
- The feature you will predict/classify
- Features used to train the model
- How you partitioned your data
Option 1: Advanced Analysis
You’ve had the opportunity to analyze two datasets thus far; now it’s time to analyze a dataset of your own. Find a dataset online and use Spark (or Hadoop) to analyze it. You should:
- [0.5 pt] Describe the dataset
- [0.5 pt] Outline the types of insights you hope to gain from it
- [1 pt] Make hypotheses about what you might find
- [6 pt] Design at least 3 “questions” (along the lines of those above) and answer them. Remember that presentation matters here.
Option 2: Big Data Systems
As an alternative, you can set up a big data system that you are interested in using (such as Flink, Kafka, Cassandra, etc). The process here is:
- [1 pt] System setup (must run on all 12
- [2 pt] Setup document: describe the tutorials/resources you used for and any issues you came across during the setup process
- [3 pt] Store some of the data from P2 or P3 and then port 3 questions to the system. (Note: this may be difficult depending on the type of system you’re experimenting with. You can propose alternatives)
- [2 pt] Comparison: benchmark one aspect of this system against one we used previously (Cassandra vs HDFS, Spark vs Flink, etc). It’s okay if the comparison isn’t 100% fair.
- [1 pt] Project retrospective
Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.
- 11/12: Version 1.0 posted