Project 3: Spatiotemporal Analysis with Spark (v 1.0)
Starter repository on GitHub: https://classroom.github.com/a/Zp2cUk4s
In this assignment, we’ll analyze a dataset collected by NOAA for modeling and predicting climate phenomena: the North American Mesoscale Forecast System (NAM). You’ll write Spark jobs that filter and aggregate features from the dataset. You are allowed to use any Spark-compatible programming language, and can use any libraries as long as they don’t implement/complete the assignment for you (check with the instructor first if you’re unsure).
Dataset Location
You can find the files in /bigdata/mmalensek/nam/3hr
on orion03
and orion04
. The dataset is in .tdv
format: tab-delimited values with an initial header that describes the features. You can find a list of features and their units here
A smaller 10% sample is available for local analysis in /bigdata/mmalensek/nam/3hr_sample
. If you use this dataset, please make note in your project report.
Deliverables
You will submit two deliverables for Project 3:
- A Jupyter notebook containing your code and answers to the questions below with your thoughts/opinions/analysis.
- A project retrospective
Some aspects of these questions are left up to your own interpretation. Occasionally there are no exactly right/wrong answers, but you should be able to justify your approach.
Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your notebook or provide some statistics about it (population, nearby landmarks). Perhaps the question involves some obscure concept or subculture; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!) Please also include how long each Spark job ran for. Failure to include necessary contextual information will result in deductions, regardless if you found the correct answer.
Analysis
-
[0.5 pt] Strangely Snowy: Find a location that contains snow while its surroundings do not. Why does this occur? Is it a high mountain peak in a desert?
-
[0.5 pt] Climate Chart: Given a Geohash prefix as an input, build a function that will create a climate chart for the region. This includes high, low, and average temperatures, as well as monthly average rainfall (precipitation). Here’s a (poor quality) script that will generate this for you, but you should probably modify it to make sure your units, scale, etc. are all presented in a readable fashion.
-
[1.5 pt] Travel Startup: After graduating from USF, you found a startup that aims to provide personalized travel itineraries using big data analysis. Given your own personal preferences, build a plan for a year of travel across 5 locations. Or, in other words: pick 5 regions. What is the best time of year to visit them based on the dataset?
- Part of this involves determining the comfort index for a region. You could incorporate several features: not too hot, not too cold, dry, humid, windy, etc. There are several different ways of calculating this available online, and you could also analyze how well your own metrics do.
- Another part of this involves presentation. You have to convince your potential customers that your travel itinerary is better than something they could come up with themselves with a little Googling. You can use pictures, information about local points of interest, etc.
-
[0.5 pt] Escaping the fog: After becoming rich from your startup, you are looking for the perfect location to build your Bay Area mansion with unobstructed views. Find the locations that are the least foggy and show them on a map.
-
[1 pt] SolarWind, Inc.: After getting rich from your travel startup you get bored and start a new company; here, you want to help power companies plan out the locations of solar and wind farms across North America. Locate the top 3 places for solar and wind farms, as well as a combination of both (solar + wind farm). You will report a total of 9 Geohashes as well as their relevant attributes (for example, cloud cover and wind speeds).
-
[1 pt] Climate Change: Using two-character geohash aggregates across the entire NAM grid, determine temperature trends over the past 5 years. With the regions that have experienced an increase in temperatures, build a correlation matrix using Pearson’s correlation coefficient (PCC) to determine how the variables influence one another. Finally, determine whether or not the correlations are different based on the region (e.g., maybe temperature has increased in lockstep with humidity in one location but not another). Analyze your results: can you draw any conclusions from what you’ve found?
-
[2 pt] Weather Station: Write a multi-threaded server (outside of Spark) that reads files from the dataset — one file per thread — and then streams them out on a socket for a Spark streaming context to consume (note: not ALL the files have to be opened at once! :-)). The program should produce records as fast as the network will support, i.e., faster than real time. Using Spark, consume the streams and then:
- Choose five geographical locations to aggregate. You will filter out any other locations present in the streams.
- Build an online summary of surface temperature, pressure, humidity, precipitation, visibility, and wind speed for the geographical locations you selected.
- Produce a visual overview of these summaries. You have freedom to show the data however you’d like, but the idea here is to give the viewer a high-level summary of the weather in different locations and how it is changing in real time (well, actually faster than real time in this case!). A basic approach could be to show each metric separately on a 5-by-6 grid. Your visualization should either update in place in the Jupyter notebook as data arrives, or you can build a video by exporting each frame of the visualization to a file and then combining them.
- Turn in a video of your weather station in action.
-
[1 pt] Prediction/Classification: Revisit any of the problems above and enhance them using machine learning models from MLlib. You will need to explain:
- The feature you will predict/classify
- Features used to train the model
- How you partitioned your data
- How the prediction/classification improves your analysis
Self-Guided Analysis
Now it’s time to analyze a dataset of your own. Find a dataset online and use Spark to analyze it. You should:
- [0.5 pt] Describe the dataset
- [0.5 pt] Outline the types of insights you hope to gain from it and make hypotheses about what you might find.
- [1 pt] Collect and clean the data for analysis and provide a basic set of summary information about the features in your dataset.
- [6 pt] Design at least 3 “questions” (along the lines of those above) and answer them. Remember that presentation matters here.
Grading
Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.
Steep deductions will be applied if you do most of your analysis on the driver, i.e., not distributed using Spark!