Project 2: Climate Analysis with MapReduce (v 1.0)
Starter repository on GitHub: https://classroom.github.com/g/avTjFMiD
In this assignment, we’ll analyze a dataset collected from the National Oceanic and Atmospheric Administration’s (NOAA) surface reference network (USCRN). The network is composed of around 150 weather stations based in the USA and is tasked with determining how the US climate has changed (and is changing) over time. For more information, visit the project homepage.
You can find the files in
For more information about the dataset, see the data dictionary page.
You will submit two deliverables for Project 2:
- The code for your MapReduce jobs
- A project report
For this project, you’ll end up producing several small MapReduce jobs. Each of the tasks below can be broken up into several jobs or you can combine some of them. Some aspects of these questions are left up to your own interpretation; the point of the project is not to stifle creativity with exact, black and white answers. In other words, there are often no right/wrong answers, but you should be able to justify your approach.
Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your report or provide some statistics about it (population, nearby landmarks). Pictures and maps are a good thing. Perhaps the question involves some obscure concept; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!). Failure to include necessary contextual information will result in deductions, regardless if you found the correct answer.
Finally, remember that a huge part of any analysis job is cleaning the data.
[1 pt] Extremes: When and where was the hottest and coldest surface and air temperatures observed in the dataset? Are they anomalies? If so, what were the hottest and coldest non-anomalous temperatures?
[1 pt] Drying out: Choose a region in North America (defined by Geohash, which may include several weather stations) and determine when its driest month is. This should include a histogram with data from each month.
[2 pt] Moving out: Matthew, a student in your Big Data class, really likes the Bay Area weather but due to financial limitations will never be able to own a house there. Find similarly-sized regions with similar weather patterns so Matthew can move away for good.
- You should consider more than just one or two features from the dataset here, and think carefully about your methodology.
[2 pt] Travel Startup: After graduating from USF, you found a startup that aims to provide personalized travel itineraries using big data analysis. Given your own personal preferences, build a plan for a year of travel across 5 locations. Or, in other words: pick 5 regions. What is the best time of year to visit them based on the dataset?
- Part of your answer should include the comfort index for a region. There are several different ways of calculating this available online. Note: you don’t need to use this for choosing the regions, though.
[2 pt] SolarWind, Inc.: You get bored enjoying the amazing views from your mansion that you bought with the money made with your travel startup, so you start a new company; here, you want to help power companies plan out the locations of solar and wind farms across North America. Locate the top 3 places for solar and wind farms, as well as a combination of both (solar + wind farm). You will report a total of 9 Geohashes as well as their relevant attributes (for example, cloud cover and wind speeds).
[2 pt] Climate Chart: Given a Geohash prefix, create a climate chart for the region. This includes high, low, and average temperatures, as well as monthly average rainfall (precipitation). Here’s a (poor quality) script that will generate this for you.
- Earn up to 1 point of extra credit for enhancing/improving this chart (or porting it to a more feature-rich visualization library)
[2 pt] Correlation is not Causation: Determine how features influence each other using Pearson’s correlation coefficient (PCC). The output for this job should include (1) feature pairs sorted by absolute correlation coefficient, and (2) a correlation matrix visualization (heatmaps are a good option).
- Here’s an example heatmap generation script.
temperature_tropopause, friction_velocity_surface 0.99 temperature_tropopause, relative_humidity_zerodegc_isotherm 0.72 ... (etc)
- [3 pt] Now that you’re familiar with the dataset, it’s time to choose your own adventure. Come up with a question that you will likely be able to answer with climate data, and then implement a MapReduce job (or set of jobs) to answer the question. This question is worth the most points, so it should be more sophisticated than the others. You should describe:
- The question you want to answer or problem you want to solve
- Your hypothesis: without doing any analysis, what is the most likely outcome?
- Features you will use
- A writeup describing the results, including visualizations/plots/etc if applicable. Was your hypothesis correct?
Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.
- 10/17: Version 1.0 posted