CS 677 Big Data

Project 2: Climate Analysis with MapReduce (v 1.0)

Starter repository on GitHub: https://classroom.github.com/g/avTjFMiD

In this assignment, we’ll analyze a dataset collected from the National Oceanic and Atmospheric Administration’s (NOAA) surface reference network (USCRN). The network is composed of around 150 weather stations based in the USA and is tasked with determining how the US climate has changed (and is changing) over time. For more information, visit the project homepage.

Dataset Location

You can find the files in /bigdata/mmalensek/ncdc on orion11 and orion12.

For more information about the dataset, see the data dictionary page.


You will submit two deliverables for Project 2:

For this project, you’ll end up producing several small MapReduce jobs. Each of the tasks below can be broken up into several jobs or you can combine some of them. Some aspects of these questions are left up to your own interpretation; the point of the project is not to stifle creativity with exact, black and white answers. In other words, there are often no right/wrong answers, but you should be able to justify your approach.

Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your report or provide some statistics about it (population, nearby landmarks). Pictures and maps are a good thing. Perhaps the question involves some obscure concept; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!). Failure to include necessary contextual information will result in deductions, regardless if you found the correct answer.

Finally, remember that a huge part of any analysis job is cleaning the data.


Example report:

temperature_tropopause, friction_velocity_surface    0.99
temperature_tropopause, relative_humidity_zerodegc_isotherm    0.72

Advanced Analysis


Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.