Project 1: Atmospheric Data Analysis (v 1.1)

Starter repository on GitHub: https://classroom.github.com/a/LSVOAoYP

This assignment gives you the opportunity to put your C skills to use. We will be analyzing data from the National Oceanic and Atmospheric Administration (NOAA) North American Mesoscale Forecast System to learn more about the climate in a few different states.

C is a great match for data analysis, at least in the speed department: when you’re processing millions of lines of data, you’ll be able to get things done much faster.

In this programming assignment, you will get more familiar with:

File I/O
String manipulation routines
Reading basic tab-delimited value (TDV) files
C structs
Dynamic memory allocation
Pointers!

Demo

Here’s a sample run, passing in two test files:

./climate data_tn.tdv data_wa.tdv

Opening file: data_tn.tdv
Opening file: data_wa.tdv
States found: TN WA
-- State: TN --
Number of Records: 17097
Average Humidity: 49.4%
Average Temperature: 58.3F
Max Temperature: 110.4F on Mon Aug  3 11:00:00 2015
Min Temperature: -11.1F on Fri Feb 20 04:00:00 2015
Lightning Strikes: 781
Records with Snow Cover: 107
Average Cloud Cover: 53.0%
-- State: WA --
Number of Records: 48357
Average Humidity: 61.3%
Average Temperature: 52.9F
Max Temperature: 125.7F on Sun Jun 28 17:00:00 2015
Min Temperature: -18.7F on Wed Dec 30 04:00:00 2015
Lightning Strikes: 1190
Records with Snow Cover: 1383
Average Cloud Cover: 54.5%

Testing Your Code

There are three data files included to test your code:

data_tn.tdv
data_wa.tdv
data_multi.tdv.gz

data_multi is compressed to save space. To decompress it, use your favorite archive utility or the command line:

gunzip data_multi.gz

Each file contains one record per line with fields separated by tab characters (\t). The columns are organized as follows:

TN	1424325600000	dn20t1kz0xrz	67.0	0.0	0.0	0.0	101872.0	262.5665
TN	1422770400000	dn2dcstxsf5b	23.0	0.0	100.0	0.0	100576.0	277.8087
TN	1422792000000	dn2sdp6pbb5b	96.0	0.0	100.0	0.0	100117.0	278.49207
TN	1422748800000	dn2fjteh8e80	6.0	0.0	100.0	0.0	100661.0	278.28485
TN	1423396800000	dn2k0y7ffcup	14.0	0.0	100.0	0.0	100176.0	282.02142

...

Fields:

State code (e.g., CA, TX, etc)
Timestamp (time of observation as a UNIX timestamp)
Geolocation (geohash string)
Humidity (0 - 100%)
Snow (1 = snow present, 0 = no snow)
Cloud cover (0 - 100%)
Lightning strikes (1 = lightning strike, 0 = no lightning)
Pressure (Pa)
Surface temperature (Kelvin)

We will also test your programs with other input files. Note: you can assume that each line in the files will contain all the fields. No need to check for malformed files or lines.

Hints and Resources

The dataset contains temperatures in Kelvin rather than degrees Fahrenheit. To convert K to F, you can use the following formula:

deg_f = deg_k * 1.8 - 459.67

The times the measurements were taken are expressed as Unix timestamps. These can be convered to string form with the ctime function. You will also need to divide the timestamps in the data files by 1000 to adjust for the precision ctime expects:

#include <time.h>

timestamp = timestamp / 1000;
printf("Time: %s", ctime(&timestamp));

Finally, be careful when determining which C data types to use in your struct. If you’re wondering what can be stored in different data types, check Wikipedia’s page on C Data Types.

Grading

The grade breakdown for this assignment is:

12pts Correct climate statistics
5pts Error handling (missing files, using perror, etc). Note: you can assume the data files we provide do not have any malformed data or missing fields.
5pts Support for processing multiple files
3pts Function documentation and comments
2pts Code style (no commented out blocks of code, unused variables, inconsistent indentation)
2pts Correct formatting and unit conversions
1pts Program usage message

Changelog

Initial version posted (2/6)
Added hints and dataset info, project released (2/12)