CS 677 Big Data

Project 3: Social Network Analysis with Spark (v 1.0)

Starter repository on GitHub: https://classroom.github.com/g/xszYnXHN

In this project, we will analyze a large dataset of user comments from popular news aggregation website Reddit. On Reddit, members submit content including news stories, articles, images, or videos and are also allowed to moderate the site via voting submissions up or down. The site is organized into a multitude of subreddits that specialize in particular types of content or discussion. For example, /r/politics covers US politics, and /r/technology focuses on tech news.

Similar to the content submissions, comments can also be voted up or down and several other metadata items are tracked. You’ll use these features to help you write Spark jobs that filter, aggregate, and glean insights from the dataset. You are encouraged to use Python (and Jupyter Notebooks) for this assignment, but you are given more leeway on using 3rd party libraries; for instance, since the dataset is in JSON format you may incorporate a JSON parser into your codebase.

Some students are familiar with Reddit while others may not be. As you explore the dataset, feel free to ask questions on Piazza or in your discussion groups. It’s also worth noting that since Reddit originated in the US, the comments and submissions will likely trend towards being US-centric. Additionally, certain demographics may be under- or over-represented by the dataset. You should keep factors such as these in mind as you perform your analysis.

Dataset Location

You can find the files in /bigdata/mmalensek/reddit/ on orion11 and orion12. This dataset was sourced from here.


You will submit two deliverables for Project 3:

As usual, some aspects of these questions are left up to your own interpretation. Occasionally there are no right/wrong answers, but you should be able to justify your approach.

Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your notebook or provide some statistics about it (population, nearby landmarks). Perhaps the question involves some obscure concept or subculture; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!). Failure to include necessary contextual information will result in deductions, regardless if you found the correct answer.

One final note: many of the questions below ask for you to find a specific user (or group of users). Be wary of bots or automated (non-human) accounts; perhaps the user that wrote the most comments in a particular time frame was just a bot that posts advertisements – in some cases, you will want to ignore these, so finding the top N users could be a better approach than finding the absolute top user.



Final Analysis

You’ve had the opportunity to analyze two datasets thus far; now it’s time to analyze a dataset of your own. Find a dataset online and use Spark (or Hadoop) to analyze it. You should:

  1. [0.5 pt] Describe the dataset
  2. [0.5 pt] Outline the types of insights you hope to gain from it
  3. [1 pt] Make hypotheses about what you might find
  4. [4 pt] Design at least 3 “questions” (along the lines of those above) and answer them. Remember that presentation matters here. ML Models are a good choice for some of the datasets; you can describe what you’ll try to predict or classify and outline your experiences with various models.


Hints and Tips

Some hints to remember while you’re analyzing the data:


Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.