Project 3: Social Network Analysis with Spark (v 1.0)

Starter repository on GitHub:

In this project, we will analyze a large dataset of user comments from popular news aggregation website Reddit. On Reddit, members submit content including news stories, articles, images, or videos and are also allowed to moderate the site via voting submissions up or down. The site is organized into a multitude of subreddits that specialize in particular types of content or discussion. For example, /r/politics covers US politics, and /r/technology focuses on tech news.

Similar to the content submissions, comments can also be voted up or down and several other metadata items are tracked. You’ll use these features to help you write Spark jobs that filter, aggregate, and glean insights from the dataset. You can use any language that Spark supports for this project, and you are allowed to use 3rd party libraries (unless they completely trivialize the assignment – check if in doubt).

Some students are familiar with Reddit while others may not be. As you explore the dataset, feel free to ask questions on Campuswire. It’s also worth noting that since Reddit originated in the US, the comments and submissions will likely trend towards being US-centric. Additionally, certain demographics may be under- or over-represented by the dataset. You should keep factors such as these in mind as you perform your analysis.

Dataset Location

You can find the files in /bigdata/mmalensek/data/ on orion05. This dataset was sourced from here.


You will submit two deliverables for Project 3:

Some aspects of these questions are left up to your own interpretation. Occasionally there are no exactly right/wrong answers, but you should be able to justify your approach.

Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your notebook or provide some statistics about it (population, nearby landmarks). Perhaps the question involves some obscure concept or subculture; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!)

One final note: many of the questions below ask for you to find a specific user (or group of users). Be wary of bots or automated (non-human) accounts; perhaps the user that wrote the most comments in a particular time frame was just a bot that posts advertisements – in some cases, you will want to ignore these, so finding the top N users could be a better approach than finding the absolute top user.


Comment Analysis I

Comment Analysis II

Personal Dataset Analysis


Hints and Tips

Some hints to remember while you’re analyzing the data:


Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.