Project 3: Social Network Analysis with Spark (v 1.0)
Starter repository on GitHub: https://classroom.github.com/a/XnfnAxlB
In this project, we will analyze a large dataset of user comments from popular news aggregation website Reddit. On Reddit, members submit content including news stories, articles, images, or videos and are also allowed to moderate the site via voting submissions up or down. The site is organized into a multitude of subreddits that specialize in particular types of content or discussion. For example, /r/politics
covers US politics, and /r/technology
focuses on tech news.
Similar to the content submissions, comments can also be voted up or down and several other metadata items are tracked. You’ll use these features to help you write Spark jobs that filter, aggregate, and glean insights from the dataset. You can use any language that Spark supports for this project, and you are allowed to use 3rd party libraries (unless they completely trivialize the assignment – check if in doubt).
Some students are familiar with Reddit while others may not be. As you explore the dataset, feel free to ask questions on Campuswire. It’s also worth noting that since Reddit originated in the US, the comments and submissions will likely trend towards being US-centric. Additionally, certain demographics may be under- or over-represented by the dataset. You should keep factors such as these in mind as you perform your analysis.
Dataset Location
You can find the files in /bigdata/mmalensek/data/
on orion05
. This dataset was sourced from here.
Deliverables
You will submit two deliverables for Project 3:
- A Jupyter notebook (or lab report if not using Jupyter) containing your code and answers to the questions below with your thoughts/opinions/analysis.
- A project retrospective
Some aspects of these questions are left up to your own interpretation. Occasionally there are no exactly right/wrong answers, but you should be able to justify your approach.
Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your notebook or provide some statistics about it (population, nearby landmarks). Perhaps the question involves some obscure concept or subculture; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!)
One final note: many of the questions below ask for you to find a specific user (or group of users). Be wary of bots or automated (non-human) accounts; perhaps the user that wrote the most comments in a particular time frame was just a bot that posts advertisements – in some cases, you will want to ignore these, so finding the top N users could be a better approach than finding the absolute top user.
Warm-up
- [1 pt] Best Post Award: Choose a particular day and determine what the most upvoted comment was. (Include the comment in your report, of course!)
- [1 pt] Top Comments: For the user you found in the previous question, find their three most-upvoted comments overall across the entire dataset.
- [1 pt] Subreddit Growth: How many unique subreddits were there at the beginning and end of 2020? Be sure to explain your approach!
- [1 pt] Busiest Months: Are there any months of the year where users post more frequently? Since reddit will continue to grow, determining this is not always as straightforward as it might seem. You may want to use feature scaling to normalize comments per month before comparing across years.
Comment Analysis I
-
[1 pt] Readability: write a function that computes the Gunning Fog Index and Flesch-Kincaid Readability (both reading ease and grade level) of user comments. Then:
- Choose a subreddit and plot the distribution of these scores.
- Compare readability of two subreddits of your choosing. Analyze the results.
-
[2 pt] Toxicity: write a function to perform Sentiment Analysis on comments. You can use a library to do this or roll your own. Next, compare the toxicity between two subreddits of your choosing. Analyze the results.
-
[2 pt] Targeted Advertising: given a specific user, find out more about them: where they’re from, what things they like/dislike, and other data about their background (think of at least 2 more things to determine). Note that this should be automated; I should be able to give you a username and you’ll produce a report for them. Provide three sample user reports that we can exploit to create ads they won’t be able to resist.
-
[1 pt] COVID-19 Origin: Find the first occurrence of a user posting about COVID-19. How many days was this before the lockdown in San Francisco?
-
[1 pt] COVID-19 Attitudes: Using your sentiment analysis function (or another heuristic you come up with), show how attitudes toward COVID-19 changed over time.
-
[2 pt] Trending Topics: Assuming you receive Reddit comments as a live stream, build a Spark streaming application that will consume the stream and determine what topics are trending over a particular window of time. One approach might involve filtering the data and then using Reservoir Sampling to create the sample.
Comment Analysis II
- [2 pt] Now that you’ve found the answers to the questions above, design your own question. It should be sufficiently difficult, and you should be creative! You should start with a question, and then propose a predicted answer or hypothesis before writing a Spark job to answer it. Some ideas:
- Visualization of related features. Your visualization should help tell a story.
- Clustering related users, comments, or subreddits
- Summary statistics: finding mins, maxes, standard deviations, or even correlations between variables to tell us something about a subreddit or multiple subreddits. For example, perhaps users that visit
/r/technology
also frequently visit/r/android
. - Friend graph: can you link together ‘related’ users based on some shared interest? Maybe several users visit the same collection of subreddits. The PageRank algorithm could come in handy here.
Personal Dataset Analysis
- [4 pt] Along the lines of the previous question, this time you’ll use Spark to analyze a dataset of your choosing. The scope of your question should reflect the fact that it is worth 4 points, so you are welcome to come up with multiple smaller questions if you’d prefer.
Wrap-Up
- [1 pt] Project retrospective
Hints and Tips
Some hints to remember while you’re analyzing the data:
- The dataset is not censored. We’re all adults here, but don’t put offensive material in your writeup.
- The dataset contains text-based identifiers that allow you to reconstruct threads/replies.
- In some cases, you may want to remove “
[deleted]
” comments. - Comments may contain quotes of other users, which you may want to consider as separate from the comment itself. These are prefixed with
>
(like email replies) - In some cases, you may be able to get the correct answer without reading every line in the dataset; think about ways you can avoid reading data to speed up your computations.
Grading
Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.
Changelog
- 11/15: Version 1.0 posted