CS 677 Big Data

Project 2: Social Network Analysis with MapReduce (v 1.1)

Starter repository on GitHub: https://classroom.github.com/a/fvtwVXOb

In this project, we will analyze a large dataset of user comments from popular news aggregation website Reddit. On Reddit, members submit content including news stories, articles, images, or videos and are also allowed to moderate the site via voting submissions up or down. The site is organized into a multitude of subreddits that specialize in particular types of content or discussion. For example, /r/politics covers US politics, and /r/technology focuses on tech news.

Similar to the content submissions, comments can also be voted up or down and several other metadata items are tracked. You’ll use these features to help you write MapReduce jobs that filter, aggregate, and glean insights from the dataset. You must use Java for this assignment, but you are given more leeway on using 3rd party libraries; for instance, since the dataset is in JSON format you may incorporate a JSON parser into your codebase. However, libraries that trivialize the assignment (anything that implements MapReduce-related functionality) are not allowed. For plotting or visualizations, you are not required to use Java (matplotlib, R, or even Excel is fine there). If in doubt, ask first.

Some students are familiar with Reddit while others may not be. As you explore the dataset, feel free to ask questions on Piazza or in your discussion groups. It’s also worth noting that since Reddit originated in the US, the comments and submissions will likely trend towards being US-centric. Additionally, certain demographics may be under- or over-represented by the dataset. You should keep factors such as these in mind as you perform your analysis.

Dataset Location

You can find the files in /bigdata/mmalensek/data/ on orion01. This dataset was sourced from here.

Deliverables

You will submit three deliverables for Project 2:

You will likely produce produce several small MapReduce jobs in this project. Each of the tasks below can be broken up into several jobs, or you can combine some of them. However, I should be able to run your code in “one shot,” i.e., if I want the answer to a certain question, I should be able to launch the appropriate job via YARN and see its results without needing to run other stages or scripts.

As usual, some aspects of these questions are left up to your own interpretation. Occasionally there are no right/wrong answers, but you should be able to justify your approach.

Important: many of the questions are best answered with context. Think of it this way: if I ask you for a location, you may want to embed a map in your notebook or provide some statistics about it (population, nearby landmarks). Perhaps the question involves some obscure concept or subculture; in that case, a link to the appropriate Wikipedia article is useful. Combining different forms of media through data fusion can tell a compelling story (…just make sure the story isn’t misleading!) Please also include how long each MapReduce job ran for. Failure to include necessary contextual information will result in deductions, regardless if you found the correct answer.

One final note: many of the questions below ask for you to find a specific user (or group of users). Be wary of bots or automated (non-human) accounts; perhaps the user that wrote the most comments in a particular time frame was just a bot that posts advertisements – in some cases, you will want to ignore these, so finding the top N users could be a better approach than finding the absolute top user.

Warm-up

Analysis

Wrap-Up

Hints and Tips

Some hints to remember while you’re analyzing the data:

Here’s some useful commands for working with HDFS and Hadoop.

# Create a directory for datasets:
hdfs dfs -mkdir "/datasets"

# Store a file there:
hdfs dfs -put ./gutenbooks.txt /datasets

# View the file(s) in the dataset:
hdfs dfs -ls /datasets

# Create a directory for project outputs:
hdfs dfs -mkdir "/sentiment_output"

# Launch MapReduce application
yarn jar \
    ./target/P2-1.0.jar \
    edu.usfca.cs.mr.wordcount.WordCountJob \
    /datasets/gutenbook.txt.bz2 \
    /output_job_1

# View active MapReduce applications, determine your applicationId
yarn application -list

# View an interactive list of jobs:
yarn top

# After your job completes, view its logs:
yarn logs -applicationId <app_id_here>

# Killing an application:
yarn application -kill <app_id_here>

# Deleting an output directory:
hdfs dfs -rm -r -skipTrash /tmp/yourname/output_123

Grading

Your grade is largely based on the quality of your report. I will deduct points if you violate any of the requirements listed in this document. I may also deduct points for poor design and/or formatting; please use good development practices, break your code into separate classes based on functionality, and include comments in your source where appropriate.

Changelog