Getting Started with Hadoop

We’ll assume your HDFS + Yarn + MapReduce setup is complete.

Remember: you need to start the ResourceManager from the node itself. So ssh there before running start-yarn.sh.

Copying Data In

You can view the dataset in /bigdata/mmalensek/twitter/. HDFS will let you store directories, so you could do something like:

hdfs dfs -put /bigdata/mmalensek/data /

To copy it into your HDFS root (/).

If you want to inspect the dataset up close, copy a few of the files to your computer. I would also highly recommend running local MapReduce jobs to test your code; waiting for a distributed job to run and then just crash with a NullPointerException or something along those lines will be very frustrating.

Getting Familiar with the Data

Assuming you have a few of the data files on your own computer, you can do some basic analysis with tools like grep:

grep '"subreddit":"politics"' RC_2016_01 | grep trump

This isn’t perfect, but serves as a way to do initial analysis quickly.

You could also open them with a JSON viewer, but remember that the file is very large and will probably take a long time to load. If you open the file in vim, you can press Ctrl+C to see the first lines of the file right away.

Running Your Job

You can run the WordCount example on the test files. A couple tips here:

The files don’t have to be extracted before being used as inputs. Hadoop takes care of decompressing them automatically.
You can specify a whole directory of files, or a single file. Either way, the relevant input chunks will be processed by your job.

To run the job:

yarn jar ./target/P2-1.0.jar edu.usfca.cs.mr.wordcount.WordCountJob /data/2012/RC_2012_04 /outputs/01

Note: The paths you are specifying are located within HDFS by default (since we are operating in clustered mode). If you set up a local install of Hadoop on your own system, you can specify local paths instead.

The job will run for a while and report progress. Use yarn top to get a general overview of what’s going on.

When the job is over, you can view its files with (assuming the paths above):

hdfs dfs -ls /outputs/01

Depending on the job, you’ll see a number of output files there. In the following job, I had 12 reducers:

-rw-r--r--   1 mmalensek supergroup          0 2018-10-24 23:28 /outputs/01/_SUCCESS
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00000
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00001
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00002
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00003
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00004
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00005
-rw-r--r--   1 mmalensek supergroup         23 2018-10-24 23:28 /outputs/01/part-r-00006
-rw-r--r--   1 mmalensek supergroup         23 2018-10-24 23:28 /outputs/01/part-r-00007
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00008
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00009
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00010
-rw-r--r--   1 mmalensek supergroup         22 2018-10-24 23:28 /outputs/01/part-r-00011

To view one of the output files, use hdfs dfs -cat /outputs/01/<file>. The contents will be written to your terminal!

As long as these aren’t too huge, you can merge them back together and pull them out of HDFS in one step:

hadoop fs -getmerge /outputs/01/ ./merged_output.txt

Notice how the second path is local, not inside HDFS.