Getting Started with Hadoop
We’ll assume your HDFS + Yarn + MapReduce setup is complete.
Remember: you need to start the ResourceManager from the node itself. So ssh
there before running start-yarn.sh
.
Copying Data In
You can view the dataset in /bigdata/mmalensek/twitter/
. HDFS will let you store directories, so you could do something like:
hdfs dfs -put /bigdata/mmalensek/data /
To copy it into your HDFS root (/
).
If you want to inspect the dataset up close, copy a few of the files to your computer. I would also highly recommend running local MapReduce jobs to test your code; waiting for a distributed job to run and then just crash with a NullPointerException or something along those lines will be very frustrating.
Getting Familiar with the Data
Assuming you have a few of the data files on your own computer, you can do some basic analysis with tools like grep:
grep '"subreddit":"politics"' RC_2016_01 | grep trump
This isn’t perfect, but serves as a way to do initial analysis quickly.
You could also open them with a JSON viewer, but remember that the file is very large and will probably take a long time to load. If you open the file in vim
, you can press Ctrl+C to see the first lines of the file right away.
Running Your Job
You can run the WordCount example on the test files. A couple tips here:
- The files don’t have to be extracted before being used as inputs. Hadoop takes care of decompressing them automatically.
- You can specify a whole directory of files, or a single file. Either way, the relevant input chunks will be processed by your job.
To run the job:
yarn jar ./target/P2-1.0.jar edu.usfca.cs.mr.wordcount.WordCountJob /data/2012/RC_2012_04 /outputs/01
Note: The paths you are specifying are located within HDFS by default (since we are operating in clustered mode). If you set up a local install of Hadoop on your own system, you can specify local paths instead.
The job will run for a while and report progress. Use yarn top
to get a general overview of what’s going on.
When the job is over, you can view its files with (assuming the paths above):
hdfs dfs -ls /outputs/01
Depending on the job, you’ll see a number of output files there. In the following job, I had 12 reducers:
-rw-r--r-- 1 mmalensek supergroup 0 2018-10-24 23:28 /outputs/01/_SUCCESS
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00000
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00001
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00002
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00003
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00004
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00005
-rw-r--r-- 1 mmalensek supergroup 23 2018-10-24 23:28 /outputs/01/part-r-00006
-rw-r--r-- 1 mmalensek supergroup 23 2018-10-24 23:28 /outputs/01/part-r-00007
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00008
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00009
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00010
-rw-r--r-- 1 mmalensek supergroup 22 2018-10-24 23:28 /outputs/01/part-r-00011
To view one of the output files, use hdfs dfs -cat /outputs/01/<file>
. The contents will be written to your terminal!
As long as these aren’t too huge, you can merge them back together and pull them out of HDFS in one step:
hadoop fs -getmerge /outputs/01/ ./merged_output.txt
Notice how the second path is local, not inside HDFS.