Setting up Apache Hadoop

Hadoop is a software platform designed for large-scale computation and data storage. The modern Hadoop ecosystem includes several components, including:

Hadoop MapReduce - Distributed computation
HDFS (Hadoop Distributed File System) - Distributed storage
YARN (Yet Another Resource Negotiator) - Cluster management and job scheduling

Big Data projects including HBase, Spark, Hive, and Storm often build or extend these components.

Prerequisites

Passwordless SSH

Getting the Software: Orion Machines

Hadoop 3.3 is available in /bigdata/hadoop-3.3.4. You won’t need to copy/change any of the files there; we will set up a local configuration in your home directory.

Optional: Personal Installation

If you also want to run Hadoop on your own computer, you can download the binary distribution. Extract it somewhere and use that location as your HADOOP_HOME.

Background: HDFS

First, let’s focus on the distributed file system: HDFS.

Components: HDFS

HDFS has two primary components:

Datanode: manages file system blocks, chunks of data that are usually 128 MB in size. Large files (e.g., a 10 GB CSV) may span multiple blocks and be stored across several datanodes.
Namenode: maintains file system metadata (the namespace). This includes the file system tree and the corresponding datanodes that hold the blocks associated with each file.

The namenode is a single point of failure: if it is lost, then there is no way of knowing how to reconstruct files from the blocks spread across the datanodes. To mitigate this risk, it is possible to run a hot standby namenode for high availability depending on the fault tolerance requirements of the deployment. In our case, we can get by without high availability.

There is also a secondary namenode. This node is responsible for merging live file system modifications with the on-disk persistent image of the file system (‘fsimage’). The secondary can be located on the same machine as the primary, but it’s recommended to have it on a different machine since it has similar resource requirements. Note that the secondary namenode does not provide high availability! Rather, it provides live checkpointing and merging functionality of the file system image.

File Retrieval Workflow

When you read a file in HDFS, the following steps occur:

Contact namenode with file name
Namenode responds with block locations (a list of datanodes)
For each datanode in the list, request relevant blocks
Reconstruct the file on the client side by stitching together incoming blocks

We will be operating HDFS in a clustered environment, where files are distributed (and replicated) across several machines.

Background: YARN and MapReduce

With an operational distributed file system, we can run applications that read/analyze the data (generally with spatial locality, i.e., jobs will be colocated with their relevant data blocks).

Components: YARN

Previous versions of Hadoop were tightly-coupled with the MapReduce computing paradigm. YARN decouples job management and scheduling from the MR framework and allows for more flexibility. YARN is composed of the following components:

ResourceManager: manages the computing resources (nodes) in the cluster
NodeManager: manages a particular node (host/server)
Container: a resource allocation request (CPU cores, memory) on a node
Task: process that runs in a container

So under YARN, MapReduce is just one type of task that can run.

Computation Workflow

When you run a MapReduce application, the following steps occur:

The application asks for resources from the ResourceManager, and an ApplicationMaster is launched in a container
The ApplicationMaster requests additional resources needed for execution
Tasks execute
The ApplicationMaster shuts down, followed by the client application shutting down

Cluster Setup

The official documentation outlines setting up a Hadoop cluster. We will walk through a simplified version of the setup in this guide, but if you run into trouble or want to delve deeper into the configuration, use the following links as a starting point.

IMPORTANT: Make sure you are using the correct documentation for your version of Hadoop.

Single-node setup: http://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/SingleCluster.html
Cluster setup: http://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/ClusterSetup.html

Environment Setup

We need to set up environment variables for the various Hadoop components to function correctly. Edit your ~/.bashrc (or .zshenv if you are a zsh user) and add the following:

# Use the latest version of Java:
export JAVA_HOME="/usr/lib/jvm/java"

# Location of the Hadoop installation:
export HADOOP_HOME="/bigdata/hadoop-3.3.4"

# Location of our local configuration (more on this later):
export HADOOP_CONF_DIR="${HOME}/hadoop-config"

# Where Hadoop should store log files:
export HADOOP_LOG_DIR="/bigdata/students/$(whoami)/logs"

# In our configuration, these are just aliases for $HADOOP_HOME:
export HADOOP_MAPRED_HOME="${HADOOP_HOME}"
export YARN_HOME="${HADOOP_HOME}"

# Add Hadoop binaries to the user PATH environment variable
export PATH="${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin"

# Add the location of native libraries (not required, for better performance):
export LD_LIBRARY_PATH="${HADOOP_HOME}/lib/native:${LD_LIBRARY_PATH}"

Note that the default USF configuration sources .bashrc from .bash_profile. If this isn’t the case with your setup, you may need to add it (source ~/.bashrc inside your .bash_profile).

Log out and back in (or source ~/.bashrc). Now you should be able to run Hadoop commands such as hdfs and yarn.

Configuration

The default configuration files are found in the etc directory of the Hadoop installation:

$ ls $HADOOP_HOME/etc/hadoop 
capacity-scheduler.xml            httpfs-log4j.properties     mapred-site.xml
configuration.xsl                 httpfs-signature.secret     shellprofile.d
container-executor.cfg            httpfs-site.xml             ssl-client.xml.example
core-site.xml                     kms-acls.xml                ssl-server.xml.example
hadoop-env.cmd                    kms-env.sh                  user_ec_policies.xml.template
hadoop-env.sh                     kms-log4j.properties        workers
hadoop-metrics2.properties        kms-site.xml                yarn-env.cmd
hadoop-policy.xml                 log4j.properties            yarn-env.sh
hadoop-user-functions.sh.example  mapred-env.cmd              yarnservice-log4j.properties
hdfs-site.xml                     mapred-env.sh               yarn-site.xml
httpfs-env.sh                     mapred-queues.xml.template

We need to customize the installation, so let’s make a copy of these files in our home directory. You might remember that we set the location of HADOOP_CONF_DIR to $HOME/hadoop-config, so that’s where the files will be copied:

$ mkdir -v $HADOOP_CONF_DIR
$ cp -rv $HADOOP_HOME/etc/hadoop/* $HADOOP_CONF_DIR

# Since we're installing on a non-Windows platform, let's remove the .cmd
# files (not required):
$ rm $HADOOP_CONF_DIR/*.cmd

At this point, we just have a few steps left:

Creating a list of workers (servers that will participate in our cluster)
Deciding which servers will host the various Hadoop components
Deciding what ports the components will run on (since everyone will be installing their own copy of Hadoop…)

Generating the Worker List

We will reserve orion11 and orion12 for NameNodes and ResourceManagers, with the remaining 10 machines executing tasks through NodeManager instances. To generate a worker list, we can do the following:

$ cd $HADOOP_CONF_DIR
$ for i in {01..10}; do echo "orion${i}"; done > workers
$ cat workers
orion01
orion02
orion03
orion04
orion05
orion06
orion07
orion08
orion09
orion10

Creating our Configuration

Use the starter files to create your Hadoop configuration. It is important that you run your Hadoop/HDFS components on unique ports; otherwise, you’ll have conflicts and components will fail to run properly. See the port assignment list.

Note: you can copy these starter files over your base configuration or use them as a guide.

# Go home first:
$ cd

# Download the starter files and untar:
$ wget 'https://www.cs.usfca.edu/~mmalensek/cs677/schedule/materials/starter-config.tar.gz'
$ tar xvf starter-config.tar.gz

Within the starter files, you will need to edit the following (you can find the relevant keys with the grep command if you’d like), or use the configure.sh script:

port – any XML keys with ‘port’ inside need to be updated with a unique port (from those assigned to you).
namenode – hostname of your NameNode
secondary-nn – hostname of your secondary NameNode
resourcemanager – hostname of your ResourceManager
username – your CS username. You can find this by running the whoami command.

Note that any hostname with 0.0.0.0 means that the component in question will be run on a worker and bind to the local host address. Don’t add a hostname here!

The configure.sh script will make these changes for you; all you need to supply is the hostnames of the NameNode, Secondary NameNode, and ResourceManager. After running the script, check over the files to make sure they were configured correctly and then copy them into your $HADOOP_CONF_DIR:

cp -v *.xml $HADOOP_CONF_DIR

Formatting the NameNode

Before we can start HDFS, we’ll need to format the NameNode. IMPORTANT: this must be done on the NameNode host itself, not remotely from another machine. So you will need to ssh <namenode-hostname> first:

# Note: this might be orion12 if you used it for the NameNode:
$ ssh orion11
$ hdfs namenode -format

If your HDFS installation breaks, do not simply follow these steps again and reformat the Namenode; doing so will change the cluster ID and none of your storage nodes will be able to start up. Instead, inspect the logs to determine what is wrong.

Testing the Distributed File System

Let’s start the DFS and see how much storage space we have.

$ start-dfs.sh
Starting namenodes on [orion11]
Starting datanodes
Starting secondary namenodes [orion12]

$ hdfs dfs -df -h
Filesystem              Size   Used  Available  Use%
hdfs://orion11:20000  65.5 T  6.1 G     63.9 T    0%

Around 65 TB. Not too bad! Note that this will change depending on the underlying FS usage, so you shouldn’t expect to get the exact same results here. Now let’s store a file in the file system:

# This will store './some_file.txt' into the root of your DFS:
$ hdfs dfs -put ./some_file.txt /

Testing a YARN Job

Now we need to start YARN and run a test job. ssh to the ResourceManager before running start-yarn.sh.

$ start-yarn.sh
$ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar \
    wordcount /file-to-count.txt /output-dir

The paths here refer to locations in HDFS. NOTE: if you’d like to use local files (i.e., files that are not stored in HDFS) as inputs or outputs, you must prefix them with file:// and give the complete path to their location:

$ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar \
    wordcount file:///path/to/file/file-to-count.txt file:///path/to/output/output-dir

The next step is to store some large files in HDFS and verify that they are being chunked correctly and dispersed across the machines.