Setting up Apache Hadoop
Hadoop is a software platform designed for large-scale computation and data storage. The modern Hadoop ecosystem includes several components, including:
- Hadoop MapReduce - Distributed computation
- HDFS (Hadoop Distributed File System) - Distributed storage
- YARN (Yet Another Resource Negotiator) - Cluster management and job scheduling
Big Data projects including HBase, Spark, Hive, and Storm often build or extend these components.
Prerequisites
Getting the Software: Orion Machines
Hadoop 3.3 is available in /bigdata/hadoop-3.3.4
. You won’t need to copy/change any of the files there; we will set up a local configuration in your home directory.
Optional: Personal Installation
If you also want to run Hadoop on your own computer, you can download the binary distribution. Extract it somewhere and use that location as your HADOOP_HOME
.
Background: HDFS
First, let’s focus on the distributed file system: HDFS.
Components: HDFS
HDFS has two primary components:
- Datanode: manages file system blocks, chunks of data that are usually 128 MB in size. Large files (e.g., a 10 GB CSV) may span multiple blocks and be stored across several datanodes.
- Namenode: maintains file system metadata (the namespace). This includes the file system tree and the corresponding datanodes that hold the blocks associated with each file.
The namenode is a single point of failure: if it is lost, then there is no way of knowing how to reconstruct files from the blocks spread across the datanodes. To mitigate this risk, it is possible to run a hot standby namenode for high availability depending on the fault tolerance requirements of the deployment. In our case, we can get by without high availability.
There is also a secondary namenode. This node is responsible for merging live file system modifications with the on-disk persistent image of the file system (‘fsimage’). The secondary can be located on the same machine as the primary, but it’s recommended to have it on a different machine since it has similar resource requirements. Note that the secondary namenode does not provide high availability! Rather, it provides live checkpointing and merging functionality of the file system image.
File Retrieval Workflow
When you read a file in HDFS, the following steps occur:
- Contact namenode with file name
- Namenode responds with block locations (a list of datanodes)
- For each datanode in the list, request relevant blocks
- Reconstruct the file on the client side by stitching together incoming blocks
We will be operating HDFS in a clustered environment, where files are distributed (and replicated) across several machines.
Background: YARN and MapReduce
With an operational distributed file system, we can run applications that read/analyze the data (generally with spatial locality, i.e., jobs will be colocated with their relevant data blocks).
Components: YARN
Previous versions of Hadoop were tightly-coupled with the MapReduce computing paradigm. YARN decouples job management and scheduling from the MR framework and allows for more flexibility. YARN is composed of the following components:
- ResourceManager: manages the computing resources (nodes) in the cluster
- NodeManager: manages a particular node (host/server)
- Container: a resource allocation request (CPU cores, memory) on a node
- Task: process that runs in a container
So under YARN, MapReduce is just one type of task that can run.
Computation Workflow
When you run a MapReduce application, the following steps occur:
- The application asks for resources from the ResourceManager, and an ApplicationMaster is launched in a container
- The ApplicationMaster requests additional resources needed for execution
- Tasks execute
- The ApplicationMaster shuts down, followed by the client application shutting down
Cluster Setup
The official documentation outlines setting up a Hadoop cluster. We will walk through a simplified version of the setup in this guide, but if you run into trouble or want to delve deeper into the configuration, use the following links as a starting point.
IMPORTANT: Make sure you are using the correct documentation for your version of Hadoop.
- Single-node setup: http://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/SingleCluster.html
- Cluster setup: http://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/ClusterSetup.html
Environment Setup
We need to set up environment variables for the various Hadoop components to function correctly. Edit your ~/.bashrc
(or .zshenv
if you are a zsh
user) and add the following:
# Use the latest version of Java:
export JAVA_HOME="/usr/lib/jvm/java"
# Location of the Hadoop installation:
export HADOOP_HOME="/bigdata/hadoop-3.3.4"
# Location of our local configuration (more on this later):
export HADOOP_CONF_DIR="${HOME}/hadoop-config"
# Where Hadoop should store log files:
export HADOOP_LOG_DIR="/bigdata/students/$(whoami)/logs"
# In our configuration, these are just aliases for $HADOOP_HOME:
export HADOOP_MAPRED_HOME="${HADOOP_HOME}"
export YARN_HOME="${HADOOP_HOME}"
# Add Hadoop binaries to the user PATH environment variable
export PATH="${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin"
# Add the location of native libraries (not required, for better performance):
export LD_LIBRARY_PATH="${HADOOP_HOME}/lib/native:${LD_LIBRARY_PATH}"
Note that the default USF configuration sources .bashrc
from .bash_profile
. If this isn’t the case with your setup, you may need to add it (source ~/.bashrc
inside your .bash_profile
).
Log out and back in (or source ~/.bashrc
). Now you should be able to run Hadoop commands such as hdfs
and yarn
.
Configuration
The default configuration files are found in the etc directory of the Hadoop installation:
$ ls $HADOOP_HOME/etc/hadoop
capacity-scheduler.xml httpfs-log4j.properties mapred-site.xml
configuration.xsl httpfs-signature.secret shellprofile.d
container-executor.cfg httpfs-site.xml ssl-client.xml.example
core-site.xml kms-acls.xml ssl-server.xml.example
hadoop-env.cmd kms-env.sh user_ec_policies.xml.template
hadoop-env.sh kms-log4j.properties workers
hadoop-metrics2.properties kms-site.xml yarn-env.cmd
hadoop-policy.xml log4j.properties yarn-env.sh
hadoop-user-functions.sh.example mapred-env.cmd yarnservice-log4j.properties
hdfs-site.xml mapred-env.sh yarn-site.xml
httpfs-env.sh mapred-queues.xml.template
We need to customize the installation, so let’s make a copy of these files in our home directory. You might remember that we set the location of HADOOP_CONF_DIR
to $HOME/hadoop-config
, so that’s where the files will be copied:
$ mkdir -v $HADOOP_CONF_DIR
$ cp -rv $HADOOP_HOME/etc/hadoop/* $HADOOP_CONF_DIR
# Since we're installing on a non-Windows platform, let's remove the .cmd
# files (not required):
$ rm $HADOOP_CONF_DIR/*.cmd
At this point, we just have a few steps left:
- Creating a list of workers (servers that will participate in our cluster)
- Deciding which servers will host the various Hadoop components
- Deciding what ports the components will run on (since everyone will be installing their own copy of Hadoop…)
Generating the Worker List
We will reserve orion11
and orion12
for NameNodes and ResourceManagers, with the remaining 10 machines executing tasks through NodeManager instances. To generate a worker list, we can do the following:
$ cd $HADOOP_CONF_DIR
$ for i in {01..10}; do echo "orion${i}"; done > workers
$ cat workers
orion01
orion02
orion03
orion04
orion05
orion06
orion07
orion08
orion09
orion10
Creating our Configuration
Use the starter files to create your Hadoop configuration. It is important that you run your Hadoop/HDFS components on unique ports; otherwise, you’ll have conflicts and components will fail to run properly. See the port assignment list.
Note: you can copy these starter files over your base configuration or use them as a guide.
# Go home first:
$ cd
# Download the starter files and untar:
$ wget 'https://www.cs.usfca.edu/~mmalensek/cs677/schedule/materials/starter-config.tar.gz'
$ tar xvf starter-config.tar.gz
Within the starter files, you will need to edit the following (you can find the relevant keys with the grep
command if you’d like), or use the configure.sh
script:
port
– any XML keys with ‘port’ inside need to be updated with a unique port (from those assigned to you).namenode
– hostname of your NameNodesecondary-nn
– hostname of your secondary NameNoderesourcemanager
– hostname of your ResourceManagerusername
– your CS username. You can find this by running thewhoami
command.
Note that any hostname with 0.0.0.0
means that the component in question will be run on a worker and bind to the local host address. Don’t add a hostname here!
The configure.sh
script will make these changes for you; all you need to supply is the hostnames of the NameNode, Secondary NameNode, and ResourceManager. After running the script, check over the files to make sure they were configured correctly and then copy them into your $HADOOP_CONF_DIR
:
cp -v *.xml $HADOOP_CONF_DIR
Formatting the NameNode
Before we can start HDFS, we’ll need to format the NameNode. IMPORTANT: this must be done on the NameNode host itself, not remotely from another machine. So you will need to ssh <namenode-hostname>
first:
# Note: this might be orion12 if you used it for the NameNode:
$ ssh orion11
$ hdfs namenode -format
If your HDFS installation breaks, do not simply follow these steps again and reformat the Namenode; doing so will change the cluster ID and none of your storage nodes will be able to start up. Instead, inspect the logs to determine what is wrong.
Testing the Distributed File System
Let’s start the DFS and see how much storage space we have.
$ start-dfs.sh
Starting namenodes on [orion11]
Starting datanodes
Starting secondary namenodes [orion12]
$ hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://orion11:20000 65.5 T 6.1 G 63.9 T 0%
Around 65 TB. Not too bad! Note that this will change depending on the underlying FS usage, so you shouldn’t expect to get the exact same results here. Now let’s store a file in the file system:
# This will store './some_file.txt' into the root of your DFS:
$ hdfs dfs -put ./some_file.txt /
Testing a YARN Job
Now we need to start YARN and run a test job. ssh to the ResourceManager before running start-yarn.sh
.
$ start-yarn.sh
$ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar \
wordcount /file-to-count.txt /output-dir
The paths here refer to locations in HDFS. NOTE: if you’d like to use local files (i.e., files that are not stored in HDFS) as inputs or outputs, you must prefix them with file://
and give the complete path to their location:
$ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar \
wordcount file:///path/to/file/file-to-count.txt file:///path/to/output/output-dir
The next step is to store some large files in HDFS and verify that they are being chunked correctly and dispersed across the machines.