Setting up Apache Spark
Spark is a cluster computing framework that supports working sets (distributed shared memory) and has a less restrictive programming interface than MapReduce. This allows for iterative algorithms such as Machine Learning models to execute efficiently across large, distributed datasets.
While it is possible to run Spark jobs under YARN, we will configure a standalone cluster in this guide. For storage, we’ll use our existing HDFS cluster.
Prereqisites
- Passwordless SSH
- HDFS
Getting the Software: Orion Machines
Spark 3.3.2 is available in /bigdata/spark-3.3.2-bin-hadoop3
. As with Hadoop, we will set up a local configuration in your home directory.
Optional: Personal Installation
If you want to run Spark on your own computer, install with your package manager (on macOS, you can use brew install apache-spark
), or download the binary distribution from: http://spark.apache.org/downloads.html
Background: Spark
The two main components in spark are the Cluster Manager and Workers. The Workers run a Spark Executor that manages Java processes (tasks).
To run a Spark job, you’ll need a Driver. This can be a machine in the cluster or your own laptop; it manages the state of the job and submits tasks to be executed. Each driver has a Spark context – you’ll see this when you start spark-shell
or pyspark
.
Cluster Setup
The official documentation outlines setting up a Spark cluster. We will walk through the setup here, but if you run into trouble or want to delve deeper into the configuration, use the following links as a starting point.
- Simple Setup: http://spark.apache.org/docs/latest/
- Cluster setup: http://spark.apache.org/docs/latest/cluster-overview.html
Environment Setup
We need to set up some environment variables for the various Spark components. Edit your ~/.bashrc (or .zshenv if you are a zsh user) and add the following:
# Use the latest version of Java:
# (You only need this once if it was already configured for Hadoop)
export JAVA_HOME="/usr/lib/jvm/java"
# Location of the Spark installation:
export SPARK_HOME="/bigdata/spark-3.3.2-bin-hadoop3"
# Location of our local configuration:
export SPARK_CONF_DIR="${HOME}/spark-config"
# Add Spark binaries to the user PATH environment variable
export PATH="${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin"
Log out and back in (or source
the config file). You should now be able to run spark-shell
or PYSPARK_PYTHON=python3 pyspark
for local jobs. Note that we’ll fix the pyspark
command to not require the PYSPARK_PYTHON=python3
preamble later in the guide.
Configuration
The default configuration files are found in the conf directory:
ls $SPARK_HOME/conf
docker.properties.template metrics.properties.template spark-env.sh.template
fairscheduler.xml.template workers.template
log4j.properties.template spark-defaults.conf.template
As with hadoop, let’s use these files as a starting point. SPARK_CONF_DIR
is set to $HOME/spark-config
; let’s create this now:
mkdir -v $SPARK_CONF_DIR
cp -rv $SPARK_HOME/conf/* $SPARK_CONF_DIR
Similar to the Hadoop setup, we need to configure a few more items:
- Worker list
- The cluster manager (‘master’)
Generating the Worker List
We will reserve orion11
and orion12
for master nodes, with the remaining 10 nodes used for workers.
cd $SPARK_CONF_DIR
for i in {01..10}; do echo "orion${i}"; done > workers
cat workers
orion01
orion02
orion03
orion04
orion05
orion06
orion07
orion08
orion09
orion10
Configuring the Cluster
First, choose where you’d like to run your master (orion11 or orion12). Then rename and edit spark-env.sh.template:
mv spark-env.sh.template spark-env.sh
# Edit the file (replace vim with your favorite editor) and set the following
# variables. Replace XX with your port range prefix.
vim spark-env.sh
SPARK_MASTER_HOST=orion12
SPARK_MASTER_PORT=XX071
SPARK_MASTER_WEBUI_PORT=XX072
SPARK_WORKER_PORT=XX073
SPARK_WORKER_WEBUI_PORT=XX074
SPARK_WORKER_CORES=2
SPARK_WORKER_DIR=/bigdata/students/$(whoami)/spark-worker
SPARK_LOG_DIR=/bigdata/students/$(whoami)/spark-logs
Testing the Cluster
You will need to start the master node on the machine itself, so ssh
to your master node first:
ssh orion12
start-master.sh
start-workers.sh
Note: for the first startup, you may get warning messages as log directories are created.
Next, let’s run a job on our cluster. Ideally you’d use the last free node for running jobs; if your master is on orion12, then submit jobs from orion11 for best results.
ssh orion11
PYSPARK_PYTHON=python3 pyspark --master=spark://orion12:XX071
Time to do the world-famous Word Count example (for more examples, see this page):
# Use your namenode and its port here to read data from HDFS:
text_file = sc.textFile("hdfs://orionXX:PORT/some/data")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://orionXX:PORT/output/data")
You’re now ready to use Spark!
Jupyter (and Python) Configuration
To use Jupyter Notebooks with Spark, you will need to add a few more environment variables to your ~/.bashrc
or ~/.zshenv
:
# (Add the anaconda installation binaries to your path)
export PATH="${PATH}:/home2/anaconda3/bin"
export PYSPARK_PYTHON=/home2/anaconda3/bin/python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=XX075'
# --- NOTE: Replace XX with your port range prefix. ------^^
Remember to source your shell configuration files again for the changes to take effect.
Note that if you’d prefer to just use the Python shell, you can leave out the PYSPARK_DRIVER_PYTHON
lines. One nice alternative to Jupyter would be ipython
.
The configuration above starts the Jupyter notebook server on port XX075. You can then forward this to your local machine:
# Run this to start the driver (ideally in tmux, screen, etc.)
# Here, 'orion12' refers to the location of the master node.
pyspark --master=spark://orion12:XX071
# Then, in another terminal on your local machine:
# ssh -J USERNAME@stargate.cs.usfca.edu USERNAME@orionYY -L 8080:localhost:XX075
# NOTE1: You need ProxyJump set up for the above to work
# NOTE2: YY is the orion machine you started pyspark on (can be any machine you want)
Now you can navigate to http://localhost:8080 on your machine and access your notebook from there.