Setting up Apache Spark

Spark is a cluster computing framework that supports working sets (distributed shared memory) and has a less restrictive programming interface than MapReduce. This allows for iterative algorithms such as Machine Learning models to execute efficiently across large, distributed datasets.

While it is possible to run Spark jobs under YARN, we will configure a standalone cluster in this guide. For storage, we’ll use our existing HDFS cluster.

Prereqisites

Passwordless SSH
HDFS

Getting the Software: Orion Machines

Spark 3.3.2 is available in /bigdata/spark-3.3.2-bin-hadoop3. As with Hadoop, we will set up a local configuration in your home directory.

Optional: Personal Installation

If you want to run Spark on your own computer, install with your package manager (on macOS, you can use brew install apache-spark), or download the binary distribution from: http://spark.apache.org/downloads.html

Background: Spark

The two main components in spark are the Cluster Manager and Workers. The Workers run a Spark Executor that manages Java processes (tasks).

To run a Spark job, you’ll need a Driver. This can be a machine in the cluster or your own laptop; it manages the state of the job and submits tasks to be executed. Each driver has a Spark context – you’ll see this when you start spark-shell or pyspark.

Cluster Setup

The official documentation outlines setting up a Spark cluster. We will walk through the setup here, but if you run into trouble or want to delve deeper into the configuration, use the following links as a starting point.

Simple Setup: http://spark.apache.org/docs/latest/
Cluster setup: http://spark.apache.org/docs/latest/cluster-overview.html

Environment Setup

We need to set up some environment variables for the various Spark components. Edit your ~/.bashrc (or .zshenv if you are a zsh user) and add the following:

# Use the latest version of Java:
# (You only need this once if it was already configured for Hadoop)
export JAVA_HOME="/usr/lib/jvm/java"

# Location of the Spark installation:
export SPARK_HOME="/bigdata/spark-3.3.2-bin-hadoop3"

# Location of our local configuration:
export SPARK_CONF_DIR="${HOME}/spark-config"

# Add Spark binaries to the user PATH environment variable
export PATH="${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin"

Log out and back in (or source the config file). You should now be able to run spark-shell or PYSPARK_PYTHON=python3 pyspark for local jobs. Note that we’ll fix the pyspark command to not require the PYSPARK_PYTHON=python3 preamble later in the guide.

Configuration

The default configuration files are found in the conf directory:

ls $SPARK_HOME/conf
docker.properties.template  metrics.properties.template   spark-env.sh.template
fairscheduler.xml.template  workers.template                                    
log4j.properties.template   spark-defaults.conf.template

As with hadoop, let’s use these files as a starting point. SPARK_CONF_DIR is set to $HOME/spark-config; let’s create this now:

mkdir -v $SPARK_CONF_DIR
cp -rv $SPARK_HOME/conf/* $SPARK_CONF_DIR

Similar to the Hadoop setup, we need to configure a few more items:

Worker list
The cluster manager (‘master’)

Generating the Worker List

We will reserve orion11 and orion12 for master nodes, with the remaining 10 nodes used for workers.

cd $SPARK_CONF_DIR
for i in {01..10}; do echo "orion${i}"; done > workers
cat workers
orion01
orion02
orion03
orion04
orion05
orion06
orion07
orion08
orion09
orion10

Configuring the Cluster

First, choose where you’d like to run your master (orion11 or orion12). Then rename and edit spark-env.sh.template:

mv spark-env.sh.template spark-env.sh

# Edit the file (replace vim with your favorite editor) and set the following
# variables. Replace XX with your port range prefix.
vim spark-env.sh

SPARK_MASTER_HOST=orion12
SPARK_MASTER_PORT=XX071
SPARK_MASTER_WEBUI_PORT=XX072
SPARK_WORKER_PORT=XX073
SPARK_WORKER_WEBUI_PORT=XX074
SPARK_WORKER_CORES=2
SPARK_WORKER_DIR=/bigdata/students/$(whoami)/spark-worker
SPARK_LOG_DIR=/bigdata/students/$(whoami)/spark-logs

Testing the Cluster

You will need to start the master node on the machine itself, so ssh to your master node first:

ssh orion12
start-master.sh
start-workers.sh

Note: for the first startup, you may get warning messages as log directories are created.

Next, let’s run a job on our cluster. Ideally you’d use the last free node for running jobs; if your master is on orion12, then submit jobs from orion11 for best results.

ssh orion11
PYSPARK_PYTHON=python3 pyspark --master=spark://orion12:XX071

Time to do the world-famous Word Count example (for more examples, see this page):

# Use your namenode and its port here to read data from HDFS:
text_file = sc.textFile("hdfs://orionXX:PORT/some/data")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://orionXX:PORT/output/data")

You’re now ready to use Spark!

Jupyter (and Python) Configuration

To use Jupyter Notebooks with Spark, you will need to add a few more environment variables to your ~/.bashrc or ~/.zshenv:

# (Add the anaconda installation binaries to your path)
export PATH="${PATH}:/home2/anaconda3/bin"                           
                                                                     
export PYSPARK_PYTHON=/home2/anaconda3/bin/python3                   
export PYSPARK_DRIVER_PYTHON=jupyter                                 
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=XX075'
# --- NOTE: Replace XX with your port range prefix.       ------^^

Remember to source your shell configuration files again for the changes to take effect.

Note that if you’d prefer to just use the Python shell, you can leave out the PYSPARK_DRIVER_PYTHON lines. One nice alternative to Jupyter would be ipython.

The configuration above starts the Jupyter notebook server on port XX075. You can then forward this to your local machine:

# Run this to start the driver (ideally in tmux, screen, etc.)
# Here, 'orion12' refers to the location of the master node.
pyspark --master=spark://orion12:XX071

# Then, in another terminal on your local machine:
# ssh -J USERNAME@stargate.cs.usfca.edu USERNAME@orionYY -L 8080:localhost:XX075

# NOTE1: You need ProxyJump set up for the above to work
# NOTE2: YY is the orion machine you started pyspark on (can be any machine you want)

Now you can navigate to http://localhost:8080 on your machine and access your notebook from there.