Spark 2 setup

Demo will cover:

  • Install standalone Spark on your local machine
  • Set up the PySpark REPL interface

Req's

  • This demo will by done with Python 3
  • Java v8
  • jupyter notebooks

Download Spark 2 from https://spark.apache.org/downloads.html

  1. Choose the most recent 2.x build
  2. Choose package: Pre-built for Apache Hadoop 2.7 and later. (Standalone installation does not require you to have Hadoop installed)
  3. Download generated link
  4. Move the download file to a suitable location on you harddrive
  5. run the this command to unpack download, sudo tar -xvzf <path to spark binary file download>
  6. ls to confirm files have unpacked

Standalone Spark requires some environment variables to be set.

Open bash profile, nano ~/.bash_profile and set the following:

export SPARK_HOME="/path/to/spark2/folder"
export PATH="$SPARK_HOME/bin:$PATH"

Note: if you don't have java home set, this will need to be done as well.

export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

Exit nano and reload the bash profile

source ~/.bash_profile

If not installed, install PySpark:

pyspark --version

# if not installed
pip install pyspark

With Pyspark installed we can create a spark shell

pyspark

Once the shell starts up we can get access to the Spark Context by typing sc and exiting the shell with exit()

>>> sc
<SparkContext master=local[*] appName=PySparkShell>
>>> exit()

Instead of interacting with the Pyspark shell directly, we can setup jupyter notebooks to launch when we start up Spark 2

We will need to declare more environment variables

Open bash profile again, nano ~/.bash_profile

add the following

# original version
export PYSPARK_SUBMIT_ARGS="PYSPARK-SHELL"
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark

Save and close the file. Then, reload the bash profile
source ~/.bash_profile

Now, when you run pyspark a jupyter notebook server will start