Spark 2 setup
Demo will cover:
- Install standalone Spark on your local machine
- Set up the PySpark REPL interface
Req's
- This demo will by done with Python 3
- Java v8
- jupyter notebooks
Download Spark 2 from https://spark.apache.org/downloads.html
- Choose the most recent 2.x build
- Choose package: Pre-built for Apache Hadoop 2.7 and later. (Standalone installation does not require you to have Hadoop installed)
- Download generated link
- Move the download file to a suitable location on you harddrive
- run the this command to unpack download,
sudo tar -xvzf <path to spark binary file download>
ls
to confirm files have unpacked
Standalone Spark requires some environment variables to be set.
Open bash profile, nano ~/.bash_profile
and set the following:
export SPARK_HOME="/path/to/spark2/folder"
export PATH="$SPARK_HOME/bin:$PATH"
Note: if you don't have java home set, this will need to be done as well.
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
Exit nano and reload the bash profile
source ~/.bash_profile
If not installed, install PySpark:
pyspark --version
# if not installed
pip install pyspark
With Pyspark installed we can create a spark shell
pyspark
Once the shell starts up we can get access to the Spark Context by typing sc
and exiting the shell with exit()
>>> sc
<SparkContext master=local[*] appName=PySparkShell>
>>> exit()
Instead of interacting with the Pyspark shell directly, we can setup jupyter notebooks to launch when we start up Spark 2
We will need to declare more environment variables
Open bash profile again, nano ~/.bash_profile
add the following
# original version
export PYSPARK_SUBMIT_ARGS="PYSPARK-SHELL"
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark
Save and close the file. Then, reload the bash profile
source ~/.bash_profile
Now, when you run pyspark
a jupyter notebook server will start