Jupyter Notebook for Pyspark

Goal: Create a spark project for python in Jupyter Notebook.

Download Apache Spark binary and untar it in a location of your choice and set SPARK_HOME environment variable to that location. For example,

export SPARK_HOME=/usr/lib/spark-2.1.1-bin-hadoop2.7
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.ip='*' --NotebookApp.port=8888 --NotebookApp.open_browser=False"

You can make these in $SPARK_HOME/conf/spark-env.sh to make them persistent after reboot.

Now, launch jupyter with Spark session support$ cd ~/workspace/python

$ $SPARK_HOME/bin/pyspark

Open a web browser http://localhost:7070, create a new Python 3 notebook and check whether SparkContext object is available to get started with pyspark code.

Alternatively, suppose you want to start jupyter in a normal way then start the spark session.

Run the following before in launching jupyter and install py4j using pip

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
export SPARK_HOME=/usr/lib/spark-2.1.1-bin-hadoop2.7
jupyter notebook --NotebookApp.ip='*' --NotebookApp.port=8888 --NotebookApp.open_browser=False

Launch jupyter in the browser http://localhost:8888 and create a new notebook.

import sys, glob, os
SPARK_HOME=os.environ['SPARK_HOME']
sys.path.append(SPARK_HOME + "/python")
sys.path.append(glob.glob(SPARK_HOME + "/python/lib/py4j*.zip")[0])
from pyspark.sql import SparkSession

# Create spark session
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sql = spark.sql
print(sc.uiWebUrl)

Run Jupyter

Submit .py file using spark-submit

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

Launch jupyter with spark backend: $ %SPARK_HOME/bin/pyspark

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
unset PYSPARK_DRIVER_PYTHON_OPTS

Submit the job $ %SPARK_HOME/bin/spark-submit <.py file>