Jupyter Notebook for Pyspark

Goal: Create a spark project for python in Jupyter Notebook.

Download Apache Spark binary and untar it in a location of your choice and set SPARK_HOME environment variable to that location. For example,

export SPARK_HOME=/usr/lib/spark-2.1.1-bin-hadoop2.7

export PYSPARK_PYTHON=python3

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.ip='0.0.0.0' --NotebookApp.port=8888 --NotebookApp.open_browser=False"

You can make these in $SPARK_HOME/conf/spark-env.sh to make them persistent after reboot.

Now, launch jupyter with Spark session support$ cd ~/workspace/python

$ $SPARK_HOME/bin/pyspark

Open a web browser http://localhost:7070, create a new Python 3 notebook and check whether SparkContext object is available to get started with pyspark code.

Alternatively, suppose you want to start jupyter in a normal way then start the spark session.

Run the following before in launching jupyter and install py4j using pip

export PYSPARK_PYTHON=python3

export PYSPARK_DRIVER_PYTHON=python3

export SPARK_HOME=/usr/lib/spark-2.1.1-bin-hadoop2.7

jupyter notebook --NotebookApp.ip='*' --NotebookApp.port=8888 --NotebookApp.open_browser=False

Launch jupyter in the browser http://localhost:8888 and create a new notebook.

import sys, glob, os

SPARK_HOME=os.environ['SPARK_HOME']

sys.path.append(SPARK_HOME + "/python")

sys.path.append(glob.glob(SPARK_HOME + "/python/lib/py4j*.zip")[0])

from pyspark import SparkConf

from pyspark.sql import SparkSession

from pyspark.sql import functions as F

from pyspark.sql.window import Window, WindowSpec

conf = (SparkConf()

.setAppName("PySpark Application")

.setIfMissing("spark.master", "local[*]")

.setIfMissing("spark.local.dir", "/tmp/spark")

.setIfMissing("spark.driver.memory", "5G")

.setIfMissing("spark.driver.cores", "4")

)

spark = SparkSession.builder.config(conf = conf).enableHiveSupport().getOrCreate()

# from pyspark import SparkContext

# sc = SparkContext(conf = conf)

sc = spark.sparkContext

print(sc.uiWebUrl)

sql = spark.sql

Run Jupyter

Submit .py file using spark-submit

export PYSPARK_PYTHON=python3

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

Launch jupyter with spark backend: $ %SPARK_HOME/bin/pyspark

export PYSPARK_PYTHON=python3

export PYSPARK_DRIVER_PYTHON=python3

unset PYSPARK_DRIVER_PYTHON_OPTS

Submit the job $ %SPARK_HOME/bin/spark-submit <.py file>