Jupyter Notebook for Pyspark
Goal: Create a spark project for python in Jupyter Notebook.
Download Apache Spark binary and untar it in a location of your choice and set SPARK_HOME environment variable to that location. For example,
export SPARK_HOME=/usr/lib/spark-2.1.1-bin-hadoop2.7
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.ip='0.0.0.0' --NotebookApp.port=8888 --NotebookApp.open_browser=False"
You can make these in $SPARK_HOME/conf/spark-env.sh to make them persistent after reboot.
Now, launch jupyter with Spark session support$ cd ~/workspace/python
$ $SPARK_HOME/bin/pyspark
Open a web browser http://localhost:7070, create a new Python 3 notebook and check whether SparkContext object is available to get started with pyspark code.
Alternatively, suppose you want to start jupyter in a normal way then start the spark session.
Run the following before in launching jupyter and install py4j using pip
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
export SPARK_HOME=/usr/lib/spark-2.1.1-bin-hadoop2.7
jupyter notebook --NotebookApp.ip='*' --NotebookApp.port=8888 --NotebookApp.open_browser=False
Launch jupyter in the browser http://localhost:8888 and create a new notebook.
import sys, glob, os
SPARK_HOME=os.environ['SPARK_HOME']
sys.path.append(SPARK_HOME + "/python")
sys.path.append(glob.glob(SPARK_HOME + "/python/lib/py4j*.zip")[0])
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window, WindowSpec
conf = (SparkConf()
.setAppName("PySpark Application")
.setIfMissing("spark.master", "local[*]")
.setIfMissing("spark.local.dir", "/tmp/spark")
.setIfMissing("spark.driver.memory", "5G")
.setIfMissing("spark.driver.cores", "4")
)
spark = SparkSession.builder.config(conf = conf).enableHiveSupport().getOrCreate()
# from pyspark import SparkContext
# sc = SparkContext(conf = conf)
sc = spark.sparkContext
print(sc.uiWebUrl)
sql = spark.sql
Run Jupyter
Submit .py file using spark-submit
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
Launch jupyter with spark backend: $ %SPARK_HOME/bin/pyspark
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
unset PYSPARK_DRIVER_PYTHON_OPTS
Submit the job $ %SPARK_HOME/bin/spark-submit <.py file>