Running Spark on Windows

Configure Spark to Run On Windows

1. Create directory c:\hadoop\bin

2. Download winutils.exe from https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.0/bin and save it to c:\hadoop\bin

3. Set a system environment variable HADOOP_HOME=c:\hadoop

4. Add %HADOOP_HOME% to PATH system environment variable PATH=%HADOOP_HOME%;%PATH%

5. Create a folder c:\tmp\hive and from command prompt run c:\hadoop\bin\winutils.exe chmod -R 777 \tmp\hive

7. Verify file permission c:\hadoop\bin\winutils.exe ls \tmp\hive

8. Download spark binary from http://spark.apache.org/downloads.html and unzip it to c:\

More details in the following

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.htm

Configure python3 to run pyspark

1. Install python 3.5

2. Open new command prompt. This makes sure the commands prompt loads all new environment variables.

3. Start pyspark application by running C:\spark-1.6.2-bin-hadoop2.6\bin\pyspark.cmd

Install jupyter

c:> pip3 install jupyter

Install psutil (required by spark system to collect metrics on shuffle)

c:> pip3 install psutil

Verify ipython by running the following command

c:\ipython

Configure Zeppelin

1. Download zeppelin binary from https://zeppelin.apache.org/download.html 0.6.0 is compatible with Spark v1.6.2 and unzip it to c:\.

2. Go to the C:\zeppelin-0.6.0-bin-all\conf and copy zeppelin-env.cmd.template as zeppelin-env.cmd and set the following variables as below.

set PYTHONPATH=ipython

set ZEPPELIN_HOME=D:\zeppelin-0.6.0-bin-all

set SPARK_HOME=C:\spark-1.6.2-bin-hadoop2.6

3. If you want to make pyspark default interpreter, update the C:\zeppelin-0.6.0-bin-all\conf\zeppelin-site.xml.

4. Now open a new command prompt and change directory to C:\zeppelin-0.6.0-bin-all\ run the following command to start Zeppelin service.

C:\zeppelin-0.6.0-bin-all\conf\zeppelin.cmd start

Adding proxy to spark packages

bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>" --packages <somePackage>