Running Spark on Windows
Configure Spark to Run On Windows
1. Create directory c:\hadoop\bin
2. Download winutils.exe from https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.0/bin and save it to c:\hadoop\bin
3. Set a system environment variable HADOOP_HOME=c:\hadoop
4. Add %HADOOP_HOME% to PATH system environment variable PATH=%HADOOP_HOME%;%PATH%
5. Create a folder c:\tmp\hive and from command prompt run c:\hadoop\bin\winutils.exe chmod -R 777 \tmp\hive
7. Verify file permission c:\hadoop\bin\winutils.exe ls \tmp\hive
8. Download spark binary from http://spark.apache.org/downloads.html and unzip it to c:\
More details in the following
Configure python3 to run pyspark
1. Install python 3.5
2. Open new command prompt. This makes sure the commands prompt loads all new environment variables.
3. Start pyspark application by running C:\spark-1.6.2-bin-hadoop2.6\bin\pyspark.cmd
Install jupyter
c:> pip3 install jupyter
Install psutil (required by spark system to collect metrics on shuffle)
c:> pip3 install psutil
Verify ipython by running the following command
c:\ipython
Configure Zeppelin
1. Download zeppelin binary from https://zeppelin.apache.org/download.html 0.6.0 is compatible with Spark v1.6.2 and unzip it to c:\.
2. Go to the C:\zeppelin-0.6.0-bin-all\conf and copy zeppelin-env.cmd.template as zeppelin-env.cmd and set the following variables as below.
set PYTHONPATH=ipython
set ZEPPELIN_HOME=D:\zeppelin-0.6.0-bin-all
set SPARK_HOME=C:\spark-1.6.2-bin-hadoop2.6
3. If you want to make pyspark default interpreter, update the C:\zeppelin-0.6.0-bin-all\conf\zeppelin-site.xml.
4. Now open a new command prompt and change directory to C:\zeppelin-0.6.0-bin-all\ run the following command to start Zeppelin service.
C:\zeppelin-0.6.0-bin-all\conf\zeppelin.cmd start
Adding proxy to spark packages
bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=<proxyHost> -Dhttp.proxyPort=<proxyPort> -Dhttps.proxyHost=<proxyHost> -Dhttps.proxyPort=<proxyPort>" --packages <somePackage>