Setup Spark Cluster

Install wget

$ yum install wget -y

Install Oracle JDK 1.8 on CentOS

Download the rpm for linux 64 bit version (for example jdk-8u101-linux-x64.rpm) from oracle download site.

$ wget \
--no-check-certificate \
--no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" \
http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jdk-8u112-linux-x64.rpm

Install yum install the rpm

$ sudo yum localinstall -y <rpm name e.g. jdk-8u101-linux-x64.rpm>

Set JAVA_HOME in /etc/profile by adding the following line at the end of the file.

export JAVA_HOME=/usr/java/jdk1.8.0_112

Download the latest version of Spark binary from here.

$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz

Download the latest version of Zeppelin binary from here

$ wget http://www-us.apache.org/dist/zeppelin/zeppelin-0.6.2/zeppelin-0.6.2-bin-all.tgz

Untar binary of Zeppelin and Spark to /usr/lib

$ tar xf spark-2.0.2-bin-hadoop2.7.tgz
$ tar xf zeppelin-0.6.2-bin-all.tgz
$ sudo mv spark-2.0.2-bin-hadoop2.7 /usr/lib
$ sudo mv zeppelin-0.6.2-bin-all /usr/lib

Add the following environment variables in /etc/profile.

export SPARK_HOME=/usr/lib/spark-2.0.2-bin-hadoop2.7
export ZEPPELIN_HOME=/usr/lib/zeppelin-0.6.2-bin-all
export PATH=$SPARK_HOME/bin:$PATH

After updating /etc/profile, take the change in effect by running the following command

$ source /etc/profile

Start zeppelin daemon

$ $ZEPPELIN_HOME/bin/zeppelin-daemon.sh start

Turn off firewall at OS level, if not already done.

$ sudo service iptables stop
$ sudo chkconfig iptables off

If you are using EC2 instance on AWS, make sure you have added rules inbound TCP traffic on port 8080 and 4040.

Now you can launch Zeppelin notebook.

http://ec2-35-165-67-12.us-west-2.compute.amazonaws.com:8080

Next Steps