Setup Zeppelin

Setup Zeppelin

Download the latest binary version of spark from here and Zeppelin from here. Suppose you save these in ~/Downloads directory.

$ cd ~/Downloads

$ tar xf spark-2.2.0-bin-hadoop2.7.tgz

$ tar xf zeppelin-0.7.2-bin-all.tgz

Move spark and zeppelin binary to /usr/lib.

$ sudo mv spark-2.2.2-bin-hadoop2.7 /usr/lib

$ sudo mv zeppelin-0.7.2-bin-all /usr/lib

Replace existing spark configuration. Skip this step if you do not have any existing spark conf dir.

$ cd $SPARK_HOME

$ mv conf conf.old

$ ln -s /etc/spark/conf

$ cd $SPARK_HOME/conf

$ sudo ln -s /etc/hive/conf/hive-site.xml

Prerequisites: install Java 8 and python 3.

Start Zeppelin server and check the status. If Zeppelin server is already running use restart command.

export JAVA_HOME=/usr/java/jdk1.8.0_112

export SPARK_HOME=/usr/lib/spark-2.2.2-bin-hadoop2.7

export ZEPPELIN_HOME=/usr/lib/zeppelin-0.7.2-bin-all

export PYSPARK_PYTHON=python3

export PYSPARK_DRIVER_PYTHON=ipython

cd $ZEPPELIN_HOME

bin/zeppelin-daemon.sh start

bin/zeppelin-daemon.sh status

Set up R for Spark

Following are the instruction to set up SparkR with Zeppelin.

These instructions are tested on the following environment

  • OS: CentOS 6.7

  • Java 1.8

  • Apache Spark 2.0.2

  • Zeppelin 0.6.2

  • R 3.3.1

$ sudo yum install epel-release

$ sudo yum -y install R R-devel libcurl-devel openssl-devel

Create a file installs.r with the content below.

chooseCRANmirror(graphics=FALSE, ind=46)

install.packages("devtools")

install.packages("caTools")

install.packages("ggplot2")

install.packages("devtools")

install.packages("mplot")

install.packages("googleVis");

install.packages("glmnet")

install.packages("pROC")

install.packages("data.table")

install.packages("caret")

install.packages("sqldf")

install.packages("wordcloud")

install.packages("knitr")

require(devtools);

install_github('ramnathv/rCharts')

Run the above file

$ sudo Rscript install.r

Many R developers are used to R-studio. You can install RStudio Server using yum as below.

$ wget https://download2.rstudio.org/rstudio-server-rhel-1.0.44-x86_64.rpm

$ sudo yum localinstall -y --nogpgcheck rstudio-server-rhel-1.0.44-x86_64.rpm

Check the status of RStudio server

$ sudo rstudio-server start

$ sudo rstudio-server status

If the RStudio server if running you can open the RStudio in the browser on port 8787.

For example: http://localhost:8787/

If you want to run a machine learning algorithm using Spark R, take a look at this doc.