Setup Zeppelin

Setup Zeppelin

Download the latest binary version of spark from here and Zeppelin from here. Suppose you save these in ~/Downloads directory.

$ cd ~/Downloads
$ tar xf spark-2.2.0-bin-hadoop2.7.tgz
$ tar xf zeppelin-0.7.2-bin-all.tgz

Move spark and zeppelin binary to /usr/lib.

$ sudo mv spark-2.2.2-bin-hadoop2.7 /usr/lib
$ sudo mv zeppelin-0.7.2-bin-all /usr/lib

Replace existing spark configuration. Skip this step if you do not have any existing spark conf dir.

$ cd $SPARK_HOME
$ mv conf conf.old
$ ln -s /etc/spark/conf

$ cd $SPARK_HOME/conf

$ sudo ln -s /etc/hive/conf/hive-site.xml

Prerequisites: install Java 8 and python 3.

Start Zeppelin server and check the status. If Zeppelin server is already running use restart command.

export JAVA_HOME=/usr/java/jdk1.8.0_112
export SPARK_HOME=/usr/lib/spark-2.2.2-bin-hadoop2.7
export ZEPPELIN_HOME=/usr/lib/zeppelin-0.7.2-bin-all
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=ipython
cd $ZEPPELIN_HOME
bin/zeppelin-daemon.sh start
bin/zeppelin-daemon.sh status

Set up R for Spark

Following are the instruction to set up SparkR with Zeppelin.

These instructions are tested on the following environment

  • OS: CentOS 6.7
  • Java 1.8
  • Apache Spark 2.0.2
  • Zeppelin 0.6.2
  • R 3.3.1
$ sudo yum install epel-release
$ sudo yum -y install R R-devel libcurl-devel openssl-devel

Create a file installs.r with the content below.

chooseCRANmirror(graphics=FALSE, ind=46)
install.packages("devtools")

install.packages("caTools")

install.packages("ggplot2")
install.packages("devtools")
install.packages("mplot")
install.packages("googleVis");
install.packages("glmnet")
install.packages("pROC")
install.packages("data.table")
install.packages("caret")
install.packages("sqldf")
install.packages("wordcloud")
install.packages("knitr")
require(devtools);
install_github('ramnathv/rCharts')

Run the above file

$ sudo Rscript install.r

Many R developers are used to R-studio. You can install RStudio Server using yum as below.

$ wget https://download2.rstudio.org/rstudio-server-rhel-1.0.44-x86_64.rpm
$ sudo yum localinstall -y --nogpgcheck rstudio-server-rhel-1.0.44-x86_64.rpm

Check the status of RStudio server

$ sudo rstudio-server start
$ sudo rstudio-server status

If the RStudio server if running you can open the RStudio in the browser on port 8787.

For example: http://localhost:8787/

If you want to run a machine learning algorithm using Spark R, take a look at this doc.