Setup Zeppelin
Setup Zeppelin
Download the latest binary version of spark from here and Zeppelin from here. Suppose you save these in ~/Downloads directory.
$ cd ~/Downloads
$ tar xf spark-2.2.0-bin-hadoop2.7.tgz
$ tar xf zeppelin-0.7.2-bin-all.tgz
Move spark and zeppelin binary to /usr/lib.
$ sudo mv spark-2.2.2-bin-hadoop2.7 /usr/lib
$ sudo mv zeppelin-0.7.2-bin-all /usr/lib
Replace existing spark configuration. Skip this step if you do not have any existing spark conf dir.
$ cd $SPARK_HOME
$ mv conf conf.old
$ ln -s /etc/spark/conf
$ cd $SPARK_HOME/conf
$ sudo ln -s /etc/hive/conf/hive-site.xml
Prerequisites: install Java 8 and python 3.
Start Zeppelin server and check the status. If Zeppelin server is already running use restart command.
export JAVA_HOME=/usr/java/jdk1.8.0_112
export SPARK_HOME=/usr/lib/spark-2.2.2-bin-hadoop2.7
export ZEPPELIN_HOME=/usr/lib/zeppelin-0.7.2-bin-all
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=ipython
cd $ZEPPELIN_HOME
bin/zeppelin-daemon.sh start
bin/zeppelin-daemon.sh status
Set up R for Spark
Following are the instruction to set up SparkR with Zeppelin.
These instructions are tested on the following environment
OS: CentOS 6.7
Java 1.8
Apache Spark 2.0.2
Zeppelin 0.6.2
R 3.3.1
$ sudo yum install epel-release
$ sudo yum -y install R R-devel libcurl-devel openssl-devel
Create a file installs.r with the content below.
chooseCRANmirror(graphics=FALSE, ind=46)
install.packages("devtools")
install.packages("caTools")
install.packages("ggplot2")
install.packages("devtools")
install.packages("mplot")
install.packages("googleVis");
install.packages("glmnet")
install.packages("pROC")
install.packages("data.table")
install.packages("caret")
install.packages("sqldf")
install.packages("wordcloud")
install.packages("knitr")
require(devtools);
install_github('ramnathv/rCharts')
Run the above file
$ sudo Rscript install.r
Many R developers are used to R-studio. You can install RStudio Server using yum as below.
$ wget https://download2.rstudio.org/rstudio-server-rhel-1.0.44-x86_64.rpm
$ sudo yum localinstall -y --nogpgcheck rstudio-server-rhel-1.0.44-x86_64.rpm
Check the status of RStudio server
$ sudo rstudio-server start
$ sudo rstudio-server status
If the RStudio server if running you can open the RStudio in the browser on port 8787.
For example: http://localhost:8787/
If you want to run a machine learning algorithm using Spark R, take a look at this doc.