Setup a Spark Cluster in Standalone Mode

(In Progress)

Server Name(s)

Make changes in the configuration on the master and copy them to the slave machines to make sure the configuration for spark is consistent across the cluster. There are devops tools puppet, chef to manager cluster configuration.

/opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh

/opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf

/opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/log4j.properties

/etc/init.d/http-proxy.sh

Copying the Configuration files from master to slaves :

Give full permission to everyone so that they can be modified from the other machines.

$ sudo chmod 777 /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh

$ sudo chmod 777 /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf

$ sudo chmod 777 /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/log4j.properties

$ scp /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh chilamku@hc4t01140.itcs.hpecorp.net:/tmp ==> It asks Source user password and Destination user password

To avoid password every time, you can set up passwordless SSH, which works via ssh key. For that the user has to have the same login for each of the machine.

$ scp /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf chilamku@hc4t01140.itcs.hpecorp.net:/tmp

$ scp /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/log4j.properties chilamku@hc4t01140.itcs.hpecorp.net:/tmp

$ scp /etc/init.d/http-proxy.sh chilamku@hc4t01140.itcs.hpecorp.net:/tmp

Moving the Configuration files from the Temp folder to the Root Folder in the slave Machines :

Example :

mv <filesname> <filename> /<conf/>

Set http proxy for wget and curl

export http_proxy="<servername:port>"

export https_proxy="<servername:port>"

Example,

export http_proxy="web-proxy.sgp.hp.com:8080"

export https_proxy="web-proxy.sgp.hp.com:8080"

Create a file http-proxy.sh with the lines below under /etc/init.d/http-proxy.sh. This will make the proxy setting permanent.

$ sudo vi /etc/init.d/http-proxy.sh

-----------Content of the file - do not include this line in the file ------

#!/bin/bash

export http_proxy="web-proxy.sgp.hp.com:8080"

export https_proxy="web-proxy.sgp.hp.com:8080"

Create Download Directory

Create a download directory and keep all downloaded files in this directory. Since /home/$USER does not have enough available space, you can create the home dir in /opt/mount1/ and create a symlink in home

$ sudo mkdir /opt/mount1/Downloads

$ sudo chown $USER -R /opt/mount1/Downloads

$ rm -rf ~/Downloads #delete ~/Downloads dir if it exists

$ ln -s /opt/mount1/Downloads ~/Downloads

Search for JDK versions

$ sudo yum search jdk

Example: install from yum repo

$ sudo yum install -y java-1.8.0-openjdk-devel.x86_64

Install Oracle JDK

The installation directory of the JDK choose 1.8.

$ /usr/lib/jvm/java-1<press tab>

Configure Spark

Download spark binary

Location: http://spark.apache.org/downloads.html

Choose version 1.6.2

Choose package type as Pre-built for Hadoop 2.7.2 or later

Get the URL and run the command

$ cd ~/Downloads

$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz

Untar and uncompress the spark

$ tar -zxf spark-1.6.2-bin-hadoop2.6.tgz

Check the directory which has free space of 200GB +

$ df -h

You can verify the available space for a directory mount point as well.

$ df -h /opt/mount1

Move spark directory to /opt/mount1

$ sudo mv spark-1.6.2-bin-hadoop2.6 /opt/mount1/

Update spark env variables

$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6

Update con/spark-env.sh

$ sudo cp conf/spark-env.sh.template conf/spark-env.sh

$ sudo vi conf/spark-env.sh

--- Set the following variables -----

SPARK_MASTER_IP=<servername for spark master server>

SPARK_HOME=/opt/mount1/spark-1.6.2-bin-hadoop2.6

SPARK_DRIVER_MEMORY=4g

SPARK_WORKER_MEMORY=25g

SPARK_WORKER_CORES=8

SPARK_WORKER_DIR=/opt/mount1/spark-working-dir

SPARK_LOG_DIR=/opt/mount1/spark-logs

Create spark working directory and give full permission

$ sudo mkdir /opt/mount1/spark-working-dir

$ sudo chmod 777 /opt/mount1/spark-working-dir

Create spark log directory

$ sudo mkdir /opt/mount1/spark-logs

$ sudo chmod 777 /opt/mount1/spark-logs

Configure log output

$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6

$ cp conf/log4j.properties.template conf/log4j.properties

Use the following content as log configuration ... exclude this line.

log4j.rootCategory=INFO, console, RollingAppender

log4j.appender.console=org.apache.log4j.ConsoleAppender

log4j.appender.console.Threshold=WARN

log4j.appender.console.target=System.err

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose

log4j.logger.org.spark-project.jetty=WARN

log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR

log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO

log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

log4j.logger.org.apache.parquet=ERROR

log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support

log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL

log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender

log4j.appender.RollingAppender.Threshold=INFO

log4j.appender.RollingAppender.File=/opt/mount1/spark-logs/spark-jobs.log

log4j.appender.RollingAppender.DatePattern='.'yyyy-MM-dd

log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout

log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n

Add Proxy to Spark configuration so that it can down any spark packages

$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6

$ sudo cp conf/spark-defaults.conf.template conf/spark-defaults.conf

$ sudo vi conf/spark-defaults.conf

Add the following line ---- exclude this line

spark.master spark://<master server name>:7077

spark.eventLog.enabled true

spark.eventLog.dir file:///opt/mount1/spark-logs/event-log

spark.history.fs.logDirectory file:///opt/mount1/spark-logs/event-log

spark.driver.extraJavaOptions -Dhttp.proxyHost=hostname -Dhttp.proxyPort=port -Dhttps.proxyHost=host -Dhttps.proxyPort=port

Create spark event log directory

$ sudo mkdir /opt/mount1/spark-logs/event-log

$ sudo chmod 777 /opt/mount1/spark-logs/event-log

Start Spark Services

Note: start master on ONE machine only designated as SPARK_MASTER_IP in spark-env.sh. slave process should be started on all slave machine. For a small cluster you can start slave daemon on machine designated as master.

$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6

sbin/start-master.sh

Now you can launch the spark master URL

http://<masterserver name>:8080

You will get the spark url .. it looks like this

spark://<servername>:7077

Follow the steps in the other machine also

Start the slave daemon, you need to get the spark master URL.

$ sbin/start-slave.sh spark://<servername of the spark master>:7077

Start history server on the master server .. this service should be started only on ONE machine

$ sbin/start-history-server.sh

Spark history server is accessible as http://<spark history server>:18080

Start thrift jdbc server on master machine

$ sbin/start-thriftserver.sh

The jdbc service is started on port 10000, you can test it using any sql client app such (beeline or SQL Workbench)

$ /opt/mount1/spark-1.6.2-bin-hadoop2.6/bin/beeline -u jdbc:hive2://<server for thrift service>:10000/default

To check whether is a service is active,

$ sudo jps -l

Another way to check whether the service is up,

$ sudo netstat -tulpn | grep 18080

Test Spark Cluster

$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6/

$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master <spark master>:7077 --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

To verify go open history server

http://<history server>:18080

You should be able to see the job and verify that the job has run on all executors

Configure Zeppelin

Download and untar zeppelin

$ wget http://archive.apache.org/dist/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0-bin-all.tgz

$ tar -zxf zeppelin-0.6.0-bin-all.tgz

Move zeppelin to /opt/mount1/

$ sudo mv zeppelin-0.6.0-bin-all /opt/mount1/

$ cd /opt/mount1/zeppelin-0.6.0-bin-all

Update zeppelin-env.sh

$ sudo vi conf/zeppelin-env.sh

Add the following lines to the file

export ZEPPELIN_HOME=/opt/mount1/zeppelin-0.6.0-bin-all

export ZEPPELIN_LOG_DIR=/opt/mount1/zeppelin-log

export SPARK_HOME=/opt/mount1/spark-1.6.2-bin-hadoop2.6/

Create zeppelin log directory

$ sudo mkdir /opt/mount1/zeppelin-log

$ sudo chmod 777 /opt/mount1/zeppelin-log

Edit zeppelin site to change the zeppelin web port (default is 8080 which conflicts with spark master web ui port)

$ sudo cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml

$ sudo vi conf/zeppelin-site.xml

to change zeppelin.server.port to 8070

and zeppelin.interpreters ... move org.apache.zeppelin.spark.PySparkInterpreter as first option

Then start zeppelin daemon

bin/zeppelin-daemon.sh start

Open http://<zeppelin server>:8070

JDBC connection

bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=web-proxy.corp.hp.com -Dhttp.proxyPort=8080 -Dhttps.proxyHost=web-proxy.corp.hp.com -Dhttps.proxyPort=8080" --packages com.databricks:spark-csv_2.10:1.5.0

Put all data from scp to this share folder

$ sudo mkdir /opt/mount1/share

$ sudo chmod 777 /opt/mount1/share

$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6/

$ bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=web-proxy.corp.hp.com -Dhttp.proxyPort=8080 -Dhttps.proxyHost=web-proxy.corp.hp.com -Dhttps.proxyPort=8080" --packages com.databricks:spark-csv_2.10:1.4.0,com.databricks:spark-xml_2.10:0.3.3,com.google.guava:guava:11.0 --jars /opt/mount1/share/sqljdbc4.jar --verbose

Configure mysql - only ONE server only, install mysql server

sudo yum install -y mysql-server.x86_64

sudo service mysqld start

Set root password

sudo mysqld_safe --skip-grant-tables &

mysql> user mysql; update user set password= PASSWORD('pass123') where user = 'root';

mysql> flush privileges;

More instructions here https://support.rackspace.com/how-to/mysql-resetting-a-lost-mysql-root-password/

$ sudo service mysqld restart

Create a user 'spark'@''%' and grant all previleges

More instructions here http://blog.einext.com/apache-spark/working-with-mysql-from-spark-sql

Create hive metastore to keep spark sql persisted tables

Create a file /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/hive-site.xml

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files -->

<!-- that are implied by Hadoop setup variables. -->

<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive -->

<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->

<!-- resource). -->

<!-- Hive Execution Parameters -->

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://<mysql server>:3306/metastore?createDatabaseIfNotExist=true</value>

<description>JDBC connect string for a JDBC metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>spark</value>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>SPARK12345</value>

</property>

<property>

<name>datanucleus.fixedDatastore</name>

<value>true</value>

</property>

<property>

<name>datanucleus.autoCreateSchema</name>

<value>false</value>

</property>

<property>

<name>hive.metastore.uris</name>

<value>thrift://<server name>:9083</value>

<description>IP address (or fully-qualified domain name) and port of the metastore host</description>

</property>

</configuration>

Create a hive warehouse directory on the server that run the thrift server. You need this setting to persist tables information in Spark SQL.

$ sudo mkdir -p /user/hive/warehouse/

$ sudo chmod 777 /user/hive/warehouse/

Add the slave names to the following file

/opt/mount1/spark-1.6.2-bin-hadoop2.6/slaves

hc4t01141.itcs.hpecorp.net

hc4t01140.itcs.hpecorp.net

hc4t01142.itcs.hpecorp.net

Download mysql connector jdbc

$ cd ~/Downloads

$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.39.tar.gz

$

Setup a shared directory across cluster using NFS

Designate one of the machine to host the data files. That is where we are going to host a NFS server. Suppose, hc4t01140 is your NFS file server.

On hc4t01140, install NFS applications

$ sudo yum install nfs-utils nfs-utils-lib

Start NFS services

$ sudo chkconfig nfs on

$ sudo service rpcbind start

$ sudo service nfs start

Create a folder - consider this as data folder for your entire cluster. This not efficient for a larger cluster becasue NFS becomes your I/O bottleneck for your cluster.

$ sudo mkdir /data

$ sudo chmod 777 /data

Enter the directories that you want to share in /etc/exports file

$ sudo vi /etc/exports

Add the following line

/data hc4t01140(rw,sync,no_root_squash,no_subtree_check)

Apply the new entry to the exports file

$ sudo exportfs -a

On the client machines (run on all machines except the NFS server)

$ sudo yum install nfs-utils nfs-utils-lib

Create a mount point

$ sudo mkdir -p /mnt/nfs/data

$ sudo mount hc4t01140:/data /mnt/nfs/data

$ sudo chmod 777 /mnt/nfs/data

Verify mount points

$ sudo df -h

$ sudo mount

$ ls -l /mnt/nfs/data

Testing the access

$ touch /mnt/nfs/data/nfs.test

Enter the mount details in fstab for persistence - to be done on each machine including server

$ sudo vi /etc/fstab

Add the following line

hc4t01140:/data /mnt/nfs/data nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0 0