Setup a Spark Cluster in Standalone Mode
(In Progress)
Server Name(s)
Make changes in the configuration on the master and copy them to the slave machines to make sure the configuration for spark is consistent across the cluster. There are devops tools puppet, chef to manager cluster configuration.
/opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh
/opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf
/opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/log4j.properties
/etc/init.d/http-proxy.sh
Copying the Configuration files from master to slaves :
Give full permission to everyone so that they can be modified from the other machines.
$ sudo chmod 777 /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh
$ sudo chmod 777 /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf
$ sudo chmod 777 /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/log4j.properties
$ scp /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh chilamku@hc4t01140.itcs.hpecorp.net:/tmp ==> It asks Source user password and Destination user password
To avoid password every time, you can set up passwordless SSH, which works via ssh key. For that the user has to have the same login for each of the machine.
$ scp /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/spark-defaults.conf chilamku@hc4t01140.itcs.hpecorp.net:/tmp
$ scp /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/log4j.properties chilamku@hc4t01140.itcs.hpecorp.net:/tmp
$ scp /etc/init.d/http-proxy.sh chilamku@hc4t01140.itcs.hpecorp.net:/tmp
Moving the Configuration files from the Temp folder to the Root Folder in the slave Machines :
Example :
mv <filesname> <filename> /<conf/>
Set http proxy for wget and curl
export http_proxy="<servername:port>"
export https_proxy="<servername:port>"
Example,
export http_proxy="web-proxy.sgp.hp.com:8080"
export https_proxy="web-proxy.sgp.hp.com:8080"
Create a file http-proxy.sh with the lines below under /etc/init.d/http-proxy.sh. This will make the proxy setting permanent.
$ sudo vi /etc/init.d/http-proxy.sh
-----------Content of the file - do not include this line in the file ------
#!/bin/bash
export http_proxy="web-proxy.sgp.hp.com:8080"
export https_proxy="web-proxy.sgp.hp.com:8080"
Create Download Directory
Create a download directory and keep all downloaded files in this directory. Since /home/$USER does not have enough available space, you can create the home dir in /opt/mount1/ and create a symlink in home
$ sudo mkdir /opt/mount1/Downloads
$ sudo chown $USER -R /opt/mount1/Downloads
$ rm -rf ~/Downloads #delete ~/Downloads dir if it exists
$ ln -s /opt/mount1/Downloads ~/Downloads
Search for JDK versions
$ sudo yum search jdk
Example: install from yum repo
$ sudo yum install -y java-1.8.0-openjdk-devel.x86_64
Install Oracle JDK
The installation directory of the JDK choose 1.8.
$ /usr/lib/jvm/java-1<press tab>
Configure Spark
Download spark binary
Location: http://spark.apache.org/downloads.html
Choose version 1.6.2
Choose package type as Pre-built for Hadoop 2.7.2 or later
Get the URL and run the command
$ cd ~/Downloads
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz
Untar and uncompress the spark
$ tar -zxf spark-1.6.2-bin-hadoop2.6.tgz
Check the directory which has free space of 200GB +
$ df -h
You can verify the available space for a directory mount point as well.
$ df -h /opt/mount1
Move spark directory to /opt/mount1
$ sudo mv spark-1.6.2-bin-hadoop2.6 /opt/mount1/
Update spark env variables
$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6
Update con/spark-env.sh
$ sudo cp conf/spark-env.sh.template conf/spark-env.sh
$ sudo vi conf/spark-env.sh
--- Set the following variables -----
SPARK_MASTER_IP=<servername for spark master server>
SPARK_HOME=/opt/mount1/spark-1.6.2-bin-hadoop2.6
SPARK_DRIVER_MEMORY=4g
SPARK_WORKER_MEMORY=25g
SPARK_WORKER_CORES=8
SPARK_WORKER_DIR=/opt/mount1/spark-working-dir
SPARK_LOG_DIR=/opt/mount1/spark-logs
Create spark working directory and give full permission
$ sudo mkdir /opt/mount1/spark-working-dir
$ sudo chmod 777 /opt/mount1/spark-working-dir
Create spark log directory
$ sudo mkdir /opt/mount1/spark-logs
$ sudo chmod 777 /opt/mount1/spark-logs
Configure log output
$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6
$ cp conf/log4j.properties.template conf/log4j.properties
Use the following content as log configuration ... exclude this line.
log4j.rootCategory=INFO, console, RollingAppender
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Threshold=WARN
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RollingAppender.Threshold=INFO
log4j.appender.RollingAppender.File=/opt/mount1/spark-logs/spark-jobs.log
log4j.appender.RollingAppender.DatePattern='.'yyyy-MM-dd
log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n
Add Proxy to Spark configuration so that it can down any spark packages
$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6
$ sudo cp conf/spark-defaults.conf.template conf/spark-defaults.conf
$ sudo vi conf/spark-defaults.conf
Add the following line ---- exclude this line
spark.master spark://<master server name>:7077
spark.eventLog.enabled true
spark.eventLog.dir file:///opt/mount1/spark-logs/event-log
spark.history.fs.logDirectory file:///opt/mount1/spark-logs/event-log
spark.driver.extraJavaOptions -Dhttp.proxyHost=hostname -Dhttp.proxyPort=port -Dhttps.proxyHost=host -Dhttps.proxyPort=port
Create spark event log directory
$ sudo mkdir /opt/mount1/spark-logs/event-log
$ sudo chmod 777 /opt/mount1/spark-logs/event-log
Start Spark Services
Note: start master on ONE machine only designated as SPARK_MASTER_IP in spark-env.sh. slave process should be started on all slave machine. For a small cluster you can start slave daemon on machine designated as master.
$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6
sbin/start-master.sh
Now you can launch the spark master URL
http://<masterserver name>:8080
You will get the spark url .. it looks like this
spark://<servername>:7077
Follow the steps in the other machine also
Start the slave daemon, you need to get the spark master URL.
$ sbin/start-slave.sh spark://<servername of the spark master>:7077
Start history server on the master server .. this service should be started only on ONE machine
$ sbin/start-history-server.sh
Spark history server is accessible as http://<spark history server>:18080
Start thrift jdbc server on master machine
$ sbin/start-thriftserver.sh
The jdbc service is started on port 10000, you can test it using any sql client app such (beeline or SQL Workbench)
$ /opt/mount1/spark-1.6.2-bin-hadoop2.6/bin/beeline -u jdbc:hive2://<server for thrift service>:10000/default
To check whether is a service is active,
$ sudo jps -l
Another way to check whether the service is up,
$ sudo netstat -tulpn | grep 18080
Test Spark Cluster
$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6/
$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master <spark master>:7077 --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
To verify go open history server
http://<history server>:18080
You should be able to see the job and verify that the job has run on all executors
Configure Zeppelin
Download and untar zeppelin
$ wget http://archive.apache.org/dist/zeppelin/zeppelin-0.6.0/zeppelin-0.6.0-bin-all.tgz
$ tar -zxf zeppelin-0.6.0-bin-all.tgz
Move zeppelin to /opt/mount1/
$ sudo mv zeppelin-0.6.0-bin-all /opt/mount1/
$ cd /opt/mount1/zeppelin-0.6.0-bin-all
Update zeppelin-env.sh
$ sudo vi conf/zeppelin-env.sh
Add the following lines to the file
export ZEPPELIN_HOME=/opt/mount1/zeppelin-0.6.0-bin-all
export ZEPPELIN_LOG_DIR=/opt/mount1/zeppelin-log
export SPARK_HOME=/opt/mount1/spark-1.6.2-bin-hadoop2.6/
Create zeppelin log directory
$ sudo mkdir /opt/mount1/zeppelin-log
$ sudo chmod 777 /opt/mount1/zeppelin-log
Edit zeppelin site to change the zeppelin web port (default is 8080 which conflicts with spark master web ui port)
$ sudo cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml
$ sudo vi conf/zeppelin-site.xml
to change zeppelin.server.port to 8070
and zeppelin.interpreters ... move org.apache.zeppelin.spark.PySparkInterpreter as first option
Then start zeppelin daemon
bin/zeppelin-daemon.sh start
Open http://<zeppelin server>:8070
JDBC connection
bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=web-proxy.corp.hp.com -Dhttp.proxyPort=8080 -Dhttps.proxyHost=web-proxy.corp.hp.com -Dhttps.proxyPort=8080" --packages com.databricks:spark-csv_2.10:1.5.0
Put all data from scp to this share folder
$ sudo mkdir /opt/mount1/share
$ sudo chmod 777 /opt/mount1/share
$ cd /opt/mount1/spark-1.6.2-bin-hadoop2.6/
$ bin/spark-shell --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=web-proxy.corp.hp.com -Dhttp.proxyPort=8080 -Dhttps.proxyHost=web-proxy.corp.hp.com -Dhttps.proxyPort=8080" --packages com.databricks:spark-csv_2.10:1.4.0,com.databricks:spark-xml_2.10:0.3.3,com.google.guava:guava:11.0 --jars /opt/mount1/share/sqljdbc4.jar --verbose
Configure mysql - only ONE server only, install mysql server
sudo yum install -y mysql-server.x86_64
sudo service mysqld start
Set root password
sudo mysqld_safe --skip-grant-tables &
mysql> user mysql; update user set password= PASSWORD('pass123') where user = 'root';
mysql> flush privileges;
More instructions here https://support.rackspace.com/how-to/mysql-resetting-a-lost-mysql-root-password/
$ sudo service mysqld restart
Create a user 'spark'@''%' and grant all previleges
More instructions here http://blog.einext.com/apache-spark/working-with-mysql-from-spark-sql
Create hive metastore to keep spark sql persisted tables
Create a file /opt/mount1/spark-1.6.2-bin-hadoop2.6/conf/hive-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files -->
<!-- that are implied by Hadoop setup variables. -->
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive -->
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
<!-- resource). -->
<!-- Hive Execution Parameters -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<mysql server>:3306/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>spark</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>SPARK12345</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<server name>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
</configuration>
Create a hive warehouse directory on the server that run the thrift server. You need this setting to persist tables information in Spark SQL.
$ sudo mkdir -p /user/hive/warehouse/
$ sudo chmod 777 /user/hive/warehouse/
Add the slave names to the following file
/opt/mount1/spark-1.6.2-bin-hadoop2.6/slaves
hc4t01141.itcs.hpecorp.net
hc4t01140.itcs.hpecorp.net
hc4t01142.itcs.hpecorp.net
Download mysql connector jdbc
$ cd ~/Downloads
$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.39.tar.gz
$
Setup a shared directory across cluster using NFS
Designate one of the machine to host the data files. That is where we are going to host a NFS server. Suppose, hc4t01140 is your NFS file server.
On hc4t01140, install NFS applications
$ sudo yum install nfs-utils nfs-utils-lib
Start NFS services
$ sudo chkconfig nfs on
$ sudo service rpcbind start
$ sudo service nfs start
Create a folder - consider this as data folder for your entire cluster. This not efficient for a larger cluster becasue NFS becomes your I/O bottleneck for your cluster.
$ sudo mkdir /data
$ sudo chmod 777 /data
Enter the directories that you want to share in /etc/exports file
$ sudo vi /etc/exports
Add the following line
/data hc4t01140(rw,sync,no_root_squash,no_subtree_check)
Apply the new entry to the exports file
$ sudo exportfs -a
On the client machines (run on all machines except the NFS server)
$ sudo yum install nfs-utils nfs-utils-lib
Create a mount point
$ sudo mkdir -p /mnt/nfs/data
$ sudo mount hc4t01140:/data /mnt/nfs/data
$ sudo chmod 777 /mnt/nfs/data
Verify mount points
$ sudo df -h
$ sudo mount
$ ls -l /mnt/nfs/data
Testing the access
$ touch /mnt/nfs/data/nfs.test
Enter the mount details in fstab for persistence - to be done on each machine including server
$ sudo vi /etc/fstab
Add the following line
hc4t01140:/data /mnt/nfs/data nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0 0