Sbt build manager
The following instructions are for CentOS 6.7 Linux
Install sbt
Full instruction here.
Here is how to do it on RPM based linux distributions
$ curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
$ sudo yum install -y sbt
Verify the version of sbt
$ sbt about
Run sbt command. First time, it will download necessary ivy libraries. It may take several minutes. At the end exit from sbt terminal.
$ sbt
Scala looks for sources as below.
Sources in the base directory
Sources in src/main/scala or src/main/java
Tests in src/test/scala or src/test/java
Data files in src/main/resources or src/test/resources
jars in lib
Create a project directory
$ mkdir -p ~/workspace/WordCount
$ cd ~/workspace/WordCount
Create sbt config file, called it build.sbt. More on build.sbt.
$ vi build.sbt
name := "WordCount "
version := "1.0"
scalaVersion := "2.11.8"
resolvers += Resolver.mavenLocal
val sparkVersion = "2.1.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion,
"org.apache.spark" %% "spark-graphx" % sparkVersion
)
Create a directory structure to organize your code and inside create a WordCount.scala code.
$ mkdir -p src/main/scala
$ vi src/main/scala/WordCount.scala
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
if(args.length == 0) {
System.err.println("Usage: WordCount <input path>")
System.exit(2)
}
val conf = new SparkConf()
.setMaster("local")
.setAppName("My word count")
val sc = new SparkContext(conf)
val rdd = sc.textFile(args(0))
rdd
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.collect()
.foreach(println)
}
}
Compile and generate package. The package will be placed in target directory under the project directory.
$ sbt package
Continuous compilation.
$ sbt ~compile
Submit the jar to the cluster
$ $SPARK_HOME/bin/spark-submit --class WordCount target/scala-2.10/wordcount-_2.10-1.0.jar file:/etc/passwd
Advanced Configuration
Add Logging Properties
$ mkdir -p src/main/resources
Create src/main/resources/log4j.properties file with the following content
log4j.rootLogger=INFO,console,file
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Threshold=WARN
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=spark-app.log
log4j.appender.file.MaxFileSize=10MB
log4j.appender.file.MaxBackupIndex=10
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
Add Jars
If you want to add jars to your project, create a lib directory [at the same level as src] and add jar files inside it. When you submit the job in spark, make sure to supply the jars with --jars argument or create an fat jar. To create a fat jar you need additional sbt plugin called sbt-assembly.
Add Eclipse Nature to Project
You need an extra sbt plugin called sbteclipse to create a eclipse project. Once the plugin is configured run the following command.
$ sbt eclipse
Generic Scala Code Execution
Note: If you want to run some scala code, there are a few ways of doing it.
$ mkdir Sample
$ cd Sample
$ mkdir -p src/main/scala
$ echo 'object Sample { def main(args: Array[String]) = println("Hello World!") }' > src/main/scala/Sample.scala
Create sbt.build with the following content
name := "Sample"
version := "1.0"
scalaVersion := "2.10.6"
View the directory structure
. ├── build.sbt └── src └── main └── scala └── Sample.scala
Run the project
$ sbt run
There are few more ways of executing the code. Below options require you install scala first. Follow instructions below if you want to install scala on CentOS.
Using scala
$ scala target/scala-2.10/sample_2.10-1.0.jar Sample
Using java
$ java -cp /usr/lib/scala/lib/scala-library.jar:target/scala-2.10/sample_2.10-1.0.jar Sample
Install Scala on CentOS
$ wget http://www.scala-lang.org/files/archive/scala-2.10.6.tgz
$ tar xvf scala-2.10.6.tgz
$ sudo mv scala-2.10.6 /usr/lib
$ sudo ln -s /usr/lib/scala-2.10.1 /usr/lib/scala
$ export PATH=$PATH:/usr/lib/scala/bin
$ scala -version
Check out scala versions http://www.scala-lang.org/files/archive/
Install Eclipse plugin for sbt
Create a folder ~/.sbt/0.13/plugins/
$ mkdir -p ~/.sbt/0.13/plugins/
Create a file ~/.sbt/0.13/plugins/plugins.sbt with the following content
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.0.1")
Now you can add eclipse nature to sbt managed project. Once run, you can import the project in Eclipse.
$ sbt eclipse
Every time, you make changes to build.sbt of the project, you have to run the above command.
How to make a fat/uber jar
Add a new file project/assembly.sbt with the following content
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.4")
Update build.sbt as below.
name := "WordCount"
version := "1.0"
scalaVersion := "2.11.8"
mainClass in Compile := Some("com.einext.WordCount")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
val sparkVersion = "2.1.0"
val scope = "compile"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % scope,
"org.apache.spark" %% "spark-sql" % sparkVersion % scope
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Now, in console, type the following to create the fat jar.
$ sbt assembly