Sbt build manager

The following instructions are for CentOS 6.7 Linux

Install sbt

Full instruction here.

Here is how to do it on RPM based linux distributions

$ curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
$ sudo yum install -y sbt

Verify the version of sbt

$ sbt about

Run sbt command. First time, it will download necessary ivy libraries. It may take several minutes. At the end exit from sbt terminal.

$ sbt

Scala looks for sources as below.

  • Sources in the base directory
  • Sources in src/main/scala or src/main/java
  • Tests in src/test/scala or src/test/java
  • Data files in src/main/resources or src/test/resources
  • jars in lib

Create a project directory

$ mkdir -p ~/workspace/WordCount
$ cd ~/workspace/WordCount

Create sbt config file, called it build.sbt. More on build.sbt.

$ vi build.sbt
name := "WordCount "
version := "1.0"
scalaVersion := "2.11.8"

resolvers += Resolver.mavenLocal

val sparkVersion = "2.1.1"
libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core"      % sparkVersion,
    "org.apache.spark" %% "spark-streaming" % sparkVersion,
    "org.apache.spark" %% "spark-sql"       % sparkVersion,
    "org.apache.spark" %% "spark-mllib"     % sparkVersion,
    "org.apache.spark" %% "spark-graphx"    % sparkVersion    
)

Create a directory structure to organize your code and inside create a WordCount.scala code.

$ mkdir -p src/main/scala
$ vi src/main/scala/WordCount.scala
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
  def main(args: Array[String]): Unit = {
    if(args.length == 0) {
      System.err.println("Usage: WordCount <input path>")
      System.exit(2)
    }
    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("My word count")
    val sc = new SparkContext(conf)
    val rdd = sc.textFile(args(0))
    rdd
      .flatMap(line => line.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)
      .collect()
      .foreach(println)
  }
}

Compile and generate package. The package will be placed in target directory under the project directory.

$ sbt package

Continuous compilation.

$ sbt ~compile

Submit the jar to the cluster

$ $SPARK_HOME/bin/spark-submit --class WordCount target/scala-2.10/wordcount-_2.10-1.0.jar file:/etc/passwd

Advanced Configuration

Add Logging Properties

$ mkdir -p src/main/resources

Create src/main/resources/log4j.properties file with the following content

log4j.rootLogger=INFO,console,file
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Threshold=WARN
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.file=org.apache.log4j.RollingFileAppender

log4j.appender.file.File=spark-app.log

log4j.appender.file.MaxFileSize=10MB
log4j.appender.file.MaxBackupIndex=10
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

Add Jars

If you want to add jars to your project, create a lib directory [at the same level as src] and add jar files inside it. When you submit the job in spark, make sure to supply the jars with --jars argument or create an fat jar. To create a fat jar you need additional sbt plugin called sbt-assembly.

Add Eclipse Nature to Project

You need an extra sbt plugin called sbteclipse to create a eclipse project. Once the plugin is configured run the following command.

$ sbt eclipse

Generic Scala Code Execution

Note: If you want to run some scala code, there are a few ways of doing it.

$ mkdir Sample
$ cd Sample
$ mkdir -p src/main/scala

$ echo 'object Sample { def main(args: Array[String]) = println("Hello World!") }' > src/main/scala/Sample.scala

Create sbt.build with the following content

name := "Sample"
version := "1.0"
scalaVersion := "2.10.6"

View the directory structure

. ├── build.sbt └── src └── main └── scala └── Sample.scala

Run the project

$ sbt run

There are few more ways of executing the code. Below options require you install scala first. Follow instructions below if you want to install scala on CentOS.

Using scala

$ scala target/scala-2.10/sample_2.10-1.0.jar Sample

Using java

$ java -cp /usr/lib/scala/lib/scala-library.jar:target/scala-2.10/sample_2.10-1.0.jar Sample

Install Scala on CentOS

$ wget http://www.scala-lang.org/files/archive/scala-2.10.6.tgz
$ tar xvf scala-2.10.6.tgz
$ sudo mv scala-2.10.6 /usr/lib
$ sudo ln -s /usr/lib/scala-2.10.1 /usr/lib/scala
$ export PATH=$PATH:/usr/lib/scala/bin
$ scala -version

Check out scala versions http://www.scala-lang.org/files/archive/

Install Eclipse plugin for sbt

Create a folder ~/.sbt/0.13/plugins/

$ mkdir -p ~/.sbt/0.13/plugins/

Create a file ~/.sbt/0.13/plugins/plugins.sbt with the following content

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.0.1")

Now you can add eclipse nature to sbt managed project. Once run, you can import the project in Eclipse.

$ sbt eclipse

Every time, you make changes to build.sbt of the project, you have to run the above command.

How to make a fat/uber jar

Add a new file project/assembly.sbt with the following content

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.4")

Update build.sbt as below.

name := "WordCount"
version := "1.0"
scalaVersion := "2.11.8"
mainClass in Compile := Some("com.einext.WordCount")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
val sparkVersion = "2.1.0"
val scope = "compile"
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % scope,
  "org.apache.spark" %% "spark-sql" % sparkVersion % scope
)
assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

Now, in console, type the following to create the fat jar.

$ sbt assembly