Create Spark Project in Eclipse

Goal: Create a scala project for Spark. Use maven for lib management.

1. Install Scala IDE for Eclipse from Eclipse Marketplace

2. Switch to Scala perspective.

Window > Perspectives > Open Perspective > Other ... > Scala > Click OK

3. Create new Scala Project using the menu bar option File > New > Scala Project. Give a name of your choice for the project.

4. Right click on the project and configure it as a Maven project. No need change any configuration, however group Id usually takes form like com.mycompany.projectname.


<properties>
    <sparkVersion>2.1.1</sparkVersion>
    <scalaVersion>2.11</scalaVersion>
</properties>
<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scalaVersion}</artifactId>
        <version>${sparkVersion}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scalaVersion}</artifactId>
        <version>${sparkVersion}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scalaVersion}</artifactId>
        <version>${sparkVersion}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_${scalaVersion}</artifactId>
        <version>${sparkVersion}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-graphx_${scalaVersion}</artifactId>
        <version>${sparkVersion}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-kafka_${scalaVersion}</artifactId>
        <version>1.6.3</version>
    </dependency>
</dependencies>

7. Now add a new Scala object file "WordCount" to project.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

import org.apache.spark.{SparkContext, SparkConf} object WordCount { def main(args: Array[String]): Unit = { val path = "/etc/passwd" val conf = new SparkConf() conf.setAppName("sample demo") conf.setMaster("local[*]") val sc = new SparkContext(conf) sc.textFile(path). flatMap(_.split("\\W+")). filter(_.length > 0). map((_, 1)). reduceByKey(_+_).

collect.

foreach(println) } }

8. Select the file and run it locally from Menu bar Run > Run. This step creates a build configuration. Your might fails complaining that input does not exist. that is fine.

9. In code you might notice that input path in hard coded as "/etc/passwd". But if you want to pass this input path or output path as command line argument, you have to make a few code changes. Then right click on the file WordCount.scala > Run As > Run Configurations... > Select Arguments tab > In Program Argument, type "input output". This tells that treat input directory under the working directory is the input path and output is the output directory in which the program will save the output. Now create a input directory and create a test file inside it. Refresh the project. If you find a folder called "output", delete it and delete it before every time you try to run the program again from Eclipse.

10. Now, right click on the project select Run As > Maven Install. This will create a jar file and place under the "target" folder. You have refresh the project to see the jar file.

11. Take this jar and submit using spark-submit command.

$ %SPARK_HOME%/bin/spark-submit --class WordCount --master local[*] --verbose Sample-0.0.1-SNAPSHOT.jar

Note:if you submit the job on yarn, create "/etc/passwd" path in HDFS and put some sample file before running.

Additional Configuration for Logging

Create directory "resources" inside project working directory and create a log4j.properties file inside it with the following content.

log4j.rootLogger=INFO,console,file
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Threshold=WARN
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.Threshold=INFO
log4j.appender.file.File=spark-app.log
log4j.appender.file.MaxFileSize=10MB
log4j.appender.file.MaxBackupIndex=10
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
 

If you want to control logging threshold for your application class, add the following lines to the log4j. In the example below, com.einext is name of the package for your classes.

log4j.logger.com.einext=INFO, console
log4j.additivity.com.einext=false 

Right click on the resources folder > Build Path > Use As Source Folder

Now, if you run the above WordCount.scala program again, you will see less debug message. You can control using log handler "file" and "console".

Here is a sample project https://github.com/abulbasar/spark-scala-examples