Hadoop MR Project Using Maven

Go to project workspace. Choose the directory as per your choice. 
$ cd ~/workspace

Setup maven

$ wget http://www-eu.apache.org/dist/maven/maven-3/3.5.3/binaries/apache-maven-3.5.3-bin.tar.gz
$ tar -xf apache-maven-3.5.3-bin.tar.gz
$ export PATH=~/apache-maven-3.5.3/bin/:$PATH

Note: you can put the last line, in ~/.bashrc file so that the configuration persist.

Create a maven project using quickstart archetype.

$ mvn archetype:generate -DgroupId=com.example -DartifactId=HadoopMRExamples -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

$ cd HadoopMRExamples

Note, the package name is "com.example" for the java classes.

Update the maven pom.xml with the repository details and dependencies.

$ vi pom.xml
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.example</groupId>
    <artifactId>HadoopMRExamples </artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>HadoopMRExamples</name>
    <url>http://maven.apache.org</url>
    <repositories>
        <repository>
            <id>hadoop-releases</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
    </repositories>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.0-cdh5.7.0</version>
        </dependency>
    </dependencies>
</project>

Note: Link to the repo based on the distribution of hadoop you are going to run against. Update the repository URL in the pom.xml.

Find the target version of hadoop in your system. Update the version of the hadoop-client dependency in the pom.xml as you see.

$ hadoop version
Hadoop 2.7.0-mapr-1703
Subversion git@github.com:mapr/private-hadoop-common.git -r f5076e15a823858a16376c5ffaee5fe6941eff49
Compiled by root on 2017-04-03T18:29Z
Compiled with protoc 2.5.0
From source with checksum 368fca39fd60b4f626792b2fa3927dd3
This command was run using /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1703.jar

Add WordCount.java source file.

$ mkdir -p src/main/java/com/example

$ vi src/main/java/com/example/WordCount.java

package com.example;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString().toLowerCase());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Create a local directory called input and put a sample file there.

$ mkdir input
$ echo "Hello world" >> input/sample.text
$ echo "Goodmorning world" >> input/sample.text

If you want to run locally,

$ mvn clean compile exec:java -Dexec.mainClass=com.example.WordCount -Dexec.args="input output" -X

After the run the finished, you should see a new directory created called output.

$ ls output/
part-r-00000 _SUCCESS
$ cat output/part-r-00000
goodmorning 1 
hello 1
world 2

When you are ready to test against the cluster, compile and package the project

$ mvn clean package

Make sure users have permission to write intermediate files in yarn staging directory.

$ sudo -u hdfs hadoop fs -chmod 777 /tmp/hadoop-yarn/staging

Deploy the jar using hadoop command. Create a directory called "input" inside your HDFS home directory and create a sample text file to run.

$ hadoop jar target/HadoopMRExamples-1.0-SNAPSHOT.jar com.example.WordCount input output

Project in the github

https://github.com/abulbasar/HadoopMRExamples