Hadoop MR Project Using Maven

Go to project workspace. Choose the directory as per your choice.

$ cd ~/workspace

Setup maven

$ wget http://www-eu.apache.org/dist/maven/maven-3/3.5.3/binaries/apache-maven-3.5.3-bin.tar.gz

$ tar -xf apache-maven-3.5.3-bin.tar.gz

$ export PATH=~/apache-maven-3.5.3/bin/:$PATH

Note: you can put the last line, in ~/.bashrc file so that the configuration persist.

Create a maven project using quickstart archetype.

$ mvn archetype:generate -DgroupId=com.example -DartifactId=HadoopMRExamples -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

$ cd HadoopMRExamples

Note, the package name is "com.example" for the java classes.

Update the maven pom.xml with the repository details and dependencies.

$ vi pom.xml

<?xml version="1.0"?>

<groupId>com.example</groupId>

<artifactId>HadoopMRExamples </artifactId>

<version>1.0-SNAPSHOT</version>

<name>HadoopMRExamples</name>

<url>http://maven.apache.org</url>

<id>hadoop-releases</id>

<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>

</releases>

<enabled>false</enabled>

</snapshots>

</repository>

</repositories>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-client</artifactId>

</dependency>

</dependencies>

</project>

Note: Link to the repo based on the distribution of hadoop you are going to run against. Update the repository URL in the pom.xml.

Hortonworks: http://repo.hortonworks.com/content/repositories/releases
Cloudera: https://repository.cloudera.com/artifactory/cloudera-repos
MapR: http://repository.mapr.com/nexus/content/groups/mapr-public

Find the target version of hadoop in your system. Update the version of the hadoop-client dependency in the pom.xml as you see.

$ hadoop version

Hadoop 2.7.0-mapr-1703

Subversion git@github.com:mapr/private-hadoop-common.git -r f5076e15a823858a16376c5ffaee5fe6941eff49

Compiled by root on 2017-04-03T18:29Z

Compiled with protoc 2.5.0

From source with checksum 368fca39fd60b4f626792b2fa3927dd3

This command was run using /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1703.jar

Add WordCount.java source file.

$ mkdir -p src/main/java/com/example

$ vi src/main/java/com/example/WordCount.java

package com.example;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString().toLowerCase());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Create a local directory called input and put a sample file there.

$ mkdir input

$ echo "Hello world" >> input/sample.text

$ echo "Goodmorning world" >> input/sample.text

If you want to run locally,

$ mvn clean compile exec:java -Dexec.mainClass=com.example.WordCount -Dexec.args="input output" -X

After the run the finished, you should see a new directory created called output.

$ ls output/

part-r-00000 _SUCCESS

$ cat output/part-r-00000

goodmorning 1

hello 1

world 2

When you are ready to test against the cluster, compile and package the project

$ mvn clean package

Make sure users have permission to write intermediate files in yarn staging directory.

$ sudo -u hdfs hadoop fs -chmod 777 /tmp/hadoop-yarn/staging

Deploy the jar using hadoop command. Create a directory called "input" inside your HDFS home directory and create a sample text file to run.

$ hadoop jar target/HadoopMRExamples-1.0-SNAPSHOT.jar com.example.WordCount input output

Project in the github

https://github.com/abulbasar/HadoopMRExamples