Hadoop MR Project Using Maven
Go to project workspace. Choose the directory as per your choice.
$ cd ~/workspace
Setup maven
$ wget http://www-eu.apache.org/dist/maven/maven-3/3.5.3/binaries/apache-maven-3.5.3-bin.tar.gz
$ tar -xf apache-maven-3.5.3-bin.tar.gz
$ export PATH=~/apache-maven-3.5.3/bin/:$PATH
Note: you can put the last line, in ~/.bashrc file so that the configuration persist.
Create a maven project using quickstart archetype.
$ mvn archetype:generate -DgroupId=com.example -DartifactId=HadoopMRExamples -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
$ cd HadoopMRExamples
Note, the package name is "com.example" for the java classes.
Update the maven pom.xml with the repository details and dependencies.
$ vi pom.xml
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>HadoopMRExamples </artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>HadoopMRExamples</name>
<url>http://maven.apache.org</url>
<repositories>
<repository>
<id>hadoop-releases</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0-cdh5.7.0</version>
</dependency>
</dependencies>
</project>
Note: Link to the repo based on the distribution of hadoop you are going to run against. Update the repository URL in the pom.xml.
Hortonworks: http://repo.hortonworks.com/content/repositories/releases
Cloudera: https://repository.cloudera.com/artifactory/cloudera-repos
MapR: http://repository.mapr.com/nexus/content/groups/mapr-public
Find the target version of hadoop in your system. Update the version of the hadoop-client dependency in the pom.xml as you see.
$ hadoop version
Hadoop 2.7.0-mapr-1703
Subversion git@github.com:mapr/private-hadoop-common.git -r f5076e15a823858a16376c5ffaee5fe6941eff49
Compiled by root on 2017-04-03T18:29Z
Compiled with protoc 2.5.0
From source with checksum 368fca39fd60b4f626792b2fa3927dd3
This command was run using /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1703.jar
Add WordCount.java source file.
$ mkdir -p src/main/java/com/example
$ vi src/main/java/com/example/WordCount.java
package com.example;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString().toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Create a local directory called input and put a sample file there.
$ mkdir input
$ echo "Hello world" >> input/sample.text
$ echo "Goodmorning world" >> input/sample.text
If you want to run locally,
$ mvn clean compile exec:java -Dexec.mainClass=com.example.WordCount -Dexec.args="input output" -X
After the run the finished, you should see a new directory created called output.
$ ls output/
part-r-00000 _SUCCESS
$ cat output/part-r-00000
goodmorning 1
hello 1
world 2
When you are ready to test against the cluster, compile and package the project
$ mvn clean package
Make sure users have permission to write intermediate files in yarn staging directory.
$ sudo -u hdfs hadoop fs -chmod 777 /tmp/hadoop-yarn/staging
Deploy the jar using hadoop command. Create a directory called "input" inside your HDFS home directory and create a sample text file to run.
$ hadoop jar target/HadoopMRExamples-1.0-SNAPSHOT.jar com.example.WordCount input output
Project in the github
https://github.com/abulbasar/HadoopMRExamples