Submit MapReduce Job

Hadoop contains a number of example MapReduce programs that can be submitted to the cluster using hadoop jar command.

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.

aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.

bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

dbcount: An example job that count the pageview counts from a database.

distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.

grep: A map/reduce program that counts the matches of a regex in the input.

join: A job that effects a join over sorted, equally partitioned datasets

multifilewc: A job that counts words from several files.

pentomino: A map/reduce tile laying program to find solutions to pentomino problems.

pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.

randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.

randomwriter: A map/reduce program that writes 10GB of random data per node.

secondarysort: An example defining a secondary sort to the reduce.

sort: A map/reduce program that sorts the data written by the random writer.

sudoku: A sudoku solver.

teragen: Generate data for the terasort

terasort: Run the terasort

teravalidate: Checking results of terasort

wordcount: A map/reduce program that counts the words in the input files.

wordmean: A map/reduce program that counts the average length of the words in the input files.

wordmedian: A map/reduce program that counts the median length of the words in the input files.

wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Suppose you want to execute pi function that does not require any input and output in HDFS.

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 2 2

Find all jobs that ran and running yarn

$ yarn application -appStates ALL -list

Find the application id (...starts with application_, for example, application_1472665646474_0001) to view the logs for all containers. It assumes that log aggregation is enabled.

$ yarn logs -applicationId <application id>

Specify the logging level during job submission. You can also specify these configurations during

$ HADOOP_ROOT_LOGGER=INFO,console hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi -D mapreduce.map.log.level=WARN -Dmapreduce.reduce.log.level=INFO 2 2

Note:

  • Set the HADOOP_ROOT_LOGGER logging level at lower threshold than map and reduce logging level. Otherwise, HADOOP_ROOT_LOGGER will suppress the logs of lower thresholds.

  • You can go to job history Web UI to find logs for individual tasks. Follow the steps below.

    1. Open http://localhost:19888/jobhistory

    2. Click on your Job Id hyperlink

    3. On the left hand side, navigation bar, clicks on Map Tasks (Reduce Task based on your need)

    4. Click on the Name of the task that you are interested in. Lost like look at tasks with FAILED status

    5. Click on logs hyperlink. If you show the logs. These log will be consistent with logging level you specify in the command.

Specify the number of reducers

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount -Dmapreduce.job.reduces=3 /user/cloudera/sample /user/cloudera/out

Submit a job with compression of output.

$ hadoop jar <.jar file> <class name> \

"-Dmapreduce.output.fileoutputformat.compress=true" \

"-Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec" \

"-Dmapreduce.map.output.compress=true" \

"-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec" \

<input path> <output path>