Map Reduce Algorithm

Business Problems that Big Data are solving today

  • Extracting attributes from unstructured data like images to assist consumers right match during searches - "find red t-shirt"

  • How to split the marketing expenditure across various channels - web, ios, android, facebook, twitter etc.

  • How much inventory will be obsolete next month in each location?

  • What products are on backorder?

  • How will this large order affect current inventory levels?

  • Detect fraudulent supplier websites

Some common use cases of Map Reduce

  • Query log processing

  • Crawling, indexing, and search

  • Analytics, text processing, and sentiment analysis

  • Machine learning (such as Markov chains and the Naive Bayes classifier)

  • Recommendation systems

  • Document clustering and classification

  • Bioinformatics (alignment, re-calibration, germline ingestion, and DNA/RNA sequencing)

  • Genome analysis (biomarker analysis, and regression algorithms such as linear and Cox)

When MapReduce is suitable for computation

  • When you have to handle lots of input data (e.g., aggregate or compute statistics over large amounts of data).

  • When you need to take advantage of parallel and distributed computing, data storage, and data locality.

  • When you can do many tasks independently without synchronization.

  • When you can take advantage of sorting and shuffling.

  • When you need fault tolerance and you cannot afford job failures.

Here are other scenarios where MapReduce should not be used:

  • If the computation of a value depends on previously computed values. One good example is the Fibonacci series, where each value is a summation of the previous two values:

F(k + 2) = F(k + 1) + F(k)

  • If the data set is small enough to be computed on a single machine. It is better to do this as a single reduce(map(data)) operation rather than going through the entire MapReduce process.

  • If synchronization is required to access shared data.

  • If all of your input data fits in memory.

  • If one operation depends on other operations.

  • If basic computations are processor-intensive.

Basic MapReduce Patterns

  • Distributed Task Execution

  • Counting and Summing as in Log Analysis, Data Querying

  • Collating as in inverted index, ETL

  • Filtering, Parsing, Validating as in Log Analysis, Data Querying, ETL, Data Validation

  • Sorting

Relational MapReduce Patterns

  • Selection or filtering

  • Projection

  • Union

  • Intersection

  • Difference

  • Group-By and Aggregation

  • Join - map join and hash join


  • Data Algorithms Recipe for Scaling up Hadoop and Spark book highlights following: