Map Reduce Algorithm

Business Problems that Big Data are solving today

  • Extracting attributes from unstructured data like images to assist consumers right match during searches - "find red t-shirt"
  • How to split the marketing expenditure across various channels - web, ios, android, facebook, twitter etc.
  • How much inventory will be obsolete next month in each location?
  • What products are on backorder?
  • How will this large order affect current inventory levels?
  • Detect fraudulent supplier websites

Some common use cases of Map Reduce

  • Query log processing
  • Crawling, indexing, and search
  • Analytics, text processing, and sentiment analysis
  • Machine learning (such as Markov chains and the Naive Bayes classifier)
  • Recommendation systems
  • Document clustering and classification
  • Bioinformatics (alignment, re-calibration, germline ingestion, and DNA/RNA sequencing)
  • Genome analysis (biomarker analysis, and regression algorithms such as linear and Cox)

When MapReduce is suitable for computation

  • When you have to handle lots of input data (e.g., aggregate or compute statistics over large amounts of data).
  • When you need to take advantage of parallel and distributed computing, data storage, and data locality.
  • When you can do many tasks independently without synchronization.
  • When you can take advantage of sorting and shuffling.
  • When you need fault tolerance and you cannot afford job failures.

Here are other scenarios where MapReduce should not be used:

  • If the computation of a value depends on previously computed values. One good example is the Fibonacci series, where each value is a summation of the previous two values:

F(k + 2) = F(k + 1) + F(k)

  • If the data set is small enough to be computed on a single machine. It is better to do this as a single reduce(map(data)) operation rather than going through the entire MapReduce process.
  • If synchronization is required to access shared data.
  • If all of your input data fits in memory.
  • If one operation depends on other operations.
  • If basic computations are processor-intensive.

Basic MapReduce Patterns

  • Distributed Task Execution
  • Counting and Summing as in Log Analysis, Data Querying
  • Collating as in inverted index, ETL
  • Filtering, Parsing, Validating as in Log Analysis, Data Querying, ETL, Data Validation
  • Sorting

Relational MapReduce Patterns

  • Selection or filtering
  • Projection
  • Union
  • Intersection
  • Difference
  • Group-By and Aggregation
  • Join - map join and hash join

Reference:

  • Data Algorithms Recipe for Scaling up Hadoop and Spark book highlights following: