# Map Reduce Algorithm

Business Problems that Big Data are solving today

• Extracting attributes from unstructured data like images to assist consumers right match during searches - "find red t-shirt"

• How to split the marketing expenditure across various channels - web, ios, android, facebook, twitter etc.

• How much inventory will be obsolete next month in each location?

• What products are on backorder?

• How will this large order affect current inventory levels?

• Detect fraudulent supplier websites

Some common use cases of Map Reduce

• Query log processing

• Crawling, indexing, and search

• Analytics, text processing, and sentiment analysis

• Machine learning (such as Markov chains and the Naive Bayes classifier)

• Recommendation systems

• Document clustering and classification

• Bioinformatics (alignment, re-calibration, germline ingestion, and DNA/RNA sequencing)

• Genome analysis (biomarker analysis, and regression algorithms such as linear and Cox)

When MapReduce is suitable for computation

• When you have to handle lots of input data (e.g., aggregate or compute statistics over large amounts of data).

• When you need to take advantage of parallel and distributed computing, data storage, and data locality.

• When you can do many tasks independently without synchronization.

• When you can take advantage of sorting and shuffling.

• When you need fault tolerance and you cannot afford job failures.

Here are other scenarios where MapReduce should not be used:

• If the computation of a value depends on previously computed values. One good example is the Fibonacci series, where each value is a summation of the previous two values:

F(k + 2) = F(k + 1) + F(k)

• If the data set is small enough to be computed on a single machine. It is better to do this as a single reduce(map(data)) operation rather than going through the entire MapReduce process.

• If synchronization is required to access shared data.

• If all of your input data fits in memory.

• If one operation depends on other operations.

• If basic computations are processor-intensive.

Basic MapReduce Patterns

• Counting and Summing as in Log Analysis, Data Querying

• Collating as in inverted index, ETL

• Filtering, Parsing, Validating as in Log Analysis, Data Querying, ETL, Data Validation

• Sorting

Relational MapReduce Patterns

• Selection or filtering

• Projection

• Union

• Intersection

• Difference

• Group-By and Aggregation

• Join - map join and hash join

Reference:

• Data Algorithms Recipe for Scaling up Hadoop and Spark book highlights following: