Map Reduce Algorithm
Business Problems that Big Data are solving today
Extracting attributes from unstructured data like images to assist consumers right match during searches - "find red t-shirt"
How to split the marketing expenditure across various channels - web, ios, android, facebook, twitter etc.
How much inventory will be obsolete next month in each location?
What products are on backorder?
How will this large order affect current inventory levels?
Detect fraudulent supplier websites
Some common use cases of Map Reduce
Query log processing
Crawling, indexing, and search
Analytics, text processing, and sentiment analysis
Machine learning (such as Markov chains and the Naive Bayes classifier)
Recommendation systems
Document clustering and classification
Bioinformatics (alignment, re-calibration, germline ingestion, and DNA/RNA sequencing)
Genome analysis (biomarker analysis, and regression algorithms such as linear and Cox)
When MapReduce is suitable for computation
When you have to handle lots of input data (e.g., aggregate or compute statistics over large amounts of data).
When you need to take advantage of parallel and distributed computing, data storage, and data locality.
When you can do many tasks independently without synchronization.
When you can take advantage of sorting and shuffling.
When you need fault tolerance and you cannot afford job failures.
Here are other scenarios where MapReduce should not be used:
If the computation of a value depends on previously computed values. One good example is the Fibonacci series, where each value is a summation of the previous two values:
F(k + 2) = F(k + 1) + F(k)
If the data set is small enough to be computed on a single machine. It is better to do this as a single reduce(map(data)) operation rather than going through the entire MapReduce process.
If synchronization is required to access shared data.
If all of your input data fits in memory.
If one operation depends on other operations.
If basic computations are processor-intensive.
Basic MapReduce Patterns
Distributed Task Execution
Counting and Summing as in Log Analysis, Data Querying
Collating as in inverted index, ETL
Filtering, Parsing, Validating as in Log Analysis, Data Querying, ETL, Data Validation
Sorting
Relational MapReduce Patterns
Selection or filtering
Projection
Union
Intersection
Difference
Group-By and Aggregation
Join - map join and hash join
Reference:
Data Algorithms Recipe for Scaling up Hadoop and Spark book highlights following: