MapReduce Design Patterns – DZone Big Data

This article is featured in the new DZone Guide to Big Data Processing, Volume III. Get your free copy for more insightful articles, industry statistics and more.

This article discusses four main MapReduce design patterns:

1. Enter-Map-Reduce-Exit
2. Input-Card-Output
3. Input-Multiple Maps-Reduce-Output 4. Input-Map-Combiner-Reduce-Output

Here are some real-world scenarios, to help you understand when to use which design template.

Entry-Map-Reduce-Exit

If we want to perform an aggregation operation, this model is used:

Image Title

Image Title

To count the total salary by gender, we need to set the Gender key and the Salary value. The output of the Map function is:

Image Title

The intermediate split gives the input for the Reduce function:

Image Title

And the output of the Reduce function is:

Image Title

Entry-Card-Exit

Image Title

The Collapse function is mainly used for aggregation and calculation. However, if we only want to change the format of the data, then the Input-Map-Output model is used:

Image Title

Image Title

In the Input-Multiple Maps-Reduce-Output design pattern, our input comes from two files, each with a different schema. (Note that if two or more files have the same schema, then there is no need for two mappers. We can just write the same logic in a mapper class and provide multiple input files.)

Image Title

This model is also used in Join the reduced side:

Image Title

Entry-Map-Combine-Reduce-Exit

Image Title

Apache Spark is very efficient at data processing tasks large and small, not because it best reinvents the wheel, but because it best amplifies the existing tools needed to perform efficient analysis. Coupled with its highly scalable nature on basic hardware and incredible performance capabilities over other well-known big data processing engines, Spark can finally let software finish eating the world.

A Combiner, also known as a semi-reducer, is an optional class that works by accepting input from the Map class and then passing the output key-value pairs to the Reducer class. The purpose of the Combine function is to reduce the workload of Reducer.

In a MapReduce program, 20% of the work is done at the map stage, also known as the data preparation stage. This step works in parallel.

80% of the work is done in the reduction step, which is known as the compute step. This work is not done in parallel, so it is slower than the Map phase. To reduce the computation time, some work of the Reduce phase can be done in a Combine phase.

Scenario

There are five departments, and we need to calculate the total salary by department, then by gender. However, there are additional rules for calculating these totals. After calculating the total for each department by gender:

If the department’s total salary is over 200,000, add 25,000 to the total.

If the department’s total salary is greater than 100,000, add 10,000 to the total.

Image Title

For more information on machine learning, neural networks, data health and more, get your free copy of the new DZone Guide to Dealing with Big Data, Volume III!


Source link

Abdul J. Gaspar

Leave a Reply

Your email address will not be published.