What Is MapReduce?
MapReduce refers to a framework that runs on a computational
cluster to mine large datasets. The name derives from the
application of map() and reduce() functions repurposed from
functional programming languages.
•“Map” applies to all the members of the dataset and
returns a list of results
•“Reduce” collates and resolves the results from one or
more mapping operations executed in parallel
•Very large datasets are split into large subsets called splits
•A parallelized operation performed on all splits yields
the same results as if it were executed against the larger
dataset before turning it into splits
•Implementations separate business logic from multiprocessing logic
•MapReduce framework developers focus on process
dispatching, locking, and logic flow
•App developers focus on implementing the business logic
without worrying about infrastructure or scalability issues
Implementation patterns
The Map(k1, v1) -> list(k2, v2) function is applied to every
item in the split. It produces a list of (k2, v2) pairs for each call.
The framework groups all the results with the same key
together in a new split.
The Reduce(k2, list(v2)) -> list(v3) function is applied
to each intermediate results split to produce a collection
of values v3 in the same domain. This collection may have
zero or more values. The desired result consists of all the v3
collections, often aggregated into one result file