But what is MapReduce itself?
I think it's best to start from the basics:
map
is an operation that applies a single-argument function to a sequence of inputs, producing an equal-sized sequence of outputs.
For example, assume an sequence of inputs [1, 2, 3]
and a function f(x) => x * 2
. The output would be [2, 4, 6]
.
As long as map
is given a pure function -- one without side-effects -- this operation is completely parallelizable.
reduce
is an operation that applies two-argument function to a sequence of inputs. For each element in the sequence, one argument is the return value from the previous invocation, and the second argument is the element.
For example, consider the sequence [2, 4, 6]
and a function f(x,y) => x + y
. The output would be 12
(assuming that the initial value for x
is 0).
Since reduce
relies on the previous output, this operation is sequential (I know, the linked article says that reduce is parallellizable; I'll get to that).
How does it work?
That depends on the implementation. It could happen entirely in memory (for example, the Java8 Stream framework). Or it could be distributed, as in the case of Hadoop. I'll assume that you mean the latter.
The basic idea of Hadoop is that it breaks the input (nominally a group of files) into a sequence of sequences, which then be processed by a parallel map
operation. There are a couple of differences, however, between Hadoop and the map-reduce operations I described above:
- The output of a
map
operation is actually a sequence, which Hadoop then flattens (so, it's like the example above takes [1, 2, 3]
and produces [[1,2,3], [2,4,6], [3,6,9]]
, which Hadoop translates to [1, 2, 2, 3, 2, 4, 6, 3, 6, 9]
).
- The sequences of objects are actually sequences of key-value pairs. Your code may care about the key, the value, or both.
- The framework partitions the output of the
map
operation by key, so the reduce
operation is called once for each key produced by the mapper (so can be parallelized).
So, here's (roughly) how Hadoop processes the WordCount example program
- The Hadoop framework reads each of the files configured for the job, and partitions them into sequences of lines.
- Each sequence is fed to a mapper, as a key-value pair. The mapper ignores the key, and breaks the value into zero or more words.
- The mapper then writes one pair for each token in the input line: the key is the word, and the value is the number 1 (because the word has been seen once). The mapper could be smarter, and count the number of times that each word appears, writing key-value pairs for distinct words (this might make sense if the input were entire files, rather than lines).
- The framework groups mapper outputs by key, producing a sequence of key-value pairs where the value is itself a sequence.
- The framework then calls a reducer for each key-sequence pair. Since these are independent, they can be parallelized.
- The reducer sums all the values in the sequence, and emits a single key-value pair (you could really think of this as just another
map
operation).
- The framework writes those key-value pairs to the output file (and if you think of the reducer as a
map
operation, then the framework itself does the reduce
... probably best not to go down that path).
Can the grouping phase also be done in parallel or only the map and reduce phases?
No, because the Hadoop reducer processes a sequence of values, and you need to know all of the values before you call the reducer.
Except ... Hadoop also lets you define a "combiner", which is like a reduce step that can take incomplete results, and produce something that is then passed to the actual reduce step. The WordCount example does this: it takes a key (word) and sequence of values (counts) and sums them.
The combiner is useful to reduce the space required for intermediate results, but eventually you still need to know that all results are available. But before calling the final reduce stage you need to know that you've put all of the results for a given key in the same place.
Is the main purpose of MapReduce really just parallelization when indexing large amounts of data?
I would rephrase that as "the main purpose of frameworks such as Hadoop is to simplify the distribution and coordination of parallel operations on large amounts of data."
For example, let's say that you've scanned all 18 million books in the Harvard library system. Now you have a lot of images, and you want to use OCR to translate them to text. You could manually partition the files amongst a set of machines, and manually start the OCR tasks on each of those machines. And inevitably, you'd find that some machines finish the task before others. Or maybe one of the machines has a hardware failure, and you have to figure out what it's completed and redistribute the things that it hasn't.
Or you could use a framework that distributes the files to machines that have capacity (a distributed map
, no reduce
). But then, if you want, you could also automatically generate a keyword index for all of those translated files (another map
along with a reduce
).