7

When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, but I still have some questions.

  • Does MapReduce really have an intermediate phase called grouping as the article describes?

  • Can the grouping phase also be done in parallel or only the map and reduce phases?

  • Does the map and reduce operations described in the article make sense for the problem proposed (indexing web pages by keywords)? They look too simple to me.

  • Is the main purpose of MapReduce really just parallelization when indexing large amounts of data?

  • Do you think too many people know Hadoop without understanding the fundamentals of MapReduce? Is it a problem?

4 Answers4

4

MapReduce actually has a grouping phase. The map phase essentially consists in transforming inputs into pairs of (key,value) elements. Because the reduce phase consists in "aggregating" all the values associated to the same key, you cannot avoid the need to group all values by key before the reduce phase. This may need a lot of time since values must be shuffled over the cluster.

The grouping phase can be made in parallel. Basically, a cluster node is associated to each generated key. Then, all the generated pairs of (key, value) are sent to the node associated to their key. This typically leads to an important network overload, and this phase is considered as network bounded.

The indexing of Web pages by keyword is a typical application of MapReduce. More generally, dictionary reversing can almost directly be described as a MapReduce task. You can see it as a process for building a basic search engine: you want to find Web sites containing à particular keyword. Because you don't have time to browse all the sites for each incoming query, you have to prepare a reverse dictionary of (keyword, websites).

MapReduce is not limited to indexing tasks. The map and reduce tasks can be less or more complex than those required for an indexing. You can also combine several map and reduce tasks for carrying out more complex data processing. For instance, Apache Pig provides a SQL-like language for describing "complex" MapReduce jobs.

I don't think you can correctly use Hadoop (directly) without mastering its basis. You cannot correctly describe a map or a reduce task if you don't understand how the framework will use it for processing submitted data. A deeper understanding of the MapReduce mechanism also helps to understand why a job takes so long, or why adding CPUs will not help to reduce processing time.

mgoeminne
  • 1,158
  • 6
  • 11
  • That's great! Thanks very much. One additional question: did you see any mistake or incorrect information on the referred article? – Eddie Bravo Jan 07 '17 at 09:24
  • I have not tested the source code, but the ideas seem OK. However, as I said, the grouping phase is also "parallelised" in MapReduce since you can "displace" all your pairs simultaneously. Naively, you could associate an id to each of your N nodes (from 0 to N-1), then simultaneously calculate the hash of each key h(k) and send the pair (k,v) to node h(k) mod N. Please note you don't have to finish the map phase before starting the shuffling, so MapReduce is, for multiple reasons, more efficient than the code proposed in the blog poste. – mgoeminne Jan 07 '17 at 09:49
4

But what is MapReduce itself?

I think it's best to start from the basics:

  • map is an operation that applies a single-argument function to a sequence of inputs, producing an equal-sized sequence of outputs.

    For example, assume an sequence of inputs [1, 2, 3] and a function f(x) => x * 2. The output would be [2, 4, 6].

    As long as map is given a pure function -- one without side-effects -- this operation is completely parallelizable.

  • reduce is an operation that applies two-argument function to a sequence of inputs. For each element in the sequence, one argument is the return value from the previous invocation, and the second argument is the element.

    For example, consider the sequence [2, 4, 6] and a function f(x,y) => x + y. The output would be 12 (assuming that the initial value for x is 0).

    Since reduce relies on the previous output, this operation is sequential (I know, the linked article says that reduce is parallellizable; I'll get to that).

How does it work?

That depends on the implementation. It could happen entirely in memory (for example, the Java8 Stream framework). Or it could be distributed, as in the case of Hadoop. I'll assume that you mean the latter.

The basic idea of Hadoop is that it breaks the input (nominally a group of files) into a sequence of sequences, which then be processed by a parallel map operation. There are a couple of differences, however, between Hadoop and the map-reduce operations I described above:

  • The output of a map operation is actually a sequence, which Hadoop then flattens (so, it's like the example above takes [1, 2, 3] and produces [[1,2,3], [2,4,6], [3,6,9]], which Hadoop translates to [1, 2, 2, 3, 2, 4, 6, 3, 6, 9]).
  • The sequences of objects are actually sequences of key-value pairs. Your code may care about the key, the value, or both.
  • The framework partitions the output of the map operation by key, so the reduce operation is called once for each key produced by the mapper (so can be parallelized).

So, here's (roughly) how Hadoop processes the WordCount example program

  1. The Hadoop framework reads each of the files configured for the job, and partitions them into sequences of lines.
  2. Each sequence is fed to a mapper, as a key-value pair. The mapper ignores the key, and breaks the value into zero or more words.
  3. The mapper then writes one pair for each token in the input line: the key is the word, and the value is the number 1 (because the word has been seen once). The mapper could be smarter, and count the number of times that each word appears, writing key-value pairs for distinct words (this might make sense if the input were entire files, rather than lines).
  4. The framework groups mapper outputs by key, producing a sequence of key-value pairs where the value is itself a sequence.
  5. The framework then calls a reducer for each key-sequence pair. Since these are independent, they can be parallelized.
  6. The reducer sums all the values in the sequence, and emits a single key-value pair (you could really think of this as just another map operation).
  7. The framework writes those key-value pairs to the output file (and if you think of the reducer as a map operation, then the framework itself does the reduce ... probably best not to go down that path).

Can the grouping phase also be done in parallel or only the map and reduce phases?

No, because the Hadoop reducer processes a sequence of values, and you need to know all of the values before you call the reducer.

Except ... Hadoop also lets you define a "combiner", which is like a reduce step that can take incomplete results, and produce something that is then passed to the actual reduce step. The WordCount example does this: it takes a key (word) and sequence of values (counts) and sums them.

The combiner is useful to reduce the space required for intermediate results, but eventually you still need to know that all results are available. But before calling the final reduce stage you need to know that you've put all of the results for a given key in the same place.

Is the main purpose of MapReduce really just parallelization when indexing large amounts of data?

I would rephrase that as "the main purpose of frameworks such as Hadoop is to simplify the distribution and coordination of parallel operations on large amounts of data."

For example, let's say that you've scanned all 18 million books in the Harvard library system. Now you have a lot of images, and you want to use OCR to translate them to text. You could manually partition the files amongst a set of machines, and manually start the OCR tasks on each of those machines. And inevitably, you'd find that some machines finish the task before others. Or maybe one of the machines has a hardware failure, and you have to figure out what it's completed and redistribute the things that it hasn't.

Or you could use a framework that distributes the files to machines that have capacity (a distributed map, no reduce). But then, if you want, you could also automatically generate a keyword index for all of those translated files (another map along with a reduce).

kdgregory
  • 5,220
  • 23
  • 27
1

MapReduce allows you to create a program that will run on a distributed data center.

The program is divided in two phases. The first is sent to the servers where the data is and make a pre-calculation. The results of this phase is sent to the second part of the program to be consolidated.

This works best with data processing where the first part results is small. For example, a select account, count(*) can be run in each node to produce a partial result that is small and is sent to a server for consolidation which is the sum of sums. More or less like that:

Example of a sum query in hadoop-like systems

Although hadoop will work for things that are not easily partitionable, it will not be as efficient. Take the gray sort competition. Hadoop is fast to sort, but requires a lot more hardware than its competitors.

Hadoop will be effient for direct sums, select without joins and a lot of data analytics. Hadoop will not be as efficient for sorts, count(distinct), joins and etc. Matemathically speaking, it works best when the distributive property of mathematics work.

Lucas
  • 298
  • 1
  • 4
0

MapReduce is a divide-and-conquer strategy, with some constraints on the division and aggregation policy.

All MapReduce problems must be divisible, such that each sub-portion of the problem is uniform, or "the same". This permits any worker to work on a sub-chunk of the problem.

All MapReduce results must share the same "type" or record structure. Thus you can produce results like

(a, b, c)
(null, b, c)
(a, b, null)

where a, b, c are compatible data types, but you cannot produce outputs like

(a, b, c)
(a, d, c)
(a, b)

A function must be declared that can aggregate two sets of results.

f(R1, R2) = R3

such that all fields in R1 and R2 are sensibly combined in the output record set R3. This function must be symmetric, meaning

f(R1, R2) = R3
f(R2, R1) = R3

is true for all sets of records. (or you'll never get stable output across runs)

These constraints permit a multi-stage processing where many of the stages are done in parallel.

  1. Something chops up the input data.
  2. While bits are produced by #1, resources are consumed to create partial "chunk" answers.
  3. While #1 and #2 are proceeding, the chunk answers are "reduced" to fewer chunks.
  4. When #1 is done, #2 has processed all chunk answers, and #3 has aggregated all the chunks into one, report the combined results.

Since each "problem solver" has their "bit" of input, and they can work on producing their "output chunk" without all of the input, which make it easy to spread the work across multiple machines on the network. Each "combining" worker would then aggregate the results of the "problem solvers" until all chunks of work were aggregated.

Edwin Buck
  • 489
  • 3
  • 7