63

I have been reading about MapReduce for a while -- but what I can't understand is how someone would make a decision to use (or not use) MapReduce.

I mean, what are the problem patterns that signal that MapReduce could be used.

dodgy_coder
  • 1,098
  • 7
  • 22
treecoder
  • 9,475
  • 10
  • 47
  • 84

12 Answers12

49

It's basically problems that are huge, but not hard. Travelling salesman depends crucially on the distance between any given pair of cities, so while it can be broken down into many parts, the partial results cannot be recombined so that the globally optimal solution emerges (well, probably not; if you know a way, please apply for your Fields medal now).

On the other hand, counting frequencies of words in a gigantic corpus is trivially partitionable, and trivially recombinable (you just add up the vectors computed for the segments of the corpus), so map-reduce is the obvious solution.

In practice, more problems tend to be easily recombinable than not, so the decision whether to parallelize a task or not has more to do with how huge the task is, and less with how hard it is.

Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • If you're looking for an approximate answer to the traveling salesman problem, you can trivially pick the answer with the minimum distance to merge. – dan_waterworth Apr 17 '12 at 06:40
  • I did not understand your explanation of why MapReduce would not be suitable for Travelling Salesman. –  Apr 18 '12 at 05:47
  • 11
    It is suitable for finding *a* solution, maybe even a very good one - just partition the set of cities into smaller sets, e.g. 1-10, 11-20, 21-30, find optimal routes between them, and join them with hops of 10->11, 20->21 and 30->1. But the point of the problem is to find the *optimal* route, and there is no guarantee that the optimal route is partitioned this way - it might actually start with 1->25! In other words, to find the correct partitioning, you must basically know the solution already! That's why finding the *optimal* route is not susceptible to the partition-and-reassemble trick – Kilian Foth Apr 18 '12 at 06:38
  • 3
    @KilianFoth, you could do an exhaustive search by splitting the solution space into, starting at 1, starting at 2, ..., then to solving the problem at each of these nodes by partitioning the space again in the same way. Merging at the root is simply finding the shortest route, merging at an other branch is finding the shortest 'child route + route from branch to child'. – dan_waterworth Apr 18 '12 at 07:56
  • You're right, you *can* use partitioning to compute the optimal solution if you do it carefully - you'll even get the expected speed-up by approximately the number of nodes. It's just that, since the original problem is exponentially expensive, that probably won't move the problem from 'too expensive' to 'feasible' as it usually does with other problems. So technically it's a possible application, just not a typical one. – Kilian Foth Apr 18 '12 at 09:04
  • 5
    In case you have the solution, remember that you are only eligible for your Fields medal if you are under 40. – Francesco Apr 22 '12 at 22:28
  • @dan_waterworth: You cannot merge these solutions. Merging at the branch is 'finding the shortest child route (**that does not use any node from the ancestors**) + route from branch to child. – Aditya Apr 23 '12 at 23:33
30

Can the problem be solved efficiently using distributed computing?

If the answer to this question is yes, then you have a candidate problem for MapReduce. That is because the problem pattern lends itself to being split into smaller isolated problems.

Your task: Parse this book

An example works well to ilustrate this. You have a large document (Moby Dick by Herman Melville) and your job is to perform a frequency analysis of all the words used in it.

The sequential approach

You can do this sequentially by getting your fastest machine (you've got plenty lying around) and running over the text from start to finish maintaining a hash map of every word you find (the key) and incrementing the frequency (value) every time you parse a word. Simple, straightforward and slow.

The MapReduce approach

Approaching this from a different perspective, you note that you have all these spare machines lying around and you could split this task up into chunks. Give each machine a 1Mb block of text to parse into a hash map and then collate all the hash maps from each into a single result. This is a layered MapReduce solution.

The process of reading a line of text and gathering the words is the Map phase (you create a simple map representing the words in the line with their frequency 1,2,3 etc), then the Reduce phase is when each machine collates their line maps into a single aggregate map.

The overall solution comes from a further Reduce phase where all the aggregate maps are aggregated (that word again) into a final map. Slightly more complex, massively parallel and quick.

Summary

So, to summarise, if your problem lends itself to being represented by keys, values, aggregate operations on those values in isolation then you have a candidate problem for MapReduce.

Gary
  • 24,420
  • 9
  • 63
  • 108
  • 2
    Meh; that's an oversimplification. MapReduce is about partitioning the data, applying a function to the pieces in parallel _without communication between the analyzers_, and then applying another function to combine the bits. Not all distributable problems fit that model. – Donal Fellows Apr 25 '12 at 08:52
  • 2
    Fair point - but it serves as a useful introduction and allows someone to "box" their problem. – Gary Apr 25 '12 at 15:57
13

The MapReduce pattern is taken from the world of functional programming. It is a process for applying something called a catamorphism over a data-structure in parallel. Functional programmers use catamorphisms for pretty much every simple transformation or summarization.

Assuming your data is a tree, the deciding factor is whether you can compute a value for a node using only the data contained in that node and the computed values for its children.

For example you can compute the size of a tree using a catamorphism; you would compute the sum of the computed values for all children plus one.

dan_waterworth
  • 7,287
  • 2
  • 34
  • 45
  • Good answer, I'm not sure if @good_computer was referring to the specific MapReduce framework developed by Google. And I do not know if MapReduce (again the Google framework) applies to something else than types isomorphic to lists. – scarfridge Apr 17 '12 at 06:59
  • 1
    @scarfridge, I assumed that the OP wasn't referring to the Google specific framework. I consulted the Wikipedia article with regard to whether it is only used for lists or for trees in general before posting. http://en.wikipedia.org/wiki/MapReduce#Overview – dan_waterworth Apr 17 '12 at 07:54
  • 2
    If only it were called _MapFold_; that would be so much easier to understand. – Aditya Apr 23 '12 at 23:34
6

This WPI - Applications of Map Reduce (ppt) may be of interest to you. It discusses different applications of MR, and as one of the discussed cases, it shows how Using 100 EC2 instances and 24 hours, the New York Times was able to convert 4TB of scanned articles to 1.5TB of PDF documents.

Another set of examples where MR helped in speeding performance is at: Aster - SQL Map Reduce shows some case studies of SQL-Map Reduce technology including Fraud Detection, Transformations, and others.

NoChance
  • 12,412
  • 1
  • 22
  • 39
  • 1
    If you end up with one pdf per scanned article, they you are just using a distributed map, not MapReduce. In map-reduce you apply a reduction to the results of the map to obtain a single result. – Pete Kirkham May 13 '16 at 10:53
6

Map/Reduce is a specific form of a specific kind of algorithm. You use it to transform one huge data set into another data set. (The result dataset may or may not be huge.) If you don't want a static data output set as a result of static data input, then Map/Reduce is not appropriate. Map/Reduce can easily tell you how many John Smiths are in the Manhattan phone book, but it is not well suited to build a web server.

The way Map/Reduce works is:

  • Map takes pairs of keys (k1) and values (v1) and maps them into a new set of keys (k2) and values (v2).
  • Reduce takes all the v2 values with the same k2 key and produces a new value (v3).

The result is that a list of (k1, v1) pairs is transformed into a list of (v3)s. (Of course, the value "v3" can be a composite that includes k2, which could be defined to be equal to k1.)

So you use it:

  1. If you have so much data to start with that running it all sequentially through one or two servers would take too long, and

  2. You can conceive of the output data of being a list of values or key value pairs (generally not too hard when you remember "key" only means "unique label"), and

  3. Whatever the relationship, you are sure that each piece of input data only impacts the output value for one output key.

If your data can all be processed sequentially by a single server, then since that is the dominant computing paradigm (the ones servers are built for and programmers are trained on), use a single server.

The map stage has to partition all the input data by output key. It doesn't have to produce the output value associated with the output key (that's done by the reduce stage), but it does have to uniquely assign each input key value pair to contribute to at most one output key's value. If the data is too interrelated then map reduce might not be able to handle the problem. On the other hand, it may just be that you need to use multiple rounds of map/reduce.

If you can't figure out how to turn your data transformation into a map/reduce, then of course it's not a solution.

There is a real art to figuring out if a problem can be decomposed into something Map/Reduce can handle. For example v1 and v2 might not be in the input or output data sets at all. If you just want to count unique items in the input data, then k1 = k2 = an item and v1 = v2 = 1 or 0 or really anything. Reduce just produces v3 as the sum of the number of k2's it was given.

So it's hard to say for sure that a data transformation cannot be done using Map/Reduce, but the above gives you some guideposts.

Old Pro
  • 793
  • 6
  • 11
3

MapReduce works on any problem that is made up of exactly 2 functions at some level of abstraction. The first function is is applied to each of the items in the input set, and the second function aggregates the results.

So, any time you want to get (1) result from (n) inputs, and all inputs can be examined/used by (1) function, you can use MapReduce. Again, this is at some specific level of abstraction. The (1) function may be some grouping function that checks the input and decides which of several other functions to use.

This is useful when you don't know in advance how much input you will have, when you need to share out discreet "units" of work, or when you want a single return to represent the entire result (I.E. running five thousand unit tests, and if less than x% fail, return success).

Spencer Rathbun
  • 3,576
  • 1
  • 21
  • 28
3

Most of the answers here seem to be some variation of explaining what map reduce does, which is valid. But to answer the question, which was which pattern would signal where you might be able to use map reduce is not really addressed by that.

If the naive, non functional, implementation of the problem you are looking at involves looping over something and then updating something outside the loop with some state from inside the loop, chances are you have something that ports well to map reduce. Especially if you can generalize the updating of the central state to a function that works with just two parameters and can guarantee this function is commutative and associative.

The reason you would might want to use map reduce if that is true is two fold: 1) it might be a bit cleaner and easier to test and debug if you break things into map and reduce functions. 2) map reduce functions are stateless and may be run concurrently, which speeds things up if you have multiple cpus available and something like hadoop or spark that makes use of that to run things in a cluster.

This is nice if you are looping over a lot of stuff but your mileage may vary depending on how complex your map/reduces are. It's quite common to end up with a sequential chain or tree of map reductions where in the end everything is still bottlenecked on some complex reduction step at the end of the chain. For example many graph algorithms are difficult to scale efficiently with just map reduce.

The simplest example that works well with map reduce, is counting stuff, which is a very cheap reduction. This is why word count is an often used example for map reduce. You can pretty much expect linear scalability for performance with that usecase: every cpu you add makes it faster.

2

Is it parallelisable?

Any parallelisable problem is essentially map and fold; conversely, the map step is inherently parallelisable (and the fold step might be, depending on the structure over which it's folding), so this is a bidirectional property.

Peter Taylor
  • 4,012
  • 1
  • 24
  • 29
  • 3
    This is only the case for [Embarrassingly parallel](http://en.wikipedia.org/wiki/Embarrassingly_parallel) problems. There are plenty of problems which are highly parallelisable, but which contain enough interaction between elements that a simple MapReduce wouldn't be efficient. – Mark Booth Apr 17 '12 at 14:44
  • thanks for the link, I didn't know about the embarassingly paralell term. aren't all map reduce solvable problems embarassingly paralell? – Paul Sanwald Apr 23 '12 at 14:23
  • 1
    There are many embarrassingly parallel problems, not all of which need the reduce part. – Donal Fellows Apr 25 '12 at 08:54
2

If you do much functional programming, you start running into situations that call for a general map and reduce. You probably even see them in imperative programming, but don't recognize them behind the mask of loops and accumulators.

As an example of one that came up for me recently, I've been working on a parser in Haskell. To test my parser, I pump a list of string fragments through the parser, and then I want to get a single string that I can output of my results to see if it parsed right. So that looks like:

--my initial set of test data, a list
tests = ["string1", "string2", "string3", ...]

--Map Step: turn strings into parsed results
--note the type, which demonstrates the map
applyParser :: [String] -> [Token]
--The actual function
applyParser input = map parser input

--Second map, turn tokens into output
showTokens :: [Token] -> [String]
showTokens t = map show t

--Reduce step, concat the results
combineResults :: [String] -> String
--In haskell, reduce is the foldl function, which takes an operation to fold with, a starting element, and a list to fold on
combineResults strings = foldl concat "" strings

--Finished program
testParser = print (combineResults(showTokens(applyParser tests)))

Of course, this is just pedagogical. My actual code looks a bit different, and uses more internal functions (like fold concat isn't needed since Haskell already includes unlines that does [String]->String). My main point was that I didn't anticipate using a map/reduce when I started, it just aligned to my needs. I wanted to do some stuff with lists, then turn my list into a single element of output. The use of map/reduce emerged naturally.

String processing (like parsing) is one very obvious use of map reduction, mapping is the application of various transformations on the input text, and reduce it putting the result text back together again as output. Likewise, a compiler could be similar, using folds to turn a stream of Abstract Syntax Tree elements into a better form (optimizing).

CodexArcanum
  • 3,421
  • 21
  • 23
2

Here's the major questions I use to probe a decision to use (or not use) MapReduce.

  • Is achieving reasonable parallel execution performance with minimal programmer effort important for a given problem?
  • Do I have a large number (hundreds) of parallel execution elements available?
  • Is there excellent communication bandwidth/throughput among the parallel execution elements?
  • Do I need to process a huge amount (TB) of data?
  • Does the problem I am trying to solve decompose into Map and Reduce operation?

    • Map: Execute the same operation on all data.
    • Reduce: Execute the same operation on each group of data produced by Map.
David Pointer
  • 531
  • 4
  • 9
1

Say you are searching a cluster of servers and one is unable to respond at that moment. What mapReduce will do is since it could not access that tree node to the larger Map is it will reschedule it for later and perform either the Map or the Reduce then. Essentially it tries to guarantee all information is available with the unpredictability of software and hardware in environments.

1

in effect, it's a generic "divide and conquer" pattern, so that solutions for distributing the computation can be written generically.

a simple example is like a large document. the problem is you want to count the number of letters in that document. instead of running on a single machine, you can break it down into an array of all words in the document. then you can process each word individually, and the results back together.

the pattern is useful, because once you get a generic map/reduce implementation working you can solve any problem using that same software layer, you just need to express your problem in terms of it.

ianpojman
  • 111
  • 2