Best algorithm to multi-thread this application?

Question

I define an algorithm to be best if it minimizes the total run time over commodity hardware (e.g. normal desktop and server PCs).

I have sets A and B. I also have function f that works like f(a, b, n), where a is some item in A, b is some item in B, and n is some natural number. The output that this f(a, b, n) returns is a positive real number.

These properties are true:

f(a,b,n) has a randomized component in it. So invoking the function twice will likely give you two different answers each time (albeit close to each other, but nonetheless different). Larger values of n will only reduce the estimation error of the function $f$.
f(a,b,n) is twice slower as f(a,b,n/2).
f(a,b,n) = (f(a,b,n/2) + f(a,b,n/2))/2
Computing f(a_1,b_1,n) is independent from f(a_2,b_2,n), where a_1,a_2 are distinct elements from A, and b_1,b_2 are also distinct elements from B.

My goal is to compute the following the value of answer as follows:

count = 0
output = 0

for a in A:
    for b in B:
        output += f(a,b,n)
        count += 1

answer = output / count

return answer

But the code above is highly serialized. What I wish to do is to parallelize that such that I minimize the total number of seconds needed for me to obtain answer answer.

Just to re-iterate: I am running this application on a single computer, and this is why I am considering a multi-threaded application. I do not wish to distribute it over multiple machines. All I want is to just run it really fast over the many cores that I have on a single computer.

Is `n` static or does it ever change? And to make sure, is `f` a pure function? Could you memoize the results or does it have side effects? — Vincent Savard, Aug 24 '16 at 14:35
Yes **n** is static/constant, and yes **f** is a pure function. Memorizing output from **f** is sort of useless, because I'll ask it to compute for a certain input only once. I.e. I will not ask it to re-compute the same thing again. — caveman, Aug 24 '16 at 14:43
Yeah I figured that if `n` is static, memoization wouldn't be of use, but wasn't sure if the first two properties were relevant, so I was just making sure. — Vincent Savard, Aug 24 '16 at 14:45
Also, `f(a,b,n) = (f(a,b,n/2) + f(a,b,n/2))/2` means that `f(a,b,n) = (2*f(a,b,n/2))/2 = f(a, b, n/2)`. Is `n` even used? — Vincent Savard, Aug 24 '16 at 16:50
So when you said `f` was pure, you actually meant that `f` is not pure... — Vincent Savard, Aug 25 '16 at 13:22
Right. But it is asymptotically pure, if this makes sense. It's a randomized function. It has random coins inside it. Do you know a better name for it? I.e. the average of all **f(a,b,n/)** runs converges to a single number in the limit as the number of runs approach infinity (law or large numbers), and this asymptotic value is unique for each **a** or **b**. I.e. asymptotically **f(a,b,n) = f(c,d,n)** only if either **c = a** or **d = b** --- sorry for this. I hope it's clear now? — caveman, Aug 25 '16 at 13:31
Does running `f(a,b,n)` effectively just run `f(a,b,1)` a number of times equal to `n`, then average the results? — Tanner Swett, Aug 26 '16 at 01:42
First, the "count" increment in the loop is superfluous, delete it. Instead, set `count=size(A)*size(B)` after the summation. Second, can't you simply split the tuples (a,b) of AxB into k partitions, where k is the number of CPUs you have, and let each partial sum be calculated by a different thread? The result then can be summed up finally, similar to your own solution attempt, but without this hoodoo-woodoo around n. — Doc Brown, Aug 28 '16 at 12:55
@caveman: ...just out of curiosity ...the code is just pseudocode and you're not planning to write this in python, right? — dagnelies, Aug 29 '16 at 15:11
It would probably help if you would add the label for the programming language you will be using. Different languages have different ways of supporting treading, any answers you get now will be general and superficial.. — Martin Maat, Sep 02 '16 at 07:52
I feel like the question in its current state is a bit too broad: To get *the best* performance, you need all details you could possibly get (structure of A and B, maybe even the expected number of elements in each one, which CPU instructions are available, what does f do exactly, how big are your caches, ...). In its current state, we can only give some general guidelines on optimizing, but that doesn't accomplish the goal of the question (providing the fastest algorithm). — hoffmale, Sep 02 '16 at 08:36
I saw the commodity hardware constraint (and does that exclude GPU?), but do you have a target implementation language / library? Java/Scala would have a different structure than C/C++ with OpenCL/OpenMP. — Kristian H, Sep 02 '16 at 11:49
@caveman: it seems the dependency from n and what you wrote about it is totally irrelevant for the question. When you remove this part from the question, the problem looks pretty trivial, what makes me wonder if we understood your issue correctly. So if you still waiting for a better answer (than, for example, the scetch algorithm I gave in my comment above in one sentence), you need to clarify where we misunderstood you. — Doc Brown, Sep 03 '16 at 06:18
@caveman: it is also good etiquette to state what's wrong with the answers or accept the one that seems best. — dagnelies, Sep 04 '16 at 06:16
@DocBrown You have a good point. **f(a,b,n) = (f(a,b) + f(a,b) + ... + f(a,b))/n** -- please post your answer. I may likely end up accepting it, and actually change my API to follow your answer. — caveman, Sep 04 '16 at 09:29
@dagnelies You know what is really good etiquette? Do not discuss off-topic matters. The question is not about " what is good etiquette". The question is about "optimize an algorithm". Another good etiquette is for you to read previous answers. My answer was there (it got deleted only recently) and it was identical to yours, except for using **counter** variable, which I placed for clarity as some people find size*size hard. If you wish to discuss etiquette, discuss that in Meta. I feel sorry that no one got the 100 points, but I am also sorry that a bunch of wrong people deleted my answer. — caveman, Sep 04 '16 at 09:32

Erik Eidt · Answer 1 · 2016-08-28T00:38:46.300

There's not a lot to work with here. You are not divulging f and also not asking to parallelize the internals of f, just the doubly-nested loops that invoke f. While there is some relationship between f(,,n) and f(,,n-1), I don't see how to take advantage of it because of the undisclosed randomizing component.

(I presume this is your intent, but just to be clear, there may be a better solution if we could understand the randomizing component, since repeating that over and over looks like where all the work really is, and doing something different instead might be most effective.)

So, the only thing you can do is slice the data to keep all the cores busy.

There are three methods by which you can slice the data: subranges of A, subranges of B, and subranges of N, the latter of which you've already done with your own answer.

You also haven't divulged the structure of A or B, except that they are obviously collections, or at least generators.

If they are collection manifested by memory (arrays, lists, etc..), then if they are of significant size, then iterating over A and B by each core could thrash the cache, unless somehow the cores cooperate and happen to run at same range of A & B at the same time, then they'll actually get a boost from each other!

But if they happen to get significantly out of sync with each other (say because conditional logic in f is not evaluating the same), then they'll be fighting each other for the cache.

(An analogy is doing copying of large folders/directories on your hard drive. If start another copy of different folders at the same time, both copies will more than likely slow to a crawl, and together take 10x of the serially run sum of the copies.)

So, to mitigate the cache fighting, if that is indeed a problem, you could choose to limit the individual cpu runs to a portion of the A x B matrix that will fit in the cache, and have all the cpu's work on that limited set that fits in the cache until all the cpus are done, then move on to another subset of the A x B matrix. If one cpu finishes first, I probably would not even give it new work, presuming that (a) this cache thrashing is a real problem for your domain, and (b) that all the cpu's will finish with their A x B subset more-or-less at the same time. Assuming all that we'd probably be better off running all cpus to completion on the subset before embarking on the next subset.

Of course, you also want to spawn as few threads as you can because that represents overhead as well, which is a benefit of the slicing in your answer. But it is possible that out-of-sync threads thrash the cache sufficiently fixing that would be worth spawning additional threads.

On the other hand, spawning exactly as many threads as cores, but with an algorithm that works incrementally on well-defined on matrix subranges of A x B, then waits for all the other cores to acknowledge completion before starting on the next subrange may provide the best of all solutions. Each thread announces subset completion and then suspend itself waiting for notification that all threads have completed.

So each core would march thru the same A x B subset using your notion of n/C iterations, then all the cores would move on to the next A x B subset.

In fact, even using subranges for A x B on a single threadded algorithm might improve performance over ranging over all of A x B numerous times, which is entirely possible if the memory touched in the A x B matrix doesn't fit in the cache. Each run thru A x B needs to bring the entirety of both data structures into the cache again (and maybe even again and again just for one iteration thru A x B), whereas a single thread running on a manageable subset could bring each such subset into the cache only once for the n iterations, so you might start with a single threaded version of matrix subsetting, and then add the parallel synchronization.

Please give an example, or a proof, where knowing the inner working of the function **f** will lead you to a different optimum solution, even when satisfies the properties that I have listed in my question. — caveman, Sep 03 '16 at 04:54
For example, if `f(a,b,n) = (a*b + rand(0,1) + f(a,b,n-1)) / n` with `f(a,b,0) = 0` on an 8-core cpu, one could split the addition of a*b on 6-7 designated threads (possibly enhanced with SSE instructions or similiar, likely to be RAM speed bound), and 1-2 threads that simply calculate a sum of size(A)*size(B)*n random generated numbers, which gets divided by n at the end and added to the result of the other threads (no IO, just pure calculations -> no RAM bottleneck). Without knowing the internals of f, such a split of calculations would be impossible. — hoffmale, Sep 03 '16 at 08:32
however, that approach would be rather useless if `f(a,b,n) = (rand(0,2) * a * b + (n-1) * f(a,b,n-1)) / n` with `f(a,b,0) = 0`, since now the randomness isn't independent from the result of `a*b` anymore. (both f functions would adhere to the criteria you mentioned in your question afaict) — hoffmale, Sep 03 '16 at 08:46
Do you have an example that makes sense? This just doesn't make sense at all. — caveman, Sep 06 '16 at 01:49

dagnelies · Answer 2 · 2016-09-02T09:43:31.050

2

Here is the solution to the much more general problem of parallelizing any loop cleanly.

Just let each thread pull items (pairs of a,b in your case) one after another and compute the partial result until all are consumed.

Main thread:

output = 0
iter = iterator_over(A,B)

// start threads and wait until done

answer = output / size(A) / size(B)
return answer

Each thread:

res = 0
while true:
   synchronized:
       if !iter.hasNext():
           break
       a,b = it.next()

   res += f(a, b, n)

synchronized:
    output += res

For optimal performance, the amount of threads should be the same as the amount of cpu cores.

edited Sep 02 '16 at 09:43

answered Aug 29 '16 at 15:02

dagnelies

5,415
3
20
26

This is identical to my attempted answer, except that you are not using a counter variable, and -instead- effectively using **count=size(A)*size(B)** -- agree? – caveman Sep 03 '16 at 04:43
@caveman: no idea, I don't see an answer from you – dagnelies Sep 03 '16 at 20:17
I re-pasted my answer again. Please have a look ASAP before wrongs happen again and it gets deleted. Basically wrong people made the wrong decision to vote for deleting it, and there was enough wrong people that managed to wrongfully delete my answer. – caveman Sep 04 '16 at 09:24

score 1 · Accepted Answer · edited Apr 12 '17 at 07:31

This answer builds on top of Erik Eidt's answer.

There are three methods by which you can slice the data: subranges of A, subranges of B, and subranges of N, the latter of which you've already done with your own answer.

In addition to just slicing one of the sets, it is possible to slice on both A and B.

Before I explain, I must first apologize for not using mathematical notations. This site does not support MathJax, therefore it is not possible to write notations that contain more than one level of subscripts.

The name for slicing on more than one subset (or dimension of data) is Loop Tiling.

To simplify things, let's just slice A as follows: (A₁...A₁₀), (A₁₁...A₂₀), (A₂₁...A₃₀), ... and likewise B as follows: (B₁...B₁₀), (B₁₁...B₂₀), (B₂₁...B₃₀)

The choice of slicing A and B into groups of 10 is arbitrary. You should experiment with different grouping sizes to find a more optimal combination.

The pseudocode is then:

for each ten consecutive items taken from A, namely A(10*j+1...10*j+9)
    for each ten consecutive items taken from B, namely B(10*k+1...10*k+9)
        for each item in the subrange of A(10*j+1...10*j+9), namely "a"
            for each item in the subrange of B(10*k+1...10*k+9), namely "b"
                for i=1...n
                    process f(a, b, ...)

Some details are omitted in the pseudocode in order to highlight the specifics of the "loop tiling" technique.

I omitted the parallelization aspect, but basically one can choose one of the "for" and convert it into "parallel for" to fully maximize the multicore utilization.

Some more general advice.

As explained in the linked Wikipedia article above, there is no simple rule that will help choose the most optimal subrange size for A and B. Real-world computers and software execution performance are influenced by too many factors. Instead, one has to try different values and see which combination runs faster. From performance experimentation one can discover which of the software performance factors is dominating. This can only be discovered, and is not easy to predict by theory alone.

Since you are using C, you have several choices:

OpenMP
MPI
Manually written multithreaded programming using pthreads

I see that you have pthreads in the tag. However, for this type of computation, it would be easier to achieve better performance by using OpenMP instead.

You should already know about thread safety. If you do not, it is likely that none of your multithreaded program could give you any correct answer in the first place. So it is very important that you either know what it is, or else you should avoid using multiple threads.

One way to increase multi-core CPU utilization without multithreading is to use multiple processes (OS processes). Namely, open multiple console terminals, and launch the program in each terminal. Instruct each program to save its output to a different filename. When all programs finish, combine these different files into the final result.

Any reason that you don't do loop tiling for the parameter **n**? I.e. why not 1..10, 11..20, ..., etc? — caveman, Sep 06 '16 at 01:47
@caveman: It is of course possible to do loop tiling on **n** as well. If you are using OpenMP, you can in fact just add a `pragma omp parallel for` before each level of for loop to achieve parallelization. It just means you have more parallelization parameters to choose from, which takes more time. In the end, it might not be necessary to squeeze the last 10% of performance. — rwong, Sep 06 '16 at 06:46
How is your answer different than this one? http://programmers.stackexchange.com/a/329609/243450 -- it seems to me that he is doing exactly your loop tiling? --- as for **n**, I wonder if you did it to maximize the cache hit? — caveman, Sep 06 '16 at 16:32
@caveman: that other answer generalizes the approach, by glossing over the detail of how the Cartesian set product of **A, B** would be iterated through, i.e. how the tuples will be sorted. For example if one iterates through the Cartesian product via [Z-order curve / Morton code](https://en.wikipedia.org/wiki/Z-order_curve).To the extent that cache doesn't matter, every answer should be roughly equivalent in performance. But if cache does affect performance, as Erik Eidt posits, then it would be important to limit each core's tasks to a subset of **A** and a subset of **B**. — rwong, Sep 06 '16 at 16:41
As for **n**, once a subset of **A** and a subset of **B** has been populated in a core's cache, it makes sense to have this core perform all of the work that can be done with this subset. This means processing the entirety of **n** within this core. Different cores should focus on different subsets of **A** and **B**. — rwong, Sep 06 '16 at 16:43

score 0 · Answer 4 · answered Sep 04 '16 at 09:22

Say that I have C many cpu cores, and that I wish to run this in parallel on C cores. Here is my current solution:

First thread:

THREAD_1
count = 0
output = 0

for a in A:
    for b in B:
        output += f(a,b,n/C)
        count += 1

thread_1_answer = output / count

return thread_1_answer

Second thread:

THREAD_2
count = 0
output = 0

for a in A:
    for b in B:
        output += f(a,b,n/C)
        count += 1

thread_2_answer = output / count

return thread_2_answer

...

Cth thread:

THREAD_C
count = 0
output = 0

for a in A:
    for b in B:
        output += f(a,b,n/C)
        count += 1

thread_C_answer = output / count

return thread_C_answer

Then finally I define the final answer as follows:

answer = (thread_1_answer + thread_2_answer + ... + thread_C_answer)/C

Best algorithm to multi-thread this application?

4 Answers4