Do compilers optimise in concurrency?

Question

Presuming that I have written some sequential code where it can be broken down into multiple isolated tasks, it seems it might be efficient for concurrency to be introduced.

For example

print(expensive_calc_one() + expensive_calc_two())

Asuming expensive_calc_one and expensive_calc_two are pure functions but also computationally quite expensive, it would seem sensible for a compiler to optimise the code by introducing concurrency, and allowing the two functions to run in parallel.

I know that this also has its downsides (context switching adds overhead, and some computers still only have one logical core).

Are there any compilers which would introduce concurrency into previously non-concurrent code, and are there any general patterns for it (or reasons not to do it)?

Most language don't lend themselves well to this, because the compiler does not have information on whether or not the functions can interfere with each other if run in parallel. Functional languages that mark pure and impure functions would be best suited to this. However, I can't tell which compilers do this, if any. — Theraot, Jan 31 '21 at 13:51
There's some useful discussion (specifically in context of GCC) [here](https://stackoverflow.com/questions/40088101/can-gcc-make-my-code-parallel). — Ben, Jan 31 '21 at 15:23
Are we talking about _concurrency_ or _asynchronicity_ here? Concurrency forces the threads to run in parallel, asynchronicity does not enforce it but allows for it to happen if the runtime machine so chooses. Taking a simple example: the members of a rock band have to perform at the same time when playing live (= concurrency), but when recording a studio track, they're not required (but are allowed) to record their bits at the same time. However, the studio recording can only be released when they've all played their individual parts (= asynchronicity). — Flater, Feb 01 '21 at 01:15
@Flater, it seems *concurrency* is the only reasonable interpretation of what he desires. He wants two long-winded operations to run concurrently so that they are finished sooner than if they ran sequentially. There is no apparent reason to think that these operations need to be synchronous like a live band playing together in time and in lockstep, but equally, no need for them to be asynchronous like recording separate studio tracks at different times - it's really a case of recording separate studio tracks *at the same time* to the maximum extent possible. — Steve, Feb 01 '21 at 03:02
@Steve: That doesn't mean the question could not include the compiler injecting asynchronicity, which will usually run those jobs in parallel _if and when_ it has the threads available to do so. By your interpretation, if I had an example of asynchronicty, it would be pointless to mention in this question then. But another can argue that asynchronicity _is_ an attempt at concurrency which allows (as a plan B) for non-concurrency if the host machine does not offer more than one thread, which effectively improves runtime compatibility and is therefore more desirable than forced concurrency. — Flater, Feb 01 '21 at 10:10
You normally have to tell a compiler to do so, and it is only a few that can. It is a much better idea to use a modern language that has parallelism supported directly and then code accordingly. (Java is an example). Perhaps even use a framework supporting spreading the load over multiple machines. — Thorbjørn Ravn Andersen, Feb 01 '21 at 14:24
Not precisely what you looking for but I'd suggest taking a look at https://en.wikipedia.org/wiki/Automatic_parallelization. A cursory search shows that GCC, Oracle's Fortran compiler, and Intel's C++ compiler at least seem have some form of for loop auto-parallelization. — GrumpyYoungMan, Feb 01 '21 at 16:24
Just to explain why I haven't marked anything as the answer yet - I would like to see some discussion over why it is so difficult, with reference to theoretical limitations and historical debate. — Ben, Feb 01 '21 at 16:36
@Flater "Concurrency forces the threads to run in parallel". This isn't true, you're conflating two separate concepts. A single-core computer runs many processes concurrently but none will ever run in parallel. https://stackoverflow.com/a/1050257/ . If band members record an album in separate parts but are all present at the studio at the same time then the album is being recorded concurrently, but not in parallel. — JShorthouse, Feb 02 '21 at 14:31
@JShorthouse: You're conflating existence and execution. You're using "concurrency" to refer to the existence of the thread, rather than when it executes. The same applies to your adaptation of my example. In your situation, the album is **not** being _recorded_ concurrently. But it is true that in your situation the band members _are present_ concurrently. Those are two very different things to focus on. OP's question conflates concurrency and parallelism, and therefore so did I (since non-parallel concurrency is essentially asynchronicity for the purpose of the question I asked in comments) — Flater, Feb 02 '21 at 14:45
@Flater Concurrency is not parallelism. These terms are very well defined in computer science. "Concurrent execution" simply means that multiple processes are "in-flight" at the same time. It does not imply parallelism. Again, a single core processor can concurrently execute processes without parallelism. The only way that these concepts are linked is that processes can only be executed in parallel if they are concurrent. https://softwareengineering.stackexchange.com/a/190725/ https://en.wikipedia.org/wiki/Concurrent_computing — JShorthouse, Feb 02 '21 at 15:32
@JShorthouse Concurrency is a loaded word. It can mean "doing things at the same time" or it can mean "appearing to do things at the same time". The central idea is the appearance of multiple things happening at once. Any further distinction, while it may be perfectly correct, is irrelevant for the question I asked in the comments. — Flater, Feb 02 '21 at 15:39
@Flater Maybe in the general sense. Concurrency in the context of computer science has a very specific definition that is not ambiguous. It has nothing to do with appearance of parallelism. If I slow the clock speed of single core CPU down to 25Hz it will neither be "doing things at the same time" or "appearing to do things at the same time", but will still be able to execute concurrent programs because concurrency is not parallelism. — JShorthouse, Feb 02 '21 at 15:48
Some open source compilers (e.g. [Chicken Scheme](http://www.call-cc.org/)...) are compiling *to* C: they generate C code. You could improve them to sometimes generate also [OpenCL](https://en.wikipedia.org/wiki/OpenCL). The reason for not doing so is theoretical ([Rice's theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem)...) and practical (compiler developement cost money). — Basile Starynkevitch, Feb 02 '21 at 18:51

score 54 · Answer 1 · edited Jan 31 '21 at 20:17

Asuming expensive_calc_one and expensive_calc_two are pure functions

Unfortunately, determining whether a function is pure is equivalent to solving the Halting Problem in the general case. So, you cannot have an Ahead-of-Time compiler which can in the general case decide whether a function is pure or not.

You have to help the compiler by explicitly designing the language in such a way that the compiler has a chance to decide purity. You have to force the programmer to explicitly annotate such functions, for example, or do something like Haskell or Clean, where you clearly isolate side effects using the type system.

but also computationally quite expensive

Unfortunately, determining in an Ahead-of-Time compiler whether a function is "computationally quite expensive" is also equivalent to solving the Halting Problem. So, you would need to force the programmer to explicitly annotate computationally expensive functions for the compiler to parallelize.

Now, if you have to force the programmer to explicitly annotate pure and computationally expensive functions as candidates for parallelization, then is it really automatic parallelization? Where is the difference to simply annotating functions for parallelization?

Note that some of those problems could be addressed by performing the automatic parallelization at runtime. At runtime, you can simply benchmark a function and see how long it runs, for example. Then, the next time it is called, you evaluate it in parallel. (Of course, if the function performs memoization, then your guess will be wrong.)

Are there any compilers which would introduce concurrency into previously non-concurrent code

Not really. Auto-parallelization has been (one of) the holy grail(s) of compiler research for over half a century, and is still as far away today as it was 50–70 years ago. Some compilers perform parallelization at a very small scale, by auto-vectorization, e.g. performing multiple arithmetic operations in parallel by compiling them to vector instructions (MMX/SSE on AMD64, for example). However, this is generally done on a scale of only a handful of instructions, not entire functions.

There are, however, languages where the language constructs themselves have been designed for parallelism. For example, in Fortress, a for loop executes all its iterations in parallel. That means, of course, that you are not allowed to write for loops where different iterations depend on each other. Another example is Go, which has the go keyword for spawning a goroutine.

However, in this case, you either have the programmer explicitly telling the compiler "execute this in parallel", or you have the language explicitly telling the programmer "this language construct will be executed in parallel". So, it's really the same as, say, Java, except it is much better integrated into the language.

But doing it fully automatically, is near impossible, unless the language has been specifically designed with it in mind.

And even if the language is designed for it, you often have the opposite problem now: you have so much parallelism that the scheduling overhead completely dominates the execution time.

As an example: in Excel, (conceptually) all cells are evaluated in parallel. Or more precisely, they are evaluated based on their data dependencies. However, if you were to actually evaluate all formulae in parallel, you would have a massive amount of extremely simple parallel "codelets".

There was, apparently, an experiment in having a Haskell implementation evaluate expressions in parallel. Even though the concurrent abstraction in Haskell (a "spark") is quite lightweight (it is just a pointer to a "thunk", which in turn is just a piece of un-evaluated code), this generated so many sparks that the overhead of managing the sparks overwhelmed the runtime.

When you do something like this, you essentially end up with the opposite problem compared to an imperative language: instead of having a hard time breaking up huge sequential code into smaller parallel bits, you have a hard time combining tiny parallel bits into reasonably-sized sequential bits. While this is semantically easier, because you cannot break code by serializing pure parallel functions, it is still quite hard to get the degree of parallelism and the size of the sequential bits right.

It's interesting to point out that those issues are undecidable. I guess you wouldn't want to heuristically guess them, but in theory there could be some cases in which it is obvious that it is computationally expensive, or that two functions do not affect each other. Do you have any links to discussions about auto-parallelisation? — Ben, Jan 31 '21 at 15:12
Being generally undecidable isn't actually important. That's about a function containing code that would make it not pure, and deciding whether that code ever gets executed. But that is such a rare case... What you need is just to check whether there is anything that would make the function impure. Functions that are sometimes pure, sometimes not, are a tiny minority. — gnasher729, Jan 31 '21 at 18:50
@gnasher729: It is true that compilers do undecidable things all the time. Escape Analysis is undecidable, but compilers do it nonetheless. You just have to live with the fact that sometimes the answer is "I don't know", which in this case means that you cannot perform the optimization because you risk crashing the program. Devirtualization is another example: undecidable in the general case, but implemented in many C++ compilers as an optimization. However, devirtualization fails often enough that C++ programmers are taught to avoid virtual functions. Dead code detection: the Java Language … — Jörg W Mittag, Jan 31 '21 at 19:36
… Specification specifies a specific subset of cases where it is performed, and the language is specifically designed to allow it. The rules about definite assignment, "effectively final", etc. in C++, Java, C# and co. are there precisely because liveness *in general* is undecidable. And so on. In general, you need specific restrictions in the language and/or help from the programmer to make it work often enough that it is worth it. For concurrency, specifically, I feel that having the language be explicitly concurrent is much easier than bolting it on as a compiler optimization. — Jörg W Mittag, Jan 31 '21 at 19:39
However, I may also be misunderstanding the OP's question here. I am assuming "given an existing language, not specially designed for concurrency, can I automatically discover potential concurrency as a compiler optimization?" If we are allowed to assume a language that is *specifically designed* for automatic concurrency, that's a whole different ballgame. Then we end up with something like Fortress, data flow languages, Occam-π, APL, vector languages, etc. — Jörg W Mittag, Jan 31 '21 at 19:42
Data-parallelism in a single *long-running* loop can be exploited by compilers by auto-magically creating thread-level parallelism as well as auto-vectorizing with SIMD. OpenMP `#pragma omp parallel ...` hints the compiler to do this, and `gcc -ftree-parallelize-loops=4` will even do that on non-hinted loop. Maybe not a good idea without profile-guided optimization to find actually hot loops, though, because of the high overhead! [C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/q/32000917) shows it doing a poor job. (@Ben) — Peter Cordes, Feb 01 '21 at 23:20
Unrolling with multiple registers (stuff like software pipelining to hide latency) is another way to expose data parallelism to the hardware, letting the CPU find the instruction-level parallelism that creates, and [more easily run multiple instructions per clock cycle](https://softwareengineering.stackexchange.com/questions/349972/how-does-a-single-thread-run-on-multiple-cores/350024#350024). (Putting all 3 of these together is how you write a matmul function that keeps both SIMD FMA units busy most clock cycles on all cores, although you'd often use manual parallelization there.) — Peter Cordes, Feb 01 '21 at 23:26
Programmers generally know many things which compilers could benefit from knowing, but couldn't possibly know unless told. Languages designed for performance should focus on letting programmers give compilers useful information, rather than expecting compilers to figure things out on their own. A memory-compare function which is optimized for best performance in cases where two blocks match perfectly will yield sub-optimal performance when given blocks that differ within the first few bytes, and vice versa, but a compiler can't be expected to know which form will work better unless told. — supercat, Feb 02 '21 at 17:37
Intel's Fortran Compiler checks for pure functions in `DO CONCURRENT`, so there's at least one ahead of time capable of determining if a function is pure. It's explained here: https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2023-1/do-concurrent.html — UpTide, Jul 25 '23 at 23:24

score 9 · Answer 2 · answered Jan 31 '21 at 14:20

Compilers are not generally smart enough to do this, in particular because most languages don't have a sufficiently reliable concept of a “pure function”. There can be rather subtle interactions between threads. Clearly, they should not access the same memory regions. But the program might rely on invariants that could be broken through concurrent modifications. Introducing threads is also a fairly drastic change to the original program. Some other aspects of “automatic async/await” were discussed here.

However, given a concurrent program, some compilers/runtimes can make that program run in parallel without substantial extra work. These languages generally have a concept of an executor that is responsible for completing pending tasks, where tasks can be I/O operations or async functions that are being awaited. There can often be different executor implementations, i.e. different event loops, and sometimes even multiple running executors within the same program.

With executors, all I have to say is that my functions are async and where they can be suspended. The runtime is then responsible for deciding how to schedule them, whether in a single-threaded event loop or on multiple threads in parallel. Having at most one thread per CPU and using an in-process scheduler avoids task-switching overhead. Typically, a language with async/await would express your code snippet like this:

# start expensive computations
task_one = expensive_calc_one()
task_two = expensive_calc_two()

# wait for both to complete and print the result
print(await task_one + await task_two)

However, languages differ in how exactly async functions are executed. When an async function is invoked, some execute them directly up to the first suspension point which cannot lead to parallelism. Others will create task without executing any part of the function. CPU-bound tasks that should execute in the “background” might still require extra annotation.

Languages that follow this general strategy are C#, Python, Rust, and many others. Python doesn't really support parallelism though even when using multiple threads. In Rust, the type system ensures that data can be sent to a different thread, or shared across threads – and prevents creation of such tasks if not. In Go, an await can be simulated with single-element channels, whereas tasks can be created with the Go keyword. JavaScript can start multiple workers, but the event loop is single-threaded. Executors are also available in Java and Swift, albeit without real await syntax.

Haskell is probably the only language that could feasibly do automatic multi-threading because every function (that doesn't involve the IO monad) is pure by definition. But I don't think Haskell/GHC actually does start multiple threads implicitly.

As far as I know, there was a prototype of a Haskell implementation that automatically parallelized evaluation. The problem was that because every expression in Haskell is pure and lazy, *every expression* can be evaluated in parallel. Even a simple program would create ginormous amounts of parallelism, to the point where scheduling overhead was completely dominating execution. Which is why they switched to the current scheme with explicit annotations. Where automatic parallelization in imperative languages has trouble *breaking up* code, this implementation had trouble *combining* it. — Jörg W Mittag, Jan 31 '21 at 14:31
I appreciate the effort you have put in to this answer, but I would really like some explanation as to why this kind of optimisation is harder than other optimisations, and ideally some historical context of people choosing not to implement it. I understand that it's difficult, but not why it's too difficult to implement. — Ben, Jan 31 '21 at 15:02
@Ben Jörg's answer discusses why some of the issues mentioned in my first paragraph are undecidable. This doesn't mean such approaches would have to be totally useless, e.g. practical type systems like Scala and C++ are undecidable as well. But auto-parallelism is really not a feature that can be bolted onto a language because it breaks many basic assumptions a programmer has about program execution, e.g. regarding atomic variable updates. AFAIK only Rust and Haskell could auto-parallelize safely, but they have their own reasons for not doing so (zero-cost principle, bad results in practice). — amon, Jan 31 '21 at 15:13
@amon Essentially, "auto-parallelisation" is the term I was looking for, and it does seem to be implemented in some compilers. Your answer isn't really what I was looking for (although interesting in its own right). — Ben, Jan 31 '21 at 15:16

score 2 · Answer 3 · answered Feb 02 '21 at 17:19

Introducing things running in parallel (not the same as concurrency, as noted in comments) is actually done quite a bit during runtime.

It sounds like what you're most interested in is instruction level parallelism where two instructions are running simultaneously, finishing in half(ish) the time it would take them to run sequentially. Specifically, you're talking about out-of-order execution where independent steps can be run in any order. I think other answers have done a good job of talking about why this is difficult, but I wanted to give a few examples of where the generic idea of automatic parallelization is actually being used successfully.

Specifically, if you do this during runtime you can use heuristic data based on what the program is actually doing, rather than (much more complicated) trying to analyze the program statically.

Speculative execution/branch prediction: We have a normal order of (calculate which branch to take)->(execute that branch). But we can sometimes parallelize this so that we (calculate which branch to take), (execute branch a), (execute branch b) simultaneously--and then only apply the results of whichever branch it turned out that we needed to take. Sometimes this can be additionally optimized by recognizing that certain branches are much more likely than others (maybe a loop repeats hundreds of times before choosing the exit-loop branch) and by analyzing the program during runtime the computer can make choices about which branch to run based on how likely each branch is.

Beyond just parallelizing your actual program, we can parallelize the work around running your program. Just-in-time compilation: Java used to be S-L-O-W but now is (reasonably) fast... by parallelizing running the program and compiling the program. The JIT compiler even does runtime analysis to spend more time optimizing the compilation of the areas of the program that are spending more time in execution and skips areas that aren't spending much or any time in execution.

Some of this can be done at a hardware level, too...such as instruction pipelining. If different parts of a processor are used for read, execute, and write, then a processor may be reading the next (predicted) instruction while executing the current one and writing the results from the previous one.

score 0 · Answer 4 · answered Feb 02 '21 at 10:59

No. Yes, sort of, but not the way you think.

For procedural languages like C/C++ or Pascal or Java it is generally not possible to simply run functions in parallel because functions can have state which can end up affecting the result depending on the order of invocation.

Languages like Go allows you to mark functions as able to be executed in parallel called goroutines (the keyword is simply go - the name of the language). However goroutines can't really be used in the same way as regular functions since they are executed asynchronously. Unlike multithreading in languages like C not all calls of goroutines run in a separate thread. Go will automatically decide how many threads are needed to execute goroutines which makes goroutines a lot easier to use compared to threads.

However there are languages which can sort of do what you want. But it's not because the compiler can do it. It is because the language allows compilers to make such assumptions. These are pure functional languages: Lisp, Haskell, Erlang etc.

The reason why pure functional languages can do this is because they don't have variables. They only have constants (so much so that most of these languages call constants "variables"). Having no variables means there are no state changes which means compilers can make the assumption that they can call any function in any order and the result should be the same because results depend only on the arguments to functions.

Google's use of Lisp's map and reduce to search through their early database of websites is a good example of automatic parallelisation. Because each invocation of the map callback can be executed independently, in order to scale Google what they originally did was run their original algorithm on a version of Lisp with parallel map. They didn't have to change their algorithm - they just had to use a different interpreter/compiler.

Do compilers optimise in concurrency?

4 Answers4