This left me wondering how important is Multithreading in the current
industry scenario?
In performance-critical fields where the performance isn't coming from third party code doing the heavy-lifting, but our own, then I'd tend to consider things in this order of importance from the CPU perspective (GPU is a wildcard I won't go into):
- Memory Efficiency (ex: locality of reference).
- Algorithmic
- Multithreading
- SIMD
- Other Optimizations (static branch prediction hints, e.g.)
Note that this is list is not solely based on importance but a lot of other dynamics like the impact they have on maintenance, how straightforward they are (if not, worth considering more in advance), their interactions with others on the list, etc.
Memory Efficiency
Most might be surprised at my choice of memory efficiency over algorithmic. It's because memory efficiency interacts with all 4 other items on this list, and it's because consideration of it is often very much in the "design" category rather than "implementation" category. There is admittedly a bit of a chicken or the egg problem here since understanding memory efficiency often requires considering all 4 items on the list, while all 4 other items also require considering memory efficiency. Yet it's at the heart of everything.
For example, if we have a need for a data structure that offers linear-time sequential access and constant-time insertions to the back and nothing else for small elements, the naive choice here to reach for would be a linked list. That's disregarding memory efficiency. When we consider memory efficiency in the mix, then we end up choosing more contiguous structures in this scenario, like growable array-based structures or more contiguous nodes (ex: one storing 128 elements in a node) linked together, or at the very least a linked list backed by a pool allocator. These have a dramatic edge in spite of having the same algorithmic complexity. Likewise, we often choose quicksort of an array over merge sort in spite of an inferior algorithmic complexity simply because of memory efficiency.
Likewise, we can't have efficient multithreading if our memory access patterns are so granular and scattered in nature that we end up maximizing the amount of false sharing while locking at the most granular levels in code. So memory efficiency multiplies the efficiency multithreading. It's a prerequisite to getting the most of out threads.
Every single item above on the list has a complex interaction with data, and focusing on how data is represented is ultimately in the vein of memory efficiency. Every single one of these above can be bottlenecked with an inappropriate way of representing or accessing data.
Another reason memory efficiency is so important is that it can apply throughout an entire codebase. Generally when people imagine that inefficiencies accumulate from little bitty sections of work here and there, it's a sign that they need to grab a profiler. Yet low-latency fields or ones dealing with very limited hardware will actually find, even after profiling, sessions that indicate no clear hotspots (just times dispersed all over the place) in a codebase that's blatantly inefficient with the way it's allocating, copying, and accessing memory. Typically this is about the only time an entire codebase can be susceptible to a performance concern that might lead to a whole new set of standards applied throughout the codebase, and memory efficiency is often at the heart of it.
Algorithmic
This one's pretty much a given, as the choice in a sorting algorithm can make the difference between a massive input taking months to sort versus seconds to sort. It makes the biggest impact of all if the choice is between, say, really sub-par quadratic or cubic algorithms and a linearithmic one, or between linear and logarithmic or constant, at least until we have like 1,000,000 core machines (in which case memory efficiency would become even more important).
It's not at the top of my personal list, however, since anyone competent in their field would know to use an acceleration structure for frustum culling, e.g. We're saturated by algorithmic knowledge, and knowing things like using a variant of a trie such as a radix tree for prefix-based searches is baby stuff. Lacking this kind of basic knowledge of the field we're working in, then algorithmic efficiency would certainly rise to the top, but often algorithmic efficiency is trivial.
Also inventing new algorithms can be a necessity in some fields (ex: in mesh processing I've had to invent hundreds since they either did not exist before, or the implementations of similar features in other products were proprietary secrets, not published in a paper). However, once we're past the problem-solving part and find a way to get the correct results, and once efficiency becomes the goal, the only way to really gain it is to consider how we're interacting with data (memory). Without understanding memory efficiency, the new algorithm can become needlessly complex with futile efforts to make it faster, when the only thing it needed was a little more consideration of memory efficiency to yield a simpler, more elegant algorithm.
Lastly, algorithms tend to be more in the "implementation" category than memory efficiency. They're often easier to improve in hindsight even with a sub-optimal algorithm used initially. For example, an inferior image processing algorithm is often just implemented in one local place in the codebase. It can be swapped out with a better one later. However, if all image processing algorithms are tied to a Pixel
interface which has a sub-optimal memory representation, but the only way to correct it is to change the way multiple pixels are represented (and not a single one), then we're often SOL and will have to completely rewrite the codebase towards an Image
interface. Same kind of thing goes for replacing a sorting algorithm -- it's usually an implementation detail, while a complete change to the underlying representation of data being sorted or the way it's passed through messages might require interfaces to be redesigned.
Multithreading
Multithreading is a tough one in the context of performance since it's a micro-level optimization playing to hardware characteristics, but our hardware is really scaling in that direction. Already I have peers who have 32 cores (I only have 4).
Yet mulithreading is among the most dangerous micro-optimizations probably known to a professional if the purpose is used to speed up software. The race condition is pretty much the most deadly bug possible, since it's so indeterministic in nature (maybe only showing up once every few months on a developer's machine at a most inconvenient time outside of a debugging context, if at all). So it has arguably the most negative degradation on maintainability and potential correctness of code among all of these, especially since bugs related to multithreading can easily fly under the radar of even the most careful testing.
Nevertheless, it's becoming so important. While it may still not always trump something like memory efficiency (which can sometimes make things a hundred times faster) given the number of cores we have now, we're seeing more and more cores. Of course, even with 100-core machines, I'd still put memory efficiency at the top of the list, since thread efficiency is generally impossible without it. A program can use a hundred threads on such a machine and still be slow lacking efficient memory representation and access patterns (which will tie in to locking patterns).
SIMD
SIMD is also a bit awkward since the registers are actually getting wider, with plans to get even wider. Originally we saw 64-bit MMX registers followed by 128-bit XMM registers capable of 4 SPFP operations in parallel. Now we're seeing 256-bit YMM registers capable of 8 in parallel. And there's already plans in place for 512-bit registers which would allow 16 in parallel.
These would interact and multiply with the efficiency of multithreading. Yet SIMD can degrade maintainability just as much as multithreading. Even though bugs related to them aren't necessarily as difficult to reproduce and fix as a deadlock or race condition, portability is awkward, and ensuring that the code can run on everyone's machine (and using the appropriate instructions based on their hardware capabilities) is awkward.
Another thing is that while compilers today usually don't beat expertly-written SIMD code, they do beat naive attempts easily. They might improve to the point where we no longer have to do it manually, or at least without getting so manual as to writing intrinsics or straight-up assembly code (perhaps just a little human guidance).
Again though, without a memory layout that's efficient for vectorized processing, SIMD is useless. We'll end up just loading one scalar field into a wide register only to do one operation on it. At the heart of all of these items is a dependency on memory layouts to be truly efficient.
Other Optimizations
These are often what I would suggest we start calling "micro" nowadays if the word suggests not only going beyond algorithmic focus but towards changes that have a minuscule impact on performance.
Often trying to optimize for branch prediction requires a change in algorithm or memory efficiency, e.g. If this is attempted merely through hints and rearranging code for static prediction, that only tends to improve the first-time execution of such code, making the effects questionable if not often outright negligible.
Back to Multithreading for Performance
So anyway, how important is multithreading from a performance context? On my 4-core machine, it can ideally make things about 5 times faster (what I can get with hyperthreading). It would be considerably more important to my colleague who has 32 cores. And it will become increasingly important in the years to come.
So it's pretty important. But it's useless to just throw a bunch of threads at the problem if the memory efficiency isn't there to allow locks to be used sparingly, to reduce false sharing, etc.
Multithreading Outside of Performance
Multithreading isn't always about sheer performance in a straightforward throughput kind of sense. Sometimes it's used to balance a load even at the possible cost of throughput to improve responsiveness to the user, or to allow the user to do more multitasking without waiting for things to finish (ex: continue browsing while downloading a file).
In those cases, I'd suggest that multithreading rises even higher towards the top (possibly even above memory efficiency), since it's then about user-end design rather than about getting the most out of the hardware. It's going to often dominate interface designs and the way we structure our entire codebase in such scenarios.
When we're not simply parallelizing a tight loop accessing a massive data structure, multithreading goes to the really hardcore "design" category, and design always trumps implementation.
So in those cases, I'd say considering multithreading upfront is absolutely critical, even more than memory representation and access.