Why do modern processors use few advanced cores instead of many simple ones or some hybrid combination of the two?

Question

I understand that memory is a big bottleneck in modern computer systems, but can't a system with many simple cores be more efficient than one with single digit number of advanced cores for some tasks?

From what I understand a GPU is an extreme version of this, but isn't there some middle ground for certain tasks that needs a density/complexity which is somewhere between the two extremes?

Yes more cores and more parallelism is more efficient! But the limit to useful instruction parallelism for general code has limited the useful number of CPU cores in the past decades to such a low number, that it was no big deal to make all of these cores "high performance" ones. With more and more awareness for parallelism, you see a) wider cores and b) much more cores recently.. indeed to such an extent that the most modern architecture from Intel *does* use the big-little approach (hybrid) and I am sure AMD will adopt a similar thing during this decade. — tobalt, Nov 20 '21 at 18:35
Useful search term : Amdahl's Law. Relevance today : the more of your task load is inherently serial, the less benefit in having parallel cores. Every load is different. — , Nov 20 '21 at 19:31
Why waste Silicon cost if I can implement the same performance with a core consuming much lesser area. — Mitu Raj, Nov 20 '21 at 22:30
"**can't** a system with many simple cores be more efficient than one with single digit number of advanced cores for **some tasks**?" – That's the crux right there: it *can* be more efficient for *some tasks*. But it *isn't* more efficient for *most* tasks, including the tasks commonly executed on mainstream machines. — Jörg W Mittag, Nov 21 '21 at 07:53
You can have many simple cores and you can do all the parallel division of tasks (if and when possible), but then it comes time to put all the results back together [as well as dispensing the new inputs to the cores]. The (virtually) sub divided memory needs to maintain sync. and the performance suffers. The communication and sync between the main/aux cores becomes the bottleneck. A GPU benefits from more/simpler cores as a majority of its tasks look the same (for which GPUs have specialized processing units). For a general CPU, that is not a frequent use case. — Syed, Nov 21 '21 at 14:39

DKNguyen · Answer 1 · 2021-11-20T16:46:06.273

Programming for parallelism is difficult, so most things are largely done sequentially which requires more complex processors. The clock limit prevents the processors from getting much more complex so instead we have several complex processors which mainly allow independent tasks to be run simultaneously or allow a single task to be split up into a few simultaneously threads when it is really obvious and simple to do so.
Because programming for parallelism is difficult, you can fairly accurately predict when the extra for effort massive parallelism will be invested on the programming side (i.e. where it is actually needed). This makes it easy to target exactly who actually will bother doing parallel work with many simple cores which is why you end up with general-purpose CPUs with fewer complex cores and special purpose GPUs with a large number of simpler cores.
Development is expensive so you need a sufficiently large market to support the development of a mixed or mid-complexity, multi-core processor.

yes! So many times yes. I supplemented this excellent answer with my view on what these "middle-ground" complexity manycore processors are, today. — Marcus Müller, Nov 20 '21 at 17:00

Marcus Müller · Answer 2 · 2021-11-20T17:13:45.723

From what I understand a GPU is an extreme version of this, but isn't there some middle ground for certain tasks that needs a density/complexity which is somewhere between the two extremes?

Modern GPUs are that middle ground. Where earlier GPUs were really simple "one instruction, different data, everyone waits until the slowest is done, next instruction" with only very limited instruction sets, modern GPUs compute units are far more general and independent.

There were multiple (if not many) attempts to do this before – and aside from niche usage, all of them failed due to neither falling into the sweet spot of independent high-performance CPUs nor cheap, low-power massively parallel simplistic shader units. Basically, you need to be efficient with both your CPU time and your memory bandwidth, and that means you either need few, but performant cores sharing main memory but having extensive local caches, or you need many, but centrally orchestrated simpler cores. It has shown you really can't have both easily – which is (my interpretation of) why it took the world so long to arrive at modern GPUs, and why only two companies are dominating the GPGPU market very clearly, with one clear leader.

Examples of these commercial failures include:

intel terascale
intel Larrabee¹
Tilera tile64
IBM kilocore

¹ ^{Larrabee used very imperformant x86 cores in a manycore processor, and thus was middle ground and useless, taken over by GPUs on one side and classical workstation/server CPUs on the other, though one could argue that its successor, the Xeon Phi, has a better fate, but these really are more like lots of mighty x86-64 including AVX-512 on a single chip, so it's not the middle ground you hope for. The series got discontinued last year, mostly because lack of demand – GPUs on one side, classical x86-64 on the other side simply are more useful and better compute/Watt.}

Also of note: you mostly find GPUs living side-by-side with CPUs with those high-performant cores. You just don't build a whole computer out of GPUs, because while they can do _some_ things very fast, they still need CPUs to steer the overall direction of the application. — TimWescott, Nov 20 '21 at 17:10
@TimWescott also very true, but the role of CPUs in high perf computing based on GPUs vanes – data center GPUs paired with the right network cards can do things like take data transfers directly from network, do the GPU computation, and send the result back out – without any ado from the CPU (aside from initializing everything). This aligns nicely with programming paradigms as seen in TensorFlow and PyTorch, where your computational development is done in a scripting language that in the end compiles a data flow graph for the GPU, so that the CPU is uninvolved in the actual computation. — Marcus Müller, Nov 20 '21 at 17:13
[Intel's new chips](https://www.theverge.com/2021/8/19/22631059/intel-alder-lake-chip-hybrid-architecture-day-2021-desktop-laptop-preview-cores) might be a new improvement. Combination of both complex cores and simple cores to offload different types of work instead of simply a # of the same cores. — Nelson, Nov 21 '21 at 14:50
@Nelson yep, they're jumping on the little.BIG bandwagon that arm's higher-end application processors have been riding for a decade. but I consider these approaches basically identical to "a few hefty processors", and far away from "a sea of identical, small processors"; it's really nuance. — Marcus Müller, Nov 21 '21 at 14:51

score 4 · Answer 3 · answered Nov 21 '21 at 18:43

I'd like to take the frame challenge further than the other answers here. Until the end of the 20th Century, parallel computing was the domain of specialist servers and supercomputers; general-purpose consumer computers were based on a single processor core running at faster and faster clock speeds. Since then, multiple core processors have become the norm, and the number of cores included has slowly increased.

Modern architectures may include (even on a single chip):

Low-power CPU cores
High-power CPU cores which can be switched off when not in use
Parallel computation units arranged into a GPU

Nonetheless, a lot of software is only able to make use of a single core, so general-purpose CPU cores are not likely to go away any time soon. That's because you can only make use of multiple cores in certain select cases:

You have lots of data which you want to process in similar ways. This is what GPUs are optimised for, because graphics tasks often have lots of elements which need the same relatively simple processing. It has proven useful in machine learning.
You have multiple tasks which are largely independent of each other. Running multiple programs on a desktop OS sounds like it would fall into this category, but most programs spend a lot of time waiting for external input anyway, so can simply share time on the same processor. Within a single program, there is often a lot of complexity in identifying which instructions can be performed in parallel (or out of sequence), and synchronising tasks safely.

score 3 · Answer 4 · answered Nov 20 '21 at 17:57

3

Some tasks lend themselves to parallelism, some don’t. Broadly, tasks that tend to be repetitive with lots of time spent in loops can be partitioned into smaller sub tasks, marshaled by the host. Tasks that tend to be ‘thready’ with lots of branches aren’t so easy to break up.

Graphics and AI are examples that can be broken up into parallel tasks. TCP/IP, not so much.

Point being, there’s a place for both.

answered Nov 20 '21 at 17:57

hacktastical

49,832
2
47
138

1

Given TCP/IP is inherently able to be decomposed into individual connections, I would have that TCP/IP in fact is pretty good at being parallelised. On a web server you get to parallelise not only the TCP header work etc, but also compression/decompression and encryption/decryption. Unless you are talking about a very small number of high throughput connections, in which case I agree (but that is not the typical server use case). – abligh Nov 21 '21 at 17:39
1

My experience with this is mixing TCP/IP with media processing on a VLIW architecture. Didn’t work out so well. gen2 used a hybrid VLIW + ARM approach. – hacktastical Nov 21 '21 at 18:03
2

A VLIW requires very small-scale instruction-level parallelism to do well. @abligh is talking about thread-level parallelism available from multiple TCP connections making multiple *cores* useful, like the question is asking about. But TCP/IP itself is so light-weight you don't need a whole core per connection, so unless you have a lot of other work to do (like encryption/decryption and (de)compression) it's not something you need or want a [SPARC Niagara](https://en.wikipedia.org/wiki/UltraSPARC_T1) or Xeon Phi for. Still, multiple connections distribute throughput across cores easily. – Peter Cordes Nov 21 '21 at 20:34

score 3 · Answer 5 · answered Nov 21 '21 at 04:56

Umm, GPUs are not “an extreme version of a system with many simple cores”. Eg in an Nvidia GPU each SM (streaming multiprocessor) is a highly multithreaded processor. The data path is IIRC 16 lanes, each capable of 32-bit floating point calculations. The multi threading scheduling of a GPU is much more sophisticated than the scheduling for an MT CPU, snd is arguably comparable to that of an OOO ILP scheduler.

The big difference is that an OOO CPU needs to make many different scheduling decisions every cycle that affect the next 1-3 cycles. Largely different scheduling decisions for each instruction. Whereas the GPU’s scheduling decision applies to that set of 16 ALUs wide in space, also typically 2 to 4 cycles deep in time. Moreover, instead of scheduling a dependent instruction 1 or 2 cycles later, in a GPU the next dependent instruction from the same thread can only arrive many cycles later - in some GPUs no earlier than 40 cycles!!!! In between the GPU schedules operations group other threads. However, more and more even the massively parallel workloads of a GPU do not have enough parallelism, so more and more GPUs are exploiting ILP within the same “thread” - Nvidia’s terminology is “warp” - the same set of 16-space-wide X 2-time-deep operations executing in lockstep.

Ie the biggest difference between the CPU and GPU SM are the the SM amortizes the cost of control logic over many lockstep elements per thread/warp, and many independent control flow threads/warps within the same processor.

Plus, of course, a GPU has multiple if these processing elements. Many more in a single chip than CPUs do.

Overall, GPUs have a much lower ratio if control logic to compute/ALU logic than CPUs do. GPU control logic is pretty sophisticated, but it’s cost is spread out, amortized, over more ALUs - because in a GPU many of the ALUs are doing almost exactly the same thing at the same time.

—

But I am not addressing your question, am I? You asked “ why not many simple processors, rather than the big complicated single processors…”?

The name for “many independent processors” is MIMD. Multiple Instruction Mujtiple Data”. Now, a GPU is MIMD - but each of its independent processors is SIMD lockstep and MT multithreaded.

I think you are asking about “many independent simple processors”. A MIMD whose independent processors are, say, inky 32 or 64-bits wide, not-multithreaded, not suoerscakar, and not OOO. I like calling this a MIMD(1), whereas a GPU is a MIMD(SIMD(16x2)*MT(…))

So: here’s where a GPU beats a MIMD-1 —- the MIMD-1 has at least 16x more control logic, proportionally, than the simple GPU MIMd(SIMD(32)), even assuming that the GPZu cores are not MT or ILP. Ie the GPU wastes only 1/16 as much area and power on control logic as the MIMD-1.

—-

But again you are not asking about why a GPU beats a massively parallel MIMD-1. You are asking about why a massively parallel machine with simple MIMD-1 processing elements doesn’t beat a machine whose processing elements are much more complicated. Perhaps 16 or 1024x more complicated, so that you can only have 1/16 - 1/1024 if the processors. Say, “only” 8 CPUs per chip, versus perhaps 128 or even 8K MIMD-1s in the same chip.

Well, the MPP-MIMD-1 would beat the machine with much more complicated cores - for exactly the same workloads that the GPU beats the MIMD-1 machine. So the MIMD-1 gets squeezed out.

And for other workloads… well, if there is not enough parallelism to use all if the simple MIMD-1 processors, it is beaten by the big CPUs. And even if there us enough parallellsm, but if the irregular kind that GPUs cannot do well - well then the MIMD-1 cores are probably spending lots of time waiting for memory.

Also, for that matter, if you have 16x more MIMD-1 processors than you have big-CPUs, you have 16x more wires going to memory. Which eats away at that supposed 16x advantage.

Eg a MIMD-1 MPP would beat a big-CPU less-parallel machine, in a workload that had enough parallelism, if memory were free, 1-cycle latency, and if memory wires cost nothing. Or, equivalently, if each of the MIMD-1’s accessed private memory only. If logic gates were much bigger than wires… but when wiring area and power dominates the actual logic, the MIMD-1s lose more and more ground to fewer parallel bigger CPUs.

—-

There may still be a place for MIMD-1 MPPs - but it is squeezed between GPUs and big non-GPU CPUs.

—-

So much for the bad news. Now for the good news:

Programmers prefer thinking about completely independent MIMD CPUs to thinking about complex trade offs if MIMD and SIMD and MT.

In fact, the historical GPU programming model was to treat each of the lockstep lane-threads within a control-flow warp as an independent MIMD-1.

Conversely, another programming model than humans seem comfortable with is a single thread of control - but with data parallel operations, like operating in arbitrary arrays. Basically a vector processor.But again, humans like arbitrary vectors and arrays, snd don’t like to have to worry about tuning to different numbers of vector elements, etc.

In many ways a GPU is just a way if taking MIMD-1 code and running it in a less expensive micro architecture that shares control logic between what would otherwise be independent processing elements.

Ie the good news is that your “lots of simple processors” model wins in some domains -but as a software or programming model concept - one that us implemented more efficiently by either a GPU or fewer parallel but more ILP/MLP micro architectures.

There is a lot of good info here, but the answer has too little structure to be well comprehensible IMO. I suggest you spend some time proof-reading/rearranging it. I think it also skips a bit over the role of compilers. Indeed developers trained on classical languages like C may sometimes not *think* parallel. But in this cases there are compilers that take care of that. educating millions of developers is a harder problem than optimizing compilers. CPU are getting more and more parallelism (intracore and intercore) because compilers improve. — tobalt, Nov 21 '21 at 07:14
Oh, interesting point about wire vs. logic gate sizes. Is that part of the reason designs like [UltraSPARC T1 Niagara](https://en.wikipedia.org/wiki/UltraSPARC_T1) was viable, and useful for some workloads back in 2005? Or just that its 8 cores was a good number as far as interconnects, and is a pretty modest core count by today's standards. But back in 2005 they had to be simple to fit that many on a chip? (And making them barrel processors to hide latency was a way to get decent throughput across many threads out of simpler processors, since leaning into threading was the whole idea.) — Peter Cordes, Nov 21 '21 at 21:05
@PeterCordes: I didn’t realize that Niagara was good for anything. :-). OK I’m being a bit snarky, partly just to tweak the Niagara people of my acquaintance. My point about gate versus wire scaling is AFAIK more applicable to circa the difference between the 1960s and the 1980s and nowadays. I would have to check, but I don’t think the ratio has changed that dramatically since 2005. Although the trends continue, everything in process technology has slowed down in recent years. — Krazy Glew, Nov 22 '21 at 00:54
Ah, ok. I knew wire *resistance* and propagation delay was an ever-increasing challenge as processes shrink, e.g. https://www.realworldtech.com/shrinking-cpu/4/ from 2004, since resistance scales with area/length so goes up if everything shrinks uniformly, even though wires are shorter. (I have no direct evidence of Niagara being good for anything either, but I assume it was at least ok at the workloads it was specifically designed for, if not much else. :P CPUs that make interesting examples are often examples of "never go full ", like Alpha with its lack of byte load/store :P) — Peter Cordes, Nov 22 '21 at 01:06
@tobalt: you’re right, my answer is more of a rant. I get very tired of people saying that GPUs are examples of many simple processors. I suppose that, and a few others such a simple statement, or a more useful summary. I will get around to that in my copious free time. — Krazy Glew, Nov 22 '21 at 01:08
I was just making a suggestion for improvement..If due to lack of time it is not an option, then *I* much prefer reading a deeper and less structured text over a simple trivial answer, so please leave it. After all, if readers are interested they can read it multiple times or fill in the missing bits with their own research. — tobalt, Nov 22 '21 at 05:14

score 2 · Answer 6 · answered Nov 21 '21 at 18:55

Algorithms, proofs, thought processes tend to be a list of things to do in sequence. The human brain may be a massively parallel processor but the things it does in parallel are not as much conscious thought processes but acquired skills. As a consequence, the descriptions for what a computer should be doing tends to be structured in a comparatively linear manner, and computer instruction sets are structured in sequential steps rather than a set of tasks to be done with some interdependence. Superscalar architectures actually parallelise a few things while figuring out interdependence, and the original MIPS architecture ("Microprocessor without Interlocking Pipeline Stages") had the compiler already figure out those interdependencies and schedule operations accordingly.

But that is at a very minor and very low-level scale. It turns out that sequentiality very much pervades computing, thinking about computing, and programming and attempts at massive parallelism expressed reasonably naturally in architecture ("dataflow architecture") and/or programming ("Occam") did not really take off and some massively parallelisable systems like artificial neural networks get their programming rather via handwaving and "learning" rather than explicit instructions.

Original MIPS R2000 was scalar, but yes it was pipelined and had load-delay slots that the compiler had to fill with something (maybe just a NOP), and branch-delay slots. An example of offloading the job of finding true instruction-level *parallelism* (rather than just overlapping load and branch latencies with other stuff) would be VLIW architectures where one "instruction word" actually contains multiple instructions that the CPU executes without checking them for conflicts with each other. Itanium is a notable example. Compilers were never able to do quite as well as hoped, I think. — Peter Cordes, Nov 21 '21 at 21:15
https://www.realworldtech.com/ev8-mckinley/5/ has some analysis of the 2nd-gen Itanium microarchitecture, McKinley, with its dual-bundle pipeline (6 "instructions" wide). — Peter Cordes, Nov 21 '21 at 21:16

score 2 · Answer 7 · answered Nov 21 '21 at 23:06

TL:DR: Yes, but not for most tasks. That's why the current iteration of this idea is hybrid CPUs with some performance, some efficiency cores, now that we have transistor budgets to throw that many cores onto a consumer laptop/desktop CPU.

but can't a system with many simple cores be more efficient than one with single digit number of advanced cores for some tasks?

"For some tasks" is the rub. They're a lot worse for many other tasks, ones that haven't been or can't easily be parallelized. For looping over a medium-sized array, for example, it's often not worth talking to other CPU cores about doing some of the work, because the latency involved is comparable to the time it would take just doing the work in a single thread.

And a "many simple-ish CPU" processor is worse than GPUs for tasks that are highly parallel, and don't have much data-dependent branching. I.e. where high latency can be tolerated to achieve the high throughput per-power and per-die-area GPUs can provide. So as other answers have pointed out, the middle ground between throughput-optimized GPUs and latency-optimized CPUs isn't very big in terms of commercial demand. (Stuff like CPU SIMD does it well enough for most things, although with power-efficiency becoming ever more important, there is room for hybrid CPUs with some efficiency cores.)

Per-thread performance is very important for things that aren't embarrassingly parallel. (And also because memory bandwidth / cache footprint scale some with number of threads for many workloads, so there's a lower limit on how simple/small you'd want to make each core without a totally different architecture such as a GPU.)

A system with fewer big cores can use SMT (Simultaneous Multi-Threading, for example hyperthreading) to make those big cores look like 2x as many smaller cores. (Or 4x or 8x, for example in IBM POWER CPUs.) This is not as power-efficient as actually having more smaller cores, but is in the same ballpark. And of course simple OS context-switching makes it possible to run as many software threads as you want on a core, while the reverse is not possible: there's no simple way to use lots of simple cores to run one thread fast.

There are diminishing returns The opposite of this question, Why not make one big CPU core? has

Related: Modern Microprocessors A 90-Minute Guide! has a section on SMT and multi-core, and is excellent background reading on CPU design constraints like power.

Making large cache-coherent systems is hard so it's hard to scale to huge numbers of CPU cores. The biggest Xeon and Epyc chips are packing 56 or 64 physical cores on one die.

Compare this with Xeon Phi compute cards which is pretty much what you wondered about: AVX-512 bolted on to low-power Silvermont cores, going up to 72 cores per card with some high-bandwidth memory. (And 4-way SMT to hide ALU and memory latency, so it actually supported 4x that many threads.)

They discontinued that line in 2018 due to lack of demand. This article says it's "never seen any commercial success in the market". You couldn't get big speedups from running existing binaries on it; code generally needed to be compile to take advantage of AVX-512. (I think Intel's toolchain was supposed to be able to auto-parallelize some loops so source changes might have been less necessary or smaller than for using GPUs). And it omitted AVX-512BW so it wasn't good for high-quality video encode (x264/x265 as opposed to fixed-function hardware); I think primarily good for FP work, which means it was competing with GPUs. (Some of the reason may be due to working on a new from-the-ground-up architecture for "exascale" computing, after seeing how the computing landscape evolved since the start of the Larrabee project in the mid 2000's; it was originally designed to be a more-programmable GPU based on x86 cores(!), but evolved into Xeon Phi once GPGPU did that better and they found a more appropriate nice.)

Hybrid / heterogenous CPUs: some fast, some efficient core

The latest iteration of your idea is to have a mix of cores, so you can still have a few fast cores for serial / latency-sensitive stuff.

Some code is only somewhat parallel, or has a few different threads doing separate tasks that are individually serial. (Not really distributing one task across many threads.)

ARM has been doing that for a while (calling it big.LITTLE), and Intel's new Alder Lake design with a mix of Performance (Golden Cove) cores and Efficiency (Gracemont) cores is exactly this: add some relatively-small throughput-optimized cores that don't push so far into the diminishing returns for spending more power to increase per-thread throughput.

So when doing "light weight" work where an E-core is enough to keep up with something that isn't useful to do faster (like playing a video or typing / clicking around on a web page), only that small core needs to be powered up.

Or when doing some number crunching / video encoding / whatever with lots of thread-level parallelism, 4 E-cores for the area of one P-core gives you more total throughput. But you still have some P-cores for tasks that don't (or simply weren't) parallelized. (I wrote in more detail about Alder Lake on superuser.com.)

Even the E cores on Alder Lake do fairly wide superscalar out-of-order exec, and can have quite good throughput on code where the instruction-level parallelism is easy for the CPU to find. (On ARM big.LITTLE, the little cores are often in-order, but still 3-wide superscalar with stuff like hit-under-miss caches to find some memory-level parallelism. e.g. Cortex-A53)

For most systems for general workloads, it's not commercially viable to not have any latency-optimized cores that have high single-threaded performance. Many tasks don't easily parallelize, or simply haven't been because that's much more programming effort. (Although low-end smartphones sometimes only use low-end cores; people would rather have cheap and slow than no phone at all, and power efficiency matters even more than for laptops.)

Previous many-small-cores CPUs:

I already mentioned Xeon Phi, but years before that another interesting example was Sun UltraSPARC T1, aka Niagara, released in 2005.

It was an 8-core CPU (or 4 or 6 core for other models), at a time when x86 CPUs were only just starting¹ to introduce dual-core such as Athlon X2. They weren't even trying to aim for high per-thread performance, which was essential for most interactive use back then. Instead they were aiming at server / database workloads with many connections so there was already plenty of thread-level parallelism for software, even back then.

Each core had a fairly simple pipeline, and was a "barrel" processor, rotating between up to 4 non-stalled logical cores aka hardware threads. (Basically like 4-way SMT, but in-order so instructions from separate threads never mix in an execution unit.) Keeping cores small and simple made locking overhead lower, I think.

32 logical cores was a huge amount back in 2005. (And later generations were multi-socket capable, allowing 2x or 4x that in a whole system). The wiki article mentions that without special compiler options (to auto-parallelize I assume), it left a lot of performance on the table for workloads that weren't already parallelized, like gzip or MySQL (back in 2005, and IDK if they were talking about a single big query or what). So that's the downside of having many weak cores, especially if you really lean into it like Sun did by making it with not much cache, and depending on the barrel processor to hide latency.

Footnote 1: Part of that was that most of the x86 market was for machines that would run Windows, and until Windows XP the mainstream versions weren't SMP-capable IIRC. But servers had been multi-socket for a long time, achieving SMP by having multiple physical CPU packages in separate sockets. But primarily it was that transistor budgets were still at a point where more cache and wider/deeper OoO exec were still providing significant gains.

Parallelization of many tasks is hard

Communication between cores is always going to be pretty expensive (at least high-latency), because distances are inherently far, and because of out-of-order exec to hide memory and L3-cache latency. A core can speculatively see its own actions, but stores can't be made visible to other cores speculatively, otherwise you'd have to roll them back on detecting a branch mispredict. (Which is completely impractical in terms of maintaining a consistent point to roll back to, and would defeat the purpose of having it be a separate core.)

So inter-core communication takes at least the latency of the interconnect and cache-coherency protocol, plus the latency of the store buffer and out-of-order exec window. (And the memory-ordering barriers normally involved limit out-of-order execution somewhat.)

"..CPU SIMD does it well _enough_ for most things.." - worth mentioning the current fastest supercomputer uses the A64FX processor, no accelerators. That's Arm SVE, with a massive amount of memory bandwidth. — awjlogan, Nov 22 '21 at 08:27

Why do modern processors use few advanced cores instead of many simple ones or some hybrid combination of the two?

7 Answers7

Hybrid / heterogenous CPUs: some fast, some efficient core

Previous many-small-cores CPUs:

Parallelization of many tasks is hard