73

I am trying to understand, at a high-level, how single threads run across multiple cores. Below is my best understanding. I do not believe it is correct though.

Based on my reading of Hyper-threading, it seems the OS organizes the instructions of all threads in such a way that they are not waiting on each other. Then the front-end of the CPU further organizes those instructions by distributing one thread to each core, and distributes independent instructions from each thread among any open cycles.

So if there is only a single thread, then the OS will not do any optimization. However, the front-end of the CPU will distribute independent instruction sets among each core.

According to https://stackoverflow.com/a/15936270, a specific programming language may create more or less threads, but it is irrelevant when determining what to do with those threads. The OS and CPU handle this, so this happens regardless of the programming language used.

enter image description here

Just to clarify, I am asking about a single thread run across multiple cores, not about running multiple threads on a single core.

What is wrong with my summary? Where and how is a thread's instructions split up among multiple cores? Does the programming language matter? I know this is a broad subject; I am hoping for a high-level understanding of it.

Martin Maat
  • 18,218
  • 3
  • 30
  • 57
Evorlor
  • 1,440
  • 2
  • 16
  • 22
  • 7
    A set of instructions for a single software thread may run on many cores, but not at once. – This company is turning evil. Jun 01 '17 at 17:51
  • 1
    You're mixing software threads (which involve the OS scheduler) and hardware threads or HyperThreading (a CPU feature that makes one core behave like two). – ugoren Jun 01 '17 at 21:44
  • 3
    I have 20 drivers and 4 trucks. How is it possible that one driver can deliver packages with two trucks? How is it possible that one truck can have multiple drivers? The answer to both questions is the same. Take turns. – Eric Lippert Jun 09 '17 at 18:21

3 Answers3

95

The operating system offers time slices of CPU to threads that are eligible to run.

If there is only one core, then the operating system schedules the most eligible thread to run on that core for a time slice. After a time slice is completed, or when the running thread blocks on IO, or when the processor is interrupted by external events, the operating system reevaluates what thread to run next (and it could choose the same thread again or a different one).

Eligibility to run consists of variations on fairness and priority and readiness, and by this method various threads get time slices, some more than others.

If there are multiple cores, N, then the operating system schedules the most eligible N threads to run on the cores.

Processor Affinity is an efficiency consideration. Each time a CPU runs a different thread than before, it tends to slow down a bit because its cache is warm for the previous thread, but cold to the new one. Thus, running the same thread on the same processor over numerous time slices is an efficiency advantage.

However, the operating system is free to offer one thread time-slices on different CPUs, and it could rotate through all the CPUs on different time slices. It cannot, however, as @gnasher729 says, run one thread on multiple CPUs simultaneously.

Hyperthreading is a method in hardware by which a single enhanced CPU core can support execution of two or more different threads simultaneously. (Such a CPU can offer additional threads at lower cost in silicon real-estate than additional full cores.) This enhanced CPU core needs to support additional state for the other threads, such as CPU register values, and also has coordination state & behavior that enables sharing of functional units within that CPU without conflating the threads.

Hyperthreading, while technically challenging from a hardware perspective, from the programmer's perspective, the execution model is merely that of additional CPU cores rather than anything more complex. So, the operating system sees additional CPU cores, though there are some new processor affinity issues as several hyperthreaded threads are sharing one CPU core's cache architecture.


We might naively think that two threads running on a hyperthreadded core each run half as fast as they would each with their own full core. But this is not necessarily the case, since a single thread's execution is full of slack cycles, and some amount of them can be used by the other hyperthreaded thread. Further, even during non-slack cycles, one thread may be using different functional units than the other so simultaneous execution can occur. The enhanced CPU for hyperthreading may have a few more of certain heavily used functional units specially to support that.

Kyll
  • 103
  • 4
Erik Eidt
  • 33,282
  • 5
  • 57
  • 91
  • 3
    "Thus, running the same thread on the same processor over numerous time slices is an efficiency advantage." Wouldn't it have to be *contiguous* time slices? Otherwise, the caches would be wiped by other threads, no? +1 for a nice explanation. – jpmc26 Jun 01 '17 at 21:53
  • 1
    @jpmc26, quite so: contiguous time slices on the same CPU core for full affinity advantage! – Erik Eidt Jun 01 '17 at 22:05
  • If there are N cores, they're *each* running the kernel, and they each *indepently* pick a task that's ready to run. There's no master controller that decides what *other* cores should be running. (At least that's how Linux's scheduler is designed). It's a distributed algorithm where each core runs the scheduler function when needed, and it chooses a ready thread (whose affinity mask matches that core). Timeslices on different cores are not in lock-step with each other, even if they all have the same max length. "The OS assigning threads to cores" is one way to summarize, though. – Peter Cordes Jun 02 '17 at 00:21
  • 1
    Hyperthreading (and other SMT implementations) are best with code that has a lot of branch-mispredicts or latency-bound operations. If one thread already saturates the front-end issue rate (4 uops per clock on Intel CPUs), then there's not much to gain from running that on two logical cores of the same physical core. In some cases, the more carefully tuned/optimized the code is, the less it benefits from HT. – Peter Cordes Jun 02 '17 at 00:28
  • @PeterCordes There's many independent functional blocks in each core; it may even be impossible to use all of them by a single thread of code. That was the original reasoning behind HT; the actual efficiency does depend on what's running, of course. If one thread does integer addition while the other does FPU or vector operations, they can get pretty close to double the processing speed. High-latency operations benefit just as easily on a single thread, as long as they are sufficiently independent from each other. – Luaan Jun 02 '17 at 08:24
  • This answer misses the mark because the OP is actually asking about out-of-order execution, and the real misunderstanding is just the use of the word "core." – Dan Jun 02 '17 at 09:10
  • 2
    @Luaan: HT is often good, but the situation isn't as simple as you describe. The front-end issue bandwidth (4 uops per clock on Intel, 6 on Ryzen) is equally shared between threads (unless one is stalled). If that's the bottleneck, then like I said HT won't help at all. It's not uncommon for Skylake to come close to that in a well-tuned loop, if there's a mix of loads, ALU and stores... Transistors are cheap (and can't all be switching at once or the CPU would melt), so modern x86 CPUs have more execution ports than the front-end can feed (with many execution units being replicated... – Peter Cordes Jun 02 '17 at 11:45
  • 2
    ... on multiple ports)... This might seem like a waste, but often a loop will only use one kind of ALU execution unit at once, so having duplicates of everything means that whatever kind of code is running, there are multiple ports for its instructions. So the reason you cited for benefiting from HT is not that common, since most code has some loads and/or stores taking up front-end bandwidth, and what's left is often not enough to saturate the execution units. – Peter Cordes Jun 02 '17 at 11:45
  • 2
    @Luaan: Also, in Intel CPUs the integer and FP/vector execution units share the same execution *ports*. For example, the FP FMA/mul/add units are on ports 0/1. But the integer multiplier is also on port1, and simple integer ops can run on any of the 4 execution ports (diagram in my answer). A second thread using up issue bandwidth will slow them both down even if they don't compete for execution units, but often there's a net throughput gain if they don't compete too badly for cache. Even well-tuned high-throughput code like x264/x265 (video encoders) benefit about 15% on Skylake from HT. – Peter Cordes Jun 02 '17 at 11:49
  • 3
    @luaan In addition to what Peter said, your claim that *"That was the original reasoning behind HT"* is incorrect. The original reasoning behind HT was that the NetBurst microarchitecture had lengthened the pipeline to such an extreme extent (for the purposes of driving up the clock speed) that branch mispredictions and other pipeline bubbles absolutely *killed* performance. HT was one of Intel's solutions to minimizing the amount of time that this big expensive chip's execution units sat idle because of bubbles in the pipeline: code from other threads could be inserted and run in those holes. – Cody Gray - on strike Jun 02 '17 at 13:43
  • Unfortunately, NetBurst's trace-cache was so small (and decoders so weak) that HT didn't help a lot in keeping the OOO core busy. I've never optimized anything for netburst myself, but what I've read is that often the best use for a second thread was to run a prefetch loop (as described in [Ulrich Drepper's What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf), which was written in the P4 era. Most of that excellent guide is still useful, but the prefetch-thread advice is outdated now that HW prefetchers are smarter, and HT doesn't suck.) (@Cody) – Peter Cordes Jun 03 '17 at 07:12
  • Hyperthreading is not `can support execution of two or more different threads simultaneously`. It is simply a hardware implementation of context switching. Only one thread actually runs on the CPU at a time but the hardware can quickly switch between two threads without needing to swap values of all registers between threads (context switch). Context switching is expensive. So much so in fact that singlethreading became a performance boost (node.js, netty, twisted etc.). – slebetman Jun 04 '17 at 02:47
  • @slebetman, oh no: I say you're selling hyperthreading short by calling it context switching in hardware! Context switching involves (1) some context/state that is directly usable for execution, (2) an alternate area to save context that is not directly usable for execution, (3) swapping between these areas, first to save existing context to the alternate area, and second to reload the actionable context/state from the alternate area. Because swapping takes effort, it is impossible for context switching to make forward progress on two threads in the same clock cycle. – Erik Eidt Jun 04 '17 at 05:07
  • 1
    @slebetman, hyperthreading isn't context switching, as *the hardware is actually maintaining two (or more) directly usable contexts, concurrently*: there is ***no swapping of contexts*** between a main area and an alternate area. Instead of swapping there is tracking of what belongs to which context, as bit of additional state that goes with operations and operands. And under the right circumstances, the hyperthreads can actually all execute concurrently, e.g. both threads can make forward progress in the same CPU clock cycle by non-overlapping use of otherwise available functional units. – Erik Eidt Jun 04 '17 at 05:07
  • @slebetman: Intel's hyperthreading is a pretty fine-grained SMT implementation. Erik is right: **both threads can have uops in flight in the out-of-order core simultaneously**, so there is true interleaving between threads. The front-end fetch/decode/issue stages alternate between threads (or when one thread is stalled, the other gets every cycle), but the execution units are competitively shared. Several other microarchitectural resources are statically partitioned (including the ROB so one thread can't fill it while stalled on a cache-miss, which would starve the other thread). – Peter Cordes Jun 05 '17 at 22:16
  • @PeterCordes: By Erik's definition hyperthreading is not an SMT implementation. It is a superscalar optimisation across two threads. – slebetman Jun 06 '17 at 00:39
  • 1
    @slebetman: What exactly do you think [the definition of SMT is](https://en.wikipedia.org/wiki/Simultaneous_multithreading), and why don't you think Intel's hyperthreading is SMT? The only way it could be any finer-grained would be if the front-end could issue a mix of uops from both threads every cycle, instead of alternating issuing groups of 4 every cycle when both threads have 4 uops ready to issue. If one thread is running low-IPC code, it will maybe only issue a group of 4 uops every 3 or 4 cycles, while the other thread gets the rest of the front-end cycles, so there's little to gain. – Peter Cordes Jun 06 '17 at 01:30
  • @slebetman: The wikipedia article's wording doesn't support my argument very well. The "taxonomy" (definitions) section seems to be talking about SMT for in-order cores, because they say "In simultaneous multithreading, instructions from more than one thread can be executed in any given pipeline stage at a time." But then in the taxonomy section, they only talk about "issue", not what happens once the out-of-order core gets the uops. But then later in the article, all iterations of Intel's Hyperthreading (P4, Atom, Nehalem/SnB, and Xeon Phi) are described as SMT. – Peter Cordes Jun 06 '17 at 01:41
29

There is no such thing as a single thread running on multiple cores simultaneously.

It doesn't mean, however, that instructions from one thread cannot be executed in parallel. There are mechanisms called instruction pipelining and out-of-order execution that allow it. Each core has a lot of redundant resources that are not utilized by simple instructions, so multiple such instructions can be run together (as long as the next one doesn't depend on the previous result). However, this still happens inside a single core.

Hyper-threading is kind of extreme variant of this idea, in which one core not only executes instructions from one thread in parallel, but mixes instructions from two different threads to optimize resource usage even further.

Related Wikipedia entries: Instruction pipelining, out-of-order execution.

Frax
  • 1,844
  • 12
  • 16
  • 3
    They cannot run simultaneously, but they can run in parallel? Are these not the same thing? – Evorlor Jun 01 '17 at 23:42
  • 12
    @Evorlor The key thing here is the difference between a core and an execution unit. A single thread can only run on one core, but a processor can use dynamic analysis to work out which instructions being executed by a core do not depend on each other and execute these on different execution units simultaniously. One core may have several execution units. – user1937198 Jun 01 '17 at 23:55
  • 3
    @Evorlor: An out-of-order CPU can find and exploit the **instruction-level parallelism** within the instruction stream of a single thread. e.g. often the instructions that update a loop counter are independent from some of the other work a loop does. Or in an `a[i] = b[i] + c[i]` loop, each iteration is independent, so loads, adds, and stores from different iterations can be in flight at once. It has to preserve the illusion that the instructions executed in program order, but for example a store that misses in cache doesn't delay the thread (until it runs out of space in the store buffer). – Peter Cordes Jun 02 '17 at 00:09
  • 3
    @user1937198: The phrase "dynamic analysis" would better suit a JIT compiler. Out-of-order CPUs don't *really* analyze; it's more like a greedy algorithm that runs whatever instructions have been decoded&issued and have their inputs ready. (The out-of-order reordering window is limited by a few microarchitectural resources, for example Intel Sandybridge has a ReOrder Buffer size of 168 uops. See also [measuring the ROB size experimentally](http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/)). All implemented with hardware state-machines to handle 4 uops per clock. – Peter Cordes Jun 02 '17 at 00:17
  • @PeterCordes Well it is analysis, its just very simple analysis. I was trying to express that the instructions are processed during decode and execution by the core rather than being done at separately from code execution without getting into the details of how it actually happens. I felt the reorder buffer is an unnecessary complication for someone struggling with the idea of hardware vs software threads. Maybe I should have said the processor dynamically tracks dependencies to allow sending uops to an execution unit earlier. – user1937198 Jun 02 '17 at 00:41
  • 2
    @user1937198: Yeah, agreed. I was just worried that the phrase you chose made it sound like the "analysis" was saved along with the instructions somewhere for reuse the next time the program was run. I wanted to emphasize that **out-of-order execution can only find ILP on a very local level, only up to a couple hundred instructions, not between two independent loops (unless they're short).** Anyway, I ended up [writing an answer where I could go into way too much detail](https://softwareengineering.stackexchange.com/a/350024/165079), since tuning code for x86 microarchitectures is what I do. – Peter Cordes Jun 02 '17 at 01:23
  • @PeterCordes An interesting variation is Intel's Itanium, which has *explicit* instruction level parallelism - determined by the compiler, rather than the CPU. A JIT compiler on Itanium could do dynamic analysis of OOE well enough. Sadly, Itanium wasn't really popular... :) – Luaan Jun 02 '17 at 08:27
  • 3
    @Luaan yeah, it was an interesting idea, but AOT compilers *still* aren't smart enough to fully exploit it. Also, Linus Torvalds (and others) have argued that exposing that much of the internals of the pipeline is a big constraint on future designs. e.g. you can't really increase the pipeline width without changing the ISA. Or you build a CPU that tracks dependencies the usual way, and maybe issues two VLIW groups in parallel, but then you've lost the CPU-complexity benefit of EPIC but still have the downsides (lost issue bandwidth when the compiler can't fill a word). – Peter Cordes Jun 02 '17 at 12:05
  • 1
    Maybe JIT optimization would be common even for languages like C/C++ if Itanium had caught on. – Peter Cordes Jun 02 '17 at 12:08
  • Doesn't it depend on the kind of processor you're running on? (And also exactly how you define "thread" and "core".) For instance a SIMD processor like a GPU can be running the same thread on many execution units, using different data on each. – jamesqf Jun 03 '17 at 21:38
  • "An out-of-order CPU can find and exploit the instruction-level parallelism within the instruction stream of a single thread" @PeterCordes technically, you can have OOO scalar (NEC SX/5). OOO speaks to scheduling whereas superscalar speaks to parallelism. I think you'd need OOO superscalar for your example. – Olsonist Nov 15 '19 at 20:21
  • 1
    @Olsonist: that's true, but it's only relevant for instructions with latencies > 1. So it's still exploiting ILP to run 1 instruction per clock instead of being limited to less than that by dependencies. This involves having multiple instructions in-flight at once = parallelism. My answer on this question is based on modern x86 CPUs which are superscalar OoO, though, because that's what people most want to know about. – Peter Cordes Nov 15 '19 at 20:24
29

summary: Finding and exploiting the (instruction-level) parallelism in a single-threaded program is done purely in hardware, by the CPU core it's running on. And only over a window of a couple hundred instructions, not large-scale reordering.

Single-threaded programs get no benefit from multi-core CPUs, except that other things can run on the other cores instead of taking time away from the single-threaded task.


the OS organizes the instructions of all threads in such a way that they are not waiting on each other.

The OS does NOT look inside the instruction streams of threads. It only schedules threads to cores.

Actually, each core runs the OS's scheduler function when it needs to figure out what to do next. Scheduling is a distributed algorithm. To better understand multi-core machines, think of each core as running the kernel separately. Just like a multi-threaded program, the kernel is written so that its code on one core can safely interact with its code on other cores to update shared data structures (like the list of threads that are ready to run.

Anyway, the OS is involved in helping multi-threaded processes exploit thread-level parallelism which must be explicitly exposed by manually writing a multi-threaded program. (Or by an auto-parallelizing compiler with OpenMP or something).

Then the front-end of the CPU further organizes those instructions by distributing one thread to each core, and distributes independent instructions from each thread among any open cycles.

A CPU core is only running one stream of instructions, if it isn't halted (asleep until the next interrupt, e.g. timer interrupt). Often that's a thread, but it could also be a kernel interrupt handler, or miscellaneous kernel code if the kernel decided to do something other than just return to the previous thread after handling and interrupt or system call.

With HyperThreading or other SMT designs, a physical CPU core acts like multiple "logical" cores. The only difference from an OS perspective between a quad-core-with-hyperthreading (4c8t) CPU and a plain 8-core machine (8c8t) is that an HT-aware OS will try to schedule threads to separate physical cores so they don't compete with each other. An OS that didn't know about hyperthreading would just see 8 cores (unless you disable HT in the BIOS, then it would only detect 4).


The term "front-end" refers to the part of a CPU core that fetches machine code, decodes the instructions, and issues them into the out-of-order part of the core. Each core has its own front-end, and it's part of the core as a whole. Instructions it fetches are what the CPU is currently running.

Inside the out-of-order part of the core, instructions (or uops) are dispatched to execution ports when their input operands are ready and there's a free execution port. This doesn't have to happen in program order, so this is how an OOO CPU can exploit the instruction-level parallelism within a single thread.

If you replace "core" with "execution unit" in your idea, you're close to correct. Yes, the CPU does distribute independent instructions/uops to execution units in parallel. (But there's a terminology mix-up, since you said "front-end" when really it's the CPU's instruction-scheduler aka Reservation Station that picks instructions ready to execute).

Out-of-order execution can only find ILP on a very local level, only up to a couple hundred instructions, not between two independent loops (unless they're short).


For example, the asm equivalent of this

int i=0,j=0;
do {
    i++;
    j++;
} while(42);

will run about as fast as the same loop only incrementing one counter on Intel Haswell. i++ only depends on the previous value of i, while j++ only depends on the previous value of j, so the two dependency chains can run in parallel without breaking the illusion of everything being executed in program order.

On x86, the loop would look something like this:

top_of_loop:
    inc eax
    inc edx
    jmp .loop

Haswell has 4 integer execution ports, and all of them have adder units, so it can sustain a throughput of up to 4 inc instructions per clock if they're all independent. (With latency=1, so you only need 4 registers to max out the throughput by keeping 4 inc instructions in flight. Contrast this with vector-FP MUL or FMA: latency=5 throughput=0.5 needs 10 vector accumulators to keep 10 FMAs in flight to max out the throughput. And each vector can be 256b, holding 8 single-precision floats).

The taken-branch is also a bottleneck: a loop always takes at least one whole clock per iteration, because taken-branch throughput is limited to 1 per clock. I could put one more instruction inside the loop without reducing performance, unless it also reads/writes eax or edx in which case it would lengthen that dependency chain. Putting 2 more instructions in the loop (or one complex multi-uop instruction) would create a bottleneck on the front-end, since it can only issue 4 uops per clock into the out-of-order core. (See this SO Q&A for some details on what happens for loops that aren't a multiple of 4 uops: the loop-buffer and uop cache make things interesting.)


In more complex cases, finding the parallelism requires looking at a larger window of instructions. (e.g. maybe there's a sequence of 10 instructions that all depend on each other, then some independent ones).

The Re-Order Buffer capacity is one of the factors that limits the out-of-order window size. On Intel Haswell, it's 192 uops. (And you can even measure it experimentally, along with register-renaming capacity (register-file size).) Low-power CPU cores like ARM have much smaller ROB sizes, if they do out-of-order execution at all.

Also note that CPUs need to be pipelined, as well as out-of-order. So it has to fetch&decode instructions well ahead of the ones being executed, preferably with enough throughput to refill buffers after missing any fetch cycles. Branches are tricky, because we don't know where to even fetch from if we don't know which way a branch went. This is why branch-prediction is so important. (And why modern CPUs use speculative execution: they guess which way a branch will go and start fetching/decoding/executing that instruction stream. When a misprediction is detected, they roll back to the last known-good state and execute from there.)

If you want to read more about CPU internals, there are some links in the Stackoverflow x86 tag wiki, including to Agner Fog's microarch guide, and to David Kanter's detailed writeups with diagrams of Intel and AMD CPUs. From his Intel Haswell microarchitecture writeup, this is the final diagram of the whole pipeline of a Haswell core (not the whole chip).

This is a block diagram of a single CPU core. A quad-core CPU has 4 of these on a chip, each with their own L1/L2 caches (sharing an L3 cache, memory controllers, and PCIe connections to the system devices).

Haswell full pipeline

I know this is overwhelmingly complicated. Kanter's article also shows parts of this to talk about the frontend separately from the execution units or the caches, for example.

Peter Cordes
  • 452
  • 5
  • 9
  • 2
    "Finding and exploiting the (instruction-level) parallelism in a single-threaded program is done purely in hardware" Note that this only applies to conventional ISAs, not VLIWs in which ILP is determined completely by the compiler or programmer, or cooperatively between hardware and software. – Hadi Brais May 25 '18 at 18:51
  • (Follows your second quote '>'): *A CPU core is only running one stream of instructions*: Does this line include the possibility of superscalar processor, which in my current understanding run more than one stream in ILP? – Kindred Dec 08 '18 at 13:05
  • So what you mean "front-end" is everything above the 192 Entry ROB in the diagram? – Kindred Dec 08 '18 at 14:01
  • For @Frax's answer, second comment: Is his statement *"A single thread can only run on one core"* correct for HyperThreading? – Kindred Dec 08 '18 at 14:50
  • 1
    @user7813604: yes. Hyperthreading can't parallelize a single thread. It does the reverse: it runs multiple threads on one core, reducing per-thread performance but increasing overall throughput. – Peter Cordes Dec 08 '18 at 15:01
  • @PeterCordes: Thank you so much! Now it's more clear. – Kindred Dec 08 '18 at 15:03
  • 1
    @user7813604: The whole point of ILP is finding which instructions can be run in parallel while still maintaining the *illusion* that each instruction ran in order, each one finishing before the next one starts. A scalar pipelined CPU may need to stall sometimes for dependencies if latency is higher than 1. But it's an even bigger deal for superscalar CPUs. – Peter Cordes Dec 08 '18 at 15:05
  • It seems like my statement in second comment is incorrect because superscalar processor is SISD so only one instr. stream but two fetch at the same time? – Kindred Dec 08 '18 at 15:05
  • But people (may be incorrect) said that ILP is indeed executed preemptively behind the scene. Or you may help me understand the details of this, what I want to know is that whether ILP indeed in parallel, i.e. a++, b++; executed independently and simultaneously, can be finished at the same physical instant. – Kindred Dec 08 '18 at 15:06
  • 1
    @user7813604: yes, my answer literally uses that as an example. Haswell for example can execute up to 4 `inc` instructions in the same clock cycle, to its 4 integer ALU execution units. – Peter Cordes Dec 08 '18 at 15:13
  • 1
    @user7813604: Yes, ILP is how much *can* be executed in parallel. A real CPU will have a limited ability to find and exploit ILP by actually running it in parallel within a single core, e.g. up to 4-wide superscalar in Intel. This answer tries to explain that with examples. – Peter Cordes Dec 08 '18 at 15:18
  • OK, now I understand what you mean, parallelism is basically ILP, I shouldn't care about deeper phrase of parallel. Thank you so much, now I have to change some of my notes. – Kindred Dec 08 '18 at 15:53
  • "Finding and exploiting ... over a window of a couple hundred instructions, ..." What you say is correct but the finding part is shorter and the exploiting part is longer. Haswell can have 192 in-flight instructions. I think the ILP search is within the 60-entry scheduler list. And then these numbers are magnitudes rather than a search window size. A slow load can sit there while hundreds of register instructions fly through. Anyways, there's an availability bias for readers (like me) to conflate all this. I dunno if that list is shared by the hyperthreads but it makes sense that it would be. – Olsonist Aug 26 '19 at 18:58
  • @Olsonist: The RS (scheduler) size is the *instantaneous* window for any given cycle. But uops leave the RS after executing successfully, tracked only by the ROB after that, not both. So one slow instruction like a cache-miss load doesn't stop independent later instructions from being executed. Related: [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](//stackoverflow.com/q/51986046) shows the crossover from fitting in the RS to not, for a weird special case with two chains of equal latency uops. – Peter Cordes Aug 26 '19 at 20:54
  • @Olsonist: The ROB issues and retires strictly in order, so a cache-miss load typically leads to filling the ROB, which is only 224 uops on Skylake. Maybe that's what you meant by "hundreds"? Anyway, that's why it needs to be larger than RS, which lets uops leave out of order (on execution not retirement). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ where ROB vs. PRF size limits are explored. – Peter Cordes Aug 26 '19 at 20:57