How does branch prediction work, if you still have to check for the conditions?

Question

I was reading the popular answer about Branch Prediction from https://stackoverflow.com/q/11227809/555690, and there is something confusing me:

If you guessed right, it continues on.

If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.

If you guess right every time, the train will never have to stop.

If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.

But this is what I don't get: to know whether your guess was right or wrong, you have to make a condition check anyway. So how does branch prediction even work, if either way you are still doing the same conditional check?

What I am trying to say is, isn't branch prediction exactly the same as having no branch prediction at all because you are doing the same conditional checks anyway? (obviously I'm wrong, but I don't get it)

This [wiki](http://en.wikipedia.org/wiki/Branch_predictor) article does a pretty good job explaining it. — enderland, Mar 02 '15 at 18:10
A modern CPU is pipelined and can do several things at the same time. Thus it can start executing its guess while it's still figuring out if it guessed right. If the guess was right, the pipeline keeps running. On a wrong guess the pipeline is tossed out and execution restarts from the "right answer" point. — markspace, Mar 02 '15 at 18:24
Related reading: [pipeline](https://en.wikipedia.org/wiki/Pipeline_(computing)). I would also recommend rereading the accepted answer on that SO question, as it does answer your question here. — , Mar 02 '15 at 18:38

score 38 · Answer 1 · answered Mar 02 '15 at 20:25

Think of it like a road trip without GPS. You come to an intersection, and think you need to turn, but aren't completely sure. So you take the turn, but ask your passenger to check the map. Maybe you're three miles down the road by the time you finish arguing about where you are. If you were right, you're three miles farther than you would have been if you had stopped and argued before turning. If you were wrong, you have to turn around.

CPU pipelines work the same way. By the time they can check the condition, they're already a ways down the road. The difference is, they don't have to drive the three miles back, they just lose the head start. That means there's no harm in trying.

This explanation is neat. – sharptooth Mar 04 '15 at 15:11 — sharptooth, Mar 04 '15 at 15:11

Mike Nakis · Accepted Answer · 2021-11-19T11:55:32.910

Of course the condition is checked every single time. You cannot avoid this.

Branch prediction and many other tricks that modern CPUs do are all about achieving as much processing as possible in parallel. One of the main features that accomplishes things in parallel is the CPU pipeline. When the pipeline is full, you get the most benefits. When the pipeline is not full, performance suffers.

Usually, a condition is immediately followed by a conditional branch instruction, which will either branch or not branch, depending on the result of the condition. This means that there are two different streams of instructions that the CPU might need to follow.

Unfortunately, immediately after loading the condition instruction and the branch instruction, the CPU does not know yet what the condition will evaluate to, but it still has to keep loading stuff into the pipeline. So it picks one of the two sets of instructions based on a guess ("prediction") as to what the condition will evaluate to.

Later on, the CPU finds out whether its guess was right or wrong.

If the guess turns out to be right, then the branch went to the correct place, and the right instructions were loaded into the pipeline, so all is good.
If it turns out that the guess was wrong, then all the instructions that were loaded into the pipeline after the conditional branch instruction were wrong, they need to be discarded, and fetching of instructions must commence again from the right place.

Amendment

In response to StarWeaver's comment, to give an idea of what the CPU has to do in order to execute a single instruction:

Consider something as simple as MOV AX,[SI+10] which we humans naïvely think of as "load AX with the word at SI plus 10". Roughly, the CPU has to:

emit the contents of PC (the "program counter register") to the address bus;
read the instruction opcode from the data bus;
increment PC;
decode the opcode to discover that it is supposed to be followed by an operand;
emit the contents of PC to the address bus;
read the operand (in this case 10) from the data bus;
increment PC;
feed the operand and SI to the adder;
emit the result of the adder to the address bus;
read AX from the data bus.

This is a whopping 10 steps. Some of these steps will be optimized away even in non-pipelined CPUs, for example the CPU will almost always increment PC in parallel with the next step, which is an easy thing to do because the PC is a very, very special register which is never used for any other job, so there is no possibility of contention between different parts of the CPU for access to this particular register. But still, we are left with 8 steps for such a simple instruction, and note that I am already assuming some degree of sophistication on behalf of the CPU, for example I am assuming that there will be no need for a whole extra step for the adder to actually carry out the addition before the result can be read from it, and I am assuming that the output of the adder can be sent directly to the address bus without having to be stored in some intermediate internal addressing register.

Now, consider that there exist more complicated addressing modes, like MOV AX, [DX+SI*4+10], and even far more complicated instructions, like MUL AX, operand which actually perform loops inside the CPU to calculate their result.

So, my point here is that the "atomic level" metaphor is far from suitable for the CPU instruction level. It might be suitable for the pipeline step level, if you don't want to go too far down to the actual logic gate level.

Huh, i wonder if part of the problem people (including me) have about understanding this is that it's very hard (for me anyway) to imagine a cpu only having partial knowledge of a single instruction; or to have a bunch of half finished instructions "going through the pizza belt oven" … for me at least, it feels like a shift in scale to the atomic when i'm used to working with things between erector set and metal lathe level. — Weaver, Mar 04 '15 at 06:50
@StarWeaver I liked your comment, so I amended my answer to address it. — Mike Nakis, Mar 04 '15 at 09:01
Wow, nice explination. I tend to forget how much goes into just moving words into more useful locations. I'm still visualizing a cpu as a set of belt-driven pizza ovens though :3. — Weaver, Mar 04 '15 at 09:34
It's worth bearing in mind that the [Stack Overflow question](https://stackoverflow.com/q/11227809/1709587) linked to by the OP - the one with 1.3 million views that probably introduced over 1 million programmers to the previously unfamiliar fact that "branch prediction" even exists - exhibits an example in *Java*. To folks like me who are used to working at the level of abstraction that languages like Java provide us, even `MOV AX,[SI+10]` is alien, not "simple"; most programmers today have never written assembly. We don't "naively think of" it as meaning anything. — Mark Amery, Jan 08 '19 at 13:06
@MarkAmery well, okay, I thought it is rather obvious that by "we humans" I mean "we humans who dare write assembly". The point being made is that even assembly language programmers are not thinking of the pipeline all the time, or even at all. — Mike Nakis, Jan 08 '19 at 13:10

score 3 · Answer 3 · answered Mar 04 '15 at 11:00

Yes, the condition is checked either way. But the advantage of branch prediction is that you can do work instead of waiting for the result of the condition check.

Lets say you have to write an essay and it can be about topic A or topic B. You know from previous essays that your teacher likes topic A better than B and chooses it more often. Instead of waiting for his decision you can start writing the essay about the first topic. Now there are two possible outcomes:

You began your essay on the wrong topic and have to drop what you have written so far. You have to start writing about the other topic and it is the same time effort as if you had waited.
You guessed right and you have already work done.

Modern CPUs are idling most of the time because they are waiting for IO responses or the outcome of other calculations. This time can be used to do some future work.

Even if you have to dismiss what you are doing in this idle time - it's most likely to be more effective if you have the ability to guess which path the program will chose. And modern CPUs have this ability.

score 2 · Answer 4 · answered Mar 02 '15 at 19:57

From my understanding, branch prediction is most useful when the condition you need to check requires the outcome of something which is expensive or still in progress, and you would otherwise be twiddling your thumbs waiting for the value to evaluate the condition.

With things like out-of-order execution, you can use branch prediction to start filling in empty spots in the pipeline that the CPU would otherwise not be able to use. In a situation where there aren't, for some reason, any idle cycles in the pipeline, then yes, there isn't a gain in branch prediction.

But the key here is, the CPU is starting the work for one of the predicted branches because it can't evaluate the condition itself yet.

score 2 · Answer 5 · answered Mar 04 '15 at 09:40

Short form:

Some CPUs can start working on a new instruction before finishing the old one. These are the CPUs that use branch prediction.

A pseudocode example:

int globalVariable;
int Read(int* readThis, int* readThat)
{
    if ((globalVariable*globalVariable % 17) < 5)
       return *readThis;
    else
       return *readThat;
}

The above code checks a condition and based on the outcome it needs to return either the value stored at memory location addThis or the value stored at readThat. If branch prediction predicts the condition to be true, the CPU will already read the value stored at memory location addThis while doing the calculation necessary to evaluate the if statement. This is a simplified example.

score 0 · Answer 6 · answered Nov 20 '21 at 13:10

You have instructions that test conditions (often “compare”instructions), and later there are instructions acting on these conditions (usually conditional branch instructions). For example your program contains “if (x > 0) z = v + 1;”. There will be a compare instruction comparing x and 0, and a conditional branch around the “z = v + 1” if the result of the comparison is “less or equal”.

Now modern processors often queue up huge amounts of instructions that have not been executed, it could be 500. So when the processor sees the conditional branch, the compare instruction quite likely hasn’t finished yet. So the processor doesn’t know for sure what to do. It has two choices: Wait until the “compare” instruction has finished, which wastes time, or make a guess whether the branch is taken. In that case it continues execution with that guess. When the compare finishes, the compare will be correct or not. If correct, good. If not correct, the processor throws away all the instructions that were executed incorrectly. Not worse than waiting.

In many cases, 90% or more of all predictions are correct. Take “ for (I = 0; I < 1000000; ++I) { … } the processor will quickly start guessing that each time the loop will be repeated. It will be correct 999,999 times and wrong only once.

In your assembler code, there is no branch prediction. There is a compare, and a conditional branch. The “predicting” is done completely by the processor and invisible to you, but it allows the processor to continue working and not stand still while waiting for the compare to finish.

How does branch prediction work, if you still have to check for the conditions?

6 Answers6