Of course the condition is checked every single time. You cannot avoid this.
Branch prediction and many other tricks that modern CPUs do are all about achieving as much processing as possible in parallel. One of the main features that accomplishes things in parallel is the CPU pipeline. When the pipeline is full, you get the most benefits. When the pipeline is not full, performance suffers.
Usually, a condition is immediately followed by a conditional branch instruction, which will either branch or not branch, depending on the result of the condition. This means that there are two different streams of instructions that the CPU might need to follow.
Unfortunately, immediately after loading the condition instruction and the branch instruction, the CPU does not know yet what the condition will evaluate to, but it still has to keep loading stuff into the pipeline. So it picks one of the two sets of instructions based on a guess ("prediction") as to what the condition will evaluate to.
Later on, the CPU finds out whether its guess was right or wrong.
- If the guess turns out to be right, then the branch went to the correct place, and the right instructions were loaded into the pipeline, so all is good.
- If it turns out that the guess was wrong, then all the instructions that were loaded into the pipeline after the conditional branch instruction were wrong, they need to be discarded, and fetching of instructions must commence again from the right place.
Amendment
In response to StarWeaver's comment, to give an idea of what the CPU has to do in order to execute a single instruction:
Consider something as simple as MOV AX,[SI+10]
which we humans naïvely think of as "load AX with the word at SI plus 10". Roughly, the CPU has to:
- emit the contents of PC (the "program counter register") to the address bus;
- read the instruction opcode from the data bus;
- increment PC;
- decode the opcode to discover that it is supposed to be followed by an operand;
- emit the contents of PC to the address bus;
- read the operand (in this case 10) from the data bus;
- increment PC;
- feed the operand and SI to the adder;
- emit the result of the adder to the address bus;
- read AX from the data bus.
This is a whopping 10 steps. Some of these steps will be optimized away even in non-pipelined CPUs, for example the CPU will almost always increment PC in parallel with the next step, which is an easy thing to do because the PC is a very, very special register which is never used for any other job, so there is no possibility of contention between different parts of the CPU for access to this particular register. But still, we are left with 8 steps for such a simple instruction, and note that I am already assuming some degree of sophistication on behalf of the CPU, for example I am assuming that there will be no need for a whole extra step for the adder to actually carry out the addition before the result can be read from it, and I am assuming that the output of the adder can be sent directly to the address bus without having to be stored in some intermediate internal addressing register.
Now, consider that there exist more complicated addressing modes, like MOV AX, [DX+SI*4+10]
, and even far more complicated instructions, like MUL AX, operand
which actually perform loops inside the CPU to calculate their result.
So, my point here is that the "atomic level" metaphor is far from suitable for the CPU instruction level. It might be suitable for the pipeline step level, if you don't want to go too far down to the actual logic gate level.