Why do we need stalls even if branches can be determined?

Question

I am learning about pipelining and was reading about control hazards from the book Computer Organization and Design: The Hardware/Software Interface (MIPS Edition). There is a paragraph in the book (Chapter 4.6) that has me puzzled:

Let's assume that we put in enough extra hardware so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline (see COD Section 4.9 (Control hazards) for details). Even with this extra hardware, the pipeline involving conditional branches would look like the figure below. The lw instruction, executed if the branch fails, is stalled one extra 200 ps clock cycle before starting.

I don't quite understand what exactly the paragraph is saying here. My initial guess was that it meant even in a hypothetical scenario where you can determine which branch to take and update the program counter within the one-clock cycle afforded before the next instruction must be fetched, we would still need a stall but that doesn't make sense to me because if we know what to do, why not just do it and go about it? So I assume I am clearly missing something but I can't piece it together.

Below is the picture referenced in the text:

periblepsis · Accepted Answer · 2023-04-24T23:32:50.747

I don't understand your confusion. Just walk yourself step by step as if this were all on a piece of paper, emulating the computer by hand, and where you are limited to certain parallel operations and the instructions are limited to a certain series of steps in getting their work done.

An instruction cannot be decoded and its meaning worked out while it is being fetched. Only afterwards.

So imagine you are doing a 'fetch' by hand, but you are not allowed to look at what you are fetching until after you have fetched it and put it into a little decoder box.

Once it is in the box and you are now in the 2nd cycle for this instruction, you are allowed to peak at it and decide your next action for it.

Meanwhile, of course, you also want to fetch the next instruction, in parallel. That instruction will be whatever immediately follows because the only assumption you can make, before actually decoding the current beq instruction (which you don't know about yet) is to continue using the PC counter as the address for the next fetch of an instruction on the assumption that it will be needed.

Now, finally, you have just brought in (fetched) that beq instruction and you only now find out that this is what it is (on its 2nd cycle.) Assume that you are allowed to see this fact and do all of the other necessary stuff to determine whether or not the instruction is acted on (it's true, so a branch is indicated), or not. You get to do all of that in this 2nd cycle. That's fine. However, in this 2nd cycle of this beq instruction, you are also fetching something from memory in its 1st cycle using the updated PC counter (pointing to the instruction after the beq.)

In this 2nd cycle of the beq, you now are allowed to know that the beq is true (they said you have enough hardware for that) and that you need to update the PC pointer so that it will correctly fetch the or (that they show you there in the diagram.) But even if you can completely update the PC register in this 2nd cycle because you designed a lot of hardware to get that done, the PC still cannot be used to fetch the or instruction until the next step when a new fetch can occur (because it is currently busy fetching the instruction after the beq.)

Note also that the fetch that took place before you could decode the instruction on its 2nd cycle is a wasted fetch. It has to be thrown away, since you now know that you should no longer execute it. (This loss of a fetch is just another way of expressing the waste involved that the authors are trying to get you to follow along about.)

Note also that things would be even worse if you had refused to do a fetch (in parallel) on the assumption that the instruction following the current one is next up. In this still-worse situation where you choose not to do anything until you are sure you need to do it, then it's like the worst-case laundry scenario they discussed just before this point in the book. So if you don't perform the fetch until you know whether or not you will need it, then you waste even more cycles. Think about it.

I think one of the very best ways to understand computer architecture is to simply do everything by hand, first. Get out a piece of paper and set down some rules of operation in front of you. Use someone else's pipeline design and instruction execution steps. Lay out the sheet of paper with little boxes for each instruction and make yourself a marker (a button from a shirt is fine) to represent your PC counter, which you are allowed to move around on the paper as you execute code by hand. Then write yourself a short program (compute a GCD, for example) and then execute it, completely, by hand using paper following the rules. Spend a day doing that and you will pretty much no longer have as many questions.

(Have a look at this site from a generous soul to see roughly what I'm hinting towards. That's a Bell Labs product from 1968 to teach computing. It's old. So it doesn't have the pipelined stages added to it. But it gives the idea of what I'm asking you to consider doing on paper.)

(Note to all, as the authors write, "A majority of the readers for this book do not plan to become computer architects.")

Why do we need stalls even if branches can be determined?

1 Answers1