In this graph, can we simply remove the write back stage since the mux is pushed back into the memory access stage and there is no logic in the write back stage. Is it because of the register file that cannot be read and written simultaneously, we have to wait an extra clock to let the write finish? What if the register file has internal forwarding, can I remove the write back stage?
-
It could be had done for uniformity in the clock duration. Also what would be the advantage in anticipating the register writeback without waiting for the memory? It could conceivably help for register to register operation but that would be a quite more complex architecture than a 5 stage pipeline – Lorenzo Marcantonio Mar 29 '21 at 09:20
-
I don't think that's a good idea. If it's necessary to do Write-Back for example for a store instruction, you do it on that clock cycle. Otherwise do nothing on that clock cycle. It's a simple architecture in this way. – Mitu Raj Mar 29 '21 at 16:04
-
2I feel like we can remove the write back stage as long as we deal with the logic, data hazard, and control hazard properly. In fact we can remove all the stage registers if we want, then it becomes a one-cycle processor :D. The problem is the trade-off between performance and area. For 5-stage pipeline, we have highest clock frequency but more area. For 4-stage pipeline, we have lower clock frequency compared with 5-stage because we combine the MEM and WB stage together and we need more time to finish the work. For single-cycle processor, it takes smallest area but clock frequency is lowest. – Lei Gao Mar 29 '21 at 19:18
1 Answers
Yes, you could eliminate the WB stage.
In fact, if you really want to trade complexity for performance, you would bypass the ALU output (EX stage) back into the input to the EX stage and skip the MEM and WB stages.
The pipeline shown in the original post is dramatically sub-optimal -- it wastes two ticks on many instructions.
Consider an ADD followed by an AND, for instance. As soon as the ADD result pops out of the EX stage, it is ready to be used in the next instruction. But the pipeline above sends it through two more pipe stages. So if this were a real processor, you'd have to stuff at least one other instruction between the ADD and the AND. Back when Babbage was a little boy, I saw a study that said most input operands were generated within the last two instructions.
So all the processor designs that I've seen have implemented a bypass path from EX out to EX in. Perhaps there are lower-power/performance designs that leave out the bypass, but it seems like a bad trade.

- 101
- 5