Its hard to build cooler CPUs, because the successful use of the gates and flipflops and static_RAM and the data_movement busses are all
OVER DESIGNED
for performance margin.
You may need a 7picoSecond NAND gate. But because of variations in the doping (which with 2020 and earlier CPUs, was already a problem), you design it for 4 picoSeconds +- 1pS (as example). The gate has to charge and discharge the metal_metal capacitance of its output metallization that runs very near other pieces of metal that may be switching in the OPPOSITE DIRECTION, thus our little gate has to do the work of two gates so far as handling parasitics (this is a matter of routing, correctable if the human/tool bothered to detect this timing challenge.).
And the GROUND metal bounced up and down, because OTHER gates are also busy at the same time (or within 10 or 20 picoseconds of the same time) and their need for charge will cause Ground transients.
Ditto for the VDD metal.
This bouncing of GND and VDD requires MORE OVERDESIGN.
Result is a MCU that is reliable for many years, and some tolerance of power voltage, but may be 5:1 over_designed. But the MCU is dependable in its state machine behavior.
But it is overdesigned.
But the state changes are trustable.
But it is overdesigned.