Problem setting
I'm interested in estimating an upper bound to how many floating point operations (FLOP) can be performed per Joule of energy dissipation on existing digital CMOS hardware. To do this, I want to first estimate the energy dissipation per unit length of interconnect, and then estimate the total length of interconnect that gets charged up per FLOP. The former is relatively straightforward to calculate since charging up wire capacitances has associated energy \$E = \frac{1}{2} CV^2\$, but the latter is more tricky, and is the focus of this question. i.e. "What is the average total length of interconnect wire that needs to be charged/discharged per FLOP?"
I'm unsure about whether my approach for estimating it is good and whether or not the numbers I've used are valid, and would like to sense-check it with people with more expertise than myself. Note that I'm interested in an order-of-magnitude estimate, and that I'm most interested in this question from the perspective of AI accelerators (e.g. NVIDIA H100 GPUs).
My attempt
My attempt at estimating the total interconnect wire length per FLOP is to split the calculation into two parts: (1) estimate the average length of interconnect wire using Rent's Rule, and (2) estimate the number of wires that need to be charged up per FLOP. The total is then obtained by multiplying these two results together.
Several papers obtain a distribution of wire lengths of the interconnections in microprocessors based on Rent's Rule, which follows something roughly like a power law. This book chapter from 2015 for instance finds a distribution ranging from around 10 um to 1 cm. Eyeballing the distribution, the weighted average is plausibly on the order of 100 um.
To determine the number of wires that get charged up, I note that current floating point operations in AI are often done using 16-bit precision. According to this Wikipedia article (which doesn't cite a source), a 16-bit multiplier requires around \$10^4\$ transistors. If charging up one wire length corresponds to communicating 1 bit across the wire, and assuming roughly one wire is needed per transistor (I'm unsure if this is valid), we have \$10^4\$ interconnect wires. Finally, we assume that one multiply requires the same number of transistors as a FLOP up to some small constant multiplier (e.g. 2x), which we ignore.
Combining these two results, the total interconnect length I estimate per FLOP is \$10^4 \cdot 10^{-4} \: \text{m} = 1 \: \text{m}\$. As a sanity check, this paper argues that capacitances per unit length are roughly 1 pF/cm, which suggests energies per unit length of \$10^{-10}\$ J/m, and thus energies of 100 pJ/FLOP. The same paper also quotes an empirical estimate of around 10 pJ/FLOP; so I'm off by an order of magnitude somehow.
There are some weaknesses with this argument that I'm aware of. For instance, the average length of interconnect overall may not be the same as the average length of interconnect that is actually charged up (e.g. perhaps a FLOP is typically performed in spatially local fashion, such that most wires that are charged up are shorter than what Rent's Rule would predict). Perhaps this is the reason why my previous calculation yields a number higher than the empirical estimate, though I'm not sure.
Question
Do you think that this general approach to estimating average interconnect length is valid for an order-of-magnitude estimate? What do you think are the most important objections to my calculation, and how should I therefore try and modify it?
[Edit for clarification: I'm most interested in feedback on the calculation of the average length of interconnect wire, and the number of wires charged up per FLOP. These are the quantities I'm by far the most uncertain about and the values of which could change my conclusions a lot if I've made an error.]