5

Problem setting

I'm interested in estimating an upper bound to how many floating point operations (FLOP) can be performed per Joule of energy dissipation on existing digital CMOS hardware. To do this, I want to first estimate the energy dissipation per unit length of interconnect, and then estimate the total length of interconnect that gets charged up per FLOP. The former is relatively straightforward to calculate since charging up wire capacitances has associated energy \$E = \frac{1}{2} CV^2\$, but the latter is more tricky, and is the focus of this question. i.e. "What is the average total length of interconnect wire that needs to be charged/discharged per FLOP?"

I'm unsure about whether my approach for estimating it is good and whether or not the numbers I've used are valid, and would like to sense-check it with people with more expertise than myself. Note that I'm interested in an order-of-magnitude estimate, and that I'm most interested in this question from the perspective of AI accelerators (e.g. NVIDIA H100 GPUs).

My attempt

My attempt at estimating the total interconnect wire length per FLOP is to split the calculation into two parts: (1) estimate the average length of interconnect wire using Rent's Rule, and (2) estimate the number of wires that need to be charged up per FLOP. The total is then obtained by multiplying these two results together.

  1. Several papers obtain a distribution of wire lengths of the interconnections in microprocessors based on Rent's Rule, which follows something roughly like a power law. This book chapter from 2015 for instance finds a distribution ranging from around 10 um to 1 cm. Eyeballing the distribution, the weighted average is plausibly on the order of 100 um.

  2. To determine the number of wires that get charged up, I note that current floating point operations in AI are often done using 16-bit precision. According to this Wikipedia article (which doesn't cite a source), a 16-bit multiplier requires around \$10^4\$ transistors. If charging up one wire length corresponds to communicating 1 bit across the wire, and assuming roughly one wire is needed per transistor (I'm unsure if this is valid), we have \$10^4\$ interconnect wires. Finally, we assume that one multiply requires the same number of transistors as a FLOP up to some small constant multiplier (e.g. 2x), which we ignore.

Combining these two results, the total interconnect length I estimate per FLOP is \$10^4 \cdot 10^{-4} \: \text{m} = 1 \: \text{m}\$. As a sanity check, this paper argues that capacitances per unit length are roughly 1 pF/cm, which suggests energies per unit length of \$10^{-10}\$ J/m, and thus energies of 100 pJ/FLOP. The same paper also quotes an empirical estimate of around 10 pJ/FLOP; so I'm off by an order of magnitude somehow.

There are some weaknesses with this argument that I'm aware of. For instance, the average length of interconnect overall may not be the same as the average length of interconnect that is actually charged up (e.g. perhaps a FLOP is typically performed in spatially local fashion, such that most wires that are charged up are shorter than what Rent's Rule would predict). Perhaps this is the reason why my previous calculation yields a number higher than the empirical estimate, though I'm not sure.

Question

Do you think that this general approach to estimating average interconnect length is valid for an order-of-magnitude estimate? What do you think are the most important objections to my calculation, and how should I therefore try and modify it?

[Edit for clarification: I'm most interested in feedback on the calculation of the average length of interconnect wire, and the number of wires charged up per FLOP. These are the quantities I'm by far the most uncertain about and the values of which could change my conclusions a lot if I've made an error.]

anson
  • 53
  • 4
  • 2
    Are you sure the interconnect is going to outweigh the gate capacitance of the various transistors involved in the computation? – Hearth Jun 06 '23 at 02:46
  • I'm more curious how the `average length` could be shown to be a useful measure in developing an answer to the question. – periblepsis Jun 06 '23 at 03:17
  • @Hearth I'm not sure because I'm anything but an expert on these kinds of topics, but the basis for my sanity check calculation is that I've heard that most energy dissipation in modern processors is via the interconnect - e.g. [this paper from the post](https://arxiv.org/pdf/1609.05510.pdf). This is a claim I've seen pop up in a few different places – anson Jun 06 '23 at 03:37
  • The wire will have inductance and form a transmission line as you raise frequency high enough. Have you considered this? – Andy aka Jun 06 '23 at 08:00
  • @Andyaka Interesting, I hadn't considered this. All the sources I've seen for estimating interconnect energy costs just look at the capacitance. Do you have an intuition for how significant the associated heating effects might be? (For reference, fig ES20 from the [IRDS 2022 report](https://irds.ieee.org/images/files/pdf/2022/2022IRDS_ES.pdf) suggests clock frequencies approaching 10 GHz) – anson Jun 06 '23 at 19:02
  • The designers of your FPU will have optimized its layout to keep the critical path as short as possible so that it hits high clock speeds. They'll do this by putting the sequential elements in the multiplication process as close as possible to one another. Thus the wires that carry the most signals will be short while longer wires will much less commonly carry signals. For this reason I don't think taking the average length makes sense, most signals won't ever go on average wires. – user1850479 Jun 06 '23 at 19:43
  • @user1850479 Yes, I agree this is an important consideration. Do you know how I might go about estimating the length of wires that are actually charged up, if this approach using Rent's rule is problematic? I've tried searching for relevant estimates but am unsure what references to consult. I've also considered taking some kind of weighted average from the wire length distribution with more weight being placed on shorter interconnects, but don't know what kind of weighting is appropriate. – anson Jun 06 '23 at 20:25
  • 2
    Since you have no idea what Nvidia optimized when they designed their FPU, what constraints they faced or how successful they were I think your approach is hopeless. You have no idea what they chose to do. Maybe instead you could work backwards from the energy per flop and estimate the fraction expenses on the wiring? At least then you have real measurements to anchor yourself to reality. – user1850479 Jun 06 '23 at 20:46

1 Answers1

3

I do not think this approach can yield numbers that are useful, even to one order of magnitude. It is more than 2 orders of magnitude off of what can be empirically calculated (an NVIDIA L4 handily achieves 574 fJ/FLOP, 20 fold the lower estimate from the paper and 200 times lower than your estimate).

By approach, I mean your approach to estimating the average length and average number of interconnects per FLOP. The wider approach of total wire area that must be charged per FLOP can work, but not without making some corrections.

This is hardly an exhaustive list, I would hope that other answers or comments might improve upon this list. These are just the ones I can think of that would have a significant impact.

  1. Interconnect charging only occurs on a low to high transition, or in other words, a bit going from 0 to 1. During a a floating point calculation, nowhere near all 10^4 interconnects are going to be charged, as many will already be charged and others will be transitioning from 1 to 0. What matters is how many transistors actually have to change state on average which is going to be a fair bit less than the total in the circuit.

  2. Interconnect length on a CPU is not even remotely isomorphic, and a 16-bit floating point unit lays at one extreme of this. The only feature that is more structured than an FPU is things like SRAM cache (memory). FPUs have repeating patterns of transistors that minimize interconnect length between them, and are far more compact than other parts of a CPU core. The average interconnect length in an FPU is going to sit somewhere well below the average. And this can vary wildly between one FPU to another, even if they're both half precision ones. There are multiple architectures used here with the main tradeoff being speed/power/die area. That directly translates into changing interconnect length. Any sort of estimate here is going to have to look at the actual FPU architecture in question and an interconnect estimate made specific to that type of FPU.

  3. Bits occur at the logic gate level, not the transistor level. Each gate is comprised of several transistors that are typically directly adjacent to each other and have negligible interconnect length between them. You should be estimating interconnects using gate number, not transistor count. Be careful not to confuse transistor gates (as in, gate, drain, and source) with logic gates (AND, OR, NOT, XOR, etc). Estimating gate count from transistor count is possible but fraught with peril. Gates can comprise far more transistors for many inputs. But for a first order estimation, just assuming 2 input gates is fine. Depending on the type of gate, they will be made of 2 to 6 transistors each.

  4. Modern ICs use bus precharging. This is a fancy way of saying that the interconnects are kept at some voltage above ground when logic 0/logic low, but low enough to not cause a logic high condition. The interconnects will then swing between this higher voltage level and voltage high. This reduces the amount of charge needed as the effective voltage swing is lowered. This improves both speed and power consumption.

metacollin
  • 27,884
  • 4
  • 64
  • 119
  • I'm relatively skeptical that one could come up with even a reasonable order of magnitude estimateof the energy used without quite an in depth understanding of the structure of a specific GPU - a floating point operation is quite copmplex, and has had decades of development to optimise every element of the structure – BeB00 Jun 11 '23 at 23:51