2

I'm trying to figure out just how expensive floating point fundamentally is, at the hardware level. For example, how many more transistors does a 32-bit multiplier cost in floating point compared to integer.

To be specific:

  • A 32-bit floating point multiplier, versus a 32-bit integer multiplier.
  • Both have a throughput of one clock cycle.
  • The FP does not need IEEE semantics; it can make the simplifications typical of GPUs e.g. no exceptions, rounding mode is not configurable, denormals flush to zero.
  • The integer multiplier only produces 32 bits of result and throws away the rest.
  • If it matters, say the target clock speed is 50 MHz and the implementation technology is CMOS.
  • I'm just considering the arithmetic hardware itself, not other issues like control logic, register renaming etc.

Roughly how much more expensive is the floating point circuit? For example, twice as many transistors?

rwallace
  • 559
  • 3
  • 11
  • there are various tradeoffs to make in an actual implementation. you could fire up your favorite hdl package and synthesize two units with your choices for those tradeoffs and compare. – PlasmaHH Apr 16 '18 at 11:50
  • @PlasmaHH What sort of tradeoffs exist, aside from the ones I mentioned? – rwallace Apr 16 '18 at 11:53
  • 1
    real implementations have lots of complex things like register renaming, ways to access the data, superscalar execution. Then there is speed, you say you want one clock cycle, but its easier on 5MHz than on 5GHz. On the former you can go some lazy routes and have internally basically loops running, but on 5GHz you would have a really fine tuned design. And then there are probably fine prints like carry bits or how the 64bit result in the integer case is stored and whatnot that may or may not influence whether you get it into the time constrain of a cycle of the desired clock or not – PlasmaHH Apr 16 '18 at 11:58
  • @PlasmaHH Okay, added clarifications accordingly. – rwallace Apr 16 '18 at 12:03
  • 1
    Have you ever seen a block diagram of a floating-point multiplier? It has an integer multiplier at its core, to handle the multiplication of the mantissas. It also has shifters before and after to handle denormalization and renormalization. Then there's the separate path that deals with the exponents. Some of these blocks are \$O(n^2)\$ and some of them are \$O(n)\$, but they all have different scale factors for the number of transistors. That's a lot of variables, and I don't think it's really possible to distill it down to a single generic factor. – Dave Tweed Apr 16 '18 at 12:31
  • 2
    You say, *"throughput of one clock cycle"*, which implies that you want a result every cycle. This is not difficult at all. The key difference is *latency* -- integer multiplication can take just one cycle, but floating-point will typically require 3-4 cycles (pipelined) because of the pre- and post-processing required on the numbers. – Dave Tweed Apr 16 '18 at 12:54
  • @DaveTweed Right, that was my understanding, that for floating point multiplication, it's reasonable to ask for throughput of one cycle, but latency of one cycle would be difficult or impossible. – rwallace Apr 16 '18 at 13:05

1 Answers1

3

I'm writing this with no experience what so ever, so take this answer with a grain of salt. That said...

A typical 32-bit float has 23 bits of fraction. Multiplying two of these only requires a 23x23 multiplier, keeping the upper 23 bits. The exponents are then added. Adds are cheap.

Your 32-bit integer has 32 bits, so you need a 32x32 bit multiplier, keeping the upper or lower 32 bits.

So: Your floating point multiplier ought to be cheaper.

pipe
  • 13,748
  • 5
  • 42
  • 72
  • That is a consideration, certainly. On the other hand, floating point operands need to be shifted to bring them into alignment, and a barrel shifter has, like a multiplier, quadratic cost, so that's a counterweight to start with. – rwallace Apr 16 '18 at 12:29
  • 1
    @pipe, Actually, they are 24-bit mantissas. The 24th bit is not stored explicitly, because its value is implied by the exponent -- 1 for normalized numbers, and 0 for denormalized numbers. – Dave Tweed Apr 16 '18 at 12:36
  • 2
    @rwallace: You're thinking of floating-point addition/subtraction. Multiplication does not require that kind of alignment. – Dave Tweed Apr 16 '18 at 12:38
  • @DaveTweed Good point. I stand corrected. – rwallace Apr 16 '18 at 12:40
  • @DaveTweed Sure, but the implicit bit is also a constant so it's easier to handle in the multiplication. Also, the user doesn't seem to want IEEE floats or (de)normalized stuff so I thought that a general idea is good enough here. – pipe Apr 16 '18 at 13:28
  • No, it isn't any easier to handle; you really do need a 24x24 multiplier. – Dave Tweed Apr 16 '18 at 13:50
  • 2
    Floating point multiplication should be cheaper, add and subtract require barrel shifters to get single cycle operation. – Spehro Pefhany Apr 16 '18 at 15:12
  • If you allow subnormals, the result bits you need might not be the high half of a 24x24 => 24-bit multiply. Some real CPUs (e.g. Intel) handle subnormal multiplies (or subnormal add results) by taking a microcode assist instead of making the critical path longer to handle the general case of renormalizing. This is true even in Skylake where add and multiply both run on FMA units. [Related](https://electronics.stackexchange.com/questions/452181/why-does-intels-haswell/452193#452193) about FP add vs. mul complexity, although that specific question has a different answer. – Peter Cordes Nov 20 '19 at 00:29
  • Also, I think you need some (or all) of the low-half bits to produce a *correctly-rounded* result. If the rounding mode is towards +Inf, any non-zero bits in the low half mean rounding up the magnitude for a positive integer. Only with rounding = truncation can you just take the high half of the 24x24 multiply. IEEE-754 requires the Basic operations (+ - * / sqrt) to produce Correctly Rounded results (error <= 0.5ulp, i.e. all mantissa bits correct.) – Peter Cordes Nov 20 '19 at 00:53
  • Although the FP multiplier can be smaller (due to a narrower mantissa, small exponent, and not needing to handle denorms or exceptions), it's not very useful without an adder (for GPU MACs). And the required FP adder will be either much larger or much slower than a 32-bit integer adder, due to required normalization. So there's still a trade-off. – hotpaw2 Dec 12 '19 at 17:43