How fast does a 64-bit multiply or divide execute on an FPGA?

Question

When using a regular FPGA such as Xilinx Spartan 3 or Virtex 5, how many cycles does a double-precision floating-point 64-bit multiplication or division take to execute?

As far as I understand, the FPGA does not have a hard FPU and you need to create one using the standard IEEE libraries or other materials. This means that it won't execute in a single cycle, so I'm looking for a rough estimate to compare the performance of a 100 Mhz CPU with a 100 MHz Spartan/Virtex FPGA.

I'm primarily interested in floating-point operators, but if you have experience with integer operations that would be appreciated as well.

First, it is worth clarifying the question : you talk about 64-bit MPY/divide, which would imply 64-bit integer multiply/divide - then you mention FPU, which implies double-precision floating point. Details of the answer will differ for each... — , Dec 21 '12 at 12:21
For one multiplication at a time, the time is probably comparable or slightly in favour of the CPU. Obviously the advantage of the FPGA is that you can have a lot of them in parallel. — pjc50, Dec 21 '12 at 13:04

score 12 · Accepted Answer · 2012-12-21T13:24:55.683

I haven't done this for double precision FP, but the same principles apply as for single precision, for which I have implemented division (as multiply by reciprocal).

What these FPGAs do have, instead of FPUs, is hardwired DSP/multiplier blocks, capable of implementing a 18*18 or (Virtex-5) 18*25 multiplication in a single cycle. And the larger devices have around a thousand of these, or even 126 or 180 at the top end of the Spartan-3 or Spartan-6 families.

So you can decompose a large multiplication into smaller operations using several of these (2 for the Virtex-5 doing single precision) using the DSP's adders or FPGA fabric to sum the partial products.

You will get an answer in a few cycles - 3 or 4 for SP, maybe 5 for DP - depending on how you compose the adder tree (and sometimes, where the synth tools insist on adding pipeline registers!).

However that is the latency - as it is pipelined, throughput will be 1 result per clock cycle.

For division, I approximated a reciprocal operator using a lookup table followed by quadratic interpolation. This was accurate to better than single-precision and would extend (with more hardware) to DP if I wanted. In Spartan-6 it takes 2 BlockRams and 4 DSP/multipliers, and a couple of hundred LUT/FF pairs.

Its latency is 8 cycles, but again the throughput is single-cycle, so by combining it with the above multiplier, you get one division per clock cycle. It should exceed 100MHz in Spartan-3. In Spartan-6 the synthesis estimate is 185MHz but that's with 1.6ns on a single routing path, so 200MHz is within reason.

In Virtex-5 it reached 200MHz without effort, as did its square root twin. I had a couple of summer students attempt to re-pipeline it - with less than 12 cycles latency they got close to 400MHz - 2.5 ns for a square root.

But remember you have maybe a hundred to a thousand DSP units? That gives you one or two orders of magnitude more processing power than a single FP unit.

Thanks for your answer, Brian, but is'nt your whole answer about integer multiply / divide? I'm primarily interested in floating point. — Robin Rodricks, Dec 21 '12 at 13:47
No, as I said, single precision, meaning 32-bit floating point. The same principles apply for doubles, but the resource usage is obviously higher. — , Dec 21 '12 at 15:08

score 6 · Answer 2 · answered Dec 21 '12 at 15:32

When using a regular FPGA such as Xilinx Spartan 3 or Virtex 5, how many cycles does a double-precision floating-point 64-bit multiplication or division take to execute?

The answer is: Yes!

But seriously, it is super hard to come up with a number. When designing any complex logic there is always a trade-off between different things, and no one approach is good for all designs. I'll try to cover the big ones.

With logic design one trade-off is size vs. speed. The easy example of this is let's say that a single Floating Point multiplier is too slow. To speed it up all you have to do is add a second multiplier. Your logic size doubles, but do does the number of multiplies per second. But even just looking at a single multiplier, there are different ways to multiply numbers; some are fast and large, others are small and slow.

Another trade-off is clock speed vs. clocks per multiply. I could design some logic that would do a single floating point multiply in one clock. But that would also require the clock to be slower-- maybe as slow as 10 MHz. Or, I could design it to work with a 100 MHz clock but it would require 10 clocks per multiply. The overall speed is the same (one multiply in 100 ns), but one has a faster clock.

Related to the previous paragraph is the trade-off of clock speed vs. multiply latency. There is a technique in logic design called pipelining. Basically you take a chunk of logic and break it up into smaller stages, where each stage take one clock cycle to complete. The advantage here is that each stage can be working on a multiply while the other stages are working on other multiplies. For example, let's say that we're running at 100 MHz with a 10 stage pipeline. This means that it will take 10 clocks for each multiply, but the logic is also working on 10 different multiplies at the same time! The cool thing is that it is completing a multiply on every clock cycle. So the effective clocks per multiply is 1, it just takes 10 clocks for each of those multiplies to complete.

So the answer to your question, how fast can an FPGA do a multiply, is really up to you. FPGA's come in different sizes and speeds, and you can dedicate as much of that logic to the task at hand as you want. But let's look at one specific scenario...

Let's say that we want to use the largest Spartan-3A and all we care about is 32-bit floating point multiplies. A 32-bit float multiply requires a 24x24 integer multiplier and an 8-bit adder. This requires four of the dedicated multiplier blocks and some generic slices (too few to care about). The XC3S1400A has 32 dedicated multipliers, so we can do eight of our floating point multipliers in parallel. A very rough guess on the clock speed would be about 100 MHz. We can fully pipeline this design so that we can complete four 32-bit floating point multiplies per clock cycle, for an effective speed of 800 million floating point multiplies per second.

A double precision multiply requires 9 dedicated multiplier blocks per floating point multiply, so we could only do 3 multiplies in parallel-- resulting in a speed of about 300 million 64-bit floating point multiplies per second.

For comparison, lets consider the newer Xilinx Virtex-7 series. The dedicated multipliers in that are bigger, so we only need 6 dedicated multiplier blocks for a 64-bit floating point multiply. There are also 1,920 dedicated multipliers on the largest part-- so we can do 320 double precision floating point multiplies in parallel. Those parts are also much faster. I estimate that we can run those parts at 200 MHz, giving us a total speed of 64 BILLION double precision floating point multiplies per second. Of course, those chips cost about US$10,000 each.

Floating point division is much harder to do quickly. The logic is much bigger, especially in an FPGA, and it runs much slower. The same is true for most CPU's, in that the division instructions (floating and fixed point) run much slower. If speed is important then you want to eliminate as many of the divides as possible. For example, rather than dividing by 5, you should multiply by 0.2. In fact, on many systems it is faster to calculate a reciprocal and the do a multiply than to just do a divide.

The same trade-offs apply to division as multiplication-- it is just that division is always going to be much slower and much bigger than multiplication.

A TI DSP or even a GPU on a Rasberry Pi 3 will suck the magic smoke out of what was once an ALU on FPGA. — dhchdhd, Aug 29 '17 at 14:29

score 4 · Answer 3 · answered Dec 21 '12 at 11:56

4

At least on Altera ALT_FP division component, the double precision 64bits division (52 bits mantissa) takes 10, 24 or 61 clock cycles (selectable). Single extended precision can vary. E.g. 43bits division where exponent is 11 bits, mantissa is 26 bits it allows to select such clock output latency options: 8, 18 or 35. Start ISE and check what You can have on Xilinx.

answered Dec 21 '12 at 11:56

Tomas D.

1,011
5
7

1

Are these numbers latency, or throughput? – Dec 21 '12 at 13:25
Double precision clock cycles are selectable? By what factor? To use more/fewer blocks? And what about multiplication? – Robin Rodricks Dec 21 '12 at 13:48
1

I've not moved along with the component parameters, just started a main window and copied what it says. Probably need to read the docu and check what other parameters the component gui offers. So basically, I can't answer to both questions by now. – Tomas D. Dec 25 '12 at 13:22

score 2 · Answer 4 · answered Jan 09 '13 at 11:19

2

There's no reason it can't take a single cycle. It would likely be a rather large cycle however and use a lot of resources...

answered Jan 09 '13 at 11:19

Martin Thompson

8,439
1
23
44

score 1 · Answer 5 · answered Jul 15 '15 at 18:14

I have implementations of double-precision, floating-point multiply and divide. The multiplication takes 13 clock cycles and the divide takes 109 clock cycles. Both are pipelined for 100% throughput (one result per clock) and around 200MHz operation on a Xilinx V5. I don't know how many fewer clocks you could get at 100MHz, but dividing by two would be a safe bet.

I also have single-precision floating-point implementations which take 10 and 51 clocks under the same situation.

How fast does a 64-bit multiply or divide execute on an FPGA?

5 Answers5