When using a regular FPGA such as Xilinx Spartan 3 or Virtex 5, how
many cycles does a double-precision floating-point 64-bit
multiplication or division take to execute?
The answer is: Yes!
But seriously, it is super hard to come up with a number. When designing any complex logic there is always a trade-off between different things, and no one approach is good for all designs. I'll try to cover the big ones.
With logic design one trade-off is size vs. speed. The easy example of this is let's say that a single Floating Point multiplier is too slow. To speed it up all you have to do is add a second multiplier. Your logic size doubles, but do does the number of multiplies per second. But even just looking at a single multiplier, there are different ways to multiply numbers; some are fast and large, others are small and slow.
Another trade-off is clock speed vs. clocks per multiply. I could design some logic that would do a single floating point multiply in one clock. But that would also require the clock to be slower-- maybe as slow as 10 MHz. Or, I could design it to work with a 100 MHz clock but it would require 10 clocks per multiply. The overall speed is the same (one multiply in 100 ns), but one has a faster clock.
Related to the previous paragraph is the trade-off of clock speed vs. multiply latency. There is a technique in logic design called pipelining. Basically you take a chunk of logic and break it up into smaller stages, where each stage take one clock cycle to complete. The advantage here is that each stage can be working on a multiply while the other stages are working on other multiplies. For example, let's say that we're running at 100 MHz with a 10 stage pipeline. This means that it will take 10 clocks for each multiply, but the logic is also working on 10 different multiplies at the same time! The cool thing is that it is completing a multiply on every clock cycle. So the effective clocks per multiply is 1, it just takes 10 clocks for each of those multiplies to complete.
So the answer to your question, how fast can an FPGA do a multiply, is really up to you. FPGA's come in different sizes and speeds, and you can dedicate as much of that logic to the task at hand as you want. But let's look at one specific scenario...
Let's say that we want to use the largest Spartan-3A and all we care about is 32-bit floating point multiplies. A 32-bit float multiply requires a 24x24 integer multiplier and an 8-bit adder. This requires four of the dedicated multiplier blocks and some generic slices (too few to care about). The XC3S1400A has 32 dedicated multipliers, so we can do eight of our floating point multipliers in parallel. A very rough guess on the clock speed would be about 100 MHz. We can fully pipeline this design so that we can complete four 32-bit floating point multiplies per clock cycle, for an effective speed of 800 million floating point multiplies per second.
A double precision multiply requires 9 dedicated multiplier blocks per floating point multiply, so we could only do 3 multiplies in parallel-- resulting in a speed of about 300 million 64-bit floating point multiplies per second.
For comparison, lets consider the newer Xilinx Virtex-7 series. The dedicated multipliers in that are bigger, so we only need 6 dedicated multiplier blocks for a 64-bit floating point multiply. There are also 1,920 dedicated multipliers on the largest part-- so we can do 320 double precision floating point multiplies in parallel. Those parts are also much faster. I estimate that we can run those parts at 200 MHz, giving us a total speed of 64 BILLION double precision floating point multiplies per second. Of course, those chips cost about US$10,000 each.
Floating point division is much harder to do quickly. The logic is much bigger, especially in an FPGA, and it runs much slower. The same is true for most CPU's, in that the division instructions (floating and fixed point) run much slower. If speed is important then you want to eliminate as many of the divides as possible. For example, rather than dividing by 5, you should multiply by 0.2. In fact, on many systems it is faster to calculate a reciprocal and the do a multiply than to just do a divide.
The same trade-offs apply to division as multiplication-- it is just that division is always going to be much slower and much bigger than multiplication.