Why does hardware division take much longer than multiplication?

Question

Why does hardware division take so much longer than multiplication on a microcontroller? E.g., on a dsPIC, a division takes 19 cycles, while multiplication takes only one clock cycle.

I went through some tutorials, including Division algorithm and Multiplication algorithm on Wikipedia. Here is my reasoning.

A division algorithm, like a slow division method with restoring on Wikipedia, is a recursive algorithm. This means that (intermediate) results from step k are used as inputs to step k+1, which means that these algorithms cannot be parallelized. Therefore, it takes at least n cycles to complete the division, whereas n is a number of bits in a dividend. For 16-bit dividends, this is equal to at least 16 cycles.

A multiplication algorithm doesn't need to be recursive, which means that it is possible to parallelize it. However, there are many different multiplication algorithms, and I don't have a clue which one may be used by microcontrollers. How does multiplication work on a hardware/microcontroller?

I've found a Dadda multiplier algorithm, which is supposed to take only one clock cycle to finish. However, what I don't get here is that Dadda's algorithm proceeds in three steps, whereas results from step 1 are used in step 2, etc. According to this, this would take at least three clock cycles to finish.

Algorithm is not really defining the number of clock cycle. Your specific CPU might have a hardware multiplier/divider working in one cycle or 20 cycles regardless of the internal implementation. — Eugene Sh., Jan 16 '17 at 18:55
OP, can you provide a link that gives more information on the 19 vs 1 cycles you talk about? Something specific about your DSP. — Vladimir Cravero, Jan 16 '17 at 18:57
Thanks for answers. Here is a datasheet for my microcontroller: http://ww1.microchip.com/downloads/en/DeviceDoc/70005127c.pdf . See Instruction set overview, starting on page 292. It says that all DIV instructions take 18 cycles, while all MUL instructions take only 1 cycle. But is not common only for this MCU, I've seen this in many other MCUs. — Marko Gulin, Jan 16 '17 at 19:00
What takes longer if **you** do paper-and-pencil multiplication and division? Why? — Curd, Jan 16 '17 at 22:56
@Curd, well, they're about the same, aren't they. Are for me. I don't think that illustrates it as well as you might imagine. — TonyM, Jan 16 '17 at 23:08
@TonyM the paper-and-pen division algo I have in my mind inherently applies a series of multiplications of divider-digits-times × 1-digit in each step, and a modulo operation, and a subtraction, so with exactly *that* approach to division in my head, I'd argue that division requires more effort than multiplication. Your point is still valid, because instead of trying out exhaustively how many times the divisor fits into the remainder, there's good heuristics — Marcus Müller, Jan 17 '17 at 10:00
The other factor is economics and usage patterns. Most usages invoke multiply far more often than divide. Dedicating a large area of silicon to a faster hardware divide function that will be used relatively infrequently is poor economics. Better to make a smaller and cheaper chip, or to use the extra logic in a more productive manner. BTW when I started with minicomputers, divide wasn't always an instruction. On some machines it was a software library call, like square root. — nigel222, Jan 17 '17 at 10:46
In passing, are there or were there machines that optimised divide-by-ten in hardware? — nigel222, Jan 17 '17 at 10:49
@nigel222 puuuh; I could imagine some pocket calculator MCUs might have a BCD conversion unit/instruction, and that would make that trivial to implement in software (just ignore lowest digit). — Marcus Müller, Jan 17 '17 at 11:44
@MarkoGulin might also add that when executing division frequently, chances are that you actually want floating point numbers, which are available on many CPUs these days, and dividing those is a different kind of problem (it does incorporate at least one fixed-point division, but that has limited input range) — Marcus Müller, Jan 17 '17 at 11:46
Historically even mainframes implemented BCD-string instructions. In financial programming, the cost of binary-to-decimal conversions was regarded as too high. These days just about anything register to register in a general-purpose computer's CPU is "free" and it's accessing the RAM that slows things down. Between then and now, optimised divide-by-ten for integer-to-decimal conversions? I just wondered - never heard of it. — nigel222, Jan 17 '17 at 13:19
Of note wrt floating-point divide. The Cray-1 didn't have one, only a reciprocal-approximation instruction which then used a multiply and two more instructions to get full-accuracy division done. Intel floating-point units also have a reciprocal-approximation instruction. I don't know what it's used for. Also, x = y / const compiles to multiplication by the reciprocal of const. fdiv is needed only for dividing by a variable. — nigel222, Jan 17 '17 at 13:32
[Why is division more expensive than multiplication?](http://stackoverflow.com/q/15745819/995714), [Why is division so much more complex than other arithmetic operations?](http://scicomp.stackexchange.com/q/187/22956) — phuclv, Jan 17 '17 at 14:08
@TonyM: I doubt that both take about the same time. Of course you have to look at calculation with more than just 1 oder 2 digits numbers. Just try it: multiply two random 6 digit numbers to get a 12 result and divide a random 12 digit number by a random 6 digit number to get a 6 digit result and look how long it takes. — Curd, Jan 17 '17 at 20:24

score 37 · Accepted Answer · edited Jun 11 '20 at 15:10

37

A divider maps much less elegantly to typical hardware. Take Lattice ICE40 FPGAs as examples.

Let us compare two cases: this 8x8 bit to 16 bit multiplier:

module multiply (clk, a, b, result);
   input clk;
   input [7:0]a;
   input [7:0]b;
   output [15:0]result;
   always @(posedge clk)
     result = a * b;
endmodule // multiply

and this divider that reduces 8 and 8 bit operands to 8 bit result:

module divide(clk, a, b, result);
   input clk;
   input [7:0] a;
   input [7:0] b;
   output [7:0] result;
   always @(posedge clk)
     result = a / b;
endmodule // divide

(Yes, I know, the clock doesn't do anything)

An overview of the generated schematic when mapping the multiplier to an ICE40 FPGA can be found here and the divider here.

The synthesis statistics from Yosys are:

multiply

Number of wires: 155
Number of wire bits: 214
Number of public wires: 4
Number of public wire bits: 33
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 191
- SB_CARRY 10
- SB_DFF 16
- SB_LUT4 165

divide

Number of wires: 145
Number of wire bits: 320
Number of public wires: 4
Number of public wire bits: 25
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 219
- SB_CARRY 85
- SB_DFF 8
- SB_LUT4 126

It's worth noting that the size of the generated verilog for a full-width multiplier and a maximally-dividing divider aren't that extreme. However, if you'll look at the pictures below, you'll notice the multiplier has maybe a depth of 15, whereas the divider looks more like 50 or so; the critical path (i.e. the longest path that can occur during operation) is what defines the speed!

You won't be able to read this, anyway, to get a visual impression. I think the differences in complexity are possible to spot. These are single cycle multiplier/dividers!

Multiply

Multiply on an ICE40 (warning: ~100 Mpixel image)

Divide

(Divide on an ICE40) (warning: ~100 Mpixel image)

edited Jun 11 '20 at 15:10

Community

1

answered Jan 16 '17 at 20:27

Marcus Müller

88,280
5
131
237

Dear Marcus, thank you very much for this answer! Now, I'll try to simplify your answer, and please correct me if I'm wrong: Both divide and multiply operations require several iterations which can be done in a "single clock cycle". However, a required number of dedicated gates for a single-cycle divide operation is much higher than for a single-cycle multiply operation. It is a tradeoff between hardware complexity and speed. Did I get this right? – Marko Gulin Jan 16 '17 at 21:57
4

no, you can implement them non-iteratively. But it will simply take quite some time until the valid result "ripples" through the logic. The implementations above are non-iterative. – Marcus Müller Jan 16 '17 at 22:04
Can you share the graph specs for these images? I want to run them though my own rendering algorithm. – QED Jan 17 '17 at 03:16
13

I want a wall poster of the divider. – Ian Howson Jan 17 '17 at 05:40
@MarcusMüller Hmm, as I can see from your post, the two algorithms (divide/multiply) are almost equal when it comes to required number of gates. However, the divide algorithm is much "deeper" (15 vs. 50 in your post), causing a much longer propagation delay. Please have a look at this in your post: "you'll notice the multiplier has maybe a depth of 15, whereas the multiplier looks more like 50 or so". I guess what you meant is "the divider looks more like 50 or so". – Marko Gulin Jan 17 '17 at 09:07
1

@QED I'd point you to Yosys to apply to the verilog above: All I did was `yosys`, `read_verilog multiply.v`, `synth_ice40`, `show -format svg -prefix multiply_ice40` . But stand by, I'll repeat that with `show -viewer /bin/true -prefix multiply_ice40` to get the .dot files :) – Marcus Müller Jan 17 '17 at 09:38
@QED see my two Github gist links above , [1](https://gist.github.com/marcusmueller/edcf41d7a96b32405c0c0a08ff5f1a4a) and [divide](https://gist.github.com/marcusmueller/3182ccae22a3e714c351c7768a11b075) – Marcus Müller Jan 17 '17 at 09:45
8

There's a PDF now in the [multiply](https://gist.github.com/marcusmueller/edcf41d7a96b32405c0c0a08ff5f1a4a) gist. It's 3378 × 3177 mm, so please discuss with your significant other before you put this on the bedroom ceiling. – Marcus Müller Jan 17 '17 at 09:51
@MarcusMüller Can you please review my short summary from all the answers. – Marko Gulin Jan 17 '17 at 10:25
3

Your 100 megapixel images are impressive, but are way overkill for the point you're trying to make, and they cause huge problems for anyone trying to view this page on a device with limited memory such as a phone or tablet. If you want to display the images inline, please find a way to produce a lower resolution preview. – Dave Tweed Jan 17 '17 at 14:24
4

Yo, those graphviz charts are off the hook, yo! – Spencer Williams Jan 17 '17 at 23:05
1

@SpencerWilliams yo, that's how my Free and Open Source FPGA synthesis flows, man! [Yosys](http://www.clifford.at/yosys/) does that all for me; and much, much more! – Marcus Müller Jan 18 '17 at 14:57

Peter Green · Answer 2 · 2020-06-24T16:04:23.780

We can have multiple layers of logic per clock cycle but there is a limit, exactly how many layers of logic we can have an how complex those layers can be will depend on our clock speed and our semiconductor process.

However, there are many different multiplication algorithms, and I don't have a clue which one may be used by microcontrollers

Afaict most multiplication in computers uses a variant of binary long multiplication. Binary long multiplication involves

Shifting one operand by various different ammounts
Masking the shifted numbers based on the second operand
Adding the results of the masking together.

So lets take a look at implementing this in hardware.

Shifting is just a matter of how we wire things up, so it comes for free.
Masking requires AND gates. That means one layer of logic, so from a time point of view it's cheap.
Addition is relatively expensive due to the need for a carry chain. Fortunately there is a trick we can use. For most of the addition stages rather than adding two numbers to produce one we can add three numbers to produce two.

So lets ballpark how many logic stages we need for an 8x8 multiplier with a 16 bit results. For simplicity lets assume we don't try and optimise for the fact that not all of the intermediate results have bits in all of the positions.

Lets assume that a full adder is implemented in two "gate stages".

1 for masking to produce 8 intermediate results.
2 to add groups of three numbers to reduce the 8 intermediate results to 6
2 to add groups of three numbers to reduce the 6 intermediate results to 4
2 to add a group of three numbers to reduce the 4 intermediate results to 3
2 to add a group of three numbers to reduce the 3 intermediate results to 2
32 to add up the final two results.

So about 41 logic stages total. Most of which are spent adding up the last two intermediate results.

This could be improved further by exploiting the fact that not all the intermediate results have all bits present (that is basically what the dada multiplier does), by using a carry lookahead adder for the final step. By adding 7 numbers to produce 3 instead of three to produce two (reducing the number of stages at the price of more gates and wider gates) etc.

That is all minor details though, the important point is that the number of stages needed to multiply two n bit numbers and produce a 2n bit result is roughly proportional to n.

On the other hand if we look at division algorithms we find they all have an iterative process where.

What is done on one iteration depends heavilly on the results of the previous iteration.
the number of logic stages required to implement an iteration is roughly proporitional to n (subtraction and comparision are very similar in complexity to addition)
the number of iterations is also roughly proportional to n.

So the number of logic stages required to implement division is roughly proportional to n squared.

Thank you for your answer. I've read on Wiki that Dadda's algorithm is very efficient when it comes to required number of gates to implement this algorithm on hardware. Despite that, most hardware uses "binary long multiplication"? — Marko Gulin, Jan 16 '17 at 22:03
I tlooks like dada's algotihm is an optimised version of binary long multiplication. — Peter Green, Jan 16 '17 at 22:04
I burn 8 cycles to do a 1/x division. I then use that against an 8 cycle multiplication for a fixed cost of 16 cycles. — b degnan, Jan 17 '17 at 12:23
This nicely demonstrates that multiplication is not so much worse than addition after all. — Hagen von Eitzen, Jan 17 '17 at 19:11
An iteration requires a subtraction which can be done in O(lgN) stages using O(NlgN) hardware, or O(sqrt(N)) stages using O(N) hardware. The essential point, though, is that multiplication requires O(lgN) stages, while division requires O(NlgN) stages. Not O(N\*N) but larger than multiplication by a factor of O(N) unless one starts by taking an approximate reciprocal so as to allow more work to be done per step. — supercat, Jan 17 '17 at 22:00
Why does adding the final two results require 32 stages? Wouldn't a carry-lookahead adder help here? — Nayuki, Jan 18 '17 at 05:47

score 9 · Answer 3 · answered Jan 16 '17 at 19:00

9

Slow division is inherently iterative so it tends to take longer. There are somewhat faster slow division algorithms than the simple ones, using lookup tables. The SRT algorithm produces two bits per cycle. An error in such a table was the cause of the infamous Pentium FDIV bug (ca. 1994). Then there are so-called fast division algorithms.

Of course, in principle, you could simply use a huge lookup table to calculate the product or quotient of two numbers, and thus get results in a single cycle, but that tends to rapidly get impractical as the number of bits per number increases.

answered Jan 16 '17 at 19:00

Spehro Pefhany

376,485
21
320
842

But the bottom line is - division algorithms cannot be parallelized, unlike multiplication algorithms, and that is why they are so much slower? – Marko Gulin Jan 16 '17 at 19:02
2

@MarkoGulin "cannot" is a very strong assertion. It's certainly not straightforward. – Spehro Pefhany Jan 16 '17 at 19:14
3

I think you could weaken it from "division algorithms cannot be parallelized" to "the ways we have found to parallelize division are more straining on the hardware implementing the division than parallelized multiplication." Sphero gives an example of how to do single-cycle division using O(2^n) gates to multiply n-bit numbers... but that's just not practical. – Cort Ammon Jan 17 '17 at 00:14
2

Long division can exploit parallelism to any desired degree by computing an approximate reciprocal which, when multiplied by the divisor, yields a result of the form 1000...xxxx, When working with a divisor in such a form with N leadig zeroes, it's easy to compute N bits of the result with each step. – supercat Jan 17 '17 at 02:02

score 6 · Answer 4 · answered Jan 16 '17 at 21:30

Practical division algorithms are all based on numeric suites which converge to the quotient.

There are additive methods, as non-restoring or SRT which works by adding or removing 2^N to the quotient and correspondingly add or remove the 2^N * Divisor to the partial remainder until it has converged to zero.
There are multiplicative methods, as Newton-Raphson or Goldshmidth, which are root finding methods where division is calculated as the inverse of multiplication.

Additive methods gives one or a few bits per cycle. Multiplicative methods double the number of bits for each cycle but need some initial approximation, often obtained with a constant table.

The "slow" and "fast" denominations are misleading, as the actual speed depends on the number of bits, how much hardware is devoted to the function (and a fast multiplier is very large)...

Division is slower than multiplication because there is no direct, parallel method for calculating it : Either there is an iteration, or hardware is copied to implement the iteration as cascaded (or pipelined) blocks.

score 5 · Answer 5 · answered Jan 16 '17 at 19:17

A division algorithm (in fact any algorithm) can be made in one clock cycle. If you are willing to pay for the extra transistors and lower allowed clock rate.

Suppose you have a set of gates that implements one clock cycle of an existing multi-cycle division algorithm. To make the algorithm single cycle, use multiple stages of hardware (similar to that used in one stage of the multi-cycle algorithm), with the output of one stage feeding the next stage.

Of course the reason not to do it that way is that it uses a lot of transistors. For example for a 16 bit division it may use nearly 16 X more transistors. Also having more stages of gates lowers the maximum allowed clock frequency (because there is more stages of propagation delay).

To put it another way: you *could* slow down the rest of the CPU so everything else was as slow as division, but that would be insane vs. letting it take many cycles while having everything else fast. — Peter Cordes, Mar 10 '22 at 11:24

score 1 · Answer 6 · edited May 23 '17 at 11:33

Why does hardware division take so much longer than multiplication on a microcontroller?

This is not an electronics question. At best it is a computer question, better addressed to Stack Overflow.

See, for example, here: Is multiplication faster than float division?

In reality, it is a Real Life question: Why does division take so much longer than multiplication?

Which would you rather calculate on paper?

51 * 82

or

4182 / 51

Division takes longer than multiplication because it is harder to do.

Why does hardware division take much longer than multiplication?

6 Answers6

multiply

divide

Multiply

Divide

Linked