14

I recently compiled some C++ code for the ATmega1284P in Atmel Studio and was analyzing the timings of some routines using my scope. To my surprise, a loop I thought I had optimized was taking longer than expected.

After taking a peek at the assembly code, I noticed that the compiler has compiled the following multiplication:

word *= 10;

into repeated addition:

add    r28, r28
adc    r29, r29
add    r18, r18
adc    r19, r19
add    r18, r18
adc    r19, r19
add    r18, r18
adc    r19, r19
add    r18, r28
adc    r19, r29

Now although I admire the optimizations in this repeated addition, there is no reason to do so when the chip has hardware multiplication instructions available, like mul.

1284P hardware multiplication supported

Why is gcc behaving this way and how can I tell it to use hardware multiplication instructions? Note that I have set up the multiplication so overflow is not an issue. word is a uint16_t.

Peter Mortensen
  • 1,676
  • 3
  • 17
  • 23
Hackstaar
  • 886
  • 6
  • 22

2 Answers2

21

Assuming that you want to optimize for speed:

Unless you really find out how many cycles it takes if you use mul, you can't compare.

So let’s try:

  • If you use the mul instruction, you need two of them (the operand is 16 bit). So that already costs 4 cycles.
  • You have to load constant 10 into one register: 1 cycle.
  • Then you have to add both 16 bit results. That costs another 2 cycles.
  • Then you have to consider, that the results of both muls always go into register pair r1:r0, but in order to add both results you have to move one of them to some other place: that costs 2 more cycles.

So that makes it already 4 + 1 + 2 + 2 = 9 cycles, just one cycle less than what you got with repeated addition, but you've used more registers. That alone may justify repeated addition (depending on what else is done, having used one more register may cost you more than one cycle).

EDIT:
The calculation above was a back-of-the-envelope calculation; further analysis (cf. 2012rcampion's answer) shows that moving a register pair can be done in 1 cycle (there's a special instruction) and for adding the two multiplication results one add is enough (because we are not interested in a full 3 byte result but want to do a wrap-around with 16-bit width); on the other hand additional measures are required: e.g. clearing the zero register (I guess r0).
Note also that the variant with mul uses much more registers: 7 instead of only 4; a high price because it may cause several more cycles (and code size) at other places.

Curd
  • 16,043
  • 34
  • 43
  • 2
    Ahah! The fact that the value is stored in a r1:r0 escaped me. That makes sense for a multiply-by 10 being optimized down to additions. I tried multiplying by 100 and it used mul's like I expected. Marked as accepted. – Hackstaar Apr 28 '21 at 08:21
  • 2
    Ten additions still seems less than optimal, wouldn't 5 additions followed by an ROR be better? – Glen Yates Apr 28 '21 at 15:53
  • It seems to be more complicated than 10 simple additions. Four registers are involved. Is it ((2*(2*word)) + 1*word)*2? – Peter Mortensen Apr 28 '21 at 16:03
  • 1
    Glen Yates: Perhaps because the registers are 8-bit registers? (Not a rhetorical question) – Peter Mortensen Apr 28 '21 at 16:08
  • 1
    @PeterMortensen: Yep, it looks like each `add`+`adc` pair is a single 16 bit addition. So what the assembly code quoted by the OP seems to be doing is calculating `2 * word` (one doubling) and `8 * word` (three doublings) and then adding them together. Five 16 bit additions all together. (It also seems to assume that the original value of `word` is found both in `r18`/`r19` and in `r28`/ `r29`. Presumably that's ensured by something before the quoted code that the OP left out.) – Ilmari Karonen Apr 28 '21 at 16:26
  • 1
    @PeterMortensen Thanks, my assembly experience is from a different platform and of course I meant LSL and ROL instead of right shifts and rolls. – Glen Yates Apr 28 '21 at 16:37
  • @PeterMortensen this is correct, and that's another reason why multiplying by ten is more efficiently done this way. Ten additions would take more cycles than mul. – Hackstaar Apr 28 '21 at 20:03
4

I can't tell you why GCC is doing the multiply this way, but I can tell you that it is not faster than using mul. I used the the datasheet (section 33) to count the number of cycles the following routine takes (AVR-GCC 9.2.0, compiler explorer):

#include <stdint.h>

uint16_t times_ten(uint16_t word) {
    return word * 10;
}

For -O1 through -O3 (optimize for speed):

movw r18,r24 ; 1 cycle 
lsl r18      ; 1
rol r19      ; 1
lsl r18      ; 1
rol r19      ; 1
add r24,r18  ; 1
adc r25,r19  ; 1
lsl r24      ; 1
rol r25      ; 1
             ; 9 total

Note the combination of lsl and rol to multiply a register pair by two is the same speed as adding a register pair to itself with add and adc as shown in the question (2 cycles and 2 instructions).

And for -Os (optimize for code size):

ldi r18,lo8(10)  ; 1
movw r20,r24     ; 1
mul r18,r20      ; 2
movw r24,r0      ; 1
mul r18,r21      ; 2
add r25,r0       ; 1
clr __zero_reg__ ; 1
                 ; 9 total

Interestingly, both methods take the same number of cycles, and using mul uses fewer instructions. In fact, done multiple times as part of a larger function where 10 can be saved in a register and zeroing r1 can be deferred until the end, using mul would be two cycles faster!

2012rcampion
  • 592
  • 3
  • 16
  • 2
    Interesting... I have considered moving to Atmelcrochip's new IDE, which has an improved compiler and the extra option to pay some enormous fee for 'revolutionary optimizations.' Perhaps being a little more intelligent with multiplication is one of those perks, but I think I'll stick with my additions for now... – Hackstaar Apr 29 '21 at 00:24
  • Note that the variant without `mul` uses only **4** registers (r18, r19, r24, r25) while the variant with `mul` uses **7** registers (r0, r1, r18, r20, r21, r24, r25); a big disadvantage of the latter. – Curd Apr 30 '21 at 12:20