AVR assembly: fastest way to increment two combined bytes

Question

What is the fastest way to increment two combined bytes in assembler (assuming I'm working on an 8-bit CPU)? Currently I'm doing this:

OVF1_handler: ; TIMER1 overflow ISR

lds r21, timerhl ; load low byte into working register; 2 cycles
add r21, counter_inc ; add 1 to working register (value of counter_inc is 1); 1 cycle

brbs 0, OVF1_handler_carry ; branch if bit 0 (carry flag bit) of SREG is set; 1 cycle if false . 2 cycles if true
sts timerhl, r21 ; otherwise write value back to variable; 2 cycles
reti ; we're done

OVF1_handler_carry: ; in case of carry bit is set
    sts timerhl, r21 ; write value of low byte back to variable; 2 cycles

    lds r21, timerhh ; load high byte into working register; 2 cycles
    inc r21 ; increment it by 1 (no carry check needed here); 1 cycle
    sts timerhh, r21 ; write value of high byte back to variable; 2 cycles

reti ; we're done

So in sum there are

255 * (2+1+1+2) + (2+1+2+2+2+1+2) = 1542 cycles

to count from 0 to 256 (255 times (2+1+1+2) because no overflow plus 1 time (2+1+2+2+2+1+2) when overflow occurs).

Is my calculation correct and is there a faster way?

jippie · Accepted Answer · 2013-04-07T21:12:36.530

8

Have a bit more trust in your compiler. Write the code in C, compile it and look at the disassembly. Unsure which toolchain you use, but avr-gcc creates pretty well optimized code.

lds     r24 , lowbyte   ; 2 clocks
lds     r25 , highbyte  ; 2 clocks
adiw    r24 , 0x01      ; 2 clocks - Add Immediate to Word (= 16 bit)
sts     lowbyte  , r24  ; 2 clocks
sts     highbyte , r25  ; 2 clocks

You can disassemble the .elf file with the following command (provided you use the gcc toolchain):

avr-objdump -C -d $(src).elf

BTW: You probably need to push the used registers to stack beforehand and pop them afterwards (2 cycles each). Also remember that an interrupt (including reti) lasts at least 8 clock cycles apart from the instructions being executed.

; TIMER1_OVF            ;  4 clocks
push    r24             ;  2 clocks
IN      r24 , SREG      ;  1 clock  - save CPU flags
push    r24             ;  2 clocks
push    r25             ;  2 clocks
; do the addition above - 10 clocks
pop     r25             ;  2 clocks
pop     r24             ;  2 clocks
OUT     SREG , r24      ;  1 clock
pop     r24             ;  2 clocks
reti                    ;  4 clocks
; total 32 clock ticks

edited Apr 07 '13 at 21:12

answered Apr 07 '13 at 20:57

jippie

33,033
16
93
160

Or you can give `avr-gcc` an argument to output disasembly in the compilation process. – angelatlarge Apr 07 '13 at 20:58
Personally I think avr-gcc is generating a messy listing, it does contain lots of comments though. – jippie Apr 07 '13 at 21:00
So there are 10 clocks in sum. Counting from 0 to 256 would take then `256 * 10 = 2560` clocks. That's 1000 clocks more than in my code. – arminb Apr 07 '13 at 21:11
It has predictable number of clock cycles, no matter if you have a carry half way or not. And it is shorter than your code in case of a carry. – jippie Apr 07 '13 at 21:13
I dont't understand. My example also has predictable number of clock cycles. 255 times it takes `(2+1+1+2)` cycles and 1 time it takes `(2+1+2+2+2+1+2)` cycles. I'm not loking for a short looking code but for the fastet :o) – arminb Apr 07 '13 at 21:16
No, the execution time of your code is different in case your low byte overflows. Therefore you have 2 exit pionts (`reti`) – jippie Apr 07 '13 at 21:17
Yes that's right but this fact makes it faster or doesn't it? Your code always takes 2560 clock cycles and mine 1542. – arminb Apr 07 '13 at 21:19
Why would you use two bytes if you only want to count to 256? – jippie Apr 07 '13 at 21:22
Because I must use 2 bytes if I want to count to values greater than 255. In fact I want to count to 65535 but also in this case my code also takes less clock cycles (394752) and yours 655360. Sorry if I'm misunderstanding something. – arminb Apr 07 '13 at 21:26
Hmm .. think you may be right. On top you would gain extra because you only have a single register to push/pop. – jippie Apr 07 '13 at 21:27
Notice that you don't push the CPU flags onto stack (SREG in my answer), that may affect your main loop. – jippie Apr 07 '13 at 21:30
But I could use r24 and r25 only for this purpose so I don't have to load and store it. That would save me 8 clock cycles wouldn't it? So I only had to `adiw r24 , 0x01` which would only take 2 cycles. That would take 131072 clock cycles for counting from 0 to 65535. – arminb Apr 07 '13 at 21:32
1

That is a possibility, yes, and the advantage for using assembly. Still your CPU flags will change on every interrupt, which may affect your main loop. – jippie Apr 07 '13 at 21:39
1

If that is not a problem, then there are some AVR's that have room for 2 instructions in the interrupt vector table. On top of that, if you don't use the next entry in that table you can use it yourself anyway. So with a bit of luck you can fit the entire interrupt routine in the vector table. Saves you a branch (2 clocks). – jippie Apr 07 '13 at 21:45

score 1 · Answer 2 · answered Dec 28 '22 at 17:24

When resorting to assembler,

still value readable / maintainable higher than saved two bytes. Or cycles.
Unless that saving makes what otherwise would be broken.
try to get things done without ado.
check with a decent compiler if it doesn't do better.
godbolt is a valuable resource here.

A stab or three at incrementing a two byte integral value in an ISR:

OVF1_handler:          ; TIMER1 overflow ISR
    push r21           ; 1 cycle (2 if AVRe)
    in   r21, SREG     ; 1 cycle
    push r21           ; 1 cycle (2 if AVRe)

    lds  r21, timerhl  ; load low byte into working register; 2 cycles
    subi r21, -1       ; 1 cycle
    sts  timerhl, r21  ; write value back to variable; 2 cycles
#if !BEST_WORST_CASE  // best worst case cycles: process carry unconditionally
# if !BEST_PREDOMINANT_CASE    // no duplicated epilogue, 7/11 cycles
    brcs OVF1_epilogue ; 1 cycle if no carry, else 2
# else // BEST_PREDOMINANT_CASE  // duplicated epilogue, 6/12 cycles
    brcc OVF1_overflow ; 1 cycle if carry, else 2
    pop  r21
    out  SREG, r21
    pop  r21
    reti               ; we're done
# endif
#endif

OVF1_overflow:         ; in case lower byte overflowed
    lds  r21, timerhh  ; load high byte into working register; 2 cycles
    sbci r21, -1       ; process carry; 1 cycle - no faster than addiw
    sts  timerhh, r21  ; write value of high byte back to variable; 2 cycles

OVF1_epilogue:
    pop  r21           ; 2 cycles
    out  SREG, r21     ; 1 cycle
    pop  r21           ; 2 cycles
    reti               ; we're done

most of any advantage over using addiw comes from not saving&restoring a second register.

AVR assembly: fastest way to increment two combined bytes

2 Answers2