Set off by a question on CodeReview, I revisited "8-bit AVR unsigned division".
Trying much the same as for below 8 bit/8 bit → Q, R implementation, I found assessment of cycle counts of an "unrolled non-restoring bit-wise 16/16 → Q, R" tedious as well as error prone - e.g., overlooking exceeding the limit for conditional branch displacement.
(Using neglected skills and a fitting Atmel Studio 6.2 installation)
; 8-bit AVR non-restoring udiv proof-of-principle, uses about two cycles more than libgcc's
; (<https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/avr/lib1funcs.S>)
; improved with substituting 8 for 9 and "rol r_arg1" for "rjmp __udivmodqi4_ep"
.def __sor = r21 /* -divisor */
_udivmodq:; 60-68 cycles "+ ret", 20 words. Q = 255, R = dividend when divisor 0
ldi cnt, 8 ; init loop counter
clr __sor ; 1
sub __sor, dvsor; 2 __sor = -dvsor
clr rem ; 3 there will be four jumps to and fro, tops
lsl dividend ; 4
pos_low:
rol rem ; shift dividend into remainder
add rem, __sor ; 1 adding to keep carry interpretation consistent
brcc neg_rol ; 2 remainder turned negative
pos_rol: ; four jumps here, at most; same number to neg
rol dividend ; 4 shift dividend (with CARRY)
dec cnt ; 5 decrement loop counter
brne pos_low ; 6 8
ret
neg_low:
rol rem ; shift dividend into remainder
add rem, dvsor ; 1 adding to keep carry interpretation consistent
brcs pos_rol ; 2 remainder restored to positive
neg_rol: ; four jumps here, at most; one less to pos
rol dividend ; 4
dec cnt ; 5 decrement loop counter
brne neg_low ; 6 8
; 63 upper bound - tight? 62 because 7 taken brcxs at most?!
add rem, dvsor ; restore remainder
ret
How should I assess worst case runtime?
How should I assess correctness of advanced implementations, say, along the lines of Fast 16-bit / 16-bit unsigned division for megaAVR?