How to assess worst case runtime and correctness of an assembly implementation?

Question

Set off by a question on CodeReview, I revisited "8-bit AVR unsigned division".
Trying much the same as for below 8 bit/8 bit → Q, R implementation, I found assessment of cycle counts of an "unrolled non-restoring bit-wise 16/16 → Q, R" tedious as well as error prone - e.g., overlooking exceeding the limit for conditional branch displacement.
(Using neglected skills and a fitting Atmel Studio 6.2 installation)

; 8-bit AVR non-restoring udiv proof-of-principle, uses about two cycles more than libgcc's
; (<https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/avr/lib1funcs.S>)
;  improved with substituting 8 for 9 and "rol  r_arg1" for "rjmp   __udivmodqi4_ep"
.def    __sor = r21 /* -divisor */
_udivmodq:; 60-68 cycles "+ ret", 20 words. Q = 255, R = dividend when divisor 0
    ldi cnt, 8      ;           init loop counter
    clr __sor       ;   1
    sub __sor, dvsor;   2       __sor = -dvsor
    clr rem         ;   3       there will be four jumps to and fro, tops
    lsl dividend    ;   4
pos_low:
    rol rem         ;           shift dividend into remainder
    add rem, __sor  ;   1       adding to keep carry interpretation consistent
    brcc neg_rol    ;   2       remainder turned negative
pos_rol:            ;           four jumps here, at most; same number to neg
    rol dividend    ;   4       shift dividend (with CARRY)
    dec cnt         ;   5       decrement loop counter
    brne pos_low    ;   6    8
    ret

neg_low:
    rol rem         ;           shift dividend into remainder
    add rem, dvsor  ;   1       adding to keep carry interpretation consistent
    brcs pos_rol    ;   2       remainder restored to positive
neg_rol:            ;           four jumps here, at most; one less to pos
    rol dividend    ;   4
    dec cnt         ;   5       decrement loop counter
    brne neg_low    ;   6    8
                    ;   63 upper bound - tight? 62 because 7 taken brcxs at most?!
    add rem, dvsor  ;           restore remainder
    ret

How should I assess worst case runtime?
How should I assess correctness of advanced implementations, say, along the lines of Fast 16-bit / 16-bit unsigned division for megaAVR?

(Would have tagged [tag:timing-analysis] it that didn't read *circuit*. While not mentioning software in general or assembly in particular, *analysis* looked general enough.) — greybeard, Jan 21 '21 at 13:29
(Just found [user4574's original post](https://electronics.stackexchange.com/q/537350).) (*Quad Programmable Comparators*?) — greybeard, Jan 21 '21 at 13:36
These are very traditional theoretical _computer science_ questions, especially the one verifying correctness but also when you want _worst case_ runtime instead of a more practical one. Doesn't change because it's in assembly language, for example Knuth goes over all his algorithms using assembly. — pipe, Jan 21 '21 at 13:57
The limited input range of some 65280 combinations and straightforward timing of the AVR architecture makes an exhaustive test tractable here. You can sample a single-cycle timer before and after the call to measure the duration, though you'll want to cancel for sampling overhead by subtracting the duration of a known dummy return function. — doynax, Jan 23 '21 at 09:04
@doynax: practical with the 8/8bit→Q&R above, but, as stated, I'm at a loss with 16/16→16&16, let alone 24/16→16&16. (Currently trying to get Atmel Studio 6.2 to debug/simulate my code-) — greybeard, Jan 23 '21 at 09:56
@greybeard: Of course, I apologize for not properly reading your question. If so then we are back to analytic methods for proving correctness and worst-case timing. I suppose that you might transcribe the code instruction-by-instruction into a model of the algorithm on computer where an exhaustive 2^32 evaluation is tractable, but that is about as far as this approach scales. This algorithm seems to have limited branching and no loops, so timing should at least be a straightforward through a couple of test cases or manual count. — doynax, Jan 23 '21 at 10:05
(@doynax: Atmel's simulation still misses the very first cycle - or is that me not updating in half a decade? Actually, inputs that *may* be critical/extremal are easy enough to construct for "the bit-wise methods". But then, there are "advanced methods", starting with *multiplication by reciprocal*, and numerical limit for reducing lookup table size (see user4574's question).) — greybeard, Jan 23 '21 at 10:48
@greybeard: The generic Atmel simulator is likely to be far too slow for this size of problem anyway, I was rather referring to transcribing the assembly code line-by-line equivalent C and adding a constant to a cycle table. Or testing inputs exercising the limited paths in the 16-bit variant. As for proving correctness of numerical methods I don't see any straightforward non-exhaustive solutions short of analytic proof, probably your best bet is to prove equivalence to a well-known algorithm. To be honest I barely follow the theorems for Knuths algorithm D. — doynax, Jan 23 '21 at 10:58
... In any event you will of course want to do a sanity-check verifying the results and cycle counts hardware for a subset of cases. Randomized tests are always useful in this area..After all manual transcription and analysis is always error-prone. — doynax, Jan 23 '21 at 11:01

How to assess worst case runtime and correctness of an assembly implementation?

0 Answers0