What is the fastest (in clock cycles) 16-bit x 16-bit unsigned integer division algorithm that will run on an ATMEGA1284?

Question

What is the fastest (in clock cycles) division algorithm that will run on an ATMEGA1284?

The dividend is an unsigned 16-bit number passed into the algorithm in a pair of 8-bit registers.
The divisor is an unsigned 16-bit number passed into the algorithm in a pair of 8-bit registers.
The quotient and remainder are passed out in any pairs of 8-bit registers.
The algorithm code (plus any lookup tables) must consume less than 5K bytes of flash memory.
The algorithm may return any values for division by 0.

AVR Instruction Set Manual:
https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-avr-instruction-set-manual.pdf

AVR200: Multiply and Divide Routines:
https://ww1.microchip.com/downloads/en/AppNotes/doc0936.pdf

The algorithm I have so far takes between 57 and 68 clock cycles and uses 768 bytes of lookup tables. I ran an exhaustive 16-hour test that verified that it calculates the correct quotient and remainder for all 2^32 combinations of dividend and divisor.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;ARGUMENTS:  r16, r17, r18, r19
;  r16:r17 = N (numerator)
;  r18:r19 = D (divisor)
;RETURNS:    r20, r21
;  r20:r21 (quotient)
;  r22:r23 (remainder)
;
;DESCRIPTION:  divides an unsigned 16 bit number N by unsigned 16 bit divisor D
;
;ALGORITHM OVERVIEW
;
;RZERO = 0;
;if(D < 256){
;  N2 = (N * ((R1H_TBL[D] << 8) + R1L_TBL[D])) >> 16;
;  P  = N2 * D
;}else{
;  D1 = (R1H_TBL[D] * D) >> 8
;  N1 = (R1H_TBL[D] * N) >> 8
;  if(D1 < 256){
;    N2 = N1 >> 8;
;  }else{
;    N2 = N2 * R2_TBL[D1 & 0xFF];
;  }
;  P = N2 * D;
;  if(P > 65535){
;    N2 = N2 - 1    ;//Decrement quotient
;    N1 = N2 - P + D;//Calculate remainder
;    return;//return quotient in N2, remainder in N1
;  }
;}
;N1 = N - P;
;if(P > N){
;  N2 = N2 - 1;//decrease quotient
;  N1 = N1 + D;//increase reamainder
;  return;//return quotient in N2, remainder in N1
;}
;if(N1 > D){
;  N2 = N2 + 1;
;  N1 = N1 - D;
;  return;//return quotient in N2, remainder in N1
;}
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.align 256
;Recipricol table #1, high byte.
;R1H_TBL[x] = min( high(2^16/x) / 256 , 255)
R1H_TBL:
.db 0xFF, 0xFF, 0x80, 0x55, 0x40, 0x33, 0x2A, 0x24, 0x20, 0x1C, 0x19, 0x17, 0x15, 0x13, 0x12, 0x11
.db 0x10, 0x0F, 0x0E, 0x0D, 0x0C, 0x0C, 0x0B, 0x0B, 0x0A, 0x0A, 0x09, 0x09, 0x09, 0x08, 0x08, 0x08
.db 0x08, 0x07, 0x07, 0x07, 0x07, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x05, 0x05, 0x05, 0x05, 0x05
.db 0x05, 0x05, 0x05, 0x05, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04
.db 0x04, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03
.db 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
;Recipricol table #1, low byte.
;R1L_TBL[x] = min( low(2^16/x) mod 256 , 255)
R1L_TBL:
.db 0xFF, 0xFF, 0x00, 0x55, 0x00, 0x33, 0xAA, 0x92, 0x00, 0x71, 0x99, 0x45, 0x55, 0xB1, 0x49, 0x11
.db 0x00, 0x0F, 0x38, 0x79, 0xCC, 0x30, 0xA2, 0x21, 0xAA, 0x3D, 0xD8, 0x7B, 0x24, 0xD3, 0x88, 0x42
.db 0x00, 0xC1, 0x87, 0x50, 0x1C, 0xEB, 0xBC, 0x90, 0x66, 0x3E, 0x18, 0xF4, 0xD1, 0xB0, 0x90, 0x72
.db 0x55, 0x39, 0x1E, 0x05, 0xEC, 0xD4, 0xBD, 0xA7, 0x92, 0x7D, 0x69, 0x56, 0x44, 0x32, 0x21, 0x10
.db 0x00, 0xF0, 0xE0, 0xD2, 0xC3, 0xB5, 0xA8, 0x9B, 0x8E, 0x81, 0x75, 0x69, 0x5E, 0x53, 0x48, 0x3D
.db 0x33, 0x29, 0x1F, 0x15, 0x0C, 0x03, 0xFA, 0xF1, 0xE8, 0xE0, 0xD8, 0xD0, 0xC8, 0xC0, 0xB9, 0xB1
.db 0xAA, 0xA3, 0x9C, 0x95, 0x8F, 0x88, 0x82, 0x7C, 0x76, 0x70, 0x6A, 0x64, 0x5E, 0x59, 0x53, 0x4E
.db 0x49, 0x43, 0x3E, 0x39, 0x34, 0x30, 0x2B, 0x26, 0x22, 0x1D, 0x19, 0x14, 0x10, 0x0C, 0x08, 0x04
.db 0x00, 0xFC, 0xF8, 0xF4, 0xF0, 0xEC, 0xE9, 0xE5, 0xE1, 0xDE, 0xDA, 0xD7, 0xD4, 0xD0, 0xCD, 0xCA
.db 0xC7, 0xC3, 0xC0, 0xBD, 0xBA, 0xB7, 0xB4, 0xB2, 0xAF, 0xAC, 0xA9, 0xA6, 0xA4, 0xA1, 0x9E, 0x9C
.db 0x99, 0x97, 0x94, 0x92, 0x8F, 0x8D, 0x8A, 0x88, 0x86, 0x83, 0x81, 0x7F, 0x7D, 0x7A, 0x78, 0x76
.db 0x74, 0x72, 0x70, 0x6E, 0x6C, 0x6A, 0x68, 0x66, 0x64, 0x62, 0x60, 0x5E, 0x5C, 0x5A, 0x58, 0x57
.db 0x55, 0x53, 0x51, 0x50, 0x4E, 0x4C, 0x4A, 0x49, 0x47, 0x46, 0x44, 0x42, 0x41, 0x3F, 0x3E, 0x3C
.db 0x3B, 0x39, 0x38, 0x36, 0x35, 0x33, 0x32, 0x30, 0x2F, 0x2E, 0x2C, 0x2B, 0x29, 0x28, 0x27, 0x25
.db 0x24, 0x23, 0x21, 0x20, 0x1F, 0x1E, 0x1C, 0x1B, 0x1A, 0x19, 0x18, 0x16, 0x15, 0x14, 0x13, 0x12
.db 0x11, 0x0F, 0x0E, 0x0D, 0x0C, 0x0B, 0x0A, 0x09, 0x08, 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01
;Recipricol table #2
;R2_TBL[x] = min( 2^16/(x+256), 255)
R2_TBL:
.db 0xFF, 0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA, 0xF9, 0xF8, 0xF7, 0xF6, 0xF5, 0xF4, 0xF3, 0xF2, 0xF1
.db 0xF0, 0xF0, 0xEF, 0xEE, 0xED, 0xEC, 0xEB, 0xEA, 0xEA, 0xE9, 0xE8, 0xE7, 0xE6, 0xE5, 0xE5, 0xE4
.db 0xE3, 0xE2, 0xE1, 0xE1, 0xE0, 0xDF, 0xDE, 0xDE, 0xDD, 0xDC, 0xDB, 0xDB, 0xDA, 0xD9, 0xD9, 0xD8
.db 0xD7, 0xD6, 0xD6, 0xD5, 0xD4, 0xD4, 0xD3, 0xD2, 0xD2, 0xD1, 0xD0, 0xD0, 0xCF, 0xCE, 0xCE, 0xCD
.db 0xCC, 0xCC, 0xCB, 0xCA, 0xCA, 0xC9, 0xC9, 0xC8, 0xC7, 0xC7, 0xC6, 0xC5, 0xC5, 0xC4, 0xC4, 0xC3
.db 0xC3, 0xC2, 0xC1, 0xC1, 0xC0, 0xC0, 0xBF, 0xBF, 0xBE, 0xBD, 0xBD, 0xBC, 0xBC, 0xBB, 0xBB, 0xBA
.db 0xBA, 0xB9, 0xB9, 0xB8, 0xB8, 0xB7, 0xB7, 0xB6, 0xB6, 0xB5, 0xB5, 0xB4, 0xB4, 0xB3, 0xB3, 0xB2
.db 0xB2, 0xB1, 0xB1, 0xB0, 0xB0, 0xAF, 0xAF, 0xAE, 0xAE, 0xAD, 0xAD, 0xAC, 0xAC, 0xAC, 0xAB, 0xAB
.db 0xAA, 0xAA, 0xA9, 0xA9, 0xA8, 0xA8, 0xA8, 0xA7, 0xA7, 0xA6, 0xA6, 0xA5, 0xA5, 0xA5, 0xA4, 0xA4
.db 0xA3, 0xA3, 0xA3, 0xA2, 0xA2, 0xA1, 0xA1, 0xA1, 0xA0, 0xA0, 0x9F, 0x9F, 0x9F, 0x9E, 0x9E, 0x9D
.db 0x9D, 0x9D, 0x9C, 0x9C, 0x9C, 0x9B, 0x9B, 0x9A, 0x9A, 0x9A, 0x99, 0x99, 0x99, 0x98, 0x98, 0x98
.db 0x97, 0x97, 0x97, 0x96, 0x96, 0x95, 0x95, 0x95, 0x94, 0x94, 0x94, 0x93, 0x93, 0x93, 0x92, 0x92
.db 0x92, 0x91, 0x91, 0x91, 0x90, 0x90, 0x90, 0x90, 0x8F, 0x8F, 0x8F, 0x8E, 0x8E, 0x8E, 0x8D, 0x8D
.db 0x8D, 0x8C, 0x8C, 0x8C, 0x8C, 0x8B, 0x8B, 0x8B, 0x8A, 0x8A, 0x8A, 0x89, 0x89, 0x89, 0x89, 0x88
.db 0x88, 0x88, 0x87, 0x87, 0x87, 0x87, 0x86, 0x86, 0x86, 0x86, 0x85, 0x85, 0x85, 0x84, 0x84, 0x84
.db 0x84, 0x83, 0x83, 0x83, 0x83, 0x82, 0x82, 0x82, 0x82, 0x81, 0x81, 0x81, 0x81, 0x80, 0x80, 0x80

.def NH    = r16 .def NL    = r17
.def DH    = r18 .def DL    = r19
.def N2H   = r20 .def N2L   = r21
.def N1H   = r22 .def N1L   = r23
.def PRODL = r0  .def PRODH = r1
.def PH    = r2  .def PL    = r3
.def D1H   = r4  .def D1L   = r5
.def RZERO = r6
.def Rx    = r7 

idivu_16x16_x:
    clr RZERO                 ;1
    ;Check that DH is not zero
    tst DH                    ;1
    brne idivu_16x16_x_dhne   ;2
    ;code for D < 256   
idivu_16x8:
    ;lookup low byte of recipricol into P.
    ;where P = min(2^16 / D,2^16-1)
    mov zl, DL               ;1
    ldi zh, high(R1L_TBL*2)  ;1 
    lpm PL, Z                ;3
    ldi zh, high(R1H_TBL*2)  ;1 
    lpm PH, Z                ;3
    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    ;calculate N2 = (P * N) >> 16
    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    ;     NH:NL
    ;  X  RH:RL
    ;------------------------------------------
    ;   N2H    |   N2L    |  N1H     | dropped
    ;----------+----------+----------+---------
    ;          |          | H(PL*NL) | L(PL*NL)
    ;          | H(PL*NH) | L(PL*NH) |
    ;          | H(PH*NL) | L(PH*NL) |
    ; H(PH*NH) | L(PH*NH) |          |
    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 
    mul NL , PL     ;1  PL*NL
    mov N1H, PRODH  ;1  N1H <= H(PL*NL)
    mul NH , PH     ;1  PH*NH
    mov N2H, PRODH  ;1  N2H <= H(PH*NH)
    mov N2L, PRODL  ;1  N2L <= L(PH*NH)
    mul NH , PL     ;1  PL*NH
    add N1H, PRODL  ;1  N1H <= H(PL*NL) + L(PL*NH) 
    adc N2L, PRODH  ;1  N2L <= L(PH*NH) + H(PL*NH)
    adc N2H, RZERO  ;1  propagate carry to N2H      
    mul NL , PH     ;1  PH*NL
    add N1H, PRODL  ;1  N1H <= H(PL*NL) + L(PL*NH) + L(PH*NL)
    adc N2L, PRODH  ;1  N2L <= H(PH*NL) + L(PH*NH) + H(PL*NH)
    adc N2H, RZERO  ;1  propagate carry to N2H  
    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
    ;calculate P = N2 * DL ,note DH=0
    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;    
    mul N2L, DL              ;1
    mov PL, PRODL            ;1
    mov PH, PRODH            ;1
    mul N2H, DL              ;1
    add PH, PRODL            ;1
    rjmp idivu_16x16_x_adj_n ;2
    ;code for D >= 256
idivu_16x16_x_dhne:          
    ;Lookup Rx = min(256 / DH, 255)     
    mov zl, DH               ;1
    ldi zh, high(R1H_TBL*2)  ;1 
    lpm Rx, Z                ;3
    ;D1 = (D * Rx) >> 8          
    mul Rx , DH              ;1
    mov D1L, PRODL           ;1
    mov D1H, PRODH           ;1
    mul Rx , DL              ;1
    add D1L, PRODH           ;1
    adc D1H, RZERO           ;1
    ;N1 = (D * Rx) >> 8          
    mul Rx , NH              ;1
    mov N1L, PRODL           ;1
    mov N1H, PRODH           ;1
    mul Rx , NL              ;1
    add N1L, PRODH           ;1
    adc N1H, RZERO           ;1
    ;if D1H = 0 then use Rx = 256, otherwise use table   
    tst D1H                  ;1
    brne idivu_16x16_x_dxhne ;2
    mov N2L, N1H             ;1
    clr N2H                  ;1
    rjmp idivu_16x16_x_checkn;2
    idivu_16x16_x_dxhne:
    ;Lookup Rx = (2 ^ 16) \ (256 + D1H)
    mov zl, D1L              ;1
    ldi zh, high(R2_TBL*2)   ;1
    lpm Rx, Z                ;3
    ;N2 = (N1 * R2) >> 16
    mul Rx  , N1H            ;1
    mov PL  , PRODL          ;1
    mov N2L , PRODH          ;1
    mul Rx , N1L             ;1
    add PL , PRODH           ;1
    adc N2L, RZERO           ;1
    clr N2H                  ;1
    idivu_16x16_x_checkn:
    ;Check result (it may be off by +/- 1)
    ;P = N2 * D
    ;NOTE:  N2 <=255 so NxH = 0, also P < 2^16 so we can discard upper byte of DH * NxL
    mul DL , N2L             ;1
    mov PL, PRODL            ;1
    mov PH, PRODH            ;1
    mul DH , N2L             ;1
    add PH , PRODL           ;1 
    brcc idivu_16x16_x_adj_n ;2
    ;if multiply overflowed then...
    ;decrement quotient
    ;calculate remainder as N - P + D
    subi N2L, 0x01           ;1
    sbci N2H, 0x00           ;1
    mov N1L, NL              ;1
    mov N1H, NH              ;1
    sub N1L, PL              ;1
    sbc N1H, PH              ;1
    add  N1L, DL             ;1
    adc  N1H, DH             ;1
    rjmp idivu_16x16_x_end   ;2
    ;Adjust result up or down by 1 if needed.
    idivu_16x16_x_adj_n:
    ;Add -P to N, with result in P
    mov N1L, NL              ;1
    mov N1H, NH              ;1
    sub N1L, PL              ;1
    sbc N1H, PH              ;1
    brsh idivu_16x16_x_pltn  ;2
    idivu_16x16_x_decn2:
    ;if P > N then decrement quotient, add to remainder
    subi N2L, 0x01           ;1
    sbci N2H, 0x00           ;1
    add  N1L, DL             ;1
    adc  N1H, DH             ;1
    rjmp idivu_16x16_x_end   ;2
    idivu_16x16_x_pltn:
    ;test remainder to D 
    cp  N1L, DL              ;1
    cpc N1H, DH              ;1
    ;if remainder < D then goto end
    brlo idivu_16x16_x_end   ;2
    ;if remainder >= D then increment quotient, reduce remainder
    subi N2L, 0xFF           ;1
    sbci N2H, 0xFF           ;1
    sub N1L, DL              ;1
    sbc N1H, DH              ;1
    idivu_16x16_x_end:
    ret
    .undef NH    .undef NL   
    .undef DH    .undef DL   
    .undef N2H   .undef N2L  
    .undef N1H   .undef N1L  
    .undef PRODL .undef PRODH
    .undef PH    .undef PL   
    .undef D1H   .undef D1L  
    .undef RZERO 
    .undef Rx

What is your question, exactly? No doubt there might be faster algorithms, but what exactly is your definition of the “fastest” algorithm? Are you asking us to optimize your code? — StarCat, Dec 14 '20 at 20:54
Pretty sure that the C operator `/` would perform fairly well. — Eugene Sh., Dec 14 '20 at 21:00
@StarCat Fastest means, "takes the least clock cycles" on this processor, and consumes less than 5K bytes of flash memory to hold the code and any associated tables. There might always be some (yet unknown) faster algorithm out there. I am just asking for people to give the fastest algorithms they know of for this processor architecture. The algorithms put out by Atmel in their application notes take either 243 cycles, or 173 cycles. I was able to improve on this and get it down to 68 cycles. If someone finds a way to save clock cycles in my existing code, that also counts as an answer. — user4574, Dec 14 '20 at 21:14
@EugeneSh. Pretty sure it wouldn't be compared to this, but is this code the absolute fastest _possible_? I doubt it - need to take a closer look! — Bruce Abbott, Dec 14 '20 at 21:16
@EugeneSh. The code generated by a C compiler is likely going to be some variant of shift and subtract type division. Therefore its probably comparable to Atmel's app note, and probably several times slower than what I already have. Having said that, I would be glad to see a disassembly listing that proves otherwise. — user4574, Dec 14 '20 at 21:21
@EugeneSh. For the purposes of this question "fastest known published algorithm" or "fastest I could come up with that's better than what's been published" counts as "fastest". — user4574, Dec 14 '20 at 21:24
@user4574 http://avr-gcc.senthilthecoder.com/ - here you can play around with the disassembly. — Eugene Sh., Dec 14 '20 at 21:27
In fact the original godbolt works too: https://gcc.godbolt.org/z/sqxcvq .. but it has a call to a builtin function, which is not helpful I guess — Eugene Sh., Dec 14 '20 at 21:29
@BruceAbbott This code is 2.5X faster than what Atmel published, but I agree, it could be faster. The lookup tables I am using don't have enough bits to directly compute the answer and require an extra step where I add +/-1 to correct the result at the end. The extra step uses about 10 cycles. Also in one branch there is a multiply that can overflow and requires an extra check which takes a few cycles. Using more bits in the lookup coefficients would eliminate the checks, but at the cost of even more cycles due to the larger multiply operations required to use them. — user4574, Dec 14 '20 at 21:32
@user4574 this is a general purpose division algorithm. I am pretty sure most code running on a small microcontroller does not need a general purpose algorithm. I.e. I would think in many if not most cases either the divisor or the dividend will be known at compile time, which will probably create opportunities to create code that’s faster than this. — StarCat, Dec 14 '20 at 21:32
@StarCat In this case I need a general division algorithm. I am using it to calculate the intersection points between shapes and the edge of the screen for graphics drawing algorithms. I will also probably be using it to do math on captured wave-forms as well. — user4574, Dec 14 '20 at 21:35
@EugeneSh. Based on your link, it looks like GCC uses a shift and subtract type algorithm on the processor. Its very small, code wise, but not that fast. — user4574, Dec 14 '20 at 21:52
@user4574 Have you attempted an unrolled, non-restoring co-routine approach using only registers and no memory reads? I suspect it will take longer. But I have to ask, anyway. — jonk, Dec 15 '20 at 04:50
@user4574 I often want to normalize inputs, prior. (This makes it very close to floating point, though.) I find it less valuable to know that 14/31813 is 0 R 14, as nothing new has been uncovered that couldn't otherwise have been learned just with a simple comparison. But I do often find the operation useful if I instead learn that the result is 59065 R 54991 E -27 D 63626 = (59065 + 54991/63626)*2^(-27). It's still an integer divide. Just preceded with normalization prior to the same division steps. — jonk, Dec 15 '20 at 10:37
(@jonk: I have tinkered with not-restoring-until-return - for "the 8-iteration path for divisors with non-zero high byte". At 7-8 cycles per bit faster than the 5 cycles per bit path for "up to 7 bit divisors" above 11 iterations, on par with 5-6 cycles per bit for "8 bit divisors". 4iteration2copy *smaller* than a similar speed 2iteration4copy restoring/non-performing one.) — greybeard, Feb 27 '21 at 09:07

score 1 · Answer 1 · answered Feb 26 '21 at 09:13

1

This question has been re-asked on Code Review@SE. From that post:

I am looking for ways to reduce either the code size, lookup table size, or number of clock cycles.

From my assessment, access to R1H_TBL(table of high bytes of reciprocals) & R2_TBL[x] is on the critical path, making it less attractive to go beyond investing a single cycle to "special case away" the upper half of R1H_TBL.
R1L_TBL is off the critical path, adding to the annoyance of not seeing how to shrink it - I invested hours on end trying to do polynomial approximation using arithmetic of an 8-bit processor.

Would you believe half the integers are even?
When \$e = 2h, h \in ℕ\$, \$\lfloor\frac{2^{16}}{e}\rfloor\$ = \$\lfloor\frac{\lfloor\frac{2^{16}}{h}\rfloor}{2}\rfloor\$: no need for entries for even numbers!
Not done working out the details (esp carry from high to low byte) or checking applicability to R2_TBL.
I stumbled upon this re-assessing unrolled shift-and-"subtract" (before: 88 cycles "+ return", after ~77, going for 70) after finding I tackled shaving off iterations the wrong way. (The more successful way is using a table for small divisors - and a single comparison deciding what is small.)

answered Feb 26 '21 at 09:13

greybeard

1,469
1
4
17

I was able to get down to 62 cycles so far. By re-arranging some registers I was able to use the MOVW instructions to move some register pairs in 1 clock cycle. The instruction set reference shows 2 cycles for MOVW, but the part data-sheet shows 1. Also generating the table into RAM allows 2-cycle table access instead of 3-cycles for LPM. This part has a lot of RAM so using 256*3 bytes of RAM is doable. – user4574 Feb 26 '21 at 19:47
(@user4574: Currently, I'm afraid the above doesn't lead to another viable reduction of table size - haven't found a way to do the necessary "scalings" fast enough. Can you recommend a presentation of Goldschmidt's division & Newton-Raphson for non-floating-point representations? (I (re-)discovered [Even/Seidel/Ferguson: *A parametric error analysis of Goldschmidt’s division algorithm*](https://www.sciencedirect.com/science/article/pii/S0022000004000960).) Is there some publicly accessible material your code is based on?) – greybeard Feb 27 '21 at 06:57
My approach to developing the algorithm was to look up division algorithms on Wikipedia to get general ideas. https://en.wikipedia.org/wiki/Division_algorithm . And to look at any recommendations from Atmel (which I liked in the original post). I don't actually know of any good papers I could recommend. Although, it should be noted that with the existing tables you get 8-bit x 8-bit division almost for free if you were writing that function. – user4574 Feb 27 '21 at 23:28

What is the fastest (in clock cycles) 16-bit x 16-bit unsigned integer division algorithm that will run on an ATMEGA1284?

1 Answers1

Linked