13

If a hardware doesn't support modulus or division operations, it takes many more CPU cycles to simulate modulus/division by software. Is there any faster way to calculate division and modulus if the operand is 10?

In my project I frequently need to calculate integer modulus 10. In particular, I'm working on PIC16F and need to show a number on an LCD. There are 4 digits to support, so there are 4 calls to modulus and division function (software implementation). That is, like the following:

digit = number % 10;    // call to an expensive function
number /= 10;           // call to an expensive function
somehow_lit_segments();

digit = number % 10;    // call to an expensive function
number /= 10;           // call to an expensive function
somehow_lit_segments();

digit = number % 10;    // call to an expensive function
number /= 10;           // call to an expensive function
somehow_lit_segments();

digit = number % 10;    // call to an expensive function
number /= 10;           // call to an expensive function
somehow_lit_segments();

There are other areas that uses similar code.

feetwet
  • 2,322
  • 9
  • 31
  • 55
Donotalo
  • 347
  • 1
  • 3
  • 7
  • Why are a few dozen calls/sec a problem? I wouldn't bother unless the project is fully functional and bug-free. – Nick T Apr 05 '11 at 21:53
  • I've noticed that if I continuously displaying some number in the main busy loop, the button response becomes slow. I.e., to detect that a button has been pressed, I have to press that button a bit longer. This happens when system clock is running 32768 Hz. – Donotalo Apr 05 '11 at 23:10
  • Are you using interrupts? Why are you using a 32kHz xtal; usually you can get lower power performance if you operate faster and go to sleep when idle. – Nick T Apr 05 '11 at 23:13
  • i'm using interrupts. but just to update display it doesn't worth to switch to high speed oscillation. power-wise. for my project. it has to be run low speed clock nearly 90% of its life time. – Donotalo Apr 05 '11 at 23:20
  • 2
    As general note, the book **Hacker's Delight** by Henry S. Warren, Jr. is *the* source for clever bit-twiddling trickery. I looked for division suggestions, and it doesn't have anything for divide by 10 that is superior to any of the answers below. – RBerteig Apr 06 '11 at 01:39
  • Hacker's Delight is a great resource, also just fun to read. (Note: doesn't actually have anything to do with hacking.) – tcrosley Nov 26 '13 at 18:40
  • Of course it has to do with hacking! I don't see the point in pretending that hacking is cracking. http://www.catb.org/jargon/html/H/hack.html – Yann Vernier May 30 '14 at 10:57

10 Answers10

12

Heres a binary to BCD algorithm I used several years ago based on one found here. I was using an external BCD to 7 seg display driver so the result could be written to the proper ports directly as packed BCD for output.

This is fairly fast if you have a hardware multiplier in the PIC, I was using a PIC18F97J60. If you don't have a hardware multiplier on your PIC, consider using shift + add for the multiplication.

This takes in an unsigned 16-bit int and returns packed BCD with 5 digits, it could be modified and made faster for 4 digits. It uses shift + additions to approximate division by 10 but given the limited input range it is exact for this use. You may want to pack the result differently as well to line up with how your using the result.

void intToPackedBCD( uint16_t n, uint8_t *digits ) {
    
    uint8_t d4, d3, d2, d1, d0, q;  //d4 MSD, d0 LSD

    d1 = (n>>4)  & 0xF;
    d2 = (n>>8)  & 0xF;
    d3 = (n>>12) & 0xF;

    d0 = 6*(d3 + d2 + d1) + (n & 0xF);
    q = (d0 * 0xCD) >> 11;
    d0 = d0 - 10*q;

    d1 = q + 9*d3 + 5*d2 + d1;
    q = (d1 * 0xCD) >> 11;
    d1 = d1 - 10*q;

    d2 = q + 2*d2;
    q = (d2 * 0x1A) >> 8;
    d2 = d2 - 10*q;

    d3 = q + 4*d3;
    d4 = (d3 * 0x1A) >> 8;
    d3 = d3 - 10*d4;

    digits[0] = (d4<<4) | (d3);
    digits[1] = (d2<<4) | (d1);
    digits[2] = (d0<<4);
}
phuclv
  • 458
  • 4
  • 19
Mark
  • 11,627
  • 1
  • 31
  • 38
  • great link, thanks! it doesn't only optimizes the speed, it also decreases the code size. I've implemented "12 bit binary to 4 ASCII Decimal Digits" from your link because that doesn't involve any multiplication. – Donotalo Apr 06 '11 at 18:13
8

You can convert from binary to packed BCD without any division using double dabble algorithm. It uses only shift and add 3.

For example convert 24310 = 111100112 to binary

0000 0000 0000   11110011   Initialization
0000 0000 0001   11100110   Shift
0000 0000 0011   11001100   Shift
0000 0000 0111   10011000   Shift
0000 0000 1010   10011000   Add 3 to ONES, since it was 7
0000 0001 0101   00110000   Shift
0000 0001 1000   00110000   Add 3 to ONES, since it was 5
0000 0011 0000   01100000   Shift
0000 0110 0000   11000000   Shift
0000 1001 0000   11000000   Add 3 to TENS, since it was 6
0001 0010 0001   10000000   Shift
0010 0100 0011   00000000   Shift
   2    4    3
       BCD

This algorithm is very efficient when there's no hardware divisor available. More over only left shift by 1 is used, so it's fast even when a barrel shifter is not available

phuclv
  • 458
  • 4
  • 19
8

Assuming unsigned integers, division and multiplication can be formed from bit shifts. And from (integer) division and multiplication, modulo can be derived.

To multiply by 10:

y = (x << 3) + (x << 1);

To divide by 10 is more difficult. I know of several division algorithms. If I recall correctly, there is a way to divide by 10 quickly using bit shifts and subtraction, but I can't remember the exact method. If that's not true, then this is a divide algorithm which manages <130 cycles. I'm not sure what micro you're using, but you can use that in some way, even if you have to port it.

EDIT: Somebody says over at Stack Overflow, if you can tolerate a bit of error and have a large temporary register, this will work:

temp = (ms * 205) >> 11;  // 205/2048 is nearly the same as /10

Assuming you have division and multiplication, modulo is simple:

mod = x - ((x / z) * z)
Thomas O
  • 31,546
  • 57
  • 182
  • 320
4

Depending on the amount of digits you need you may be able to use the brute force method (d - input number, t - output ASCII string):

t--;
if (d >= 1000) t++; *t = '0'; while (d >= 1000) { d -= 1000; *t += 1; }
if (d >= 100) t++; *t = '0'; while (d >= 100) { d -= 100; *t += 1;}
if (d >= 10) t++; *t = '0'; while (d >= 10) { d -= 10; *t += 1;}
t++; *t = '0' + d;

You can also change the multiple ifs into a loop, with powers of ten obtained by multiplication or a lookup table.

jpc
  • 5,302
  • 1
  • 24
  • 45
2

Application note AVR204 describes algorithms for BCD arithmetic, including conversion from binary to BCD and vice versa. The appnote is by Atmel, which is AVR, but the described algorithms are processor-independent.

greybeard
  • 1,469
  • 1
  • 4
  • 17
stevenvh
  • 145,145
  • 21
  • 455
  • 667
1

I don't have a good answer, but there's a great discussion on our sister site Stack Overflow on the exact same subject of division and modulo optimization.

Do you have enough memory to implement a lookup table?

Hackers Delight has a paper on optimal division algorithms.

Adam Lawrence
  • 32,921
  • 3
  • 58
  • 110
1

Have you considered holding that value as BCD all the time (using simple special "BCD increment" and "BCD add" subroutines), rather than holding that value in binary form and converting to BCD as needed (using a more difficult to understand "convert from binary to BCD" subroutine)?

At one time, all computers stored all data as decimal digits (ten-position gears, two-out-of-five code vacuum tubes, BCD, etc.), and that legacy still lingers on today. (see Why do real-time clock chips use BCD ).

davidcary
  • 17,426
  • 11
  • 66
  • 115
  • The number to be displayed on the LCD is a variable, ranging from -1999 to 1999. It indicates a temperature and is calculated in binary format. – Donotalo Apr 06 '11 at 14:06
1

The PICList is an amazing resource for people programming PIC processors.

BCD conversion

Have you considered using an off-the-shelf tried-and-tested binary-to-BCD subroutine specifically optimized for the PIC16F?

In particular, people on the PICList have spent a lot of time optimizing binary-to-BCD conversions on a PIC16F. Those routines (each one hand-optimized for a specific size) are summarized at "PIC Microcontoller Radix Conversion Math Methods" http://www.piclist.com/techref/microchip/math/radix/index.htm

integer division and mod

On a CPU like the PIC16F, a subroutine specialized to divide by a constant is often much faster than a general-purpose "divide variable A by variable B" routine. You may want to put your constant (in this case, "0.1") in the "Code Generation for Constant Multiplication/Division" http://www.piclist.com/techref/piclist/codegen/constdivmul.htm or check out the canned routines near http://www.piclist.com/techref/microchip/math/basic.htm .

davidcary
  • 17,426
  • 11
  • 66
  • 115
1

Given an 8x8 hardware multiply, one can compute a divmod-10 of an arbitrary-size number by using a routine which computes it for 12-bit number in the range 0-2559 via the procedure:

  1. Assume original number in OrigH:OrigL
  2. Divide the original number by two and store that in TempH:TempL
  3. Add the MSB of TempL*51 to the LSB of TempH*51. That is the approximate quotient
  4. Multiply the approximate quotient by 10, discarding the MSB of the value.
  5. Subtract the LSB of that result from the LSB of the original number.
  6. If that value is 10 or greater (max will be 19), subtract 10 and add 1 to the approximate quotient

I would suggest writing a divmod routine which the MSB of the number will be in W, and the LSB pointed to by FSR; the routine should store the quotient in FSR with post-decrement and leave the remainder in W. To divide a 32-bit long by 10, one would then use something like:

  movlw 0
  lfsr 0,_number + 3   ; Point to MSB
  call _divmod10_step
  call _divmod10_step
  call _divmod10_step
  call _divmod10_step

A divmod-6 step would be very similar, except using constants of 85 and 6 rather than 51 and 10. In either case, I would expect the divmod10_step would be 20 cycles (plus four for the call/return) so a short divmod10 would be about 50 cycles and a long divmod10 would be about 100 (if one special-cases the first step, one could save a few cycles).

supercat
  • 45,939
  • 2
  • 84
  • 143
1

this may not be the fastest but is a simple way.

 a = 65535;

    l = 0;
    m = 0;
    n = 0;
    o = 0;
    p = 0;

    while (a >= 10000)
    {   a -= 10000;
        l += 1;
    }
     while (a >= 1000)
    {   a -= 1000;
        m += 1;
    }
     while (a >= 100)
    {   a -= 100;
        n += 1;
    }
     while (a >= 10)
    {   a -= 10;
        o += 1;
    }
     while (a > 0)
    {   a -= 1;
        p += 1;
    }
sergiu
  • 21
  • 3