Why does modulo operation consume more power?

Question

I mainly use Cortex-M4F or Cortex-M0/0+ devices such as:

STM32L4, G4
STM32L0, G0

I sometimes see blogs like this one saying "Avoid Modulo" for lower power consumption.

Comparing these two programs, is this a proper way to avoid the modulo operation?

Assume the MCU supply voltage could be either 1.8 V or 3.3 V.

while(1) { // CODE X, I thought removing conditional statements could perform better
    my_index++;
    if (my_index >= 8) my_index = 0;
}

VS

while(1) { // CODE Y
    my_index = (my_index + 1) % 8;
}

Also, why does modulo operation consume more?

You should compare the assembler code generated for both versions. — Uwe, Mar 13 '22 at 18:38
here is the assembly for the unsigned modulo operation from the "standard library" LLVM (a compiler backend) needs to create ARM assembly. You can see its a little bit more instructions than an if statement. https://code.woboq.org/llvm/compiler-rt/lib/builtins/arm/aeabi_uidivmod.S.html — Miron, Mar 13 '22 at 19:40
"modulo 8" will not in practice be implemented as a modulo operation (by any decent compiler, anyway). There are easier and more power efficient implementations for any "modulo 2^n". — , Mar 13 '22 at 20:26
The comparison OP uses is not a good one because a more "efficient" operation will be execute more times per second due to the while loop, thus drawing more power. — abligh, Mar 14 '22 at 09:23
Efficient code improves power consumption when it increase the opportunity for the CPU to enter a sleep state, there is exactly zero opportunity for that in either of the given examples. So unless the instructions in one example use more energy *per clock cycle* then both examples will use the same amount of power, and the second one will use less time (and therefore less energy) per iteration. — Rodney, Mar 14 '22 at 13:13
If a system will wake up when necessary to do work, and go back to sleep once it's done until there's more work to be done, then the amount of time required to perform a chunk of work will strongly correlate with the amount of energy required to perform that chunk of work, even if the code actually doing the work is agnostic to the notion of energy consumption. — supercat, Mar 14 '22 at 22:05

TimWescott · Accepted Answer · 2022-03-15T16:55:37.253

33

That set of rules makes sense, sort of. But they're limited.

Specifically, the blanket rule "don't use modulo" is somewhat misguided, and should really mean "avoid the use of code that results in a divide operation".

(My set of rules would be to understand how the hardware works, compile to assembly and inspect the result, profile your code, and benchmark, benchmark, benchmark).

If you had a line of code that said a = b % c; then (assuming that a, b, and c are integers) you're specifying to the compiler that c could be any integer value. It would have to compile in a divide operation. Divide operations take lots of either time or logic area; in either case that translates to energy consumed to perform a divide.

In your specific case, where you say my_index = (my_index + 1) % 8; then the compiler -- even with optimizations set at their lowest level -- will probably turn that into the machine language equivalent of my_index = (my_index + 1) & 0x0007;. In other words, it won't divide (way expensive), it won't even branch (less expensive), but it'll mask (least expensive on most processors today). But this will only happen if the modulo is by a power of two.

You could insure that by just using my_index = (my_index + 1) & 0x0007;, at the cost of code comprehension and thus code maintainability. Comment it well if you go down that road.

So in your specific case, as long as that % 8 doesn't change, or only ever changes to % N where N is always \$2^n\$, and the compiler knows that, the speed won't change. But if you or someone else comes along later and changes it to my_index = (my_index + 1) % 17; (or any other non power of 2) then suddenly your code will have a divide operation and it'll be more power-hungry. In that case, using the conditional statement will be less expensive.

(In C/C++, you make sure that the compiler knows the value of a constant ahead of time by using a # define statement, or (depending on the optimizer) declaring it const unsigned int or (stronger, if it's C++) declaring it constexpr int. Other compiled languages (i.e. Rust) have their own ways of making this happen.)

Note: I wouldn't be surprised if a good optimizing compiler wouldn't turn the 'if' construct into a mask -- but I wouldn't be surprised if it didn't. Ditto, a really good optimizing compiler might see the my_index = (my_index + 1) % 17; and infer the conditional construct. I don't think I'd count on that without looking at the assembly, and I don't think I'd trust it 100% -- I might use it, but I'd put a comment in the code about crossing my fingers and hoping the compiler plays nice.

Unless you're absolutely backed against the wall for power consumption, you should also be thinking about code readability and fragility. Someone will come along later and need to understand that code, and will appreciate it if it's not a minefield full of opportunities for screwing it up. That someone may be future-you, so be nice!

edited Mar 15 '22 at 16:55

answered Mar 13 '22 at 17:24

TimWescott

44,867
1
41
104

8

If `my_index` is a signed type, `%8` will be more expensive than `&7`, unless the compiler manages to prove that the value is non-negative at that point. (Because the asm will have to produce a negative remainder for negative `my_index`.) – Peter Cordes Mar 14 '22 at 03:44
1

I'm not really sure the the `%8` form is any more "readable" than the `&7` form and in either case the `if (my_index == 8)` is arguably more self-documenting. One gotcha with using & is if you had a macro `my_index = (my_index + 1) & (INDEX_MAX -1)` you could be making the assumption `INDEX_MAX` must be a power of two but someone down the line might not share that assumption, then they set it to 10 and break your code. – Rodney Mar 14 '22 at 13:20
1

Another thing to note - ARM can potentially do conditional code without a branch, but not if the instruction decoder is limited to the thumb instruction set (and I think those part numbers do indeed have that limitation) – Rodney Mar 14 '22 at 13:27
3

@Rodney: My preference is to have two adjacent defines, for e.g. `SER_RXQUEUE_SIZE` and `SER_RXQUEUE_MASK`, with the latter being defined just below the former, as `(SER_RXQUEUE_SIZE-1)`. One could then use a `#if` to squawk if `(SER_RXQUEUE_SIZE & SER_RX_MASK)` is non-zero. – supercat Mar 14 '22 at 15:08
@Rodney: Cortex-M4 doesn't have full Thumb, but does support `it` / `ite` (to predicate up to 4 later instructions). https://godbolt.org/z/shqaGW6d5 . You're right that M0 doesn't support it, or at least compilers choose not to use it for that test. IIRC, the predicate is evaluated separately for each predicated instruction, so flag-setting instructions like `cmp` inside one `it` block work just like in ARM mode, and so the `it` itself can't necessarily be handled as just a jump over the false part. There is a restriction on a predicated branch being the last insn in an `it` block, though. – Peter Cordes Mar 15 '22 at 02:11
1

@PeterCordes good point about signed and the %8. It may not _always_ be more expensive, because the compiler writer has some leeway in how they implement it so they _could_ make it so that `(a % 8) == (a & 0x0007)` is always true. Usually it'll be whatever the processor does under the hood. – TimWescott Mar 15 '22 at 05:31
@Rodney yes. I'm assuming that if you're trying to save every erg by trimming out clock ticks, you may be willing to make the code a bit less readable. I touch on that in my answer, but basically, if you need to optimize that much, it may be worth having two lines of comments and one really strange line of code, just to get it working right. – TimWescott Mar 15 '22 at 05:33
1

@TimWescott: In contexts where you're branching or booleanizing on 0/non-zero, `int % 8` can be optimized to just an AND or TEST / TST instruction, unless you defeat it with `a%2 == 1` (And real compilers do this https://godbolt.org/z/xPnjPWq7f). The example in the question was not one of those cases. In ISO C99 and later, `a%8` is fully defined to be a remainder, not modulo, to go with `/` being fully defined as truncating towards zero, not like Python. (And yes, this is what all modern mainstream CPUs with HW division at all do under the hood) IIRC, C89 did allow the choice you suggest. – Peter Cordes Mar 15 '22 at 05:47
@PeterCordes Argh. I'm a wee bit behind in my reading! I hadn't meant that one would actually be doing that test, just trying to illustrate my (apparently obsolete) point. – TimWescott Mar 15 '22 at 05:51
2

`…or only ever changes to % N where N is always 2^n…` **and** the compiler knows it (it’s a compile-time constant, not a variable passed to a function, for instance) – jcaron Mar 15 '22 at 11:34
@jcaron thanks. I've incorporated that into my answer. – TimWescott Mar 15 '22 at 16:19

Lundin · Answer 2 · 2022-03-15T14:10:51.340

First of all, in case it isn't obvious: longer execution time means more power consumption. Though if you are mostly interested in reduced power consumption, look at the system clock before anything else.

Why does modulo operation consume more power?

It doesn't on a modern compiler. Division and modulo are heavy CPU operations for most cores, but C code generating actual div etc instructions was mostly happening up until some 15-20 years back. Modern compilers will pick the best code when you enable optimizations and avoid division when it can be avoided. Also division is a bigger problem performance-wise on low-end MCUs like 8 and 16 bitters.

However, it should be mentioned that it's somewhat common practice in embedded systems to run with all optimizations disabled. Mostly because compiler optimizers of various mediocre-quality embedded compilers rightfully built up a nasty reputation of being buggy, back in the 90s and early 2000s.

If you run with optimizations disabled, then you are of course on your own and have to perform all optimizations manually - which is definitely not recommended practice unless you have in-depth knowledge about both C and the target CPU.

Lets disassemble your particular code examples in gcc-arm-none-eabi, with -O3. I made these stand-alone examples:

void func1 (void)
{
  static unsigned int my_index;
  while(1) 
  { 
    my_index++;
    if (my_index >= 8) my_index = 0;

    volatile unsigned int out = my_index;
  }
}

void func2 (void)
{
  static unsigned int my_index;
  while(1) 
  { 
    my_index = (my_index + 1) % 8;
    volatile unsigned int out = my_index;
  }
}

The volatiles are needed as side-effect to block the optimizer from removing the code entirely. Now, disassembling this using Godbolt https://godbolt.org/z/bM5M5v38h, we get nearly identical machine code. No division in sight. The version with addition is actually performing ever so slightly worse because of the cmp instruction (branch).

I thought removing conditional statements could perform better

Yes generally, and in your case it actually does, though it's a micro-optimization. Cortex M in general do not have advanced branch prediction nor cache memories. On a M0 it's not worth the head ache to even consider. I believe some STM32x4 have some hardware support for a simple form of branch predication. Higher end M7 etc will have cache and then avoiding branches matters more.

In general:

You should strive to write code as readable as possible. Then optimize when there are actual performance bottlenecks in your code. Manual optimizations are highly qualified work and requires lots of experience.

In this particular case I'd say the addition/counter version is more readable so I would use that regardless of a few CPU ticks more or less.

As for the blog you linked, the author is not a complete rookie and make some good points, but there are some strange and even misguided things mentioned. Let me comment on that bullet list you got the modulo comment from:

Use the “Static Const” Class as much as possible to prevent runtime copying of arrays, structures etc. that consumes power.

I have no idea what a "Static Const" Class is supposed to mean. C is notably case-sensitive and doesn't have a class keyword. I assume the author doesn't know proper C terminology and actually means to say something like: Use static storage class specifiers and const correctness whenever possible. If that's what they meant to say, that's general good advise.
Use Pointers. They are probably the most difficult part of the C language to understand for beginners but they are the best for accessing structures and unions efficiently.

It's kind of like telling the construction worker to use concrete... it's mandatory, not an option. Pointers is a fundamental building block of C.
Avoid Modulo!

Not really good advise as proven above.
Local variables over global variables where possible. Local variables are contained in the CPU while global variables are stored in the RAM, the CPU accesses local variables faster.

Generally correct although file scope variables ("globals") may get temporary stored in registers too. The main reason to avoid truly global (external linkage) variables is program design, not performance.

Also the difference between register and RAM access is not that big on most MCUs, this comment mostly applies to high-end CPUs like x86, Cortex A, Power PC etc. When manually optimizing memory access for mid-range MCUs like Cortex M you should rather consider flash vs RAM, since flash often has wait states.

However, reducing the scope of variables is always a good thing for readability, to minimize bugs and to reduce namespace clutter.
Unsigned data types are your best friend where possible.

True but not because of performance, but because of implicit conversions and poorly-defined behavior of signed/negative operands when used in bitwise operations.
Adopt “countdown” for loops where possible.

Ok this is even worse dinosaur advise than the modulo one. The rationale for this is very well-known, that compare vs zero is faster than compare against a value. But compilers have been able to do that optimization for ages! Do not write down-counting loops, that's simply obfuscation for nothing gained. This was valid advise around year 1993, not in year 2022.
Instead of bit fields for unsigned integers, use bit masks.

Good advise, but not related to performance either, but portability and poorly-defined behavior.

Overall the quality of the blog post is diverse: very sound advise is mixed with plain bad advise. I would stop reading that blog. Please note how often in this answer I have to go back 20-30 years back in time when commenting on it.

The "poorly defined behavior of signed operands" is another relic of the previous century. C99 fixed that. I agree that the embedded world suffers a lot from outdated stories. There's a good reason why GCC/ARM has become such a staple of embedded development, but that has raised the bar for the competition. — MSalters, Mar 15 '22 at 12:42
Re “_Static Const_”: In C (but not in C++) a global `const` variable has by default external linkage. Thus, unless you perform link-time optimization, it will be allocated in RAM. A global `static const` has internal linkage, and the compiler has a chance to optimize away its storage. — Edgar Bonet, Mar 15 '22 at 12:46
@MSalters No, by that I mean weaknesses in the C language itself. Left-shifting a signed number into the sign bit gives undefined behavior in C17. Right-shifting a negative number gives implementation-defined behavior. Bitwise complement `~` of a small integer number can turn unsigned numbers signed and negative. Signed arithmetic overflow is undefined behavior in C but well-defined in the CPU ISA. And so on. — Lundin, Mar 15 '22 at 12:47
@EdgarBonet Yes, no, maybe :) I've seen linkers do all manner of allocations when faced with something that only got `const` qualifier at file scope. Could be RAM, could be ROM, or it could be some Harvard architecture weirdness linking in case that applies. Adding `static` is good practice for sure, but no guarantees. Always check the map file to be sure. — Lundin, Mar 15 '22 at 12:51
In the old world odor, if I did mod 8 by masking and included a comment "This computes the number modulo 8", my boss would insist that the comment be removed. Really, that's how it was! — richard1941, Mar 18 '22 at 03:55
@richard1941 Well that's weird. I'd always leave a comment like that, even when doing something far more obvious like `x >>= 1; // divide by 2`. — Lundin, Mar 18 '22 at 07:31

Why does modulo operation consume more power?

2 Answers2