MCU programming - C++ O2 optimization breaks while loop

Question

I know people says code optimization should only bring out the hidden bug in the program, but hear me out. I am staying on a screen, until some input via an interrupt is met.

Here is what I see in the debugger. Notice the line inspected and the expression value intercepted.

Code in image:

//...
ui::Context::addEventListener(ui::EventType::JOYSTICK_DOWN, &constant_dynamic_handler);

while (true) {
    if (choice != 0) //debugger pause
        break;
}

ui::Context::removeEventListener(ui::EventType::JOYSTICK_DOWN, &constant_dynamic_handler);

if (choice == 1) goto constant;
else if (choice == 2) goto dynamic;
else if (choice == 3) goto reset;
else if (choice == 4) goto exit;
//...

//debugger view: 
//expression: choice
//value: 1

The constant_dynamic_handler is a lambda function declared earlier, that just changes choice to some integer other than 0. The fact that I can pause in the loop means that the loop is not exited, but the value is in fact changed. I cannot step over one step in the debugger as it will fail to read the memory on the CPU and requires a restart to debug again.

choice is declared simply in the same scope as the if-statement block, as int choice = 0;. It is altered only within an interrupt listener triggered with a hardware input.

The program works with O0 flag instead of O1 or O2.

I'm using NXP K60 and c++11, if that is required. Is it my problem? Could there be any thing that I'm not aware of? I am a beginner at MCU programming~~, and this code works on desktop~~(Just tried, doesn't work).

I wouldn't expect this to work on a desktop system either. C compilers are allowed to read a variable once, and then assume it doesn't change, unless it's declared volatile. Every compiler I've used in the last 20 years performs this optimisation. — Jules, Jul 23 '18 at 08:18
Others have already pointed out the reason why this doesn't work with `-O2`. I'd also recommend (as a matter of style) simplifying the loop - why loop forever, and then break when a condition is met, rather than just doing `while (choice == 0) {}` ? — psmears, Jul 23 '18 at 09:52
@psmears I had that before I change into this because that didn't work. — Daniel Cheung, Jul 23 '18 at 10:39
@Wilson The image I provided only aimed to prove my point with what I saw during debugging. — Daniel Cheung, Jul 23 '18 at 10:40
What version of C++ are you using? C++11 introduced some new assumptions about loops like `while(true) { /*...*/ }`, which might cause additional headaches... (See [this part](https://youtu.be/g7entxbQOCc?t=9m56s) of a talk for a relevant example.) — hoffmale, Jul 23 '18 at 10:47
@Jules You're right. I never did try that out. And now I know it also doesn't work on desktop. — Daniel Cheung, Jul 23 '18 at 10:52
Not related to the specific question, but "goto considered harmful" (google it). If you're expecting anyone else to look at this code, you'd better have a seriously good reason for those gotos, else you should rewrite that code. For multiple conditions on the same variable, a switch statement is more elegant (and faster if the compiler doesn't realise it can turn the if statement into a switch statement at optimisation). Also your while loop could just be written "while (choice != 0);". — Graham, Jul 23 '18 at 17:16
@Graham The reason I used `goto` here is because the previous version didn't work. I was exhausting all possible ways to write the code that has the same effect. — Daniel Cheung, Jul 23 '18 at 18:59
(1) While the code itself may be helpful, it's more helpful to post the code itself so people can test it. See https://meta.stackoverflow.com/q/303812 . (2) Please post a [Minimal Complete Verifiable Example](https://stackoverflow.com/help/mcve), which includes the declaration of `choice`, so you won't receive comments such as [this one](https://electronics.stackexchange.com/questions/387181#comment942979_387181). It's completely possible for the bug to lie not in the code you shown, although the program appears to start behaving wrong there, you may invoked an undefined behavior long ago. — , Jul 24 '18 at 05:26
This would've been much better asked on stack overflow - where you have a lot more dedicated c++ people. — UKMonkey, Jul 24 '18 at 09:19
Your question discriminates against the blind and hard of sight. Please post _actual code_ in _text format_. — Lightness Races in Orbit, Jul 24 '18 at 09:22
Note that debuggers and optimizers don't always mix. If you want to know what's going on in the code, you should really be looking at the disassembly. A good IDE will let you step through the assembly and see e.g. what registers are being compared and what addresses are being loaded. If the optimizer produced an infinite loop, you'll find a branch instruction that points to its own address. — Adam Haun, Jul 27 '18 at 16:23

score 55 · Answer 1 · answered Jul 23 '18 at 07:59

55

The code optimizer has analyzed the code and from what it can see the value of choice will never change. And since it will never change, there's no point in checking it in the first place.

The fix is to declare the variable volatile so that the compiler is forced to emit code that checks its value regardless of the optimization level used.

answered Jul 23 '18 at 07:59

Ignacio Vazquez-Abrams

48,282
4
73
102

16

I once ran across an embedded compiler that **ignored** volatile when you turned the optimiser on.... Oh how we laughed, it took ages to find and we would up grovelling thru the assembly output. All variables that are modified outside the normal control flow should be declared volatile, and yes, "volatile const" is a thing! – Dan Mills Jul 23 '18 at 09:58
8

In modern C++, `std::atomic choice` would be good for communication between an interrupt handler and other code. Use `choice.store(value, std::memory_order_relaxed)`, and in this loop `uint8_t tmp; while(0 == (tmp=choice.load(std::memory_order_relaxed)) {}` would be good. (And probably compile to the same asm as `volatile`) – Peter Cordes Jul 23 '18 at 14:06
6

@PeterCordes: using `std::atomic` will very likely produce **different** assembly compared to `volatile` (unless you're using that weird MSVC extension, which IIRC only works for x86, x64 and possibly ARM). Atomics need to update the value atomically, i.e. no observer should be able to see any intermediate state. OTOH `volatile` only says "this value might have changed since you last read it", which is much less restrictive. Also, on some platforms there are special instructions for some special volatile values, e.g. memory mapped registers. – hoffmale Jul 23 '18 at 18:28
1

@hoffmale: You're talking about atomic RMW operations like `.fetch_add` (or `choice++`). Yes of course that compiles differently than `volatile`, to x86 `lock add` for example ([Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850)). But pure-load and pure-store are already atomic for 8-bit integers on all ISAs I'm aware of (except early DEC Alpha which only has word-sized loads/stores). (e.g. x86, [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881)). – Peter Cordes Jul 23 '18 at 18:43
@hoffmale: just for fun, I looked at gcc output for x86 and MSP430 on Godbolt, and the asm is in fact the same for volatile or atomic for the writer and the spin-wait: https://godbolt.org/g/4MbgNN. Good point that volatile can imply special instructions, that's exactly what you *don't* want for communication between an interrupt handler and regular code, and a good reason to choose `atomic`. If you ever want to increment, you can write it as `tmp = choice; tmp++; choice=tmp;`, which would be a allowed to compile to `add byte [choice], 1` on x86 (without a `lock` prefix). – Peter Cordes Jul 23 '18 at 18:45
I am curious at how the compiler sees the code. The `choice` variable was *referenced* within the lambda which clearly causes a side-effect. Wouldn't that prompt the compiler not to optimize away the variable, as the variable might change? – Daniel Cheung Jul 23 '18 at 19:02
3

@DanielCheung: The "problem" with that reference is that it doesn't modify the value inside the loop (it happens concurrently in an interrupt handler) - and there are no constructs (like `volatile` or synchronization primitives) to hint that `choice` might be changed elsewhere. – hoffmale Jul 23 '18 at 19:15
6

@DanielCheung: **A data race on a non-`atomic` variable is Undefined Behaviour in C++11** (concurrent read+write), which is why the compiler is allowed to convert `while(!choice){}` into `if(!choice) infloop();`, i.e. to hoist the load out of the loop. Lots of code repeatedly references the same global variable within a function, and forcing C++11 compilers to de-optimize it would suck a lot. – Peter Cordes Jul 23 '18 at 20:27
5

@PeterCordes An infinate loop with no side effects is Undefined Behaviour. https://en.cppreference.com/w/cpp/language/ub As such the compiler is allowed to optimise further if it wants, and just simply remove it entirely. – UKMonkey Jul 24 '18 at 09:18
@PeterCordes No, that's not correct. Being non-atomic is not the reason the compiler is allowed to hoist the load out of the loop. It's a lot simpler, as UKMonkey has explained. – iheanyi Jul 24 '18 at 13:55
1

@iheanyi: UKMonkey is correct, compilers are *allowed* to do that, but that's not what gcc does in practice. GCC does the transformation I described: converting it into check once then infinite-loop. Data race UB is what enables that transformation, and would still enable it even if there was a `volatile` access to something else inside the loop. (Or something like `printf`, if it could prove that `printf` couldn't change the value of `choice`. e.g. with `static int choice`.) IMO data race UB is the more important thing to understand here because it explains a whole range of optimizations. – Peter Cordes Jul 24 '18 at 14:07
@iheanyi: gcc chooses not to optimize away infinite loop like `while(42)` or `while(u++ <= UINT_MAX)`. Some compilers do optimize based on infloop UB (at least sometimes? I tried to create an example but failed: https://godbolt.org/g/KofYh6). But I think last time I saw this, the compiler (maybe MSVC?) did preserve obviously-intentional infinite loops like `while(42){}` even if other infloops disappeared. – Peter Cordes Jul 24 '18 at 14:16
@hoffmale: I added an answer to clear up all the stuff people have been replying to me about in comments. – Peter Cordes Jul 24 '18 at 18:57
1

Thinking about it some more... There are some optimization passes for atomics (`x = 2; x = 3` might be optimized to `x = 3;`, so no store for `2` since no observer would be able to tell whether that store happened or not) that cannot be done for `volatile` values (since reading or writing a `volatile` value might cause a side effect for some special values, e.g. an IO port). – hoffmale Jul 25 '18 at 06:42
@UKMonkey "Reading an object designated by a volatile glvalue ([basic.lval]), modifying an object, calling a library I/O function, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment." So no, if you have infinite loop waiting on a non-atomic volatile, it is not conforming to throw it out. – Alice Jul 26 '18 at 12:17
@Alice but it's not volatile. – UKMonkey Jul 26 '18 at 12:19
1

@hoffmale: just FYI, yes ISO C++ on paper *allows* that kind of optimization for atomics. In practice compilers basically treat all atomics pretty much as `volatile atomic` and don't optimize, because of potential problems it could create. See [Why don't compilers merge redundant std::atomic writes?](//stackoverflow.com/q/45960387) – Peter Cordes Dec 10 '19 at 00:26
@DanMills - I laughed when I first encountered `volatile const` until I realised that its use was actually `volatile const*` which *does* make sense - it's a pointer with a constant address to data which is volatile - an I/O port. – Den-Jason Sep 22 '20 at 05:19
1

@Den-Jason It can get worse then that, DMA ping pong buffer target pointer updated in an ISR for example, both the pointer and the data are volatile, or a pointer to a read only status register, volatile const * const uint32_t.... – Dan Mills Sep 22 '20 at 14:01
@PeterCordes: If the authors of the Standard had intended that an infinite loop invite completely arbitrary behavior, they could have specified as a constraint that all loops must either have side effects or terminate. The notion that "may assume" invites UB if an assumption is violated is contrary to use of the term "assume" in other fields of human endeavor. Someone who is told they may assume a bridge will be repaired by the time they make a certain trip should be excused for being late if they arrive at the bridge and then have to make a detour, but should not be excused... – supercat Feb 28 '22 at 18:00
...for ignoring "BRIDGE OUT" signs on the basis that they couldn't possibly be correct unless the assumption were violated. The "as-if" rule can't really accommodate situations where it may some but not all behaviors that are inconsistent with sequential execution should be viewed as acceptable, but IMHO the Standard was intended to allow compilers to delay (possibly indefinitely) the execution of a loop across code which is statically reachable from within it, and whose execution once reached could not be affected by anything within the loop. That would allow many optimizations... – supercat Feb 28 '22 at 18:05
...that would be negated if infinite loops were treated as UB (since the latter treatment would make it necessary for programmers to include dummy side effects within loops to prevent compilers from generating code that behaves in unacceptable fashion). – supercat Feb 28 '22 at 18:06

Peter Cordes · Accepted Answer · 2020-09-09T14:03:33.600

(Cross site duplicate on SO about the thread case, rather than interrupt/signal-handler case). Also related: When to use volatile with multi threading?

A data race on a non-atomic variable¹ is Undefined Behaviour in C++11². i.e. potentially-concurrent read+write or write+write without any synchronization to provide a happens-before relationship, e.g. a mutex or release/acquire synchronization.

The compiler is allowed to assume that no other thread has modified choice between two reads of it (because that would be data-race UB (Undefined Behaviour)), so it can CSE and hoist the check out of the loop.

This is in fact what gcc does (and most other compilers too):

while(!choice){}

optimizes into asm that looks like this:

if(!choice)     // conditional branch outside the loop to skip it
    while(1){}  // infinite loop, like ARM  .L2: b .L2

This happens in the target-independent part of gcc, so it applies to all architectures.

You want the compiler to be able to do this kind of optimization, because real code contains stuff like for (int i=0 ; i < global_size ; i++ ) { ... }. You want the compiler to be able to load the global outside the loop, not keep re-loading it every loop iteration, or for every access later in a function. Data has to be in registers for the CPU to work with it, not memory.

The compiler could even assume the code is never reached with choice == 0, because an infinite loop with no side effects is Undefined Behaviour. (Reads / writes of non-volatile variables don't count as side effects). Stuff like printf is a side-effect, but calling a non-inline function would also stop the compiler from optimizing away the re-reads of choice, unless it was static int choice. (Then the compiler would know that printf couldn't modify it, unless something in this compilation unit passed &choice to a non-inline function. i.e. escape analysis might allow the compiler to prove that static int choice couldn't be modified by a call to an "unknown" non-inline function.)

In practice real compilers don't optimize away simple infinite loops, they assume (as a quality-of-implementation issue or something) that you meant to write while(42){}. But an example in https://en.cppreference.com/w/cpp/language/ub shows that clang will optimize away an infinite loop if there was code with no side effects in it which it optimized away.

Officially supported 100% portable / legal C++11 ways to do this:

You don't really have multiple threads, you have an interrupt handler. In C++11 terms, that's exactly like a signal handler: it can run asynchronously with your main program, but on the same core.

C and C++ have had a solution for that for a long time: volatile sig_atomic_t is guaranteed to be ok to write in a signal handler and read in your main program

An integer type which can be accessed as an atomic entity even in the presence of asynchronous interrupts made by signals.

void reader() {

    volatile sig_atomic_t shared_choice;
    auto handler = a lambda that sets shared_choice;

    ... register lambda as interrupt handler

    sig_atomic_t choice;        // non-volatile local to read it into
    while((choice=shared_choice) == 0){
        // if your CPU has any kind of power-saving instruction like x86 pause, do it here.
        // or a sleep-until-next-interrupt like x86 hlt
    }

    ... unregister it.

    switch(choice) {
        case 1: goto constant;
        ...
        case 0: // you could build the loop around this switch instead of a separate spinloop
                // but it doesn't matter much
    }
}

Other volatile types are not guaranteed by the standard to be atomic (although in practice they are up to at least pointer width on normal architectures like x86 and ARM, because locals will be naturally aligned. uint8_t is a single byte, and modern ISAs can atomically store a byte without a read/modify/write of the surrounding word, despite any misinformation you may have heard about word-oriented CPUs).

What you'd really like is a way to make a specific access volatile, instead of needing a separate variable. You might be able to do that with *(volatile sig_atomic_t*)&choice, like the Linux kernel's ACCESS_ONCE macro, but Linux compiles with strict-aliasing disabled to make that kind of thing safe. I think in practice that would work on gcc/clang, but I think it's not strictly legal C++.

With `std::atomic<T>` for lock-free `T`

(with std::memory_order_relaxed to get efficient asm without barrier instructions, like you can get from volatile)

C++11 introduce a standard mechanism to handle the case where one thread reads a variable while another thread (or signal handler) writes it.

It provides control over memory-ordering, with sequential-consistency by default, which is expensive and not needed for your case. std::memory_order_relaxed atomic loads/stores will compile to the same asm (for your K60 ARM Cortex-M4 CPU) as volatile uint8_t, with the advantage of letting you use a uint8_t instead of whatever width sig_atomic_t is, while still avoiding even a hint of C++11 data race UB.

(Of course it's only portable to platforms where atomic<T> is lock-free for your T; otherwise async access from the main program and an interrupt handler can deadlock. C++ implementations aren't allowed to invent writes to surrounding objects, so if they have uint8_t at all, it should be lock-free atomic. Or just use unsigned char. But for types too wide to be naturally atomic, atomic<T> will use a hidden lock. With regular code unable to ever wake up and release a lock while the only CPU core is stuck in an interrupt handler, you're screwed if a signal/interrupt arrives while that lock is held.)

#include <atomic>
#include <stdint.h>

volatile uint8_t v;
std::atomic<uint8_t> a;

void a_reader() {
    while (a.load(std::memory_order_relaxed) == 0) {}
    // std::atomic_signal_fence(std::memory_order_acquire); // optional
}
void v_reader() {
    while (v == 0) {}
}

Both compile to the same asm, with gcc7.2 -O3 for ARM, on the Godbolt compiler explorer

a_reader():
    ldr     r2, .L7      @ load the address of the global
.L2:                     @ do {
    ldrb    r3, [r2]        @ zero_extendqisi2
    cmp     r3, #0
    beq     .L2          @ }while(choice eq 0)
    bx      lr
.L7:
    .word   .LANCHOR0


void v_writer() {
    v = 1;
}

void a_writer() {
    // a = 1;  // seq_cst needs a DMB, or x86 xchg or mfence
    a.store(1, std::memory_order_relaxed);
}

ARM asm for both:

    ldr     r3, .L15
    movs    r2, #1
    strb    r2, [r3, #1]
    bx      lr

So in this case for this implementation, volatile can do the same thing as std::atomic. On some platforms, volatile might imply using special instructions necessary for accessing memory-mapped I/O registers. (I'm not aware of any platforms like that, and it's not the case on ARM. But that's one feature of volatile you definitely don't want).

With atomic, you can even block compile-time reordering with respect to non-atomic variables, with no extra runtime cost if you're careful.

Don't use .load(mo_acquire), that will make asm that's safe with respect to other threads running on other cores at the same time. Instead, use relaxed loads/stores and use atomic_signal_fence (not thread_fence) after a relaxed load, or before a relaxed store, to get acquire or release ordering.

A possible use-case would be an interrupt handler that writes a small buffer and then sets an atomic flag to indicate that it's ready. Or an atomic index to specify which of a set of buffers.

Note that if the interrupt handler can run again while the main code is still reading the buffer, you have data race UB (and an actual bug on real hardware) In pure C++ where there are no timing restrictions or guarantees, you might have theoretical potential UB (which the compiler should assume never happens).

But it's only UB if it actually happens at runtime; If your embedded system has realtime guarantees then you may be able to guarantee that the reader can always finish checking the flag and reading the non-atomic data before the interrupt can fire again, even in the worst-case where some other interrupt comes in and delays things. You might need some kind of memory barrier to make sure the compiler doesn't optimize by continuing to reference the buffer, instead of whatever other object you read the buffer into. The compiler doesn't understand that UB-avoidance requires reading the buffer once right away, unless you tell it that somehow. (Something like GNU C asm("":::"memory") should do the trick, or even asm(""::"m"(shared_buffer[0]):"memory")).

Of course, read/modify/write operations like a++ will compile differently from v++, to a thread-safe atomic RMW, using an LL/SC retry loop, or an x86 lock add [mem], 1. The volatile version will compile to a load, then a separate store. You can express this with atomics like:

uint8_t non_atomic_inc() {
    auto tmp = a.load(std::memory_order_relaxed);
    uint8_t old_val = tmp;
    tmp++;
    a.store(tmp, std::memory_order_relaxed);
    return old_val;
}

If you actually want to increment choice in memory ever, you might consider volatile to avoid syntax pain if that's what you want instead of actual atomic increments. But remember that every access to a volatile or atomic is an extra load or store, so you should really just choose when to read it into a non-atomic / non-volatile local.

Compilers don't currently optimize atomics, but the standard allows it in cases that are safe unless you use volatile atomic<uint8_t> choice.

Again what we're really like is atomic access while the interrupt handler is registered, then normal access.

C++20 provides this with `std::atomic_ref<>`

But neither gcc nor clang actually support this in their standard library yet (libstdc++ or libc++). no member named 'atomic_ref' in namespace 'std', with gcc and clang -std=gnu++2a. There shouldn't be a problem actually implementing it, though; GNU C builtins like __atomic_load work on regular objects, so atomicity is on a per-access basis rather than a per-object basis.

void reader(){ 
    uint8_t choice;
    {  // limited scope for the atomic reference
       std::atomic_ref<uint8_t> atomic_choice(choice);
       auto choice_setter = [&atomic_choice] (int x) { atomic_choice = x; };

       ui::Context::addEventListener(ui::EventType::JOYSTICK_DOWN, &choice_setter);
       while(!atomic_choice) {}

       ui::Context::removeEventListener(ui::EventType::JOYSTICK_DOWN, &choice_setter);

    }

    switch(choice) { // then it's a normal non-atomic / non-volatile variable
    }
}

You probably end up with one extra load of the variable vs. while(!(choice = shared_choice)) ;, but if you're calling a function between the spinloop and when you use it, it's probably easier not to force the compiler to record the last read result in another local (which it may have to spill). Or I guess after the deregister you could do a final choice = shared_choice; to make it possible for the compiler to keep choice in a register only, and re-read the atomic or volatile.

Footnote 1: volatile

Even data-races on volatile are technically UB, but in that case the behaviour you get in practice on real implementations is useful, and normally identical to atomic with memory_order_relaxed, if you avoid atomic read-modify-write operations.

When to use volatile with multi threading? explains in more detail for the multi-core case: basically never, use std::atomic instead (with memory_order relaxed).

Compiler-generated code that loads or stores uint8_t is atomic on your ARM CPU. Read/modify/write like choice++ would not be an atomic RMW on volatile uint8_t choice, just an atomic load, then a later atomic store which could step on other atomic stores.

Footnote 2: C++03:

Before C++11 the ISO C++ standard didn't say anything about threads, but older compilers worked the same way; C++11 basically just made it official that the way compilers already work is correct, applying the as-if rule to preserve the behaviour of a single thread only unless you use special language features.

Re: "volatile might imply using special instructions necessary for accessing memory-mapped I/O registers." - It does on the Xtensa ISA of, e.g., those ESP-8266 chips: The docs say the compiler should insert the `MEMW` ("memory wait") instruction before reads and after writes to volatile variables to make sure the data has propagated through any/all pipelines or caches. IIRC, there was also a known silicon bug where multiple writes to the same memory location in quick succession (w/o MEMW) could cause earlier writes to be skipped and only propagate the later writes to off-core hardware/memory. — JimmyB, Apr 10 '19 at 11:51

MCU programming - C++ O2 optimization breaks while loop

2 Answers2

Officially supported 100% portable / legal C++11 ways to do this:

With `std::atomic<T>` for lock-free `T`

C++20 provides this with `std::atomic_ref<>`

Linked

MCU programming - C++ O2 optimization breaks while loop

2 Answers2

Officially supported 100% portable / legal C++11 ways to do this:

With std::atomic<T> for lock-free T

C++20 provides this with std::atomic_ref<>

Linked

With `std::atomic<T>` for lock-free `T`

C++20 provides this with `std::atomic_ref<>`