Embedded C - Most elegant way to insert a delay

Question

I'm working on a project involving a cortex-m4 mcu (LPC4370). And I need to insert a delay while turning on compiler's optimization. So far my workaround was to move up and down a digital output inside a for-loop:

for (int i = 0; i < 50000; i++)
{
     LPC_GPIO_PORT->B[DEBUGPIN_PORT][DEBUG_PIN1] = TRUE;
     LPC_GPIO_PORT->B[DEBUGPIN_PORT][DEBUG_PIN1] = FALSE;
}

But I wonder if there's a better way to fool GCC.

How long do you want the delay to be? How precise must it be? — Elliot Alderson, Jul 31 '19 at 11:31
You could set a timer to interrupt on underflow / overflow for your desired delay and just enter sleep mode. The processor will wake up at the interrupt which could simply have a single return statement. — Peter Smith, Jul 31 '19 at 13:06
Thanks for the comment. I'm doing this because I have troubles as I try to interface with Matlab via USB. I lost synchronization between the pc and my mcu, so I'm trying to put delays here and there to fix it. I know this is probably not the best practice, but I did not find other ways. So I don't need to be extremely precise, like in the milliseconds could be ok. here's my other post on usb https://electronics.stackexchange.com/questions/450659/embedded-c-usb-stack-and-arm-none-eabi-gcc-settings — a_bet, Jul 31 '19 at 13:27
I'm voting to close this question as off-topic because it is an XY problem: Using a delay in communication code is essentially *always* incorrect. You need to understand and solve your *actual* problem. In non-communication cases where delay *is appropriate* you'll find that most MCU software setups have a busy-wait delay mechanism. — Chris Stratton, Jul 31 '19 at 14:28
I understand this. I'm sorry then, I'm doing my best to work with this driver, but the support over at NXP forums is nearly 0. — a_bet, Jul 31 '19 at 14:35
No, you are wasting your time in the wrong direction. If you want to identify and solve your *actual* problem, follow the guidance posted at your existing question and use edits to make the statement of your problem there more specific and include the findings from those debug efforts. — Chris Stratton, Jul 31 '19 at 15:25
@ChrisStratton I'm not aware that XY-type posts is a valid close reason anywhere across the network. — Marc.2377, Aug 01 '19 at 02:47
@Marc.2377 XY questions are in fact a quite common close reason *here* as we see a lot of such misdirected quests. — Chris Stratton, Aug 01 '19 at 02:48

Jeroen3 · Accepted Answer · 2019-07-31T19:00:02.760

22

The context of this inline no-dependency delay is missing here. But I'm assuming you need a short delay during initialization or other part of the code where it is allowed to be blocking.

Your question shouldn't be how to fool GCC. You should tell GCC what you want.

#pragma GCC push_options
#pragma GCC optimize ("O0")   

for(uint i=0; i<T; i++){__NOP()}

#pragma GCC pop_options

From the top of my head, this loop will be approximately 5*T clocks.

(source)

Fair comment by Colin on another answer. A NOP is not guaranteed to take cycles on an M4. If you want to slow things down, perhaps ISB (flush pipeline) is a better option. See the Generic User Guide.

edited Jul 31 '19 at 19:00

answered Jul 31 '19 at 11:54

Jeroen3

21,976
36
73

Ok, so it's the first time I see a #pragma. If I understood correctly this kind of settings applies only to this small section of code. Would you advice this over an implementation which uses a timer? – a_bet Jul 31 '19 at 13:22
@a_bet That depends on context. But usually an inline delay is the last delay you'd want to apply. – Jeroen3 Jul 31 '19 at 13:45
2

You can also uses a non-`nop` instruction instead that will not be removed from the pipeline. – Harry Beadle Jul 31 '19 at 14:03
4

@Jeroen3: `-O0` does [a bit worse than 5*T](https://godbolt.org/z/S_zh6l), something like 8 instructions with a bunch of overhead. It would be better to create a short optimized loop (or at least one which compiles the same way without using pragmas) and use `__asm__ __volatile__("");` to prevent GCC from optimizing the loop away, i.e. [something like this](https://godbolt.org/z/NcwHdc). – vgru Aug 01 '19 at 03:28
1

@Groo I can't believe we are discussing the effectiveness of delay code in the most dirty way known to man. But yes, a volatile inline assembly line will work just as well. I believe the pragma expresses the intention better to any new readers. – Jeroen3 Aug 01 '19 at 05:42
2

asm volatile is the correct way to do this if you don't have a vendor-provided delay function/macro. Don't disable optimizations, even for 1 line, it messes with the for loop. – Navin Aug 01 '19 at 11:40
I am testing this solution in terms of timing accuracy. So far it seems to be a "good" quick and dirty work around! – a_bet Aug 01 '19 at 18:04

score 12 · Answer 2 · answered Jul 31 '19 at 09:35

12

Use a timer if you have one available. The SysTick is very simple to configure, with documentation in the Cortex M4 User guide (or M0 if you're on the M0 part). Increment a number in its interrupt, and in your delay function you can block until the number has incremented a certain number of steps.

Your part contains many timers if the systick is already in use, and the principle remains the same. If using a different timer you could configure it as a counter, and just look at its count register to avoid having an interrupt.

If you really want to do it in software, then you can put asm("nop"); inside your loop. nop doesn't have to take time, the processor can remove them from its pipeline without executing it, but the compiler should still generate the loop.

answered Jul 31 '19 at 09:35

Colin

4,499
2
19
33

Systick is very simple to configure but I recommend using another timer as soon as you can since Systick has its limitations with regard to counter size and interrupts when being used for delays. – DKNguyen Jul 31 '19 at 15:47
You don't even need to use interrupts, just poll the count register. This should be defined as a volatile, so the compiler will not optimise it out. IMO, SysTick is a good choice, as it's often configured to give an 'O/S timer', e.g. a microsecond timer. You will then have simple `wait_microseconds(100);` kind of things in the code. – Evil Dog Pie Aug 01 '19 at 12:37
@EvilDogPie Isn't "_just poll the count register_" almost as bad as just having a tight loop? (although probably easier to stop GCC optimizing it away). – TripeHound Aug 01 '19 at 13:57
@TripeHound Yes, it's exactly having a tight loop. That's what the o/p is asking for: a tight loop for a short delay that doesn't get removed by the compiler optimisation. There are places where a tight loop is not a bad way to do a short delay, particularly in an embedded system that's not multitasking. – Evil Dog Pie Aug 01 '19 at 14:12

score 10 · Answer 3 · answered Jul 31 '19 at 15:19

Not to detract from other answers here, but exactly what length delay do you need? Some datasheets mention nanoseconds; others microseconds; and still others milliseconds.

Nanosecond delays are usually best served by adding "time-wasting" instructions. Indeed, sometimes the very speed of the microcontroller means that the delay has been satisfied between the "set the pin high, then set the pin low" instructions that you show. Otherwise, one or more NOP, JMP-to-next-instruction, or other time-wasting instructions are sufficient.
Short microsecond delays could be done by a for loop (depending on CPU rate), but longer ones may warrant waiting on an actual timer;
Millisecond delays are usually best served by doing something else completely while waiting for the process to complete, then going back to ensure that it has actually been completed before continuing.

In short, it all depends on the peripheral.

score 3 · Answer 4 · answered Aug 05 '19 at 09:26

The best way is to use on-chip timers. Systick, RTC or peripheral timers. These have the advantage that the timing is precise, deterministic and can be easily adapted if CPU clock speed is changed. Optionally, you can even let the CPU sleep and use a wake-up interrupt.

Dirty "busy-delay" loops on the other hand, are rarely accurate and come with various problems such as "tight coupling" to a specific CPU instruction set and clock.

Some things of note:

Toggling a GPIO pin repeatedly is a bad idea since this will draw current needlessly, and potentially also cause EMC issues if the pin is connected to traces.
Using NOP instructions might not work. Many architectures (like Cortex M, iirc) are free to skip NOP on the CPU level and actually not execute them.

If you want insist on generating a dirty busy-loop, then it is sufficient to just volatile qualify the loop iterator. For example:

void dirty_delay (void)
{
  for(volatile uint32_t i=0; i<50000u; i++)
    ;
}

This is guaranteed to generate various crap code. For example ARM gcc -O3 -ffreestanding gives:

dirty_delay:
        mov     r3, #0
        sub     sp, sp, #8
        str     r3, [sp, #4]
        ldr     r3, [sp, #4]
        ldr     r2, .L7
        cmp     r3, r2
        bhi     .L1
.L3:
        ldr     r3, [sp, #4]
        add     r3, r3, #1
        str     r3, [sp, #4]
        ldr     r3, [sp, #4]
        cmp     r3, r2
        bls     .L3
.L1:
        add     sp, sp, #8
        bx      lr
.L7:
        .word   49999

From there on you can in theory calculate how many ticks each instruction takes and change the magic number 50000 accordingly. Pipelining, branch prediction etc will mean that the code might execute faster than just the sum of the clock cycles though. Since the compiler decided to involve the stack, data caching could also play a part.

My whole point here is that accurately calculating how much time this code will actually take is difficult. Trial & error benchmarking with a scope is probably a more sensible idea than attempting theoretical calculations.

Embedded C - Most elegant way to insert a delay

4 Answers4