The best way is to use on-chip timers. Systick, RTC or peripheral timers. These have the advantage that the timing is precise, deterministic and can be easily adapted if CPU clock speed is changed. Optionally, you can even let the CPU sleep and use a wake-up interrupt.
Dirty "busy-delay" loops on the other hand, are rarely accurate and come with various problems such as "tight coupling" to a specific CPU instruction set and clock.
Some things of note:
- Toggling a GPIO pin repeatedly is a bad idea since this will draw current needlessly, and potentially also cause EMC issues if the pin is connected to traces.
- Using NOP instructions might not work. Many architectures (like Cortex M, iirc) are free to skip NOP on the CPU level and actually not execute them.
If you want insist on generating a dirty busy-loop, then it is sufficient to just volatile
qualify the loop iterator. For example:
void dirty_delay (void)
{
for(volatile uint32_t i=0; i<50000u; i++)
;
}
This is guaranteed to generate various crap code. For example ARM gcc -O3 -ffreestanding
gives:
dirty_delay:
mov r3, #0
sub sp, sp, #8
str r3, [sp, #4]
ldr r3, [sp, #4]
ldr r2, .L7
cmp r3, r2
bhi .L1
.L3:
ldr r3, [sp, #4]
add r3, r3, #1
str r3, [sp, #4]
ldr r3, [sp, #4]
cmp r3, r2
bls .L3
.L1:
add sp, sp, #8
bx lr
.L7:
.word 49999
From there on you can in theory calculate how many ticks each instruction takes and change the magic number 50000 accordingly. Pipelining, branch prediction etc will mean that the code might execute faster than just the sum of the clock cycles though. Since the compiler decided to involve the stack, data caching could also play a part.
My whole point here is that accurately calculating how much time this code will actually take is difficult. Trial & error benchmarking with a scope is probably a more sensible idea than attempting theoretical calculations.