STM32F4 : Floating point instructions too slow

Question

I'm working on an audio application on the Nucleo F411RE and I've noticed that my processing was too slow, making the application skip some samples.

Looking at my disassembly I figured given the number of instructions and the 100 MHz CPU clock (that I've set in STM32CubeMx), it should be a lot faster.

I checked SYSCLK value and it is 100Mhz as expected. To be 100% sure I put 1000 "nop" in my main loop and measured 10 µs, which does correspond to a 100 MHz clock.

I measured exactly the time taken by my processing and it takes 14.5 µs ie 1450 clock cycles. I think it's way too much, considering that the processing is pretty simple :

for(i=0; i<12; i++)
{
    el1.mode[i].phase += el1.osc[i].phaseInc;  // 16 µs
    if(el1.osc[i].phase >= 1.0) // 20 µs (for the whole "if"
        el1.osc[i].phase -= 1.0; 
    el1.osc[i].value = sine[ (int16_t)(el1.osc[i].phase * RES) ]; // 96 µs
    el1.val += el1.osc[i].value * el1.osc[i].amp; // 28 µs
} // that's a total of 1.63 µs for the whole loop

where phase and phaseInc are single precision floats, and value is an int16_t, sine[] is a look up table containing 1024 int16_t.

It shouldn't be more than like 500 cycles, right? I looked at the disassembly, it does use the floating point instructions... For example, the last line disassembly is : vfma.f32 => 3 cycles vcvt.s32.f32 => 1 cycle vstr => 2 cycles ldrh.w => 2 cycles

(cycles timing according to this ) So that's a total of 8 instruction for that line, which is the "biggest". I don't really get why it's so slow... Maybe because I'm using structures or something?

If anybody has an idea, I'd be glad to hear it.

EDIT : I just measured the time line by line, you can see it in the code above. It seems like the most time consumming line is the look up table line, which would mean that it's memory access time that is critical? how could I improve that?

EDIT2: disassembly, as requested by BruceAbott (sorry it's a bit messy, probably because of the way it was optimized by the compiler):

membrane1.mode[i].phase += membrane1.mode[i].phaseInc;
0800192e:   vldr s14, [r5, #12]
08001932:   vldr s15, [r5, #8]
08001936:   vadd.f32 s15, s15, s14
0800193a:   adds r5, #24
179 if(membrane1.mode[i].phase >= 1.0)
0800193c:   vcmpe.f32 s15, s16
08001940:   vmrs APSR_nzcv, fpscr
180 membrane1.mode[i].phase -= 1.0;
08001944:   itt ge
08001946:   vmovge.f32 s14, #112    ; 0x70
0800194a:   vsubge.f32 s15, s15, s14
0800194e:   vstr s15, [r5, #-16]
182 membrane1.mode[i].value = sine[(int16_t)(membrane1.mode[i].phase * RES)];
08001952:   ldr.w r0, [r5, #-16]
08001956:   bl 0x80004bc <__extendsfdf2>
0800195a:   ldr r3, [pc, #112]      ; (0x80019cc <main+428>)
0800195c:   movs r2, #0
0800195e:   bl 0x8000564 <__muldf3>
08001962:   bl 0x8000988 <__fixdfsi>
08001966:   ldr r3, [pc, #104]      ; (0x80019d0 <main+432>)
184 membrane1.val += membrane1.mode[i].value * membrane1.mode[i].amp;
08001968:   vldr s13, [r5, #-4]
182 membrane1.mode[i].value = sine[(int16_t)(membrane1.mode[i].phase * RES)];
0800196c:   sxth r0, r0
0800196e:   ldrh.w r3, [r3, r0, lsl #1]
08001972:   strh.w r3, [r5, #-8]
184 membrane1.val += membrane1.mode[i].value * membrane1.mode[i].amp;
08001976:   sxth r3, r3
08001978:   vmov s15, r3
0800197c:   sxth r3, r4
0800197e:   vcvt.f32.s32 s14, s15
08001982:   vmov s15, r3
08001986:   vcvt.f32.s32 s15, s15
174 for(i=0; i<12; i++) // VADD.F32 : 1 cycle
0800198a:   cmp r5, r6
184 membrane1.val += membrane1.mode[i].value * membrane1.mode[i].amp;
0800198c:   vfma.f32 s15, s14, s13
08001990:   vcvt.s32.f32 s15, s15
08001994:   vstr s15, [sp, #4]
08001998:   ldrh.w r4, [sp, #4]
0800199c:   bne.n 0x800192e <main+270>

Not really... To be honest I'm really a beginner on optimization... But could access time increase so much the time taken by my processing? And how could I optimize that? — Florent, May 01 '16 at 18:39
@MarkoBuršič Sorry if my code wasn't clear, sine[] is a wavetable, not a function — Florent, May 01 '16 at 18:39
It is, I added '-mfloat-abi=hard -mfpu=fpv4-sp-d16 ' to my compiler commands — Florent, May 01 '16 at 18:42
@JonathanWheeler I checked out the datasheet for the part, it is a '0 wait state' part. I assume that you are using the pin toggle method to benchmark your loop. Use the same method to check each part. Put a pin toggle around the inner `if` statement. If that is short, then try to put it around the sine lookup table. So on... — slightlynybbled, May 01 '16 at 18:42
As an audio guy: please please _please_ use an integer for the phase. You won't accumulate jitter, and you'll get the wraparound for free if you select the size just right and ignore some finer details about the C standard. This turns everything into integer operations as well, so the whole loop shouldn't take more than a few cycles. This is however not an answer to the question, because you didn't ask how to optimize. :) — pipe, May 01 '16 at 19:38
thanks @pipe, good idea! I'll do that I re-measure to see if it fastens things up enough!! I was using floats because at first I was doing interpoltation, but the interpolation was too slow so I had to ditch it — Florent, May 01 '16 at 20:08
Well I changed phase and phaseInc to uint16_t but it didn't change the time... Maybe I did it the wrong way. Also, the most time taking line is `membrane1.mode[i].value = sine[membrane1.mode[i].phase];`(0.96 µs). is there a way to optimize this? — Florent, May 01 '16 at 20:22
EDIT : I've manage to do it eventually, it now takes only 5 µs thanks to @pipe... thanks a lot guys !!! — Florent, May 01 '16 at 20:28
bumping the clock faster on an mcu doesnt make things run faster, the flash often requires more wait states slowing everything down. Once fetched, and if processor bound sure, but if flash bound then no. ideally you want to be just below the speed boundary where you have to switch to the next larger wait state. — old_timer, May 01 '16 at 20:44
@dwelch I'm not experienced at all on the subject, and I didn't really get your last sentence... Do you have some resources on the subject? thanks! — Florent, May 01 '16 at 21:01
I think I wrote a novel in one of these SO answers. and stuff at github. Say for example your flash has one wait state under or at 24mhz and two from 24 to 32 and three from 32 on up. so at 24mhz you can access the flash at basically 12mhz, but just above 24 it is 8mhz since you need that next wait state. but as you get up to my make believe numbers. at 32 cocks the flash is 10.7mhz but just above 8mhz. so far 24 is the fastest as far as accessing the flash (these are made up numbers). the flash will be rated at some maximum speed and that is where those clock boundaries will be found — old_timer, May 01 '16 at 21:10
the wait state requirements will have boundaries at clock speeds based on the designed in flash for that mcu. so my point is your statement of going 100Mhz to make it a lot faster, may not only not make it faster but make it slower depending on where 100Mhz lands and whether you are flash bound or processor bound. so that is what my comment was about. I didnt read all the other posts, but the answer related to 64 bit float, you used 1.0 instead of 1.0F forcing a soft float, which you should have seen a big jump in the size of the binary, is at least not helping your performance. — old_timer, May 01 '16 at 21:12
In case the slowness is (partly) due to Flash accesses you could try putting the function into RAM: http://stackoverflow.com/questions/15137214/how-to-run-code-from-ram-on-arm-architecture — Michael, May 01 '16 at 21:41
@dwelch a little update about your advice : I've looked into the wait states for the STM32F4 series and it turns out that it is 0 waits state at any clock frequency. However, you lead me to learn about this whole matter and I will keep it in mind in my later projects. thank you very much! Michael : yes that is a good idea, your link is very interesting although no feedback is given there. I will give mine as soon as I put this in application! thanks. — Florent, May 06 '16 at 22:56

score 17 · Accepted Answer · answered May 01 '16 at 20:37

17

In your disassembly we see calls to 64 bit (double precision) math functions:-

08001956:   bl 0x80004bc <__extendsfdf2>
...
0800195e:   bl 0x8000564 <__muldf3>
08001962:   bl 0x8000988 <__fixdfsi>

The STM32F4 only supports 32 bit floating point in hardware, so these functions must be done in software and will take many cycles to execute. To ensure that all calculations are done in 32 bit you should define all your floating point numbers (including constants) as type float.

answered May 01 '16 at 20:37

Bruce Abbott

55,540
1
47
89

3

The OP code is written to use double so this makes sense. start by changing the 1.0 to 1.0F, without that they are double per the language standard and the things around it have to be promoted. – old_timer May 01 '16 at 20:42
1

I tried changing my RES constant to 1024.0f and it does help a lot. Thank you very much to you two! I had no idea thatlanguage standard made every floating constants a double, and promotes everything... Is there a way to change that behaviour with compiler options for example? – Florent May 01 '16 at 20:59
1

that would be compiler specific, maybe, maybe not. but within the language just use the F to specify single float. or maybe typecast (float)(1.0), the compiler should optimize out the conversion and make it compile time not runtime. – old_timer May 01 '16 at 21:16

STM32F4 : Floating point instructions too slow

1 Answers1