2

I have a C-coded function that realizes a very long calculation on a microcontroller. I try to optimize it for speed at the moment. The function content is created automatically using Mathematica. It has hundreds of calculations and looks like this:

void calcResult(float *result, float arg1, float arg2, ... float argN){
   float tmp1 = arg1 * 2 + arg2;
   float tmp2 = tmp1/arg3 + tmp1;
   ...
   ...
   float tmpN = tmp320 + tmp15 * 2;
   *result = tmp2 + tmpN;
}

I asked myself if it could possibly be faster not use "tmp" variables but an array with the same size as the number of "tmp" variables or maybe to speed things up in such a function in another way (under the assumption, that the calculations to get the "result" are already "optimized" in terms of the necessary calculation time)?

Edit:

In my opinion Is micro-optimisation important when coding? doesn't answer my specific question, even if the goal is to optimise the code as well.

theNewOne
  • 31
  • 1
  • 5
  • 2
    Why don't you just profile both options and find out? – Philip Kendall Jan 19 '19 at 13:28
  • I should do that :-) Howevery, I was also interessted whether there is another possibility to speedup a function like this – theNewOne Jan 19 '19 at 13:30
  • 1
    These are floating point operations, which means there are limitations as to compiler optimization opportunities. In particular, floating point operations are not associative. The compiler might just preserve the entire expression tree, only able to schedule operations without changing the tree structure. – rwong Jan 19 '19 at 13:46
  • 1
    It also depends on whether the microcontroller target and the compiler support special floating-point acceleration instructions, such as SIMD, or some arcane extensions. – rwong Jan 19 '19 at 13:47
  • I should think that you would get more of a speedup by returning the float rather than using the indirect output parameter. However, the only way to know how it affects your application under the operating conditions it is intended to run is to profile it. C isn't my primary language, so I could be wrong... but I was under the impression that indirect pointers aren't as efficient as stack based operations. – Berin Loritsch Jan 19 '19 at 15:56
  • Using an array with constant indexes (e.g. `temp[1]` vs. `tmp1`) in place of individual variables won't help performance; at best, either it won't hurt (e.g. if the processor has no floating point registers), or, the compiler will revert them back to individual variables anyway (if it is good). You might experiment with the signature of the method, passing an array of floats vs. individual floats might help; that brings the caller(s) (not shown) into play. – Erik Eidt Jan 19 '19 at 16:13
  • @gnat I.m.o. [Is micro-optimisation important when coding?](https://softwareengineering.stackexchange.com/questions/99445/is-micro-optimisation-important-when-coding) is not helpfull for answering my specific question, even if the goal is optimisation as well. – theNewOne Jan 22 '19 at 06:49
  • 1
    @theNewOne: your sentence "it does not answer my specific question" does not convince me. The two top answers seem both fit perfectly to your case, if you like it or not. – Doc Brown Jan 22 '19 at 07:32
  • I know you will profile the code using array, my guess it'd be tiny bit faster if you pass an array instead of N arguments. Anyway, have you tried optimisation instead of micro-optimisation? – imel96 Jan 22 '19 at 09:13

3 Answers3

4

The ultimate answer, as always, is to profile both approaches and let empirical results decide. However, I would not expect a difference in speed, given modern C compiler technology. You might even make the code slower because you're tying the compiler's hands in two ways:

  1. You're dictating a specific order for variables to appear on the stack (assuming the array is stack-allocated)
  2. You're dictating a specific storage location for the temporaries (on the stack, rather than in registers or spilling)

It's likely that your compiler can figure this out (see LLVM's Mem2Reg pass) but why not tell your compiler what you mean? The compiler is then surely free to make software pipelining and register allocation (and stack locality) decisions for you.

I would also recommend declaring all your temporaries const because even if your compiler can infer that on its own, it's helpful for debugging your code generator.

Alex Reinking
  • 1,607
  • 11
  • 16
  • As small complement: I tried to profile it on a Cortex M4. I got more or less same results with and without `const` qualifier before all my temporary variables, e.g. `const float tmp1`. See https://ibb.co/fDcCt6C (greed/red --> with/without `const`qualifier) I did not try the solution with an array yet. Maybe I'll add this later. – theNewOne Jan 22 '19 at 06:42
  • @theNewOne I wouldn't have expected a performance difference with the const keywords, but it does prevent your code generator from accidentally overwriting a value it didn't mean to. – Alex Reinking Jan 22 '19 at 06:54
2

When these conditions are satisfied:

  • The array is local (confined to the function scope)
  • The array is only used locally
  • Nothing about the array's structure (e.g. pointers, references, type) are being leaked into another function (via a function call or a store),
  • Etc ... (I might be wrong since I'm not experienced in compiler optimizations)

The compiler may apply scalar replacement of aggregates (SRA, SRoA), which means the local array (which is used in elementwise single assignment way) approach and the local variables approach might be optimized in the same way, without requiring a stack space for hundreds of float elements or variables.

To check whether this is the case, it is necessary to take a look at the disassembly from an optimized build. There might be factors that prevent the compiler from applying SRA.

If SRA isn't performed, there is a huge problem with the array approach. It means the function will reserve a stack space enough for hundreds of float variable - the size of the array, since it determined that it couldn't "eliminate" the array. Compare this to the local scalar variables approach, whee compilers can rotate them onto registers and back to memory, and to reuse memory for already expired or disused values.

Once again, I highly recommend taking a look at the disassembly, from both optimized and unoptimized builds. You'll learn something. I did not learn compiler theory or optimizations in school. Most of what I learned about these topics come from years of reading compiler-generated disassembly, and reading expertly-written articles online.

The function might also be inlined. In which case, its optimization outcome also depends on how the caller is written.

rwong
  • 16,695
  • 3
  • 33
  • 81
0

Like others already pointed out I would not get my hopes up too high regarding the effect of your suggested changes. Your best bet is to spot calculations tracks that do not necessarily have to be executed sequentially and have those done on separate threads. This will likely get you a noticable performance gain.

However, looking at your example, it appears every step needs the result of the preceding step so in this particular case that does not seem a viable option.

Martin Maat
  • 18,218
  • 3
  • 30
  • 57
  • Thank you but in my case I have to run this on a small embedded plattform. Multithrading is not an option here. – theNewOne Jan 22 '19 at 06:52