105

I am a religious person and make efforts not to commit sins. That is why I tend to write small (smaller than that, to reword Robert C. Martin) functions to comply with the several commandments ordered by the Clean Code bible. But while checking some stuff, I landed on this post, below which I read this comment:

Remember that the cost of a method call can be significant, depending on the language. There's almost always a tradeoff between writing readable code and writing performant code.

Under what conditions is this quoted statement still valid nowadays given the rich industry of performant modern compilers?

That is my only question. And it is not about whether I should write long or small functions. I just highlight that your feedback may -or not- contribute to altering my attitude and leave me unable to resist the temptation of blasphemers.

Damian Yerrick
  • 309
  • 3
  • 10
Billal Begueradj
  • 1,305
  • 5
  • 16
  • 31
  • 2
    It depend on size of the call stack. Which usually can be a problem only when you are using recursively called function. – Fabio Sep 09 '17 at 07:19
  • 12
    Write readable and maintainable code. Only when you face an issue with stack overflow you can re-think your spproach – Fabio Sep 09 '17 at 07:22
  • 3
    What makes you believe there are only compiled languages? Or that everybody always is free to use a "performant modern compiler"? – Doc Brown Sep 09 '17 at 10:55
  • 35
    A general answer here is impossible. There are too many different compilers, implementing too many different language specifications. And then there are JIT-compiled languages, dynamically interpreted languages, and so on. Suffice it to say, though, if you're compiling native C or C++ code with a modern compiler, you don't have to worry about the costs of a function call. The optimizer will inline these whenever it's appropriate. As a micro-optimization enthusiast, I *rarely* see compilers making inlining decisions that I or my benchmarks disagree with. – Cody Gray - on strike Sep 09 '17 at 12:19
  • 7
    Speaking from personal experience, I write code in a proprietary language that is fairly modern in terms of capability, but function calls are ridiculously expensive, to the point where even typical for loops have to be optimized for speed: `for(Integer index = 0, size = someList.size(); index < size; index++)` instead of simply `for(Integer index = 0; index < someList.size(); index++)`. Just because your compiler was made in the last few years doesn't necessarily mean you can forego profiling. – phyrfox Sep 09 '17 at 13:08
  • 6
    @phyrfox that just makes sense, getting the value of someList.size() outside the loop instead of calling it every time through the loop. That's especially true if there is any chance of a synchronization issue where readers and writers might try to clash during the iteration, in which case you would also want to protect the list against any changes during iteration. – Craig Tullis Sep 09 '17 at 20:33
  • 2
    The statement will always be true because it's hedged. There will always exist some language that doesn't have a compiler/interpreter that knows how to optimize away the method call overhead. There will always be problems to solve that need it optimized away. There will always be people who use that as an excuse to write procedural code even when none of that applies. Optimize code for the humans that have to read it first. Then when we know it needs to be faster we can make changes. – candied_orange Sep 10 '17 at 05:09
  • 1
    @phyrfox: does `someList.size()` traverse the list to calculate its size? – ninjalj Sep 10 '17 at 11:41
  • 1
    @ninjalj It's a language compiled into byte code running in a VM that runs on byte code. As such, things like manipulating the stack (e.g. function calls) seems to take an inordinate amount of time. Fortunately, I'm in touch with some of the developers that are updating the system, so hopefully I can get them to make some changes. Really, if they just inlined a few common methods, I think they'd get a lot better performance. – phyrfox Sep 10 '17 at 15:31
  • 1
    I wouldn't worry to much about committing sins! What are sins are relative and usually made up to control. Remember that 99% of the world does not do what they should anyways. The "leaders" are the biggest sinners of all. E.g., there exists no OS that has been written with perfect performance in mind(even though it is nearly achievable) nor perfect anything. They just try to get stuff out the door to keep the $$$ rolling in. So, feel free to "sin", everyone is doing it. It's not wrong, just try not to hurt anyone in the process! You can achieve balance, which is what is important! Yin/Yang! – AbstractDissonance Sep 11 '17 at 04:08
  • 1
    I don't want to highlight PHP but it deserves a special mention. It is almost uniquely so among most mainstream programming languages. – rwong Sep 11 '17 at 04:39
  • 1
    C# offers the [AggressiveInlining] attribute if you want a function to be inlined which the compiler failed to inline (but generally, the compiler is smart enough to do it right). – Bernhard Hiller Sep 11 '17 at 08:05
  • 4
    The quoted statement is always valid. Past, present and future. It's a platitude: there's always a tradeoff, the trick is knowing when to trade what off, and when to even bother looking into it. This statement is more a reminder not to sacrifice everything blindly to a single cause and dig your heels in when there's a real problem that warrants bending your rules. Too much bad advice and decisions stem from foolish consistencies, the hobgoblins of small minds. In other words: don't become a Clean Code bigot. (That said, I'll take that over a micro-optimizing performance nut any day). – Jeroen Mostert Sep 11 '17 at 12:12
  • 8
    Beware of taking small functions too far, it may obfuscate the code just as efficiently as a monolithic mega-function does. If you don't believe me, check out some of the http://www.ioccc.org/ winners: Some code everything into a single `main()`, others split everything into some 50 tiny functions, and all are utterly unreadable. The trick is, as always, to strike a good *balance*. – cmaster - reinstate monica Sep 11 '17 at 13:05
  • 5
    The unfortunately irony is those who follow the Clean Code book produce the opposite; it appears clean only to other members of the cult. – Frank Hileman Sep 11 '17 at 15:28

12 Answers12

152

It depends on your domain.

If you are writing code for low-power microcontroller, then method call cost might be significant. But if you are creating normal website or application, then method call cost will be negligible compared to rest of the code. In that case, it will always be more worth focusing on right algorithms and data structures instead of micro-optimizations like method calls.

And there is also question of compiler inlining the methods for you. Most compilers are intelligent enough to inline functions where it is possible.

And last, there is golden rule of performance : ALWAYS PROFILE FIRST. Don't write "optimized" code based on assumptions. If you are unusure, write both cases and see which is better.

Euphoric
  • 36,735
  • 6
  • 78
  • 110
  • 14
    And e.g. the HotSpot compiler performes *Speculative Inlining*, which is in some sense inlining even when it is *not* possible. – Jörg W Mittag Sep 09 '17 at 10:20
  • 52
    In fact, in a web application, the *whole* code is probably insignificant in relation to the DB access and the network traffic... – AnoE Sep 09 '17 at 15:45
  • 76
    I'm actually into embedded and ultra low power with a very old compiler that barely knows what optimization means, and believe me even though the function calls matter it's never the first place to look for optimization. Even in this niche domain the code quality comes first in this case. – Tim Sep 09 '17 at 18:49
  • @TimF: I mean if the compiler doesn't optimize things well then it's quite plausible that the call overhead is dwarfed by all the other inefficiencies. I'd be curious what would happen in cases where the compiler *can* optimize well, but for some reason can't inline things well. – user541686 Sep 10 '17 at 19:11
  • 2
    @Mehrdad Even in this case I'd be surprised if there wasn't anything more relevant to optimize in the code. When profiling the code I see things way heavier than the functions calls, and that's where it's relevant to look for optimization. Some devs go crazy for one or two unoptimized LOC but when you profile the SW you realize that design matters more than this, at least for the largest part of the code. When you find the bottleneck you can try to optimize it, and it'll have so much more impact than low-level arbitrary optimization such as writing big functions to avoid the calls overhead. – Tim Sep 10 '17 at 20:00
  • 9
    Good answer! Your last point should be first: **Always profile before deciding where to optimise**. – CJ Dennis Sep 11 '17 at 02:32
  • 3
    On the other hand, it is important that a programmer **learn to use a profiler regularly, early in the career**. That way you learn about what to do and what not to. If bad programming habits are formed early in the career, it will be hard to fix, and it may in turn affect one's career path. Though it also depend on the corporate culture. In some work environments, time-to-market is more important; code performance may be much less important. In those environments it's okay to be fast and break things. Though even Facebook is now distancing itself from that motto. – rwong Sep 11 '17 at 04:41
  • 1
    @Mehrdad: On the old Cray vector machines, the older cray compilers didn't really do much analysis to decide when to inline; instead they would simply inline to a fixed depth. They were pretty good at doing all sorts of crazy vector loop optimizations, but if what should be a vector operation is a one-line function call a one-line function that calls a one-line function, you get function calls, can't use the vector unit, and your performance is shot. Fortunately, there was an obscure compiler flag to adjust the limit, and later versions of the compiler were smarter about it. –  Sep 11 '17 at 12:38
  • True code optimization usually increases the overall complexity and legibility. Most modern compilers take care of things like unrolling loops, inlining, etc. On the other hand, the compiler is horrible at optimizing logic mistakes or an engineer's lack of knowledge of the language — this can cause a lot of performance problems that should be reduced through experience, code reviews, profiling, etc. For example, I often see junior developers load the same database record multiple times during a request. – Phil M Sep 11 '17 at 18:54
  • In the (paraphrased) words of my DSP prof., start with a good design, then profile. Then consider restructuring to help the compiler. Then add compiler directives. Then hand-code asm. Between each step, you stop and ask if its good enough, otherwise you'll drive yourself crazy optimizing things that don't matter :) – mbrig Sep 11 '17 at 21:43
  • 1
    Bad answer. He was asking about how compilers optimize function calls, not whether optimization is justified across different domains. – aaa90210 Sep 12 '17 at 07:07
  • In C program, if I add a wrapper function which in-turn call another function without any other code, does compiler optimize and remove one function call? Ex: void func1(void) {func2()} . So here func1() is just a wrapper to retain some compatibility with other code. – Rajesh Mar 30 '20 at 09:14
60

Function call overhead depends entirely on the language, and at what level you are optimizing.

On an ultra low level, function calls and even more so virtual method calls may be costly if they lead to branch misprediction or CPU cache misses. If you've written assembler, you'll also know that you need a few extra instructions to save and restore registers around a call. It is not true that a “sufficiently smart” compiler would be able to inline the correct functions to avoid this overhead, because compilers are limited by the semantics of the language (especially around features like interface method dispatch or dynamically loaded libraries).

On a high level, languages like Perl, Python, Ruby do a lot of bookkeeping per function call, making those comparatively costly. This is made worse by meta-programming. I once sped up a Python software 3x just by hoisting function calls out of a very hot loop. In performance-critical code, inlining helper functions can have a noticeable effect.

But the vast majority of software is not so extremely performance-critical that you would be able to notice function call overhead. In any case, writing clean, simple code pays off:

  • If your code is not performance-critical, this makes maintenance easier. Even in performance-critical software, the majority of the code won't be a “hot spot”.

  • If your code is performance-critical, simple code makes it easier to understand the code and spot opportunities for optimization. The biggest wins usually don't come from micro-optimizations like inlining functions, but from algorithmic improvements. Or phrased differently: don't do the same thing faster. Find a way to do less.

Note that “simple code” does not mean “factored into a thousand tiny functions”. Every function also introduces a bit of cognitive overhead – it's more difficult to reason about more abstract code. At some point, these tiny functions might do so little that not using them would simplify your code.

8bittree
  • 5,637
  • 3
  • 27
  • 37
amon
  • 132,749
  • 27
  • 279
  • 375
  • 18
    A really smart DBA once told me "Normalize till it hurts, then denormalize till it doesn't." Seems to me it could be rephrased to "Extract methods until it hurts, then inline until it doesn't." – RubberDuck Sep 10 '17 at 16:08
  • 1
    In addition to cognitive overhead, there is symbolic overhead in the debugger information, and usually overhead in the final binaries is unavoidable. – Frank Hileman Sep 10 '17 at 22:42
  • Regarding smart compilers - they CAN do that, just not always. For example jvm can inline things based on runtime profile with very cheap/free trap for uncommon path or inline polymorphic function for which there is only one implementation of given method/interface and then deoptimize that call to properly polymorphic when new subclass is loaded dynamically on runtime. But yes, there are many languages where such things are not possible and many cases even in jvm, when it is not cost effective or possible in general case. – Artur Biesiadowski Sep 12 '17 at 11:15
21

Almost all adages about tuning code for performance are special cases of Amdahl's law. The short, humorous statement of Amdahl's law is

If one piece of your program takes 5% of runtime, and you optimize that piece so that it now takes zero percent of runtime, the program as a whole will only be 5% faster.

(Optimizing things down to zero percent of runtime is totally possible: when you sit down to optimize a large, complicated program you are quite likely to find that it's spending at least some of its runtime on stuff it doesn't need to do at all.)

This is why people normally say not to worry about function call costs: no matter how expensive they are, normally the program as a whole is only spending a tiny fraction of its runtime on call overhead, so speeding them up doesn't help very much.

But, if there's a trick you can pull that makes all the function calls faster, that trick is probably worth it. Compiler developers spend lots of time optimizing function "prologues" and "epilogues", because that benefits all of the programs compiled with that compiler, even if it's only a tiny bit for each.

And, if you have reason to believe that a program is spending a lot of its runtime just making function calls, then you should start thinking about whether some of those function calls are unnecessary. Here are some rules of thumb for knowing when you should do this:

  • If a function's per-invocation runtime is less than a millisecond, but that function is called hundreds of thousands of times, it should probably be inlined.

  • If a profile of the program shows thousands of functions, and none of them takes more than 0.1% or so of runtime, then function-call overhead is probably significant in aggregate.

  • If you have "lasagna code," in which there are many layers of abstraction that do hardly any work beyond dispatching to the next layer, and all of these layers are implemented with virtual method calls, then there's a good chance the CPU is wasting a lot of time on indirect-branch pipeline stalls. Unfortunately, the only cure for this is to get rid of some layers, which is often very hard.

zwol
  • 2,576
  • 1
  • 16
  • 16
  • 7
    Just beware of expensive stuff done deep in nested loops. I've optimized one function and gotten code that runs 10x as fast. That was after the profiler pointed out the culprit. (It was called over and over, in loops from O(n^3) to a small n O(n^6).) – Loren Pechtel Sep 11 '17 at 04:01
  • "Unfortunately, the only cure for this is to get rid of some layers, which is often very hard." -- this depends very much on your language compiler and/or virtual machine technology. If you can modify the code to make it easier for the compiler to inline (e.g. by using `final` classes and methods where applicable in Java, or non-`virtual` methods in C# or C++) then the indirection can be eliminated by the compiler/runtime and you'll see a gain without massive restructuring. As @JorgWMittag points out above, the JVM can even inline in cases where it's not provable that the optimisation is ... – Jules Sep 12 '17 at 09:45
  • ... valid, so it may well be that it's doing it in your code despite the layering anyway. – Jules Sep 12 '17 at 09:47
  • @Jules While it's true that JIT compilers *can* perform speculative optimization, it doesn't mean that such optimizations *are* applied uniformly. Specifically regarding Java, my experience is that developer culture favours layers piled on top of layers leading to extremely deep call stacks. Anecdotally, that contributes to the sluggish, bloated feel of many Java applications. Such highly layered architecture works against the JIT runtime, regardless whether the layers are technically inlineable. JIT isn't a magic bullet that can automatically fix structural problems. – amon Sep 12 '17 at 11:34
  • @amon My experience with "lasagna code" comes from very large C++ applications with a lot of code dating to the 1990s, when deeply nested object hierarchies and COM were the fashion. C++ compilers go to quite heroic efforts to crush out the abstraction penalties in programs like this, and _still_ you might see them spending a significant fraction of wall-clock runtime on indirect-branch pipeline stalls (and another significant chunk on I-cache misses). – zwol Sep 17 '17 at 15:58
18

I will challenge this quote:

There's almost always a tradeoff between writing readable code and writing performant code.

This is a really misleading statement, and a potentially dangerous attitude. There are some specific cases where you have to do a tradeoff, but in general the two factors are independent.

An example of a necessary tradeoff is when you have a simple algorithm versus a more complex but more performant. A hashtable implementation is clearly more complex than a linked list implementation, but lookup will be slower, so you might have to trade simplicity (which is a factor in readability) for performance.

Regarding function call overhead, turning a recursive algorithm into an iterative might have a significant benefit depending on the algorithm and the language. But this is again very specific scenario, and in general the overhead of function calls will be negligible or optimized away.

(Some dynamic languages like Python does have a significant method-call overhead. But if performance becomes an issue you probably shouldn't be using Python in the first place.)

Most principles for readable code - consistent formatting, meaningful identifier names, appropriate and helpful comments and so on have no effect on performance. And some - like using enums rather than strings - also have performance benefits.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
6

The function call overhead is unimportant in most cases.

However the bigger gain from inlining code is optimizing the new code after inlining.

For example if you call a function with a constant argument the optimizer can now constant fold that argument where it couldn't before inlining the call. If the argument is a function pointer (or lambda) the optimizer can now inline the calls to that lambda as well.

This is a big reason why virtual functions and function pointers are not attractive as you cannot inline them at all unless the actual function pointer has been constant folded all the way to the call site.

ratchet freak
  • 25,706
  • 2
  • 62
  • 97
5

Assuming performance does matter for your program, and it indeed has lots and lots of calls, the cost still may or may not matter depending on the type of call it is.

If the called function is small, and the compiler is able to inline it, then the cost will be essentially zero. Modern compilers/language implementations have JIT, link-time-optimizations and/or module systems designed to maximize ability to inline functions when it's beneficial.

OTOH, there is a non-obvious cost to function calls: their mere existence may inhibit compiler optimizations before and after the call.

If the compiler can't reason about what the called function does (e.g. it's virtual/dynamic dispatch or a function in a dynamic library) then it may have to pessimistically assume that that the function could have any side effect— throw an exception, modify global state, or change any memory seen through pointers. The compiler may have to save temporary values to back memory and re-read them after the call. It won't be able to re-order instructions around the call, so it may be unable to vectorize loops or hoist redundant computation out of loops.

For example, if you needlessly call a function in each loop iteration:

for(int i=0; i < /* gasp! */ strlen(s); i++) x ^= s[i];

The compiler may know it's a pure function, and move it out of the loop (in a terrible case like this example even fixes accidental O(n^2) algorithm to be O(n)):

for(int i=0, end=strlen(s); i < end; i++) x ^= s[i];

And then maybe even rewrite the loop to process 4/8/16 elements at a time using wide/SIMD instructions.

But if you add a call to some opaque code in the loop, even if the call does nothing and is super cheap itself, the compiler has to assume the worst — that the call will access a global variable that points to the same memory as s change its contents (even if it's const in your function, it can be non-const anywhere else), making the optimization impossible:

for(int i=0; i < strlen(s); i++) {
    x ^= s[i];
    do_nothing();
}
Kornel
  • 681
  • 4
  • 11
3

This old paper might answer your question:

Guy Lewis Steele, Jr.. "Debunking the 'Expensive Procedure Call' Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO". MIT AI Lab. AI Lab Memo AIM-443. October 1977.

Abstract:

Folklore states that GOTO statements are "cheap", while procedure calls are "expensive". This myth is largely a result of poorly designed language implementations. The historical growth of this myth is considered. Both theoretical ideas and an existing implementation are discussed which debunk this myth. It is shown that the unrestricted use of procedure calls permits great stylish freedom. In particular, any flowchart can be written as a "structured" program without introducing extra variables. The difficulty with the GOTO statement and the procedure call is characterized as a conflict between abstract programming concepts and concrete language constructs.

Alex Vong
  • 359
  • 1
  • 5
  • 12
    I highly doubt a paper that old will answer the question of whether "function call costs still matter in *modern* compilers". – Cody Gray - on strike Sep 09 '17 at 16:58
  • 7
    @CodyGray I think compiler technology should have advanced since 1977. So if function calls can be made cheap in 1977, we should be able to do it now. So the answer is no. Of course, this assumes you are using a decent language implementation that can do stuff like function inlining. – Alex Vong Sep 09 '17 at 18:39
  • 4
    @AlexVong Relying on 1977 compiler optimizations is like relying on commodity prices trends in the stone age. Everything has changed too much. For example, multiplication used to be replaced by memory access as a cheaper operation. Currently, it's more expensive by a huge factor. Virtual method calls are relatively *much more expensive than they used to be* (memory access and branch mispredictions), but oftentimes they can be optimized away and the virtual method call can be even inlined (Java does it all the time), so the cost is exactly zero. There was nothing like this in 1977. – maaartinus Sep 10 '17 at 02:13
  • 1
    the goto is a small cost. most of the cost is pushing arguments onto the stack and getting the return result. and the cost of a function call should be compared to inlining the code of the function. i don't think that changes much with modern compilers. – robert bristow-johnson Sep 10 '17 at 04:48
  • 1
    The paper cited points out that using a subroutine CALL instruction is actually a peephole optimization for two instructions "push return address; GOTO subroutine". If the return address is not needed (tail recursion, tail call), the peephole optimizer eliminates the "push return address" instead, and just generates the GOTO. Arguments need not be pushed onto the stack; all modern compilers support passing arguments in registers. (Of course, if you insist on using a brain-dead x86 processor with no registers, you deserve what you get.) – John R. Strohm Sep 10 '17 at 07:40
  • 3
    As others have pointed out, it's not just changes in compiler technology that have invalidated old research. If compilers had continued to improve while microarchitectures had remained largely unchanged, then the paper's conclusions would still be valid. But that's not happened. If anything, microarchitectures have changed more than compilers. Things that used to be fast are now slow, relatively speaking. – Cody Gray - on strike Sep 10 '17 at 11:59
  • 1
    Thanks for all the information. I find the status quo unsatisfying. Before this conversation, if someone asks me OP's question, I would say this is a solved problem, and you can use as many functions as you please. Now I would have to say it does matter, but you should benchmark before doing anything. In my opinions, function is such a fundamental abstraction that I expect not having any performance penalty using it. Compiler often has a benchmarking suite, so that they do not accept changes that slow down the compiler. They consider it a regression... – Alex Vong Sep 11 '17 at 11:14
  • ... Maybe microarchitecture designer should have something comparable to that, so that they reject ideas that penalize the use of function call. – Alex Vong Sep 11 '17 at 11:22
  • 2
    @AlexVong To be more precise on the CPU changes that make that paper obsolete: Back in 1977, a main memory access was a single CPU cycle. Today, even a simple access of the L1(!) cache has a latency of 3 to 4 cycles. Now, function calls are quite heavy in memory accesses (creation of stack frame, saving of return address, saving of registers for local variables), which easily drives the costs of a single function call to 20 and more cycles. If your function only rearranges its arguments, and perhaps adds another constant argument to pass to a call-through, then that's almost 100% overhead. – cmaster - reinstate monica Sep 11 '17 at 12:54
3
  • In C++ beware of designing function calls that copy arguments, the default is "pass by value". The function call overhead due to saving registers and other stack-frame related stuff can be overwhelmed by an unintended (and potentially very expensive) copy of an object.

  • There are stack-frame related optimizations that you should investigate before giving up on highly factored code.

  • Most of the time when I have had to deal with a slow program I found making algorithmic changes yielded far greater speed ups than in-lining function calls. For example: another engineer redid a parser that filled a map-of-maps structure. As part of that he removed a cached index from one map to a logically associated one. That was a nice code robustness move, however it made the program unusable due to a factor of 100 slowdown due to performing a hash lookup for all future accesses versus using the stored index. Profiling showed that most of the time was spent in the hashing function.

  • 4
    The first advice is a bit old. Since C++11, _moving_ has been possible. In particular, for functions that need to modify their arguments internally, taking an argument by value and modifying it in-place can be the most efficient choice. – MSalters Sep 11 '17 at 06:45
  • @MSalters: I think you mistook "in particular" with "furthermore" or something. The decision to pass copies or references was there before C++11 (tho I know you know it). – phresnel Sep 12 '17 at 10:36
  • @phresnel: I think I got it right. The particular case I am referring to is the case where you create a temporary in the caller, _move_ it to an argument, and then modify it in the callee. This wasn't possible before C++11, as C++03 can't/won't bind a non-const reference to a temporary.. – MSalters Sep 12 '17 at 12:50
  • @MSalters: Then I have misunderstood your comment upon first reading it. It seemed to me you were implying that before C++11, passing by value was not something one would do if one would want to modify the passed value. – phresnel Sep 12 '17 at 12:54
  • The advent of 'moving' helps most significantly in the return of objects which are more conveniently constructed in the function than outside and being passed in by reference. Prior to that returning an object from a function invoked a copy, often an expensive move. That does not deal with function arguments. I carefully put the word "designing" into the comment as one must explicitly give the compiler permission to 'move' into function arguments (&& syntax). I've gotten in the habit of 'deleteing' copy constructors to identify places where doing so is valuable. – user2543191 Sep 12 '17 at 14:05
3

Remember that the cost of a method call can be significant, depending on the language. There's almost always a tradeoff between writing readable code and writing performant code.

This is, unfortunately, highly dependent on:

  • the compiler toolchain, including the JIT if any,
  • the domain.

First of all, the first law of performance optimization is profile first. There are many domains where the performance of the software part is irrelevant to the performance of the whole stack: database calls, network operations, OS operations, ...

This does mean that the performance of the software is completely irrelevant, even if it does not improve latency, optimizing the software may result in energy savings and hardware savings (or battery savings for mobile apps), which can matter.

However, those can typically NOT be eye-balled, and often times algorithmic improvements trump micro-optimizations by a large margin.

So, before optimizing, you need to understand what you are optimizing for... and whether it is worth it.


Now, with regard to pure software performance, it varies greatly between toolchains.

There are two costs to a function call:

  • the run time cost,
  • the compile time cost.

The run time cost is rather obvious; in order to perform a function call a certain amount of work is necessary. Using C on x86 for example, a function call will require (1) spilling registers to the stack, (2) pushing arguments to the registers, performing the call, and afterward (3) restoring the registers from the stack. See this summary of calling conventions to see the work involved.

This register spilling/restoration takes a non-trivial amount of times (dozens of CPU cycles).

It is generally expected that this cost will be trivial compared to the actual cost of executing the function, however some patterns are counter-productive here: getters, functions guarded by a simple condition, etc...

Apart from interpreters, a programmer will therefore hope that their compiler or JIT will optimize out the function calls that are unnecessary; although this hope may sometimes not bear fruit. Because optimizers are not magic.

An optimizer may detect that a function call is trivial, and inline the call: essentially, copy/pasting the body of the function at the call site. This is not always a good optimization (may induce bloat) but in general is worthwhile because inlining exposes context, and the context enables more optimizations.

A typical example is:

void func(condition: boolean) {
    if (condition) {
        doLotsOfWork();
    }
}

void call() { func(false); }

If func is inlined, then the optimizer will realize that the branch is never taken, and optimize call to void call() {}.

In that sense, function calls, by hiding information from the optimizer (if not yet inlined), may inhibit certain optimizations. Virtual function calls are especially guilty of this, because devirtualization (proving which function ultimately gets called at run time) is not always easy.


In conclusion, my advice is to write clearly first, avoiding premature algorithmic pessimization (cubic complexity or worse bites quickly), and then only optimize what needs optimizing.

Matthieu M.
  • 14,567
  • 4
  • 44
  • 65
2

Yes, a missed branch prediction is more costly on modern hardware than it was decades ago, but compilers have gotten a lot smarter at optimizing this.

As an example, consider Java. At first glance, function call overhead should be particularly dominant in this language:

  • tiny functions are widespread due to the JavaBean convention
  • functions default to virtual, and usually are
  • the unit of compilation is the class; the runtime supports loading new classes at any time, including subclasses that override previously monomorphic methods

Horrified by these practices, the average C programmer would predict that Java must be at least one order of magnitude slower than C. And 20 years ago he would have been right. Modern benchmarks however place idiomatic Java code within a few percent of the equivalent C code. How is that possible?

One reason is that modern JVMs inline function calls as a matter of course. It does so using speculative inlining:

  1. Freshly loaded code executes without optimization. During this stage, for every call site, the JVM keeps track of which methods were actually invoked.
  2. Once code has been identified as performance hotspot, the runtime uses these statistics to identify the most probable execution path, and inlines that one, prefixing it with a conditional branch in case the speculative optimization does not apply.

That is, the code:

int x = point.getX();

gets rewritten to

if (point.class != Point) GOTO interpreter;
x = point.x;

And of course the runtime is smart enough to move up this type check as long as point is not assigned, or elide it if the type is known to the calling code.

In summary, if even Java manages automatic method inlining, there is no inherent reason why a compiler could not support automatic inlining, and every reason to do so, because inlining is highly beneficial on modern processors. I can therefore hardly imagine any modern mainstream compiler ignorant of this most basic of optimization strategies, and would presume a compiler capable of this unless proven otherwise.

meriton
  • 4,022
  • 17
  • 18
  • 4
    “there is no inherent reason why a compiler could not support automatic inlining” – there is. You've talked about JIT compilation, which amounts to self-modifying code (which an OS may prevent because security) and the ability to do automatic profile-guided full-program optimization. An AOT compiler for a language that allows dynamic linking doesn't know enough to devirtualize and inline any call. OTOH: an AOT compiler has time to optimize everything that it can, a JIT compiler only has time to focus on cheap optimizations in hot spots. In most cases, that leaves JIT at a slight disadvantage. – amon Sep 10 '17 at 19:41
  • 2
    Tell me one OS that prevents running Google Chrome "because security" (V8 compiles JavaScript to native code at runtime). Also, wanting to inline AOT is not quite an inherent reason (it is not determined by the language, but the architecture you choose for your compiler), and while dynamic linking does inhibit AOT inlining across compilation units, it does not inhibit inlining *within* compilation units, where most calls take place. In fact, useful inlining is arguably easier in a language that uses dynamic linking less excessively than Java. – meriton Sep 10 '17 at 23:39
  • 4
    Notably, iOS prevents JIT for non-privileged apps. Chrome or Firefox have to use the Apple-provided web view instead of their own engines. Good point though that AOT vs. JIT is an implementation-level, not a language-level choice. – amon Sep 11 '17 at 06:06
  • @meriton Windows 10 S and video game console operating systems also tend to block third-party JIT engines. – Damian Yerrick Sep 11 '17 at 19:49
2

As others say, you should measure your program's performance first, and will probably find no difference in practice.

Still, from a conceptual level I thought I'd clear up a few things that are conflated in your question. Firstly, you ask:

Do function call costs still matter in modern compilers?

Notice the key words "function" and "compilers". Your quote is subtley different:

Remember that the cost of a method call can be significant, depending on the language.

This is talking about methods, in the object oriented sense.

Whilst "function" and "method" are often used interchangably, there are differences when it comes to their cost (which you're asking about) and when it comes to compilation (which is the context you gave).

In particular, we need to know about static dispatch vs dynamic dispatch. I'll ignore optimisations for the moment.

In a language like C, we usually call functions with static dispatch. For example:

int foo(int x) {
  return x + 1;
}

int bar(int y) {
  return foo(y);
}

int main() {
  return bar(42);
}

When the compiler sees the call foo(y), it knows what function that foo name is referring to, so the output program can jump straight to the foo function, which is quite cheap. That's what static dispatch means.

The alternative is dynamic dispatch, where the compiler doesn't know which function is being called. As an example, here's some Haskell code (since the C equivalent would be messy!):

foo x = x + 1

bar f x = f x

main = print (bar foo 42)

Here the bar function is calling its argument f, which could be anything. Hence the compiler can't just compile bar to a fast jump instruction, because it doesn't know where to jump to. Instead, the code we generate for bar will dereference f to find out which function it's pointing to, then jump to it. That's what dynamic dispatch means.

Both of those examples are for functions. You mentioned methods, which can be thought of as a particular style of dynamically-dispatched function. For example, here's some Python:

class A:
  def __init__(self, x):
    self.x = x

  def foo(self):
    return self.x + 1

def bar(y):
  return y.foo()

z = A(42)
bar(z)

The y.foo() call uses dynamic dispatch, since it's looking up the value of the foo property in the y object, and calling whatever it finds; it doesn't know that y will have class A, or that the A class contains a foo method, so we can't just jump straight to it.

OK, that's the basic idea. Note that static dispatch is faster than dynamic dispatch regardless of whether we compile or interpret; all else being equal. The dereferencing incurs an extra cost either way.

So how does this affect modern, optimising compilers?

The first thing to note is that static dispatch can be optimised more heavily: when we know which function we're jumping to, can do things like inlining. With dynamic dispatch, we don't know we're jumping until run time, so there's not much optimisation we can do.

Secondly, it's possible in some languages to infer where some dynamic dispatches will end jumping to, and hence optimise them into static dispatch. This lets us perform other optimisations like inlining, etc.

In the above Python example such inference is pretty hopeless, since Python allows other code to override classes and properties, so it's difficult to infer much that will hold in all cases.

If our language lets us impose more restrictions, for example by limiting y to class A using an annotation, then we could use that information to infer the target function. In languages with subclassing (which is almost all languages with classes!) that's actually not enough, since y may actually have a different (sub)class, so we'd need extra information like Java's final annotations to know exactly which function will get called.

Haskell isn't an OO language, but we can infer the value of f by inlining bar (which is statically dispatched) into main, substituting foo for y. Since the target of foo in main is statically known, the call becomes statically dispatched, and will probably get inlined and optimised away completely (since these functions are small, the compiler is more likely to inline them; although we can't count on that in general).

Hence the cost comes down to:

  • Does the language dispatch your call statically or dynamically?
  • If it's the latter, does the language allow the implementation to infer the target using other information (e.g. types, classes, annotatations, inlining, etc.)?
  • How aggressively can static dispatch (inferred or otherwise) be optimised?

If you're using a "very dynamic" language, with lots of dynamic dispatch and few guarantees available to the compiler, then every call will incur a cost. If you're using a "very static" language, then a mature compiler will produce very fast code. If you're in between, then it can depend on your coding style and how smart the implementation is.

Warbo
  • 1,205
  • 7
  • 11
  • 1
    I disagree that calling a closure (or some function *pointer*) -as your Haskell example- is dynamic dispatch. [dynamic dispatch](https://en.wikipedia.org/wiki/Dynamic_dispatch) involves some *computation* (e.g. using some [vtable](https://en.wikipedia.org/wiki/Virtual_method_table)) to get that closure, so is more costly than indirect calls. Otherwise, nice answer. – Basile Starynkevitch Dec 01 '17 at 11:46
1

"Remember that the cost of a method call can be significant, depending on the language. There's almost always a tradeoff between writing readable code and writing performant code."

Under what conditions is this quoted statement still valid nowadays given the rich industry of performant modern compilers?

I'm just gonna flat out say never. I believe the quote to be reckless to just throw out there.

Of course I am not speaking the complete truth, but I don't care about being truthful that much. It's like in that Matrix movie, I forgot if it was 1 or 2 or 3 -- I think it was the one with the sexy Italian actress with the big melons (I didn't really like any but the first one), when the oracle lady told Keanu Reeves, "I just told you what you needed to hear," or something to this effect, that's what I want to do now.

Programmers don't need to hear this. If they're experienced with profilers in their hand and the quote is somewhat applicable to their compilers, they'll already know this and will learn this the proper way provided they understand their profiling output and why certain leaf calls are hotspots, through measuring. If they're not experienced and have never profiled their code, this is the last thing they need to hear, that they should start superstitiously compromising how they write code down to the point of inlining everything before even identifying hotspots in hopes that it'll become more performant.

Anyway, for a more accurate response, it depends. Some of the boatload of conditions are already listed among the fine answers. The possible conditions just choosing one language are already huge themselves, like C++ which would have to get into dynamic dispatch in virtual calls and when it can be optimized away and under which compilers and even linkers, and that already warrants a detailed response let alone trying to tackle the conditions in every possible language and compiler out there. But I'll add on top, "who cares?" because even working in performance-critical areas as raytracing, the last thing I'll ever start doing up front is hand-inlining methods before I have any measurements.

I do believe some people get overzealous about suggesting you should never do any micro-optimizations prior to measuring. If optimizing for locality of reference counts as a micro-optimization, then I often do begin applying such optimizations right at the beginning with a data-oriented design mindset in areas I know for certain will be performance-critical (raytracing code, e.g.), because otherwise I know I'll have to rewrite big sections soon after having worked in these domains for years. Optimizing data representation for cache hits can often have the same kind of performance improvements as algorithmic improvements unless we're talking like quadratic time to linear.

But I never, ever see a good reason to start inlining before measurements, especially since profilers are decent at revealing what might benefit from inlining, but not at revealing what might benefit from not being inlined (and not inlining can actually make code faster if the unlined function call is a rare case, improving locality of reference for the icache for hot code and sometimes even allowing optimizers to do a better job for the common case path of execution).