Why do programs use call stacks, if nested function calls can be inlined?

Question

Why not have the compiler take a program like this:

function a(b) { return b^2 };
function c(b) { return a(b) + 5 };

and convert it into a program like this:

function c(b) { return b^2 + 5 };

thereby eliminating the computer's need to remember c(b)'s return address?

I suppose the increased hard disk space and RAM needed to store the program and support its compilation (respectively) is the reason why we use call stacks. Is that correct?

See what happens if you do this on a program with any meaningful size. In particular, functions are called from more than one place. — user253751, Jun 12 '15 at 12:22
Also, sometimes the compiler doesn't know which function is called! Silly example: `window[prompt("Enter function name","")]()` — user253751, Jun 12 '15 at 12:22
How do you implement `function(a)b { if(b>0) return a(b-1); }` without a stack? — pjc50, Jun 12 '15 at 13:45
Not all programs *do* use call stacks. GLSL assumes the conversion on a language level (implementations might not actually apply it); `romcc` is a compiler that compiles C this way. Looking into their limitations will show you why most do. — Alex Celeste, Jun 12 '15 at 16:37
@pjc50: it's tail recursive, so the compiler translates it to a loop with a mutable `b`. But point taken, not all recursive functions can eliminate the recursion, and even when the function can in principle, the compiler might not be smart enough to do so. — Steve Jessop, Jun 12 '15 at 19:29
Can someone apply some better tags to this? The only tag on the question, "functional-programming", is clearly not appropriate. — hobbs, Jun 13 '15 at 09:24
@smci the edit to the question title should be reversed IMO, the original title was much better and reflected that this was going to be a great question where I could learn some stuff, while the edited title almost leads on to one answer. I would never even have clicked on the title as it stands now. — Pranab, Jun 14 '15 at 06:27
@Pranab: the original title wording did not correspond to what was being asked. The OP's example only gave two toy examples which could trivially be inlined, they did not ask the full generality of ***"Can we always eliminate function calls? and assuming yes, would we still need call stacks?"*** And the original tags were wrong and mismatched the question; I improved them. If you can improve the title, go ahead. I give up — smci, Jun 14 '15 at 23:28
@DavorŽdralo: I think that's wrong, but I already invited you all to edit/revert it yourselves, not complain at me. The OP originally asked "Why do we use call stacks (at all)?" and provided two toy examples of functions that can trivially be inlined, like as if all functions are like that. And OP tagged it [tag:functional-programming]. But many functions are either a) non-trivial b) have recursion c) call other functions, possibly unpredictably or recursively or d) (outside functional programing) have state. The OP seems to be mixing several different things. Go ahead and edit the question. — smci, Jun 15 '15 at 21:50

score 76 · Accepted Answer · edited Jun 15 '15 at 08:06

76

This is called "inlining" and many compilers do this as an optimization strategy in cases where it makes sense.

In your particular example, this optimization would save both space and execution time. But if the function was called in multiple places in the program (not uncommon!), it would increase code size, so the strategy becomes more dubious. (And of course if a function called itself directly or indirectly it would be impossible to inline, since then the code would become infinite in size.)

And obviously it is only possible for "private" functions. Functions which are exposed for external callers cannot be optimized away, at least not in languages with dynamic linking.

edited Jun 15 '15 at 08:06

mcottle

6,122
2
25
27

answered Jun 12 '15 at 08:07

JacquesB

57,310
21
127
176

C and C++ allow public functions and methods to be inlined by declaring them in the header included by callers. Whether or not this is a good thing to do depends on your willingness to trade having to recompile everything (vs. just installing new dynamic libraries) for the benefits of the extra optimization. Implementations of languages that do JIT compilation could, in theory, do it as well. – Blrfl Jun 12 '15 at 10:32
7

@Blrfl: Modern compilers actually don't need definitions in the header anymore; they can inline across Translation Units. This does require a decent linker, though. Definitions in header files are a workaround for dumb linkers. – MSalters Jun 12 '15 at 11:21
3

"Functions which are exposed for external callers cannot be optimized away" - the function has to exist, but any given call site to it (either in your own code, or if they have the source, the external callers') can be inlined. – Random832 Jun 12 '15 at 12:04
1

@MSalters: Code structured to depend on LTO sacrifices inlining entirely on toolchains that don't support it. In situations where 100% of the code will be built in an LTO-enabled world 100% of the time, then by all means go for it. There is lots of code in the world where that guarantee can't be made, and putting inline-able functions in headers allows essentially-identical optimization on the largest possible number of toolchains in a standards-compliant way. – Blrfl Jun 12 '15 at 12:28
14

Wow, 28 upvotes for an answer that doesn't even mention the reason why inlining everything is impossible: Recursion. – mastov Jun 12 '15 at 18:45
@MSalters: LTO completely precludes any meaningful form of dynamic linking - you can't have shareable code pages, or reasonable program start times, if you're deferring final code generation to load time. – R.. GitHub STOP HELPING ICE Jun 12 '15 at 22:44
3

@R.. : LTO is LINK time Optimization, not LOAD Time Optimization. – MSalters Jun 13 '15 at 01:07
@MSalters: Well they do call it a dynamic *linker*. But what I meant was that if you want to do these kinds of optimizations, you have to give up dynamic linking as we know it, or at least avoid dynamic linking the parts of your program where the optimizations need to happen. – R.. GitHub STOP HELPING ICE Jun 13 '15 at 03:47
1

@mastov: Why would recursion prevent inlining? Modern compilers actually do inline recursive calls. – MSalters Jun 13 '15 at 23:11
1

@MSalters: You can inline some recursive calls, but doing so gives you more recursive calls to inline. You can't inline *all* the calls, so pure inlining can't eliminate the need for a call stack. – user2357112 Jun 14 '15 at 06:03
You can make all self-recursion iterative if you're okay with an explicit stack. Recursion involving multiple functions is where it gets hard. – user253751 Jun 14 '15 at 08:45
2

@immibis: But if that explicit stack is introduced by the compiler, then that stack *is* the call stack. – user2357112 Jun 14 '15 at 09:59
@user2357112: Not all stack are call stacks. In particular, a **call** stack contains stack frames for every non-inlined funtion that's been **called** but hasn't returned yet. – MSalters Jun 14 '15 at 14:16
@mastov: Good point – JacquesB Jun 15 '15 at 08:54

score 51 · Answer 2 · 2015-06-12T19:29:10.160

There are two parts to your question: Why have multiple functions at all (instead of replacing function calls with their definition) and why implement those functions with call stacks instead of statically allocating their data somewhere else?

The first reason is recursion. Not just the "oh let's make a new function call for every single item in this list" kind, also the modest kind where you have maybe two calls of a function active at the same time, with many other functions in between them. You need to put local variables on a stack to support this, and you can't inline recursive functions in general.

Then there's a problem for libraries: You don't know which functions will be called from where and how often, so a "library" could never really be compiled, only shipped to all clients in some convenient high-level format that is then be inlined into the application. Aside from other problems with this, you completely lose dynamic linking with all its advantages.

Additionally, there are many reasons to not inline functions even when you could:

It's not necessarily faster. Setting up the stack frame and tearing it down are maybe a dozen single-cycle instructions, for many large or looping functions that's not even 0.1% of their execution time.
It may be slower. Duplication of code has costs, e.g., it will put more pressure in the instruction cache.
Some functions are very large and called from many places, inlining them everywhere increases binary far beyond what's reasonable.
Compilers often have a hard time with very large functions. Everything else being equal, a function of size 2*N takes more than 2*T time where a function of size N takes T time.

@JacquesB Many optimization algorithms are quadratic, cubic, or even technically NP-complete. The canonical example is register allocation, which is NP-complete by analogy with graph coloring. (Usually compilers don't attempt an exact solution, but only a couple of very poor heuristics run in linear time.) Many simple, one-pass optimizations need superlinear analysis passes first, such as everything that depends on dominance in control flows (generally n log n time with n basic blocks). — , Jun 12 '15 at 09:07
"You really have two questions here" No, I don't. Just one - why not treat a function call as just a placeholder that the compiler might, say, replace with the called function's code? — moonman239, Jun 12 '15 at 18:47
@moonman239 Then your wording threw me off. Still, your question *can* be decomposed as I do in my answer and I think that's a useful perspective. — , Jun 12 '15 at 19:28

score 17 · Answer 3 · answered Jun 13 '15 at 04:34

Stacks allow us to elegantly bypass the limits imposed by the finite number of registers.

Imagine having exactly 26 globals "registers a-z" (or even having only the 7 byte-sized registers of the 8080 chip) And every function you write in this app shares this flat list.

A naive start would be to allocate the first few registers to the first function, and knowing that it took only 3, start with "d" for the second function... You run out quickly.

Instead, if you have a metaphorical tape, like the turing machine, you could have each function start a "call another function" by saving all the variables it's using and forward() the tape, and then the callee function can muddle with as many registers as it wants. When the callee is finished, it returns control to the parent function, who knows where to snag the callee's output as needed, and then plays the tape backwards to restore its state.

Your basic call frame is just that, and are created and dropped by standardized machine code sequences the compiler puts in around the transitions from one function to another. (It's been a long time since I had to remember my C stack frames, but you can read up on the various ways the duties of who drops what at X86_calling_conventions.)

(recursion is awesome, but if you'd ever had to juggle registers without a stack, then you'd really appreciate stacks.)

I suppose the increased hard disk space and RAM needed to store the program and support its compilation (respectively) is the reason why we use call stacks. Is that correct?

While we can inline more these days, ("more speed" is always good; "fewer kb of assembly" means very little in a world of video streams) The main limitation is in the compiler's ability to flatten across certain types of code patterns.

For example, polymorphic objects -- if you don't know the one and only type of object you'll be handed, you can't flatten; you have to look at the object's vtable of features and call through that pointer... trivial to do at runtime, impossible to inline at compile time.

A modern toolchain can happily inline a polymorphically-defined function when it has flattened enough of the caller(s) to know exactly which flavor of obj is:

class Base {
    public: void act() = 0;
};
class Child1: public Base {
    public: void act() {};
};
void ActOn(Base* something) {
    something->act();
}
void InlineMe() {
    Child1 thingamabob;
    ActOn(&thingamabob);
}

in the above, the compiler can choose to keep right on statically inlining, from InlineMe through whatever's inside act(), nor a need to touch any vtables at runtime.

But any uncertainty in what flavor of object will leave it as a call to a discrete function, even if some other invocations of the same function are inlined.

score 11 · Answer 4 · edited May 23 '17 at 12:40

11

Cases which that approach cannot handle:

function fib(a) { if(a>2) return fib(a-1)+fib(a-2); else return 1; }

function many(a) { for(i = 1 to a) { b(i); };}

There are languages and platforms with limited or no call stacks. PIC microprocessors have a hardware stack limited to between 2 and 32 entries. This creates design constraints.

COBOL bans recursion: https://stackoverflow.com/questions/27806812/in-cobol-is-it-possible-to-recursively-call-a-paragraph

Imposing a ban on recursion does mean that you can represent the entire callgraph of the program statically as a DAG. Your compiler could then emit one copy of a function for each place from which it's called with a fixed jump instead of a return. No stack required, just more program space, potentially quite a lot for complex systems. But for small embedded systems this means you can guarantee not to have a stack overflow at runtime, which would be bad news for your nuclear reactor / jet turbine / car throttle control etc.

edited May 23 '17 at 12:40

Community

1

answered Jun 12 '15 at 13:59

pjc50

10,595
1
26
29

12

Your first example is basic recursion, and you're correct there. But your second example seems to be a for loop calling another function. In-lining the function is different than unrolling a loop; the function can be in-lined without unrolling the loop. Or have I missed some subtle detail? – jpmc26 Jun 13 '15 at 06:50
1

If your first example is meant to define the Fibonacci series, it is wrong. (It is missing a `fib` call.) – Paŭlo Ebermann Jun 14 '15 at 07:17
1

While forbidding recursion does mean that the entire call graph can be represented as a DAG, that doesn't mean that one could list out the full list of nested call sequences in a reasonable amount of space. On one project of mine for a microcontroller with 128KB of code space, I made the mistake of asking for a call graph which included all functions that could affect maximum parameter-RAM requirement and that call graph was over a gig. A complete call graph would have been even longer, and that was for a program that fit in 128K of code space. – supercat Jun 21 '15 at 20:03

Basile Starynkevitch · Answer 5 · 2015-06-13T14:42:37.047

You want function inlining, and most (optimizing) compilers are doing that.

Notice that inlining requires the called function to be known (and is effective only if that called function is not too big), since conceptually it is substituting the call by the rewriting of the called functgion. So you generally cannot inline an unknown function (e.g. a function pointer -and that includes functions from dynamically linked shared libraries-, which is perhaps visible as a virtual method in some vtable; but some compilers might sometimes optimize thru devirtualization techniques). Of course it is not always possible to inline recursive functions (some clever compilers might use partial evaluation and in some cases be able to inline recursive functions).

Notice also the inlining, even when it is easily possible, is not always effective: you (actually your compiler) could increase so much the code size that CPU caches (or branch predictor) would work less efficiently, and that would make your program run slower.

^{I am a bit focusing on functional programming style, since you tagged your qestion as such.}

Notice that you don't need to have any call stack (at least in the machine sense of the "call stack" expression). You could use only the heap.

So, take a look at continuations and read more about continuation passing style (CPS) and CPS transformation (intuitively, you could use continuation closures as reified "call frames" allocated in the heap, and they are sort-of mimicking a call stack; then you need an efficient garbage collector).

Andrew Appel wrote a book Compiling with Continuations and an old paper garbage collection can be faster than stack allocation. See also A.Kennedy's paper (ICFP2007) Compiling with Continuations, Continued

I also recommend reading Queinnec's Lisp In Small Pieces book, which has several chapters related to continuation & compilation.

Notice also that some languages (e.g. Brainfuck) or abstract machines (e.g. OISC, RAM) don't have any calling facilities but are still Turing-complete, so you don't (in theory) even need any function call mechanism, even if it is extremely convenient. BTW, some old instruction set architectures (e.g. IBM/370) don't even have a hardware call stack, or a pushing call machine instruction (the IBM/370 had only a Branch and Link machine instruction)

At last, if your entire program (including all the needed libraries) does not have any recursion you could store the return address (and the "local" variables, which are actually becoming static) of each function in static locations. Old Fortran77 compilers did that in the early 1980s (so the compiled programs did not use any call stack at that time).

It's very debatable is CPS has no "call stack". It's not on *the stack*, the mystical region of ordinary RAM that has a little bit of hardware support through `%esp` etc., but it still keeps the equivalent bookkeeping on an aptly named spaghetti stack in another region of RAM. The return address, in particular, is essentially encoded in the continuation. And of course continuations are not faster (and it seems to me this is what OP was getting at) than *making no calls at all* via inlining. — , Jun 12 '15 at 09:19
Appel's old papers claimed (and demonstrated with benchmarking) that CPS can be as fast as having a call stack. — Basile Starynkevitch, Jun 12 '15 at 09:22
I am skeptical of that but regardless that's not what I claimed. — , Jun 12 '15 at 10:55
Actually, this was on late 1980s era MIPS workstation. Probably, the cache hierarchy on current PCs would make the performance slightly different. There have been several papers analyzing Appel's claims (and indeed, on current machines, stack allocation might be *slightly* faster -by a few percents- than carefully crafted garbage collection) — Basile Starynkevitch, Jun 12 '15 at 11:01
As delnan remarked, there's still a call stack for non-tail calls with CPS, it's just represented differently. You don't need to go to exotic architectures to find one with no hardware stack support, you just need to step away from x86, e.g. arm has only branch-and-link. — Gilles 'SO- stop being evil', Jun 15 '15 at 07:59
@Gilles: Many newer ARM cores like the Cortex M0 and M3 (and probably others like the M4) have hardware stack support for things like interrupt handling. Further, the Thumb instruction set includes a limited subset of the STRM/STRM instructions that includes STRMDB R13 with any combination of R0-R7 with/without LR, and LDRMIA R13 of any combination of R0-R7 with/without PC, which effectively treats R13 as a stack pointer. — supercat, Jun 21 '15 at 20:07
@BasileStarynkevitch: For generation of temporary objects on a GC heap to be anywhere near as fast as stack accesses, the heap space used by youngest generation of GC objects must be small enough to fit in the lowest-level cache. If it isn't, then every cache line worth of continuation that is generated will be a cache line that will end up having to be uselessly flushed to a higher-level cache (or main memory) after it has been abandoned. By contrast, addresses associated with abandoned stack frames will get recycled very quickly, and can thus remain in cache. — supercat, Jun 21 '15 at 23:04

Zenilogix · Answer 6 · 2015-06-18T20:05:53.293

Inlining (replacing function calls with equivalent functionality) works well as an optimization strategy for small simple functions. The overhead of a function call can be effectively traded off for a small penalty in added program size (or in some cases, no penalty at all).

However, large functions which in turn call other functions could lead to an enormous explosion in program size if everything was inlined.

The whole point of callable functions is to facilitate efficient re-use, not just by the programmer, but by the machine itself, and that includes properties like reasonable memory or on-disk footprint.

For what it's worth: you can have callable functions without a call stack. For example: IBM System/360. When programming in languages such as FORTRAN on that hardware, the program counter (return address) would be saved into a small section of memory reserved just ahead of the function entry point. It allows for re-useable functions, but does not allow for recursion or multi-threaded code (an attempt at a recursive or re-entrant call would result in a previously saved return address being overwritten).

As explained by other answers, stacks are good things. They facilitate recursion and multi-threaded calls. While any algorithm coded to use recursion could be coded without relying on recursion, the result may be more complex, more difficult to maintain, and may be less efficient. I'm not sure a stack-less architecture could support multi-threading at all.

Why do programs use call stacks, if nested function calls can be inlined?

6 Answers6