In general, is it worth using virtual functions to avoid branching?

Question

There seems to be rough equivalents of instructions to equate to the cost of a branch miss virtual functions have a similar tradeoff:

instruction vs. data cache miss
optimization barrier

If you look at something like:

if (x==1) {
   p->do1();
}
else if (x==2) {
   p->do2();
}
else if (x==3) {
   p->do3();
}
...

You could have a member function array, or if many functions depend on the same categorization, or more complex categorization exists, use virtual functions:

p->do()

But, in general, how expensive are virtual functions vs branching It is hard to test on enough platforms to generalize, so I was wondering if any one had a rough rule of thumb (lovely if it were as simple as 4 ifs is the breakpoint)

In general virtual functions are clearer and I would lean towards them. But, I have several highly critical sections where I can change code from virtual functions to branches. I would prefer to have thoughts on this before I undertake this. (it's not a trivial change, or easy to test across multiple platforms)

Well, what are your performance requirements? Do you have hard numbers that you have to hit, or are you engaging in premature optimization? Both branching and virtual methods are extremely cheap in the grand scheme of things (e.g. compared to bad algorithms, I/O, or heap allocation). — amon, Nov 02 '15 at 19:54
Do whatever's more readable/flexible/unlikely to get in the way of future changes, and once you have it working *then* do profiling and see if this actually matters. Usually it doesn't. — Ixrec, Nov 02 '15 at 20:15
@Ixrec already have it working and profiled, the actual code base is large and too complex to post or I would. Updated OP to reflect this — Glenn Teitelbaum, Nov 02 '15 at 20:21
possible duplicate of [Clean readable code vs fast hard to read code. When to cross the line?](http://programmers.stackexchange.com/questions/89620/clean-readable-code-vs-fast-hard-to-read-code-when-to-cross-the-line) — gnat, Nov 02 '15 at 21:09
Question: *"But, in general, how expensive are virtual functions..."* Answer: [Indirect branch (wikipedia)](https://en.wikipedia.org/wiki/Indirect_branch) — rwong, Nov 02 '15 at 21:23
@gnat I have already decided for this code I need the fastest code that does not use any undefined behavior — Glenn Teitelbaum, Nov 02 '15 at 22:02
@rwong that does not address the issues of instruction cache fetch misses or optimization barriers, in short branches can be inlined — Glenn Teitelbaum, Nov 02 '15 at 22:05
Based on the accepted answer to another question http://stackoverflow.com/a/99341/4983400, it looks like the overhead is just a table lookup, so probably not significantly more than 1 or 2 if-else conditional tests. — Lawrence, Nov 02 '15 at 23:27
Remember that most answers are based on counting the number of instructions. As a low-level code optimizer, I don't trust the number of instructions; you must prove them on a particular CPU architecture - physically - under experimental conditions. The valid answers for this question must be empirical and experimental, not theoretical. — rwong, Nov 03 '15 at 10:29
The problem with this question is it presupposes that this is big enough to worry about. In real software, performance problems come in big chunks, like slices of pizza of multiple sizes. For example [*look here*](http://scicomp.stackexchange.com/a/1870/1262). Don't assume you know what the biggest problem is - let the program tell you. Fix that, and then let it tell you what the next one is. Do this half a dozen times, and you *might* be down to where virtual function calls are worth worrying about. They never have, in my experience. — Mike Dunlavey, Nov 06 '15 at 22:30
You have to also consider what code the optimiser generates - a simple if statement like that could end up a lot quicker than a vtable that has to be accessed at runtime. — gbjbaanb, Nov 12 '15 at 09:34
Your question misses the entire reason behind using virtual functions. If all you care about is speed then don't use functions, and don't use if-then. Don't branch if possible, copy and paste often, repeat and rinse this process over and over. That method will generally result in faster code. However, be fully prepared to spend lots of time debugging applications because of missing conditions and bug fixes as new functionality is added. Which, BTW, avoiding those issues is the reason for using virtual functions. — Dunk, Nov 12 '15 at 22:58
@Dunk in C++ functions can be inlined, templates can replace cut and paste, you can have structured performant code. Divergant code paths can be implemented via branches/switches or virtual functions, this question isn't about design, it's about generalizing the tradeoff — Glenn Teitelbaum, Nov 13 '15 at 15:41

score 28 · Answer 1 · edited Dec 20 '17 at 15:52

I wanted to jump in here among these already-excellent answers and admit that I've taken the ugly approach of actually working backwards to the anti-pattern of changing polymorphic code into switches or if/else branches with measured gains. But I didn't do this wholesale, only for the most critical paths. It doesn't have to be so black and white.

_{As a disclaimer, I work in areas like raytracing where correctness is not so difficult to achieve (and is often fuzzy and approximated anyway) while speed is often one of the most competitive qualities sought out. A reduction in render times is often one of the most common user requests, with us constantly scratching our heads and figuring out how to achieve it for the most critical measured paths.}

Polymorphic Refactoring of Conditionals

First, it's worth understanding why polymorphism can be preferable from a maintainability aspect than conditional branching (switch or a bunch of if/else statements). The main benefit here is extensibility.

With polymorphic code, we can introduce a new subtype to our codebase, add instances of it to some polymorphic data structure, and have all the existing polymorphic code still work automagically with no further modifications. If you have a bunch of code scattered throughout a large codebase that resembles the form of, "If this type is 'foo', do that", you might find yourself with a horrible burden of updating 50 disparate sections of code in order to introduce a new type of thing, and still end up missing a few.

The maintainability benefits of polymorphism naturally diminish here if you just have a couple or even one section of your codebase that needs to do such type checks.

Optimization Barrier

I would suggest not looking at this from the standpoint of branching and pipelining so much, and look at it more from the compiler design mindset of optimization barriers. There are ways to improve branch prediction that apply to both cases, like sorting data based on sub-type (if it fits into a sequence).

What differs more between these two strategies is the amount of information the optimizer has in advance. A function call which is known provides a lot more information, an indirect function call which calls an unknown function at compile-time leads to an optimization barrier.

When the function being called is known, compilers can obliterate the structure and squash it down to smithereens, inlining calls, eliminating potential aliasing overhead, doing a better job at instruction/register allocation, possibly even rearranging loops and other forms of branches, generating hard-coded miniature LUTs when appropriate (something GCC 5.3 recently surprised me with a switch statement by using a hard-coded LUT of data for the results rather than a jump table).

Some of those benefits get lost when we start introducing compile-time unknowns into the mix, as with the case of an indirect function call, and that's where conditional branching can most likely offer an edge.

Memory Optimization

Take an example of a video game which consists of processing a sequence of creatures repeatedly in a tight loop. In such a case, we might have some polymorphic container like this:

vector<Creature*> creatures;

_{Note: for simplicity I avoided unique_ptr here.}

... where Creature is a polymorphic base type. In this case, one of the difficulties with polymorphic containers is that they often want to allocate memory for each subtype separately/individually (ex: using default throwing operator new for each individual creature).

That will often make the first prioritization for optimization (should we need it) memory-based rather than branching. One strategy here is to use a fixed allocator for each sub-type, encouraging a contiguous representation by allocating in large chunks and pooling memory for each sub-type being allocated. With such a strategy, it can definitely help to sort this creatures container by sub-type (as well as address), as that is not only possibly improving branch prediction but also improving locality of reference (allowing multiple creatures of the same subtype to be accessed from a single cache line prior to eviction).

Partial Devirtualization of Data Structures and Loops

Let's say you went through all these motions and you still desire more speed. It's worth noting that each step we venture here is degrading maintainability, and we'll already be at a somewhat metal-grinding stage with diminishing performance returns. So there needs to be a pretty significant performance demand if we tread into this territory, where we're willing to sacrifice maintainability even further for smaller and smaller performance gains.

Yet the next step to try (and always with a willingness to back out our changes if it doesn't help at all) might be manual devirtualization.

_{Version control tip: unless you're far more optimization-savvy than me, it can be worth creating a new branch at this point with a willingness to toss it away if our optimization efforts miss which may very well happen. For me it's all trial and error after these kinds of points even with a profiler in hand.}

Nevertheless, we don't have to apply this mindset wholesale. Continuing our example, let's say this video game consists mostly of human creatures, by far. In such a case, we can devirtualize only human creatures by hoisting them out and creating a separate data structure just for them.

vector<Human> humans;               // common case
vector<Creature*> other_creatures;  // additional rare-case creatures

This implies that all the areas in our codebase which need to process creatures need a separate special-case loop for human creatures. Yet that eliminates the dynamic dispatch overhead (or perhaps, more appropriately, optimization barrier) for humans which are, by far, the most common creature type. If these areas are large in number and we can afford it, we might do this:

vector<Human> humans;               // common case
vector<Creature*> other_creatures;  // additional rare-case creatures
vector<Creature*> creatures;        // contains humans and other creatures

... if we can afford this, the less critical paths can stay as they are and simply process all creature types abstractly. The critical paths can process humans in one loop and other_creatures in a second loop.

We can extend this strategy as needed and potentially squeeze some gains this way, yet it's worth noting how much we're degrading maintainability in the process. Using function templates here can help to generate the code for both humans and creatures without duplicating the logic manually.

Partial Devirtualization of Classes

Something I did years ago which was really gross, and I'm not even sure it's beneficial anymore (this was in C++03 era), was partial devirtualization of a class. In that case, we were already storing a class ID with each instance for other purposes (accessed through an accessor in the base class which was non-virtual). There we did something analogical to this (my memory is a little hazy):

switch (obj->type())
{
   case id_common_type:
       static_cast<CommonType*>(obj)->non_virtual_do_something();
       break;
   ...
   default:
       obj->virtual_do_something();
       break;
}

... where virtual_do_something was implemented to call non-virtual versions in a subclass. It's gross, I know, doing an explicit static downcast to devirtualize a function call. I have no idea how beneficial this is now as I have not tried this type of thing for years. With an exposure to data-oriented design, I found the above strategy of splitting up data structures and loops in a hot/cold fashion to be far more useful, opening up more doors for optimization strategies (and far less ugly).

Wholesale Devirtualization

I must admit that I've never gotten this far applying an optimization mindset, so I have no idea of the benefits. I have avoided indirect functions in foresight in cases where I knew there was only going to be one central set of conditionals (ex: event processing with only one central place processing events), but never started off with a polymorphic mindset and optimized all the way up to here.

Theoretically, the immediate benefits here might be a potentially smaller way of identifying a type than a virtual pointer (ex: a single byte if you can commit to the idea that there are 256 unique types or less) in addition to completely obliterating these optimization barriers.

It might also help in some cases to write easier-to-maintain code (versus the optimized manual devirtualization examples above) if you just use one central switch statement without having to split up your data structures and loops based on subtype, or if there's an order-dependency in these cases where things have to be processed in a precise order (even if that causes us to branch all over the place). This would be for cases where you don't have too many places that need to do the switch.

I would generally not recommend this even with a very performance-critical mindset unless this is reasonably easy to maintain. "Easy to maintain" would tend to hinge on two dominant factors:

Not having a real extensibility need (ex: knowing for sure that you have exactly 8 types of things to process, and never any more).
Not having many places in your code that need to check these types (ex: one central place).

... yet I recommend the above scenario in most cases and iterating towards more efficient solutions by partial devirtualization as needed. It gives you a lot more breathing room to balance extensibility and maintainability needs with performance.

Virtual Functions vs. Function Pointers

To kind of top this off, I noticed here that there was some discussion about virtual functions vs. function pointers. It is true that virtual functions require a little extra work to call, but that doesn't mean they are slower. Counter-intuitively, it may even make them faster.

It's counter-intuitive here because we're used to measuring cost in terms of instructions without paying attention to the dynamics of the memory hierarchy which tend to have a much more significant impact.

If we're comparing a class with 20 virtual functions vs. a struct which stores 20 function pointers, and both are instantiated multiple times, the memory overhead of each class instance in this case 8 bytes for the virtual pointer on 64-bit machines, while the memory overhead of the struct is 160 bytes.

The practical cost there can be a whole lot more compulsory and non-compulsory cache misses with the table of function pointers vs. the class using virtual functions (and possibly page faults at a large enough input scale). That cost tends to dwarf the slightly extra work of indexing a virtual table.

I've also dealt with legacy C codebases (older than I am) where turning such structs filled with function pointers, and instantiated numerous times, actually gave significant performance gains (over 100% improvements) by turning them into classes with virtual functions, and simply due to the massive reduction in memory use, the increased cache-friendliness, etc.

On the flip side, when comparisons become more about apples to apples, I've likewise found the opposite mindset of translating from a C++ virtual function mindset to C-style function pointer mindset to be useful in these types of scenarios:

class Functionoid
{
public:
    virtual ~Functionoid() {}
    virtual void operator()() = 0;
};

... where the class was storing a single measly overridable function (or two if we count the virtual destructor). In those cases, it can definitely help in critical paths to turn that into this:

void (*func_ptr)(void* instance_data);

... ideally behind a type-safe interface to hide the dangerous casts to/from void*.

In those cases where we're tempted to use a class with a single virtual function, it can quickly help to use function pointers instead. A big reason isn't even necessarily the reduced cost in calling a function pointer. It's because we no longer face the temptation to allocate each separate functionoid on the scattered regions of the heap if we're aggregating them into a persistent structure. This kind of approach can make it easier to avoid heap-associated and memory fragmentation overhead if the instance data is homogeneous, e.g., and only the behavior varies.

So there's definitely some cases where using function pointers can help, but often I've found it the other way around if we're comparing a bunch of tables of function pointers to a single vtable which only requires one pointer to be stored per class instance. That vtable will often be sitting in one or more L1 cache lines as well in tight loops.

Conclusion

So anyway, that's my little spin on this topic. I recommend venturing in these areas with caution. Trust measurements, not instinct, and given the way these optimizations often degrade maintainability, only go as far as you can afford (and a wise route would be to err on the side of maintainability).

Virtual function are function pointers, just implemented in the viable of that class. When a virtual function is called, it is first looked up in the child and up the inheritance chain. This is why deep inheritance is very expensive and is generally avoided in c++. — Robert Baron, Aug 19 '16 at 11:47
@RobertBaron: I've never seen virtual functions being implemented as you said (=with a chain lookup through the class hierarchy). Generally compilers just generate a "flattened" vtable for each concrete type with all the correct function pointers, and at runtime the call is resolved with a single straight table lookup; no penalty is paid for deep inheritance hierarchies. — Matteo Italia, Jan 02 '18 at 13:42
Matteo, this was the explanation a technical lead gave me many years ago. Granted, it was for c++, so he may have been taking into consideration implications of multiple inheritance. Thank you for clarifying my understanding of how vtables are optimized. — Robert Baron, Jan 03 '18 at 11:58
Thanks for the good answer (+1). I wonder how much of this applies identically for std::visit instead of virtual functions. — DaveFar, Jan 23 '19 at 08:31

score 13 · Answer 2 · answered Nov 02 '15 at 21:43

Observations:

With many cases, virtual functions are faster because the vtable lookup is an O(1) operation while the else if() ladder is an O(n) operation. However, this is only true if the distribution of cases is flat.
For a single if() ... else, the conditional is faster because you save the function call overhead.
So, when you have a flat distribution of cases, a break-even point must exist. The only question is where it is located.
If you use a switch() instead of else if() ladder or virtual function calls, your compiler may produce even better code: it can do a branch to a location which is looked up from a table, but which is not a function call. That is, you have all the properties of the virtual function call without all the function call overhead.
If one is much more frequent than the rest, starting an if() ... else with that case will give you the best performance: You will execute a single conditional branch which is correctly predicted in most of the cases.
Your compiler has no knowledge of the expected distribution of cases and will assume a flat distribution.

Since your compiler likely has some good heuristics in place as to when to code a switch() as an else if() ladder or as a table lookup. I would tend to trust its judgment unless you know that the distribution of cases is biased.

So, my advise is this:

If one of the cases dwarfes the rest in terms of frequency, use a sorted else if() ladder.
Otherwise use a switch() statement, unless one of the other methods makes your code much more readable. Be sure that you don't buy a neglegible performance gain with significantly reduced readability.
If you used a switch() and are still not satisfied by the performance, do the comparison, but be prepared to find out that the switch() was already the fastest possibility.

Some compilers allow annotations to tell the compiler which case is more likely to be true, and those compilers can produce faster code as long as the annotation is correct. — gnasher729, Nov 02 '15 at 22:47
an O(1) operation isn't necessarily faster in real-world execution time than an O(n) or even O(n^20). — whatsisname, Nov 04 '15 at 20:44
@whatsisname That's why I said "for many cases". By the definition of `O(1)` and `O(n)` there exists a `k` so that the `O(n)` function is greater than the `O(1)` function for all `n >= k`. The only question is whether you are likely to have that many cases. And, yes, I have seen `switch()` statements with so many cases that an `else if()` ladder is definitely slower than a virtual function call or a loaded dispatch. — cmaster - reinstate monica, Nov 04 '15 at 21:44
The problem I have with this answer is the only warning against making a decision based on a fully irrelevant performance gain is hidden somewhere in the next to last paragraph. Everything else here pretends it may be a good idea to make a decision about `if` vs. `switch` vs. virtual functions based on perfomance. In *extremely rare cases* it may be, but in the majority of cases it is not. — Doc Brown, Jan 07 '16 at 07:05

score 7 · Answer 3 · answered Nov 03 '15 at 09:49

In general, is it worth using virtual functions to avoid branching?

In general, yes. The benefits for maintenance are significant (testing in separation, separation of concerns, improved modularity and extensibility).

But, in general, how expensive are virtual functions vs branching It is hard to test on enough platforms to generalize, so I was wondering if any one had a rough rule of thumb (lovely if it were as simple as 4 ifs is the breakpoint)

Unless you have profiled your code and know the dispatch between branches (the conditions evaluation) takes more time than the computations performed (the code in the branches), optimize the computations performed.

That is, the correct answer to "how expensive are virtual functions vs branching" is measure and find out.

Rule of thumb: unless have the situation above (branch discrimination more expensive than branch computations), optimize this part of the code for maintenance effort (use virtual functions).

You say that you want this section to run as fast as possible; How fast is that? What's your concrete requirement?

In general virtual functions are clearer and I would lean towards them. But, I have several highly critical sections where I can change code from virtual functions to branches. I would prefer to have thoughts on this before I undertake this. (it's not a trivial change, or easy to test across multiple platforms)

Use virtual functions then. This will even allow you to optimize per platform if necessary, and still keep client code clean.

Having done a lot of maintenance programming, I'm going to chime in with a bit of caution: virtual functions are IMNSHO pretty bad for maintenance, precisely because of the advantages you list. The core problem is their flexibility; you could stick pretty much anything in there... and people do. It's very hard to statically reason about dynamic dispatch. Yet in most specific cases code doesn't need all that flexibility, and *removing* runtime flexibility can make it easier to reason about code. Yet I don't want to go so far as to say you should never use dynamic dispatch; that's absurd. — Eamon Nerbonne, Oct 13 '19 at 13:38
The nicest abstractions to work with are those that are rare (i.e. a codebase has only a few opaque abstractions), yet super-duper robust. Basically: don't stick something behind a dynamic dispatch abstraction just because it happens to have a similar shape for one particular case; only do so if you cannot reasonably conceive *any* reason to ever care about any distinction between the objects sharing that interface. If you can't: better to have a non-encapsulating helper than a leaky abstraction. And even then; there's a tradeoff between runtime flexibility and codebase flexibility. — Eamon Nerbonne, Oct 13 '19 at 13:45

5gon12eder · Answer 4 · 2015-12-22T01:35:03.327

The other answers already provide good theoretical arguments. I'd like to add the results of an experiment I have performed recently to estimate whether it would be a good idea to implement a virtual machine (VM) using a large switch over the op-code or rather interpret the op-code as an index into an array of function pointers. While this is not exactly the same as a virtual function call, I think it is reasonably close.

I have written a Python script to randomly generate C++14 code for a VM with an instruction set size picked randomly (albeit not uniformly, sampling the low range more densely) between 1 and 10000. The generated VM always had 128 registers and no RAM. The instructions are not meaningful and all have the following form.

inline void
op0004(machine_state& state) noexcept
{
  const auto c = word_t {0xcf2802e8d0baca1dUL};
  const auto r1 = state.registers[58];
  const auto r2 = state.registers[69];
  const auto r3 = ((r1 + c) | r2);
  state.registers[6] = r3;
}

The script also generates dispatch routines using a switch statement…

inline int
dispatch(machine_state& state, const opcode_t opcode) noexcept
{
  switch (opcode)
  {
  case 0x0000: op0000(state); return 0;
  case 0x0001: op0001(state); return 0;
  // ...
  case 0x247a: op247a(state); return 0;
  case 0x247b: op247b(state); return 0;
  default:
    return -1;  // invalid opcode
  }
}

…and an array of function pointers.

inline int
dispatch(machine_state& state, const opcode_t opcode) noexcept
{
  typedef void (* func_type)(machine_state&);
  static const func_type table[VM_NUM_INSTRUCTIONS] = {
    op0000,
    op0001,
    // ...
    op247a,
    op247b,
  };
  if (opcode >= VM_NUM_INSTRUCTIONS)
    return -1;  // invalid opcode
  table[opcode](state);
  return 0;
}

Which dispatch routine was generated was chosen randomly for each generated VM.

For benchmarking, the stream of op-codes was generated by a randomly seeded (std::random_device) Mersenne twister random engine (std::mt19937_64).

The code for each VM was compiled with GCC 5.2.0 using the -DNDEBUG, -O3 and -std=c++14 switches. First, it was compiled using the -fprofile-generate option and profile data collected for simulating 1000 random instructions. The code was then re-compiled with the -fprofile-use option allowing optimizations based on the collected profile data.

The VM was then exercised (in the same process) four times for 50 000 000 cycles and the time for each run measured. The first run was discarded to eliminate cold-cache effects. The PRNG was not re-seeded between the runs so that they did not perform the same sequence of instructions.

Using this setup, 1000 data points for each dispatching routine were collected. The data was collected on a quad core AMD A8-6600K APU with 2048 KiB cache running 64 bit GNU/Linux without a graphical desktop or other programs running. Shown below is a plot of the average CPU time (with standard deviation) per instruction for each VM.

From this data, I could gain confidence that using a function table is a good idea except maybe for a very small number of op-codes. I do not have an explanation for the outliers of the switch version between 500 and 1000 instructions.

All source code for the benchmark as well as the full experimental data and a high resolution plot can be found on my website.

I would suggest to use a logarithmic x axis, to better show the behavior for small sets. — Pietro, May 05 '21 at 09:27

score 4 · Answer 5 · answered Nov 03 '15 at 03:14

4

In addition to cmaster's good answer, which I upvoted, keep in mind that function pointers are generally strictly faster than virtual functions. Virtual functions dispatch generally involves first following a pointer from the object to the vtable, indexing appropriately, and then dereferencing a function pointer. So the final step is the same, but there are extra steps initially. In addition, virtual functions always take "this" as an argument, function pointers are more flexible.

Another thing to keep in mind: if your critical path involves a loop, it can be helpful to sort the loop by dispatch destination. Obviously this is nlogn, whereas traversing the loop is only n, but if you are going to traverse many times this can be worth it. By sorting by dispatch destination, you ensure that the same code is executed repeatedly, keeping it hot in icache, minimizing cache misses.

A third strategy to keep in mind: if you decide to move away from virtual functions/function pointers towards if/switch strategies, you may also be well served by switching from polymorphic objects to something like boost::variant (which also provides the switch case in the form of the visitor abstraction). Polymorphic objects have to be stored by base pointer, so your data is all over the place in cache. This could easily be a bigger influence on your critical path than the cost of virtual lookup. Whereas variant is stored inline as a discriminated union; it has size equal to the largest data type (plus a small constant). If your objects don't differ in size too much, this is a great way to handle them.

Actually, I wouldn't be surprised if improving the cache coherency of your data would have a bigger impact than your original question, so I'd definitely look more into that.

answered Nov 03 '15 at 03:14

Nir Friedman

1,427
9
11

I don't know that a virtual function involves "extra steps" though. Given that the layout of the class is known at compile time, it is essentially the same as an array access. I.e. there is a pointer to the top of the class, and the offset of the function is known so just add that in, read the result, and that is the address. Not much overhead. – Nov 03 '15 at 03:21
2

It does involve extra steps. The vtable itself contains function pointers, so when you make it to the vtable, you have reached the same state you started in with a function pointer. Everything before you get to the vtable is extra work. Classes don't contain their vtables, they contain pointers to vtables, and following that pointer is an extra dereference. In fact, sometimes there is a third dereference as polymorphic classes are generally held by base class pointer, so you have to dereference a pointer to get the vtable address (to dereference it ;-) ). – Nir Friedman Nov 03 '15 at 04:08
On the flip side the fact that the vtable is stored outside the instance can actually be helpful for temporal locality vs., say, a bunch of disparate structs of function pointers where each and every function pointer is stored in a different memory address. In such cases a single vtable with a million vptrs can beat a million tables of function pointers easily (starting with just memory consumption). It can be somewhat of a toss-up here -- not so easy to break down. Generally I agree that the function pointer is often a tad cheaper but it's not so easy to put one above the other. – Nov 19 '15 at 13:16
I think, put another way, where virtual functions start to quickly and grossly outperform function pointers is when you have a boatload of object instances involved (where each object would need to store either multiple function pointers or a single vptr). Function pointers tend to be cheaper if you have, say, just one function pointer stored in memory which is going to be called a boatload of times. Otherwise function pointers can start to get slower with the amount of data redundancy and cache misses that result from many redundantly hogging memory and pointing to the same address. – Nov 19 '15 at 13:24
Of course with function pointers, you could also still store them in a central location even if they're shared by a million separate objects to avoid hogging up memory and getting a boatload of cache misses. But then they start to become equivalent to vpointers, involving pointer access to a shared location in memory to get to the actual function addresses we want to call. The fundamental question here is: do you store the function address closer to the data you are currently accessing or in a central location? vtables only allow the latter. Function pointers allow both ways. – Nov 19 '15 at 13:29
I don't follow your reasoning. The object will store the function pointer inside itself. So the function pointers are not all over the place, they're in the objects, which you're already reading, which is strictly better than a vtable no matter how coherent it/they are. Also, for a vtable to have a million pointers would imply you have an object with a million methods. – Nir Friedman Nov 19 '15 at 15:18

score 4 · Answer 6 · edited May 23 '17 at 12:40

May I just explain why I think this is an XY-problem? (You're not alone in asking them.)

I assume your real goal is to save time overall, not just to understand a point about cache-misses and virtual functions.

Here's an example of real performance tuning, in real software.

In real software, things get done that, no matter how experienced the programmer is, could be done better. One doesn't know what they are until the program is written and performance tuning can be done. There are nearly always more than one way to speed up the program. After all, to say a program is optimal, you are saying that in the pantheon of possible programs to solve your problem, none of them take less time. Really?

In the example I linked to, it originally took 2700 microseconds per "job". A series of six problems were fixed, going counter-clockwise around the pizza. The first speedup removed 33% of the time. The second one removed 11%. But notice, the second one wasn't 11% at time it was found, it was 16%, because the first problem was gone. Similarly, the third problem was magnified from 7.4% to 13% (almost double) because the first two problems were gone.

At the end, this magnification process allowed all but 3.7 microseconds to be eliminated. That's 0.14% of the original time, or a speedup of 730x.

Removing the initially large problems gives a moderate amount of speedup, but they pave the way for removing later problems. These later problems could have initially been insignificant parts of the total, but after early problems are removed, these small ones become large and can produce big speedups. (It's important to understand that, to get this result, none can be missed, and this post shows how easily they can be.)

Was the final program optimal? Probably not. None of the speedups had anything to do with cache misses. Would cache misses matter now? Maybe.

EDIT: I'm getting downvotes from people homing in on the "highly critical sections" of the OP's question. You don't know something is "highly critical" until you know what fraction of time it accounts for. If the average cost of those methods being called is 10 cycles or more, over time, the method of dispatching to them is probably not "critical", compared to what they are actually doing. I see this over and over, where people treat "needing every nanosecond" as a reason to be penny-wise and pound-foolish.

he's already said he has several "highly critical sections" that require every last nanosecond of performance. So this isn't an answer to the question he asked (even if it would be a great answer to someone else's question) — gbjbaanb, Nov 12 '15 at 09:37
@gbjbaanb: If every last nanoseconds counts, why does the question start with "in general"? That's nonsense. When nanoseconds count, you can't look for general answers, you look at what the compiler does, you look at what the hardware does, you try variations, and you measure every variation. — gnasher729, Nov 12 '15 at 10:40
@gnasher729 I don't know, but why does it end with "highly critical sections"? I guess, like slashdot, one should always read the content, and not just the title! — gbjbaanb, Nov 12 '15 at 10:45
@gbjbaanb: Everybody says they've got "highly critical sections". How do they know? I don't know something is critical until I take, say, 10 samples, and see it on 2 or more of them. In a case like this, if the methods being called take more than 10 instructions, the virtual function overhead is probably insignificant. — Mike Dunlavey, Nov 12 '15 at 12:36
@gnasher729: Well, the first thing I do is get stack samples, and on each one, examine what the program is doing and why. Then if it spends all its time in leaves of the call tree, and all the calls are *truly unavoidable*, does it matter what the compiler and hardware do. You only know method dispatch matters if samples land in the process of doing method dispatch. — Mike Dunlavey, Nov 12 '15 at 12:57

In general, is it worth using virtual functions to avoid branching?

6 Answers6

Polymorphic Refactoring of Conditionals

Optimization Barrier

Memory Optimization

Partial Devirtualization of Data Structures and Loops

Partial Devirtualization of Classes

Wholesale Devirtualization

Virtual Functions vs. Function Pointers

Conclusion

Linked