I wanted to jump in here among these already-excellent answers and admit that I've taken the ugly approach of actually working backwards to the anti-pattern of changing polymorphic code into switches
or if/else
branches with measured gains. But I didn't do this wholesale, only for the most critical paths. It doesn't have to be so black and white.
As a disclaimer, I work in areas like raytracing where correctness is not so difficult to achieve (and is often fuzzy and approximated anyway) while speed is often one of the most competitive qualities sought out. A reduction in render times is often one of the most common user requests, with us constantly scratching our heads and figuring out how to achieve it for the most critical measured paths.
Polymorphic Refactoring of Conditionals
First, it's worth understanding why polymorphism can be preferable from a maintainability aspect than conditional branching (switch
or a bunch of if/else
statements). The main benefit here is extensibility.
With polymorphic code, we can introduce a new subtype to our codebase, add instances of it to some polymorphic data structure, and have all the existing polymorphic code still work automagically with no further modifications. If you have a bunch of code scattered throughout a large codebase that resembles the form of, "If this type is 'foo', do that", you might find yourself with a horrible burden of updating 50 disparate sections of code in order to introduce a new type of thing, and still end up missing a few.
The maintainability benefits of polymorphism naturally diminish here if you just have a couple or even one section of your codebase that needs to do such type checks.
Optimization Barrier
I would suggest not looking at this from the standpoint of branching and pipelining so much, and look at it more from the compiler design mindset of optimization barriers. There are ways to improve branch prediction that apply to both cases, like sorting data based on sub-type (if it fits into a sequence).
What differs more between these two strategies is the amount of information the optimizer has in advance. A function call which is known provides a lot more information, an indirect function call which calls an unknown function at compile-time leads to an optimization barrier.
When the function being called is known, compilers can obliterate the structure and squash it down to smithereens, inlining calls, eliminating potential aliasing overhead, doing a better job at instruction/register allocation, possibly even rearranging loops and other forms of branches, generating hard-coded miniature LUTs when appropriate (something GCC 5.3 recently surprised me with a switch
statement by using a hard-coded LUT of data for the results rather than a jump table).
Some of those benefits get lost when we start introducing compile-time unknowns into the mix, as with the case of an indirect function call, and that's where conditional branching can most likely offer an edge.
Memory Optimization
Take an example of a video game which consists of processing a sequence of creatures repeatedly in a tight loop. In such a case, we might have some polymorphic container like this:
vector<Creature*> creatures;
Note: for simplicity I avoided unique_ptr
here.
... where Creature
is a polymorphic base type. In this case, one of the difficulties with polymorphic containers is that they often want to allocate memory for each subtype separately/individually (ex: using default throwing operator new
for each individual creature).
That will often make the first prioritization for optimization (should we need it) memory-based rather than branching. One strategy here is to use a fixed allocator for each sub-type, encouraging a contiguous representation by allocating in large chunks and pooling memory for each sub-type being allocated. With such a strategy, it can definitely help to sort this creatures
container by sub-type (as well as address), as that is not only possibly improving branch prediction but also improving locality of reference (allowing multiple creatures of the same subtype to be accessed from a single cache line prior to eviction).
Partial Devirtualization of Data Structures and Loops
Let's say you went through all these motions and you still desire more speed. It's worth noting that each step we venture here is degrading maintainability, and we'll already be at a somewhat metal-grinding stage with diminishing performance returns. So there needs to be a pretty significant performance demand if we tread into this territory, where we're willing to sacrifice maintainability even further for smaller and smaller performance gains.
Yet the next step to try (and always with a willingness to back out our changes if it doesn't help at all) might be manual devirtualization.
Version control tip: unless you're far more optimization-savvy than me, it can be worth creating a new branch at this point with a willingness to toss it away if our optimization efforts miss which may very well happen. For me it's all trial and error after these kinds of points even with a profiler in hand.
Nevertheless, we don't have to apply this mindset wholesale. Continuing our example, let's say this video game consists mostly of human creatures, by far. In such a case, we can devirtualize only human creatures by hoisting them out and creating a separate data structure just for them.
vector<Human> humans; // common case
vector<Creature*> other_creatures; // additional rare-case creatures
This implies that all the areas in our codebase which need to process creatures need a separate special-case loop for human creatures. Yet that eliminates the dynamic dispatch overhead (or perhaps, more appropriately, optimization barrier) for humans which are, by far, the most common creature type. If these areas are large in number and we can afford it, we might do this:
vector<Human> humans; // common case
vector<Creature*> other_creatures; // additional rare-case creatures
vector<Creature*> creatures; // contains humans and other creatures
... if we can afford this, the less critical paths can stay as they are and simply process all creature types abstractly. The critical paths can process humans
in one loop and other_creatures
in a second loop.
We can extend this strategy as needed and potentially squeeze some gains this way, yet it's worth noting how much we're degrading maintainability in the process. Using function templates here can help to generate the code for both humans and creatures without duplicating the logic manually.
Partial Devirtualization of Classes
Something I did years ago which was really gross, and I'm not even sure it's beneficial anymore (this was in C++03 era), was partial devirtualization of a class. In that case, we were already storing a class ID with each instance for other purposes (accessed through an accessor in the base class which was non-virtual). There we did something analogical to this (my memory is a little hazy):
switch (obj->type())
{
case id_common_type:
static_cast<CommonType*>(obj)->non_virtual_do_something();
break;
...
default:
obj->virtual_do_something();
break;
}
... where virtual_do_something
was implemented to call non-virtual versions in a subclass. It's gross, I know, doing an explicit static downcast to devirtualize a function call. I have no idea how beneficial this is now as I have not tried this type of thing for years. With an exposure to data-oriented design, I found the above strategy of splitting up data structures and loops in a hot/cold fashion to be far more useful, opening up more doors for optimization strategies (and far less ugly).
Wholesale Devirtualization
I must admit that I've never gotten this far applying an optimization mindset, so I have no idea of the benefits. I have avoided indirect functions in foresight in cases where I knew there was only going to be one central set of conditionals (ex: event processing with only one central place processing events), but never started off with a polymorphic mindset and optimized all the way up to here.
Theoretically, the immediate benefits here might be a potentially smaller way of identifying a type than a virtual pointer (ex: a single byte if you can commit to the idea that there are 256 unique types or less) in addition to completely obliterating these optimization barriers.
It might also help in some cases to write easier-to-maintain code (versus the optimized manual devirtualization examples above) if you just use one central switch
statement without having to split up your data structures and loops based on subtype, or if there's an order-dependency in these cases where things have to be processed in a precise order (even if that causes us to branch all over the place). This would be for cases where you don't have too many places that need to do the switch
.
I would generally not recommend this even with a very performance-critical mindset unless this is reasonably easy to maintain. "Easy to maintain" would tend to hinge on two dominant factors:
- Not having a real extensibility need (ex: knowing for sure that you have exactly 8 types of things to process, and never any more).
- Not having many places in your code that need to check these types (ex: one central place).
... yet I recommend the above scenario in most cases and iterating towards more efficient solutions by partial devirtualization as needed. It gives you a lot more breathing room to balance extensibility and maintainability needs with performance.
Virtual Functions vs. Function Pointers
To kind of top this off, I noticed here that there was some discussion about virtual functions vs. function pointers. It is true that virtual functions require a little extra work to call, but that doesn't mean they are slower. Counter-intuitively, it may even make them faster.
It's counter-intuitive here because we're used to measuring cost in terms of instructions without paying attention to the dynamics of the memory hierarchy which tend to have a much more significant impact.
If we're comparing a class
with 20 virtual functions vs. a struct
which stores 20 function pointers, and both are instantiated multiple times, the memory overhead of each class
instance in this case 8 bytes for the virtual pointer on 64-bit machines, while the memory overhead of the struct
is 160 bytes.
The practical cost there can be a whole lot more compulsory and non-compulsory cache misses with the table of function pointers vs. the class using virtual functions (and possibly page faults at a large enough input scale). That cost tends to dwarf the slightly extra work of indexing a virtual table.
I've also dealt with legacy C codebases (older than I am) where turning such structs
filled with function pointers, and instantiated numerous times, actually gave significant performance gains (over 100% improvements) by turning them into classes with virtual functions, and simply due to the massive reduction in memory use, the increased cache-friendliness, etc.
On the flip side, when comparisons become more about apples to apples, I've likewise found the opposite mindset of translating from a C++ virtual function mindset to C-style function pointer mindset to be useful in these types of scenarios:
class Functionoid
{
public:
virtual ~Functionoid() {}
virtual void operator()() = 0;
};
... where the class was storing a single measly overridable function (or two if we count the virtual destructor). In those cases, it can definitely help in critical paths to turn that into this:
void (*func_ptr)(void* instance_data);
... ideally behind a type-safe interface to hide the dangerous casts to/from void*
.
In those cases where we're tempted to use a class with a single virtual function, it can quickly help to use function pointers instead. A big reason isn't even necessarily the reduced cost in calling a function pointer. It's because we no longer face the temptation to allocate each separate functionoid on the scattered regions of the heap if we're aggregating them into a persistent structure. This kind of approach can make it easier to avoid heap-associated and memory fragmentation overhead if the instance data is homogeneous, e.g., and only the behavior varies.
So there's definitely some cases where using function pointers can help, but often I've found it the other way around if we're comparing a bunch of tables of function pointers to a single vtable which only requires one pointer to be stored per class instance. That vtable will often be sitting in one or more L1 cache lines as well in tight loops.
Conclusion
So anyway, that's my little spin on this topic. I recommend venturing in these areas with caution. Trust measurements, not instinct, and given the way these optimizations often degrade maintainability, only go as far as you can afford (and a wise route would be to err on the side of maintainability).