11

Given branch prediction, and also the effect of compiler optimizations, which code tends to offer superior performance?

Note that bRareExceptionPresent represents an uncommon condition. It is not the normal path of logic.

/* MOST COMMON path must branch around IF clause */

bool SomeFunction(bool bRareExceptionPresent)
{
  // abort before function
  if(bRareExceptionPresent)
  {
     return false;
  }    
  .. function primary body ..    
  return true;
}

/* MOST COMMON path does NOT branch */

bool SomeFunction(bool bRareExceptionPresent)
{
  if(!bRareExceptionPresent)
  {
    .. function primary body ..
  }
  else
  {
    return false;
  }
  return true;
}
dyasta
  • 213
  • 1
  • 2
  • 9
  • 9
    I'm going to go out on a limb here and say there is no difference whatsoever. – Robert Harvey Apr 24 '13 at 15:56
  • 7
    This probably depends on the specific CPU that you are compiling for, as they have different pipelining architectures (delay slots vs no delay slot). The time you have spent thinking about this is likely much more than the time saved when running - profile first, then optimize. –  Apr 24 '13 at 15:57
  • @RobertHarvey I didn't claim that the difference was large. – dyasta Apr 24 '13 at 19:03
  • 2
    It's almost certainly premature micro-optimization. – Robert Harvey Apr 24 '13 at 19:03
  • 2
    @MichaelT Yep, profiling is indeed the only reliable way to know what's really going on with performance for the code on the target, platform, within its context. However, I was curious whether one was generally preferred. – dyasta Apr 24 '13 at 19:03
  • @Robert Harvey While your point is well taken. There is no backing application here. It was my own curiosity. And it think it is OK to ponder this question. – dyasta Apr 24 '13 at 19:46
  • 1
    @RobertHarvey: It is premature micro-optimization, *except* in cases where *both* conditions are met: (1) the loop is called billions (not millions) of times; and (2) ironically, when the loop body is tiny in terms of machine code. Condition #2 means that the fraction of time spent on overhead is *not insignificant* compared to time spent on useful work. The good news is that usually, in such situations where both conditions are met, SIMD (vectorization), which is by nature branchless, will solve all the performance issues. – rwong Apr 27 '13 at 06:34
  • 1
    Related - [Is micro-optimisation important when coding?](http://programmers.stackexchange.com/questions/99445/is-micro-optimisation-important-when-coding) –  Apr 30 '13 at 22:46
  • Remember that just because *you* write the branch one way, doesn't mean the compiler won't flip it around in the actual code layout. – Sebastian Redl May 14 '15 at 23:10
  • If you will be able to use C++20 there might be hints for branch prediction. See [likely](https://en.cppreference.com/w/cpp/language/attributes/likely) – schoetbi Jul 02 '19 at 07:34

2 Answers2

10

My understanding is that the first time the CPU encounters a branch, it will predict (if supported) that forward branches are not taken and backwards branches are. The rationale for this is that loops (which typically branch backwards) are assumed to be taken.

On some processors, you can give a hint in the assembly instruction as to which path is the more likely. Details of this escape me at the moment.

Additionally, some C compilers also support static branch prediction so that you can tell the compiler which branch is more likely. In turn it may reorganize the generated code, or use modified instructions to take advantage of this information (or even just flat out ignore it).

__builtin_expect((long)!!(x), 1L)  /* GNU C to indicate that <x> will likely be TRUE */
__builtin_expect((long)!!(x), 0L)  /* GNU C to indicate that <x> will likely be FALSE */

Hope this helps.

Sparky
  • 3,055
  • 18
  • 18
  • 3
    "My understanding is that the first time the CPU encounters a branch, it will predict (if supported) that forward branches are not taken and backwards branches are." This is a very interesting thought. Do you have any evidence that this is indeed implemented in common architectures? – blubb Apr 24 '13 at 19:33
  • 5
    Straight from the horse's mouth: [A forward branch defaults to not taken. A backward branch defaults to taken](http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts). And from the same page: "prefix 0x3E – statically predict a branch as taken". – MSalters Apr 25 '13 at 14:40
  • Is there a platform agnostic pragma that's equilivent to `__builtin_expect`? – MarcusJ Jun 15 '18 at 22:35
10

In today's world, it doesn't matter much, if it at all.

Dynamic branch prediction (something thought about for decades (see An Analysis of Dynamic Branch Prediction Schemeson System Workloads published in 1996)) are fairly common place.

An example of this can be found in the ARM processor. From the Arm Info Center on Branch Prediction

To improve the branch prediction accuracy, a combination of static and dynamic techniques is employed.

The question then is "what is dynamic branch prediction in the arm processor?" Contiuned reading of Dynamic branch prediction shows that it uses a 2 bit prediction scheme (described in the paper) builds information about if the branch is strongly or weakly taken or not taken.

Over time (and by time I mean a few passes through that block) this builds up information as to which way the code will go.

For static prediction, it looks at the way the code looks itself and which way the branch is made on the test - to a previous instruction or one further in the code:

The scheme used in the ARM1136JF-S processor predicts that all forward conditional branches are not taken and all backward branches are taken. Around 65% of all branches are preceded by enough non-branch cycles to be completely predicted.

As mentioned by Sparky, this is based on the understanding that loops more often than not, loop. The loop branches backwards (it has a branch at the end of the loop to restart it at the top) - it normally does this.

The danger of trying to second guess the compiler is that you don't know how that code is actually going to be compiled (and optimized). And for the most part, it doesn't matter. With dynamic prediction, twice through the function it will predict a skip over the guard statement for a premature return. If the performance of two flushed pipelines is of critical performance, there are other things to worry about.

The time it takes to read one style over the other is likely of greater importance - making code clean so that a human can read it, because the compiler is going to do just fine no matter how messy or idealized you write the code.

  • 7
    A famous stackoverflow question showed that branch prediction *does* matter, even today. – Florian Margaine Apr 26 '13 at 20:07
  • 3
    @FlorianMargaine while it does matter, it getting in a situation where it really does matter appears to require the understanding of what you are compiling to and how it works (arm vs x86 vs mips ...). Writing code trying to do this micro-optimization at the start is likely working from mistaken premises and not achieve the desired effect. –  Apr 26 '13 at 20:27
  • Well of course, let's not quote DK. But I think this question was clearly in the sense of optimization, when you've already gone past the profiling stage. :-) – Florian Margaine Apr 26 '13 at 20:32
  • @FlorianMargaine "MichaelT Yep, profiling is indeed the only reliable way to know what's really going on with performance for the code on the target, platform, within its context. However, I was curious whether one was generally preferred. – 90h 2 days ago" --- My reading of this is a pre-profiling best practice question. –  Apr 26 '13 at 20:53
  • @FlorianMargaine Can you post a link to the question you're speaking of? – dyasta Apr 28 '13 at 19:18
  • 2
    @MichaelT Nice answer, and I agree very much with your conclusion. This sort of pre-profiling / abstract optimization can definitely be counter-productive. It ends up being a guessing game, causing one to make design decisions for irrational reasons. Still, I found myself curious ;o – dyasta Apr 28 '13 at 19:19
  • 5
    @90h http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array – Florian Margaine Apr 28 '13 at 20:14
  • the branch prediction does matter; trying to defeat the compiler and the runtime and failing matters even more –  May 14 '15 at 22:43
  • @JarrodRoberson just making sure that you realize the 'it' I am referring to is 'attempting to micro optimize for branch prediction yourself'. The cpu, most times (and unless you profile it to be able to demonstrate otherwise its rather pointless) will predict the branches quite well without the programmer's help. The CPU doing dynamic and static branch prediction matter quite a bit, and trying to write code to give them hints yourself is more problematic than writing clean and understandable code from the start. –  May 14 '15 at 23:27
  • ... and if you can suggest a better wording for my leading sentence, I would welcome it. Reading it again, the 'it' could possibly be misinterpreted. –  May 14 '15 at 23:28