Is it possible to restructure code to avoid this copy-paste bug?

Question

I came across quite a subtle bug the other day, where there are two sections of similar code that were supposed to use different variables, but copy-pasting had lead them to use the same variable. I've reduced it to the the example below.

static float module_variable;

float foo() {
    float a = calculate_a();
    float b = calculate_b();

    float c = calculate_c();
    float d = calculate_d();

    float output = get_default();

    if (a > some_threshold) {
         // `a` meets criteria, therefore we should use it
        output += a;
        module_variable = c;
    } else if (b < other_threshold) {
        // `b` meets criteria, therefore we should use it
        output += a;  // *BUG* we should be using `output += b` here
        module_variable = d;
    }

    return output
}

A lint for unused variables almost catches this, but because the else if uses b, the linter won't complain.

It seems like I want a way to only allow the inner block access to variables that were used to get into that block.

This particular bug was in C, but I'd be interested to see if anyone can come up with a different way of structuring this code to avoid bugs like this (either in C or any other language).

A unit test should catch this. A code review should too, if you use meaningful variable names. — Rik D, Jul 21 '21 at 21:34
An aside. Why are you always precomputing c and d, even though they aren't always used? — Peter M, Jul 21 '21 at 21:44
The real function is ~200 lines of math-heavy code, which is why it probably got missed in code review. I agree a unit test could have caught this, if it was well designed. I guess I'm more interested to see if the code could be improved. c & d are both saved as debug outputs regardless of the conditional statement, which is why they are always calculated. — Joel Gibson, Jul 21 '21 at 21:57
"The real function is ~200 lines" <- there is the root cause of your problem. — Philip Kendall, Jul 21 '21 at 22:42
It's very unlikely that there is any systematic solution here in terms of restructuring the code - alternatives which seem to make this specific oversight less likely, would likely be more complicated and liable to other risks. Perhaps the root cause here was capricious copy-and-pasting, but not every error must be attributable to a systemic failing - humans make random errors, too. Code reviews and tests improve error detection, but do not assure the detection of every possible error. — Steve, Jul 22 '21 at 00:00

Brandin · Accepted Answer · 2021-07-22T13:52:26.410

In general, you should try to reduce the scope of your variables. Uncle Bob's Clean Code rule for this is "declare variables close to their usage" and CppCheck typically gives a diagnostic such as "the scope of X could be reduced". Specifically, the Old C practice of declaring all variables at the top usually leads to violations of this rule, especially for large functions.

So as a style rule, I would prefer to avoid "declaration sections" at the top of the function where everything is declared. The only good reason to use that style nowadays is if you need to compile the code on certain legacy compilers that support only C89. Since C99, even plain vanilla C allows declaring the variable anywhere in the function, just like C++ and Java.

In your example, you declare and initialize a, b, c and d at the top of the function, so the scope of those variables is basically the entire function. However, your actual use of each of those is only in a very small portion of your function. If you had limited the scope of those, then your *BUG* line would become a compiler error. Scope can be limited in a large function by adding extra braces:

float foo()
{
    float output = get_default();
    
    {
        float a = calculate_a();
        if (a > some_threshold) {
            output += a;
            module_variable = calculate_c();
            return output;
        }
    }
    {
        float b = calculate_b();
        if (b < other_threshold) {
            output += a; // ERROR.
            module_variable = calculate_d();
            return output;
        }
    }
    // ... Imagine lots of other code that initializes and uses c, d, e, f, etc.
    assert(0); // Should not reach this if I'm using early returns consistently.
}

If you're using C++17 or later, you could use an initializer inside the if, which allows to simplify things a bit:

// Since C++17.
float foo()
{
    float output = get_default();
    
    if (float a = calculate_a(); a > some_threshold) {
        output += a;
        module_variable = calculate_c();
        return output;
    }
    if (float b = calculate_b(); b < other_threshold) {
        output += a; // ERROR.
        module_variable = calculate_d();
        return output;
    }
    // ... Imagine lots of other code that initializes and uses c, d, e, f, etc.
    assert(0);
}

This C++17 feature may look a bit strange at first, but notice it uses the same rule as an initializer inside a C++ and C99 for statement, which also tends to reduce scope of variables and should also be used where possible for the same reason.

c & d are both saved as debug outputs regardless of the conditional statement, which is why they are always calculated

If you always need to calculate everything, and want to avoid recalculating if necessary, and if you are using C++11, then a fancier alternative to the naked braces is to replace them with capturing lambdas that are immediately invoked. Arguably the [...](){...}(); notation is a bit noisy, but it allows some nice compiler diagnostics for a large function:

// Since C++11.
float foo()
{
    float output = get_default();
    float a = calculate_a();
    float b = calculate_b();
    float c = calculate_c();
    float d = calculate_d();
    // ... More calculate statements.
    
    [&a, &c, &output]() {
        if (a > some_threshold) {
            output += a;
            module_variable = c;
        }
    }();
    [&b, &d, &output]() {
        if (b < other_threshold) {
            output += a; // ERROR.
            module_variable = d;
        }
    }();
    // ...
    return output;
}

The specific diagnostic from GCC is:

main.cpp:32:27: error: 'a' is not captured
   32 |                 output += a; // ERROR.
      |

Which means that you either need to update the capture list for that section, or that you didn't mean a, but something else instead.

However, there is a caveat with GCC (at least the version I am using), when using immediately invoked lambda expressions. It's conceivable that you might accidentally forget the () at the end of the lambda expression, e.g. by typing just }; at the end instead of }();:

    [&a, &c, &output]() {
        if (a > some_threshold) {
            output += a;
            module_variable = c;
        }
    }; // OOPS.

That's syntatically valid, but unfortunately it just declares the code to be executed without actually executing it (the code inside the lambda is dead code). Unfornately GCC did not give a warning for that. Anyway, I see that Clang correctly gives a warning about this (expression result unused, -Wunued-value). It's possible that newer versions of GCC eventually do so as well, if the newest version does not already (I don't update my tools so often, so maybe the newest versions of GCC are better).

Finally, the lambda expressions make early returns impossible, so the sense of the short-circuiting behaviour in the "else if ... else if ..." of the original code is lost. If you need that behaviour, you could consider this:

    ...
    bool done = false;
    
    // ...
    [&a, &c, &output, &done]() {
        if (...) {
            ...
            done = true;
        }
    }();
    if (done) return output;
    // ...

Both these solutions presuppose that multiple returns are viable (e.g. that there isn't common code that follows the if-block, and must be executed for every branch), and also presuppose that the "calculate" functions don't share state or depend on the sequence given by the OP (he mentions that c and d are debug values that must be calculated on all paths). But +1 for the interesting approach. — Steve, Jul 22 '21 at 11:45
Well, if everything always needs to be calculated, or if there are other complex interactions or requirements, then of course you'll have to think more carefully. But I've found if you start out with the code to try to minimize scope (something like the above), then you can try to build it with your C++ compiler, and then use the compiler diagnostics to show you where you may need a bigger scope. Possibly you'll have to move some things outside of the braces (to a broader scope), but the end result will hopefully be tighter than declaring everything at the top. — Brandin, Jul 22 '21 at 11:53
@Steve done=true (or using "goto bottom") will also work for the first solution, to avoid multiple returns; although that's arguably uglier than if ... else if ... else if structure of the original. If the original code is really hundreds of lines of math heavy code, maybe a few lines of syntactic noise is a worthy price to pay for better diagnostics. It's hard to judge with this small example. — Brandin, Jul 22 '21 at 14:02
Just limiting scope with extra {} works really well, and imo makes the function clearer too. That if-initializer in C++ is neat! I came up with some other options (moving each if-branch into its own function, or using rust's ownership to consume `a` and `b`), but they were quite detrimental to readability. Thanks for the detailed reply! — Joel Gibson, Jul 22 '21 at 18:59
@gnasher729 I think C++17+ is the only language that has it. As an analogy, it may feel a bit like using a word that only one major language in the world uses. — Brandin, Jul 26 '21 at 07:20

Is it possible to restructure code to avoid this copy-paste bug?

1 Answers1