Minimizing code duplication without using macros or sacrificing speed

Question

Here's a snippet of code:

An inlined function:

inline void rayStep(const glm::vec3 &ray, float &rayLength, const glm::vec3 &distanceFactor, glm::ivec3 &currentVoxelCoordinates, const glm::ivec3 &raySign, const glm::ivec3 &rayPositive, glm::vec3 &positionInVoxel, const int smallestIndex) {
    rayLength += distanceFactor[smallestIndex];
    currentVoxelCoordinates[smallestIndex] += raySign[smallestIndex];
    positionInVoxel += ray * glm::vec3(distanceFactor[smallestIndex]);
    positionInVoxel[smallestIndex] = 1 - rayPositive[smallestIndex];
}

It's usage:

glm::ivec3 distanceFactor = (glm::vec3(rayPositive) - positionInVoxel) / ray;

    if (distanceFactor.x < distanceFactor.y)
    {
        if (distanceFactor.x < distanceFactor.z)
        {
            rayStep(ray, rayLength, distanceFactor, currentVoxelCoordinates, raySign, rayPositive, positionInVoxel, 0);
        }
        else
        {
            rayStep(ray, rayLength, distanceFactor, currentVoxelCoordinates, raySign, rayPositive, positionInVoxel, 2);
        }
    }
    else
    {
        if (distanceFactor.y < distanceFactor.z)
        {
            rayStep(ray, rayLength, distanceFactor, currentVoxelCoordinates, raySign, rayPositive, positionInVoxel, 1);
        }
        else
        {
            rayStep(ray, rayLength, distanceFactor, currentVoxelCoordinates, raySign, rayPositive, positionInVoxel, 2);
        }
    }

I really dislike the way the usage of the function looks like. One way I could fix it is to calculate the index of the smallest component and then just use the body of the function directly in the code:

int smallestIndex = (distanceFactor.x < distanceFactor.y) ? (distanceFactor.x < distanceFactor.z ? 0 : 2) : (distanceFactor.y < distanceFactor.z ? 1 : 2);
rayLength += distanceFactor[smallestIndex];
currentVoxelCoordinates[smallestIndex] += raySign[smallestIndex];
positionInVoxel += ray * glm::vec3(distanceFactor[smallestIndex]);
positionInVoxel[smallestIndex] = 1 - rayPositive[smallestIndex];

This looks much cleaner to me.

So why haven't I done that, if it bothers me so much?

The benefit of the above code is that value of the smallestComponentIndex of the function is know at compile time - the value is given as a constant and the function is inlined. This enables the compiler to do some optimizations which it wouldn't be able to do if the value were unknown during the compile time - which is what happens in the example with double ternary operator.

The performance hit in my example is not small, the code goes from 30ms to about 45ms of execution time - that's 50% increase.

This seems negligible, but this is a part of a simple ray tracer - if I want to scale it to do more complex calculations, I need this part to be as fast as possible, since it's done once per ray intersection. This was run on low resolution with a single ray per pixel, with no light sources taken into account. A simple ray cast, really, hence the runtime of 30ish ms.

Is there any way I can have both the code expression and speed? Some nicer way to express what I want to do, while making sure that value of the smallestComponentIndex is known at compile time?

Possible duplicate of [Is micro-optimisation important when coding?](https://softwareengineering.stackexchange.com/questions/99445/is-micro-optimisation-important-when-coding) — gnat, Jul 04 '19 at 15:34
I clarified a bit. What you see here is the core of performance critical part of the code. This needs to really scale well, since the program will spend most of it's time in these instructions. — Karlovsky120, Jul 04 '19 at 15:41
This isn't code duplication. It's an if-else ladder with function calls in it. You can't make it any more compact than that without wrapping rayStep in another function, and if you do that, you'll just be pushing the complexity somewhere else. — Robert Harvey, Jul 04 '19 at 16:11
`Is there any way I can have both the code expression and speed?` -- These two things are often mutually exclusive; you can't have one without sacrificing the other. Your first example is very readable; I suggest you keep it and the speed it provides. — Robert Harvey, Jul 04 '19 at 16:14
A lot of software engineering is about tradeoffs. As your understanding grows, first you learn about different kinds of desirable and undesirable effects of certain decisions, and how often you have trade one for the other; but your thinking is local, on a case-by-case basis (i.e., you consider a specific code example). But later on, you start thinking more holistically; squeezing out the last bit of performance from some peace of code could cost you a lot of effort, but have a negligible overall effect; you need to be targeted (find bottlenecks), and you need to be able to measure. — Filip Milovanović, Jul 05 '19 at 00:42
I do know that. This chunk of code is executed per ray per voxel in a rudimentary space subdivision ray tracer, so any speed I gain here is multiplied by a factor of several million. — Karlovsky120, Jul 05 '19 at 08:44

D Drmmr · Answer 1 · 2019-07-04T19:24:10.270

3

If you use C++11, you can wrap the duplicated code in a local lambda function. Since a lambda is guaranteed to be inlined, performance won't be affected.

auto step = [&](int index)
{
    rayStep(ray, rayLength, distanceFactor, currentVoxelCoordinates, raySign, rayPositive, positionInVoxel, index);
};
if (d.x  < d.y && d.x  < d.z) {
    step(0);
}
else if (d.y <= d.x && d.y < d.z) {
    step(1);
}
else {
    step(2);
}

edited Jul 04 '19 at 19:24

answered Jul 04 '19 at 19:17

D Drmmr

290
2
6

2

Functions are never guaranteed to be inlined by the standard, whether it makes sense or not, irrespective of use of lambdas. – Deduplicator Jul 04 '19 at 20:09
This adds almost 50% to the runtime of the test code, I'm not sure why. – Karlovsky120 Jul 05 '19 at 08:57

Deduplicator · Accepted Answer · 2019-07-05T09:45:03.940

3

Refactor for maintainability, and only try for more efficiency if needed.

So, extract the index-calculation:

static constexpr int smallestIndex(glm::ivec3 v) noexcept {
    return v.x < v.y && v.x < v.z ? 0 : v.y < v.z ? 1 : 2;
}

A more literal transcription might be optimized differently though:

static constexpr int smallestIndex(glm::ivec3 v) noexcept {
    return v.x < v.y
        ? (v.x < v.z ? 0 : 2)
        : (v.y < v.z ? 1 : 2);
}

And use that to simplify your code:

glm::ivec3 distanceFactor = (glm::vec3(rayPositive) - positionInVoxel) / ray;
rayStep(ray, rayLength, distanceFactor, currentVoxelCoordinates, raySign, rayPositive, positionInVoxel, smallestIndex(distanceFactor));

Most compilers have annotations for forcing inlining, like __attribute__((always_inline)) for gcc. Try that before sacrificing readability, if actually necessary.

edited Jul 05 '19 at 09:45

answered Jul 04 '19 at 20:34

Deduplicator

8,591
5
31
50

I've played with this a bit, but your code runs for 50ms, which is slower than both of the solutions. That doesn't make any sense, given the constexpr keyword. – Karlovsky120 Jul 05 '19 at 08:44
Do you ask for optimization (`-O2`, `-Os` or `-O3`)? Also, did you try the variant with forced inlining? – Deduplicator Jul 05 '19 at 09:35
I'm asking for `/O2 (maximum speed)`, `/Ob2 (inline any suitable)`, `/Oi (enable intrinsic functions)` and `/Ot (favour speed)`. I tried with `__forceinline` (vs c++ compiler), but the performance was the same. The way I have the code now, I get ~40ms with if tree and ~50ms with the above code. – Karlovsky120 Jul 05 '19 at 09:40
Could you try again with the more literal transcription I added? – Deduplicator Jul 05 '19 at 09:45
I'm now measuring an average between 1.6 million rays. The if tree is still fastest at about 530ns, literal code is a bit slower at 620ns while the original constexpr is slowest at about 640ns. – Karlovsky120 Jul 05 '19 at 11:33

Minimizing code duplication without using macros or sacrificing speed

2 Answers2