9

This question got rather a freezing reception at SO, so I decided to delete it there and try here instead. If you think it does not fit here either, please at least leave a comment on suggestion how to find an example I'm after...

Can you give an example, where using C99 VLAs offers a real advantage over something like current standard heap-using C++ RAII mechanisms?

The example I am after should:

  1. Achieve an easily measurable (10% maybe) performance advantage over using heap.
  2. Not have a good workaround, which would not need the whole array at all.
  3. Actually benefit from using dynamic size, instead of fixed maximum size.
  4. Be unlikely to cause stack overflow in normal use scenario.
  5. Be strong enough to tempt a developer needing the performance to include a C99 source file in a C++ project.

Adding some clarification on the context: I mean VLA as meant by C99 and not included in standard C++: int array[n] where n is a variable. And I am after an example of use case where it trumps the alternatives offered by other standards (C90, C++11):

int array[MAXSIZE]; // C stack array with compile time constant size
int *array = calloc(n, sizeof int); // C heap array with manual free
int *array = new int[n]; // C++ heap array with manual delete
std::unique_ptr<int[]> array(new int[n]); // C++ heap array with RAII
std::vector<int> array(n); // STL container with preallocated size

Some ideas:

  • Functions taking varargs, which naturally limits item count to something reasonable, yet is without any useful API-level upper limit.
  • Recursive functions, where wasted stack is undesirable
  • Many small allocations and releases, where heap overhead would be bad.
  • Handling multi-dimensional arrays (like arbitrarily sized matrices), where performance is critical, and small functions are expected to get inlined a lot.
  • From comment: concurrent algorithm, where heap allocation has synchronization overhead.

Wikipedia has an an example which does not fulfill my criteria, because the practical difference to using heap seems irrelevant at least without context. It is also non-ideal, because without more context, it seems item count could very well cause stack overflow.

Note: I'm specifically after an example code, or suggestion of an algorithm which would benefit from this, for me to implement the example myself.

hyde
  • 3,744
  • 4
  • 25
  • 35
  • 1
    A bit speculative (since this is a hammer looking for a nail), but perhaps `alloca()` would really outshine `malloc()` in a multithreaded environment because of the [lock contention in the latter](http://stackoverflow.com/q/10706466/478288). But this is a real stretch since small arrays should just use a fixed size, and large arrays will probably need the heap anyway. – chrisaycock Mar 14 '13 at 13:57
  • 1
    @chrisaycock Yes, very much hammer looking for a nail, but a hammer which actually exists (be it C99 VLA or the not-actually-in-any-standard `alloca`, which I think are basically same thing). But that multithreaded thing is good, editing question to include it! – hyde Mar 14 '13 at 14:06
  • One disadvantage of VLAs is that there's no mechanism for detecting an allocation failure; if there's not enough memory, the behavior is undefined. (The same is true for fixed-size arrays -- and for alloca().) – Keith Thompson Mar 14 '13 at 15:38
  • @KeithThompson Well, there's no guarantee that malloc/new detects allocation failure either, for example see Notes for Linux malloc man page (http://linux.die.net/man/3/malloc). – hyde Mar 14 '13 at 15:42
  • @hyde: And it's debatable whether Linux's `malloc` behavior conforms to the C standard. – Keith Thompson Mar 14 '13 at 18:53
  • The performance benefit of a VLA is often much more than 10%. The creation of a VLA is usually a single processor instruction. (Incrementing the stack pointer.) – Gort the Robot Mar 14 '13 at 21:32

3 Answers3

10

I just hacked up a little program that generates a set of random numbers restarting at the same seed each time, to ensure that it's "fair" and "comparable". As it goes along, it figures out the min and max of these values. And when it has generated the set of numbers, it counts how many are above the average of min and max.

For VERY small arrays, it shows a clear benefit with VLA's over std::vector<>.

It is not a real problem, but we can easily imagine something where we'd be reading the values from a small file instead of using random numbers, and doing some other, more meaningful counting/min/max calculations with the same sort of code.

For VERY small values of the "number of random numbers" (x) in the relevant functions, the vla solution wins by a huge margin. As the size goes larger, the "win" gets smaller, and given sufficient size, the vector solution appears to be MORE efficient - didn't study that variant too much, as when we start having thousands of elements in a VLA, it's not really what they were meant to do...

And I'm sure someone will tell me that there's some way of writing all this code with a bunch of templates and get it to do this without running more than the RDTSC and cout bits at runtime... But I don't think that's really the point.

When running this particular variant, I get about 10% difference between the func1 (VLA) and func2 (std::vector).

count = 9884
func1 time in clocks per iteration 7048685
count = 9884
func2 time in clocks per iteration 7661067
count = 9884
func3 time in clocks per iteration 8971878

This is compiled with: g++ -O3 -Wall -Wextra -std=gnu++0x -o vla vla.cpp

Here's the code:

#include <iostream>
#include <vector>
#include <cstdint>
#include <cstdlib>

using namespace std;

const int SIZE = 1000000;

uint64_t g_val[SIZE];


static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}


int func1(int x)
{
    int v[x];

    int vmax = 0;
    int vmin = x;
    for(int i = 0; i < x; i++)
    {
        v[i] = rand() % x;
        if (v[i] > vmax) 
            vmax = v[i];
        if (v[i] < vmin) 
            vmin = v[i];
    }
    int avg = (vmax + vmin) / 2;
    int count = 0;
    for(int i = 0; i < x; i++)
    {
        if (v[i] > avg)
        {
            count++;
        }
    }
    return count;
}

int func2(int x)
{
    vector<int> v;
    v.resize(x); 

    int vmax = 0;
    int vmin = x;
    for(int i = 0; i < x; i++)
    {
        v[i] = rand() % x;
        if (v[i] > vmax) 
            vmax = v[i];
        if (v[i] < vmin) 
            vmin = v[i];
    }
    int avg = (vmax + vmin) / 2;
    int count = 0;
    for(int i = 0; i < x; i++)
    {
        if (v[i] > avg)
        {
            count++;
        }
    }
    return count;
}    

int func3(int x)
{
    vector<int> v;

    int vmax = 0;
    int vmin = x;
    for(int i = 0; i < x; i++)
    {
        v.push_back(rand() % x);
        if (v[i] > vmax) 
            vmax = v[i];
        if (v[i] < vmin) 
            vmin = v[i];
    }
    int avg = (vmax + vmin) / 2;
    int count = 0;
    for(int i = 0; i < x; i++)
    {
        if (v[i] > avg)
        {
            count++;
        }
    }
    return count;
}    

void runbench(int (*f)(int), const char *name)
{
    srand(41711211);
    uint64_t long t = rdtsc();
    int count = 0;
    for(int i = 20; i < 200; i++)
    {
        count += f(i);
    }
    t = rdtsc() - t;
    cout << "count = " << count << endl;
    cout << name << " time in clocks per iteration " << dec << t << endl;
}

struct function
{
    int (*func)(int);
    const char *name;
};


#define FUNC(f) { f, #f }

function funcs[] = 
{
    FUNC(func1),
    FUNC(func2),
    FUNC(func3),
}; 


int main()
{
    for(size_t i = 0; i < sizeof(funcs)/sizeof(funcs[0]); i++)
    {
        runbench(funcs[i].func, funcs[i].name);
    }
}
hyde
  • 3,744
  • 4
  • 25
  • 35
Mats Petersson
  • 446
  • 2
  • 7
  • Wow, my system shows a 30% improvement in the VLA version over `std::vector`. – chrisaycock Mar 14 '13 at 14:15
  • 1
    Well, try with size-range of about 5-15 instead of 20-200, and you'll probably have a 1000% or more improvement. [Also depends on compiler options - I will edit the above code to show my compiler options on gcc] – Mats Petersson Mar 14 '13 at 14:18
  • I just added a `func3` which uses `v.push_back(rand())` instead of `v[i] = rand();` and removes the need for `resize()`. It takes about 10% longer compared to the one using `resize()`. [Of course, in the process, I found that the use of `v[i]` is a major contributor to the time the function takes - I'm a little surprised about that]. – Mats Petersson Mar 14 '13 at 14:46
  • 1
    @MikeBrown Do you know of an actual `std::vector` implementation which would use VLA/`alloca`, or is that just speculation? – hyde Mar 14 '13 at 14:47
  • No, the `v[i]` is not a major contributor - it's just that occassionally on my system, the process switches CPU and the TSC is not synced between the processors, so it will read a different TSC which gives bad results. – Mats Petersson Mar 14 '13 at 15:01
  • He's asking about heap-based data structures like Linked List. The vector uses an Array internally. – Michael Brown Mar 14 '13 at 15:07
  • @hyde see my answer, it's only logical to use a VLA to back a Vector. – Michael Brown Mar 14 '13 at 15:10
  • 3
    The vector does indeed use an array internally, but as far as I understand, it has no way to use a VLA. I do believe my example shows that VLA's are useful in some (perhaps even many) cases where the amount of data is small. Even if the vector ues VLA's, it would be after additional effort inside the `vector` implementation. – Mats Petersson Mar 14 '13 at 15:11
  • In theory you could write a version of `vector` that uses `alloca`, but I'd be shocked if anyone has every actually done that. – Gort the Robot Mar 14 '13 at 22:14
  • @StevenBurnap: How do you do that? As soon as the function RETURNS, the `alloca` memory goes away. Its whole point is that it's on the stack. Or do you know any clever trick to tell the compiler to allocate on the stack and still return to the calling function - perhaps some trick using the builtin function `returnaddress()`? How does the vector know that you're not going to do `return v;` somewhere later on - or does the copy constructor know that it shouldn't use `alloca` but some other constructors can? I don't see this working without a lot or tricking around. – Mats Petersson Mar 14 '13 at 23:22
  • I guess it depends on whether template functions act as functions or inline code with reference to `alloca`. – Gort the Robot Mar 15 '13 at 00:47
  • That `rdtsc` trick is neat, goes into my microbenchmark toolbox definitely, thanks :). – hyde Mar 15 '13 at 07:20
  • I think most relevant non-VLA version would simply use `std::unique_ptr v(new int[x]);` instead of the `int v[x];`. I tested that in a VirtualBox (Ubuntu guest on a Win7 host), and got some pretty crazy results (`rdtsc` problem under VM?), so maybe someone could test that too... – hyde Mar 15 '13 at 07:36
  • Another interesting observation, using `alloca` seemed to be consistently faster than C99 VLA (gcc 4.7.2). This seems a bit strange... – hyde Mar 15 '13 at 08:13
  • Interesting discussion. I will have a look after the weekend now - got to do something else for this weeekend. – Mats Petersson Mar 15 '13 at 08:29
  • I haven't forgotten this one. Will look tomorrow - I hope. Have a few GB of photos to sort from my weekend's activities. – Mats Petersson Mar 19 '13 at 19:02
1

The reason to use a VLA is primarily performance. It is a mistake to disregard the wiki example as having only an "irrelevant" difference. I can easily see cases where exactly that code could have a huge difference, for instance, if that function was called in a tight loop, where read_val was an IO function that returned very quickly on some sort of system where speed was critical.

In fact, in most places where VLAs are used in this manner, they don't replace heap calls but instead replace something like:

float vals[256]; /* I hope we never get more! */

The thing about any local declaration is that it is extremely quick. The line float vals[n] generally only requires a couple of processor instructions (maybe just one.) It simply adds the value in n to the stack pointer.

On the other hand, a heap allocation requires walking a data structure to find a free area. The time is probably an order of magnitude longer even in the luckiest case. (I.e. just the act of placing n on the stack and calling malloc is probably 5-10 instructions.) Probably much worse if there's any reasonable amount of data in the heap. It would not surprise me at all to see a case where malloc was 100x to 1000x slower in a real program.

Of course, then you also have some performance impact with the matching free, probably similar in magnitude to the malloc call.

In addition, there's the issue of memory fragmentation. Lots of little allocations tend to fragment the heap. Fragmented heaps both waste memory and increase the time required to allocate memory.

Gort the Robot
  • 14,733
  • 4
  • 51
  • 60
  • About the Wikipedia example: it could be a *part* of a good example, but without context, more code around it, it doesn't really *show* any of the 5 things enumerated in my question. Otherwise yes, I agree with your explanation. Though one thing to keep in mind: using VLAs can have a cost of accessing local variables, with them offsets of all local variables are not necessarily known at compile time, so care must be taken to not replace a one-time heap cost with an inner loop penalty for every iteration. – hyde Mar 15 '13 at 06:57
  • Um...not sure what you mean. Local variable declarations are a single operation and any mildly optimized compiler will pull the allocation out of an inner loop. There's no particular "cost" in accessing local variables, certainly not one that a VLA will increase. – Gort the Robot Mar 15 '13 at 16:16
  • Concrete example: `int vla[n]; if(test()) { struct LargeStruct s; int i; }`: stack offset of `s` will not be known at compile time, and it is also doubtful if compiler will move storage of `i` out of the inner scope to fixed stack offset. So extra machine code is needed because indirection, and this can also eat up registers, important on PC hardware. If you want example code with compiler assembly output included, please ask a separate question ;) – hyde Mar 15 '13 at 16:30
  • THe compiler doesn't have to allocate in the order encountered in the code, and it does not matter if space is allocated and not used. A smart optimizer would allocate space for `s` and `i` when the function is entered, before `test` is called or `vla` is allocated, as the allocations for `s` and `i` have no side effects. (And, in fact, `i` might even be placed in a register, meaning there is no "allocation" at all.) There are no compiler guarantees to the order of allocations on the stack, or even that the stack is used. – Gort the Robot Mar 15 '13 at 17:06
  • (deleted a comment which was wrong due to a stupid mistake) – hyde Mar 15 '13 at 22:41
  • I did a quick test, VLA array vs. fixed-length local array, and the essential difference in an inner loop was `movl %edi, %ecx` with fixed length array vs. `movl -4040(%ebp), %ecx` with VLA. Total 11 instructions in loop, 2 other mem accesses and the extra one in VLA version was extra level of indirection to other of these. I'd say significant memory accessing increase, something *to be aware of*. Total line count difference was about 30 in the .S file, but that doesn't make practical difference. (gcc 4.7.2, 32bit, -O3) – hyde Mar 15 '13 at 22:49
  • Correction to above: at another glance, the extra mem access in VLA version was actually simply running out of registers, and needing to put one more local variable to stack. – hyde Mar 15 '13 at 23:03
0

Regarding VLAs versus a Vector

Did you consider that a Vector can take advantage of VLAs itself. Without VLAs, the Vector has to specify certain "scales" of arrays e.g. 10, 100, 10000 for storage so you end up allocating a 10000 item array to hold 101 items. With VLAs, if you resize to 200, the algorithm might assume that you'll only need 200 and can allocate a 200 item array. Or it can allocate a buffer of say n*1.5.

Anyway, I'd argue that if you know how many items you'll need at runtime, a VLA is more performant (as Mats' benchmark demonstrated). What he demonstrated was a simple two pass iteration. Think of monte carlo simulations where random samples are taken repeatedly, or image manipulation (like Photoshop filters) where computations are done on each element multiple times and quite possibly each computation on each element involves looking at neighbors.

That extra pointer jump from the vector to its internal array adds up.

Answering the main question

But when you talk about using a dynamically allocated structure like a LinkedList, there is no comparison. An array provides direct access using pointer arithmetic to its elements. Using a linked list you have to walk the nodes to get to a specific element. So the VLA wins hands down in this scenario.

According to this answer, it is architecturally dependent, but in some cases memory access on the stack will be faster due to the stack being available on the cache. With a large number of elements this may be negated (potentially the cause of the diminishing returns Mats saw in his benchmarks). However, it's worth noting that Cache sizes are growing significantly and you'll potentially see more that number grow accordingly.

Michael Brown
  • 21,684
  • 3
  • 46
  • 83
  • I'm not sure I understand your reference to linked lists, so I added a section to the question, explaining the context a bit more and adding examples of alternatives I'm thinking of. – hyde Mar 14 '13 at 15:33
  • Why would a `std::vector` need scales of arrays? Why would it need space for 10K elements when it only needs 101? Also, the question never mentions linked lists, so I'm not sure where you got that from. Finally, VLAs in C99 are stack-allocated; they are a standard form of `alloca()`. Anything that requires heap storage (it lives around after the function returns) or a `realloc()` (the array resizes itself) would prohibit VLAs anyway. – chrisaycock Mar 14 '13 at 15:35
  • @chrisaycock C++ lacks a realloc() function for some reason, assuming memory is allocated with new[]. Isn't that the main the reason why std::vector must use scales? –  Mar 14 '13 at 15:42
  • @Lundin Does C++ scale the vector by powers of ten? I just got the impression that Mike Brown was really confused by the question, given the linked list reference. (He also made an earlier assertion that implied C99 VLAs live on the heap.) – chrisaycock Mar 14 '13 at 15:57
  • @hyde I didn't realize that's what you were talking about. I thought you meant other heap based data structures. Interesting now that you've added this clarification. I'm not enough of a C++ geek to tell you the difference between those. – Michael Brown Mar 14 '13 at 15:58
  • @chrisaycock I never said VLAs live on the heap. I thought hyde was asking about VLAs versus other data structures. And I don't think the C++ vector scales by powers of ten but there is memory overhead using vectors that won't be there when using a vla or a heap-allocated array. The question comes down to heap vs stack performance which is answered [here](http://stackoverflow.com/questions/3455987/comparison-of-access-performance-of-data-in-heap-and-stack) – Michael Brown Mar 14 '13 at 16:05
  • @MikeBrown The question/answer you link to is asking about cache effects of accessing the stack or heap *after* the array has been allocated. The OP here (hyde) is asking about performance differences *during the allocation*. The stack is always faster to allocate on than the heap because it's just a movement of the frame pointer. `std::vector` can't use VLAs because the frame pointer will have been moved back once the constructor returns. – chrisaycock Mar 14 '13 at 16:11
  • @chrisaycock no he asks about performance in general: allocation, accessing, deallocation. – Michael Brown Mar 14 '13 at 16:25
  • @chrisaycock I doubt the standard specifies how vector should scale, though I think I've seen some implementation where it scales exponentially: n, n*2, n*4, n*8 and so on. –  Mar 14 '13 at 22:27