4

Most higher level (or scripting) languages out there have data structures that can hold different types of data (like numbers, strings and even functions) in the same structure, and you can also add elements without caring about their size, which you don't have to specify. They are sometimes called lists.

In contrast, lower level languages (say C++ or even better C which is the base language for many scripting languages) allow quite the opposite: they only support sized arrays which can only hold one specific type of data. You could virtually continue adding items after the specified size was filled, but that would be dangerous.

So how are these types of data structures found in higher level languages implemented in the lower level language they are written in?

The closest thing I managed to write was a class in C++ which allocated a certain amount of memory and, when its size was filled, allocated a longer block of memory, copied the elements in that block and freed the first one, but that doesn't seem efficient at all and anyway only allows to add one specific type of item and only in a stack-like way (where you can only push or pop, not allocate an item at a random index).

user6245072
  • 303
  • 1
  • 7
  • When you say "dynamic array," what do you mean exactly by that? It's kind of a contradiction in terms; arrays are of a fixed size by definition. If you need it to be bigger, copying the original data to a new, larger array is exactly what you do. Not sure what your second paragraph means, other than C and C++ allow you to shoot yourself in the foot; you're probably referring to *bounds checking.* The data structures you refer to in your first paragraph are not arrays. There's no such thing as a sizeless array, AFAIK. – Robert Harvey Jul 05 '16 at 15:03
  • @Robert Harvey How should I call them? Lua calls them tables. Python and Lisp call them lists. I am referring to _these_. – user6245072 Jul 05 '16 at 15:10
  • Those are not arrays. You call them lists and tables. – Robert Harvey Jul 05 '16 at 15:11
  • @Robert Harvey fixing with 'lists'. – user6245072 Jul 05 '16 at 15:12
  • See here: https://en.wikipedia.org/wiki/Linked_list – Robert Harvey Jul 05 '16 at 15:13
  • @Robert Harvey If I understood it, in Linked Lists to search for the `n`th element of the list you must iterate through all elements at index [0..n - 2]. Also since pointers have a size too, those lists can't hold different types of data, which isn't what I am looking for. – user6245072 Jul 05 '16 at 15:20
  • Well, the pointers themselves typically have a *fixed* size. Perhaps you can provide a code example of what you're talking about? I think you're confusing the *list* with the things that it stores. – Robert Harvey Jul 05 '16 at 15:22
  • @Robert Harvey I mean if the first element is an integer and the second element is a function, the first element will have a data section of type integer, but the pointer can't be of type integer because it has to point to a function. – user6245072 Jul 05 '16 at 15:28
  • In C, pointers are just memory addresses (i.e. they are typeless). In C#, you can use `dynamic` or `object` and avoid that whole problem. – Robert Harvey Jul 05 '16 at 15:31
  • @Robert Harvey I've only used C++ yet. – user6245072 Jul 05 '16 at 15:33
  • http://stackoverflow.com/questions/8975526/c-array-of-different-objects-know-how-to-do-it-in-java – Robert Harvey Jul 05 '16 at 15:36
  • 2
    @RobertHarvey I believe ["dynamic array"](https://en.wikipedia.org/wiki/Dynamic_array) is the right term to use. – svick Jul 05 '16 at 16:17
  • @svick: *[shrug]* A number of .NET data structures use this kind of dynamic expansion of an array under the hood, but the term *dynamic array* is never used in those contexts (it being merely an implementation detail). One could expand an ordinary array in this fashion, but that doesn't make the resulting array any more "dynamic" than it was, other than its size has just changed. Perhaps the term is more common in other environments. – Robert Harvey Jul 05 '16 at 16:20
  • @RobertHarvey: it is the first term coming to my mind for this kind of data structure, and since there is also a [Wikipedia](https://en.wikipedia.org/wiki/Dynamic_array) entry for it, it seems to be quite popular. – Doc Brown Jul 05 '16 at 21:06
  • Most often, list are implemented as double-ended queues (DEQ) https://en.wikipedia.org/wiki/Double-ended_queue – shawnhcorey Jul 06 '16 at 12:43
  • @shawnhcorey - I don't think "most often" is really true. Deques are a very useful data structure, but they're not as commonly used as perhaps they should be. Java and C# both have "ArrayList" type classes, which works effectively the same way OP describes. C++ has `std::deque`, but most code I've seen uses `std::vector` in preference. [CPython](http://www.laurentluce.com/posts/python-list-implementation/) uses an arraylist-type implementation, as does [Ruby](http://stackoverflow.com/questions/7310015/ruby-array-internals). I'm not sure what languages actually use deques for this...? – Jules Jul 06 '16 at 16:23
  • 2
    @RobertHarvey: I would have no problem with using "array" to mean an associative array, a bit array, a sparse array or a jagged array. The notion that an "array" must be an dense-integer-indexed contiguous block of memory is an artifact of long familiarity with C, not a requirement of the abstract data type. I note that PHP and JavaScript both call sparse ordered maps "arrays". – Eric Lippert Jul 06 '16 at 17:43

5 Answers5

9

I guess the kind of "lists" you had in mind are not the "linked lists" mentioned by Robert Harvey, but the kind of arrays which are called list in Python, List or ArrayList in C#, ArrayList in Java, or std::vector in C++. Another popular term for this kind of data structure is "dynamic array".

These are indeed implemented internally exactly the way you sketched it: each of them has a maximum capacity, and when that limit is reached, they automatically allocate new block of memory, and copy the content to that new block. A popular, general purpose reallocation strategy is to enlarge the block with a constant factor somewhere between 1 and 2. For most real world programs, this works surprisingly efficient, and if you have a case where this is not efficient enough, you can often control the capacity manually.

And yes, these lists typically support only elements of the same data type. But in Java and C#, all data types are derived (or at least can be derived) from a common base class "object". This lets the dynamic array data type internally store a reference to the content, which allows content values of different types (the references, however, are all of the same size, regardless of the type of the content). Though I am not a Python expert, I am pretty sure that the list implementation will internally work in a similar manner.

Since you wrote, you only have used C++: a std::vector in C++ is bound to a certain type, but by using void * or a pointer to a base type of an inheritance structure, one can come around this limitation. The latter can manage the type of the different elements of the vector by a mechanism called RTTI for you, for using the former you need to manage the type information by yourself somehow.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • Another implementation I thought of was, when the first block is full, allocate another block of the same size and, when accessing the array, if the index is less than the size of the block then access the first block, otherwise access the second block with index = index - size. This allows to store as much memory as you want whitout periodically iterating an increasing number of objects to copy them. But still, it doesn't allow different data type capacity unless you use sizeless pointers which are unavailable in C / C++. – user6245072 Jul 05 '16 at 16:00
  • @user6245072: that will work (I implemented something similar more than 20 years ago in C++, when the standard lib was not available). However, you asked how it is implemented nowadays in some popular languages. Read my completed anwer about how to handle different data types. – Doc Brown Jul 05 '16 at 16:09
  • ... and, you probably do not mean "sizeless lists" or "sizeless pointers", these terms will lead to confusion. You mean "dynamic size lists/arrays" and "pointers to elements of arbitrary size". – Doc Brown Jul 05 '16 at 16:16
  • To be honest, even with this second method, you still have to keep an array of the 'blocks' you have, and you still have to expand that by copying it's elements, but this implies copying much less objects. – user6245072 Jul 05 '16 at 20:16
  • "A popular, general purpose reallocation strategy is to enlarge the block with a constant factor somewhere between 1 and 2." – That's not just "popular", that's *required* to get an amortized cost of O(1). Without exponential resizing, you don't get the amortization effect, so your amortized cost stays the same as the worst-case cost, namely O(n). Intuitively, resizing is O(n), but if you exponentially resize, you only have to resize every O(n) additions (which are themselves O(1)), and thus can pay off the cost of resizing over time. – Jörg W Mittag Jul 05 '16 at 21:40
  • @Jörg W Mittag are you talking about the method I described in the comment above? – user6245072 Jul 05 '16 at 22:20
  • @JörgWMittag: if you are nitty, allow me to be nitty, too ;-) Having the constant factor "between 1 and 2" is not required, it could be also bigger. And I guess there would be possibilities to vary the factor and still get O(1) as amortized cost. – Doc Brown Jul 06 '16 at 05:55
  • @user6245072 - the list implementation you describe in your comment above was the default list implementation used in Borland's TurboVision and ObjectWindows frameworks. It's an interesting compromise -- you get the speed of an array for small lists but the flexibility of a linked list for larger ones. – Jules Jul 06 '16 at 16:27
7

The closest thing I managed to write was a class in C++ which allocated a certain amount of memory and, when its size was filled, allocated a longer block of memory, copied the elements in that block and freed the first one, but that doesn't seem efficient at all and anyway only allows to add one specific type of item and only in a stack-like way (where you can only push or pop, not allocate an item at a random index).

First of all, good for you for doing this exercise. It is highly educational to try to figure out how these higher-level data structures are implemented in a low-level language. Let me give you some additional thoughts on this.

  • Try going lower. I notice for example that you take for granted that there is such a thing as a memory allocator, and that it gives you back a block of the size you want, and you can free it when you are done with it. There was a day when those functions did not exist. Someone had to write them for you. Suppose you had functions that could allocate and free memory, but only in 4k blocks. How would you write malloc and free, given only those functions? This is a highly educational exercise.

  • Be less vague. You say that your proposed "if it doesn't fit, allocate a new block and copy" strategy is "inefficient". Efficiency is defined as value produced divided by resources consumed. State what value and resources you care about, and then work out the actual numeric efficiency of your implementation.

  • Implement more features. You say that you cannot add an item in the middle of a resize-when-full list, but why not? You can probably figure out how to do so. What is the efficiency of this operation? What about deleting an item from the middle of a list? Your list gets bigger when it gets full; does it get smaller when it gets empty? What's the efficiency of that?

  • Try a different data structure. Can you implement a linked list? What about a binary tree? What about a balanced binary tree? Can you use them as your abstract list data type? Work out the efficiency of each operation in each of these implementation choices.

  • Embrace genericity. Can you use templates to make abstract lists that contain elements of a particular type?

Eric Lippert
  • 45,799
  • 22
  • 87
  • 126
4

Arrays, like all data structures, trade one kind of efficiency for another.

The efficiency that an array specializes in is that of rapidly looking up an element by its index. It does that, not by searching through the array for the correct element, but by performing a mathematical calculation. If an array has elements of size 8 bytes, and you want to look up the nth element, it will be located at the memory location which is 8 * n bytes from the beginning of the array.

If you want to add m elements to the end of an array, you can do that by simply allocating the additional memory at the end of the array, if it is free. If that memory is not free, you have to make a new array and copy the old data to it.


Lists work differently from arrays. Lists are data structures that link one element to the next with some kind of pointer. Each element can be stored anywhere in memory, so looking up an element quickly by index is no longer possible. You have to traverse the list, hopping from element to element through the links, until you find the element you want. Some lists have indexes such as b-trees or hashes that make this search process much faster, but it's still not as fast as calculating an index location in an array.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • You made me think that before looking on how to _implement_ a data structure, I need to understand _what_ data structure I want to implement. I'll take a look at the documentation of the scripting languages I want to emulate and understand what type of data structure they are, then if necessary ask a new question on how to actually implement them, so for this question I'm accepting your answer. – user6245072 Jul 05 '16 at 15:26
2

To answer your actual question:

Most higher level (or scripting) languages out there have data structures that can hold different types of data (like numbers, strings and even functions) in the same structure ... So how are these types of data structures found in higher level languages implemented in the lower level language they are written in?

You already deduced how variable-sized lists are implemented: there's an array somewhere, it is monitored for fullness, and reallocated when full. But what about list elements all of different sizes?

I can certainly describe for you how we implemented just such a data structure in the mid-1990s implementation of JavaScript at Microsoft. The solution is very simple and clear for the C developer: we used a discriminated union big enough to hold one of the largest thing that would go in the list.

To be precise, the discriminated union type was the OLE Automation VARIANT type used by Visual Basic, which is described here: https://en.wikipedia.org/wiki/Variant_type.

Since each VARIANT is 16 bytes, all you need to do is allocate arrays of 16 byte structs and double that when it gets full. That means that an array of a thousand two-byte integers takes 16K, not 2K, but who cares?

Sadly, this solution is insufficient; we have another problem. In JavaScript, arrays are sparse and associative. That is, arrays can be indexed 0, 1, 5, 1000, "giraffe", if you like. So a C-style dense, integer-indexed array is not going to cut it.

In fact JavaScript arrays used a dynamically grown hash table for lookups, and an associated linked list for enumerations. That way lookups were sublinear but enumerations did not change order as items were added. Indexing was achieved by converting indexes to strings and then hashing the string. To further improve performance in loops, the hash buckets were MRU linked lists.

Implementing such a data structure in C++ is highly educational; give it a shot.

Eric Lippert
  • 45,799
  • 22
  • 87
  • 126
0

Mixed types are accommodated by storing pointers to objects.

The closest thing I managed to write was a class in C++ which allocated a certain amount of memory and, when its size was filled, allocated a longer block of memory, copied the elements in that block and freed the first one, but that doesn't seem efficient at all

It's quite efficient in time. Typically the new size is twice the old size. So suppose the original size is 128, then it doubles to 256, then to 512, and then to 1024. The first 128 elements are copied three times, the next 128 are copied twice, the next 256 once. The average number of copies is Sum(n/(2^n)) = 2.

kevin cline
  • 33,608
  • 3
  • 71
  • 142
  • It's not the copy itself. It's the iteration on each single element that scares me. – user6245072 Jul 05 '16 at 22:17
  • You can avoid at least some of the cost of the per-element copy loop if you design the underlying data structure with expansion in mind. Instead of a "List", think in terms of e.g. a "List of Lists". Allocate the initial (empty) list of lists, and the first list in the list of lists. When you've filled up the first list, allocate another one, and cons it onto the list of lists. Your list handlers must know to walk the list of lists, and how to walk the individual lists IN the list of lists. – John R. Strohm Jul 05 '16 at 22:46
  • @user6245072 on average each item is only touched two or three times. – kevin cline Jul 05 '16 at 23:38
  • @kevin cline wouldn't using `realloc` avoid even that iteration? @John R. Strohm but the list of lists still has to be expanded, and if it's an array of array instead, you still have to copy something from here to there. It's a lot less items you have to copy, but it's still something. With lists also inserting and removing items isn't a problem, so expanding it is trivial. But I'd aviod them as the base structure for a scripting language structure because of the time it takes to access each element. – user6245072 Jul 06 '16 at 08:21
  • @user6245072 what do you think realloc does? it can expand an allocation only if the adjacent memory is free. Otherwise it finds a new block of memory and copies the data. – kevin cline Jul 06 '16 at 08:48
  • @kevin cline I hoped the copying was achieved in a faster way than by iterating over each element and copying. – user6245072 Jul 06 '16 at 09:00
  • It may be done by a hardware bulk-copy instruction, but the data still has to be copied. The bulk copy is faster than a programmed loop but the time will still increase with the data copied. – kevin cline Jul 06 '16 at 09:06