3

The problem

We need to store data in a table-like way, but we have very strict space constraints (~1Mb per table of 10k+ rows). We store data like this:

ID | reviews | factor | score | interval | etc.
---+---------+--------+-------+----------+-----
 1 |     244 |    2.4 |    10 |     4268 | ...

in a simple binary format (a one-dimensional array of bytes, where the index of each row can be calculated simply by knowing the length of each row, which is fixed).

There is only one function that ever reads this data (gets a row by its index), and only one function that appends a new row (to the end). Removing items from the table will never be required (the table is append-only). Both functions are covered with a decent amount of unit tests.

The problem is the following: we need to be able to quickly go through rows ordered by different columns. In other words, we need the data to be sorted by at least two columns.

A simple solution

To solve this problem, we would implement indexes that, again, would be chunks of binary data. Now I would intuitively do it by creating ordered data structures that only list the index of the row in the original table:

factor_index        score_index
------------        -----------
          6                  2
          2                  1
          3                  6
          1                  4
          .                  .

The function that appends a new row to the table would have to be updated to cause the indexes to be updated as well.

EXAMPLE: To get the first item ordered by score, we just look up the first value in the index table for score (2) and get the corresponding row from the original table (the third row if we agree that the table is zero-indexed).

However, I was suggested to take a different approach.

A more complex but presumably safer version

Instead of of only storing the indexes, we duplicate the ID fields in each index table:

factor_index | ID        score_index | ID
-------------+---        ------------+---
          6  | 46                  2 |  8
          2  |  8                  1 | 14
          3  | 91                  6 | 46
          1  | 14                  4 | 60
          .  |  .                  . |  .

Then keep the original table sorted by ID, and use the indexes only as a starting position for a binary search in the original table.

The function that appends a new record will now have to do a binary search by ID to find where to insert the new row, and cause the indexes to be updated.

EXAMPLE: To get the first item ordered by score, we look up the first row in the index table for score (2, 8), and use the index (2) as a starting position for a binary search in the table. If the data is valid, we don't even need to do a binary search, because at position 2 we will find the row with ID 8. If, however, we find that the record at position 2 has a different index, we continue with a binary search to find the right one, and log the error.

The argument for this approach is that it will work even if the index points to the wrong row in the table.

I find it hard to believe though that this approach is really better, for the following reasons:

  • It requires a binary search, which can be a new source of bugs.
  • It requires the table to be kept in order, which implies more complex insert (as opposed to a simple append).
  • It does not guard against the main table being out of order: if that happens, the index could even fail to find the record via binary search.
  • It requires writing (and testing) code that is never even meant to be run.
  • It uses more data than what is necessary.

The question

It is very high priority for our application that the above piece of data is always valid. But does that justify writing more complex data structures and lookup mechanisms to guard against edge cases which may or may not happen? Shouldn't the time and effort instead be put in writing more robust test cases for a simpler version?

pcurry
  • 222
  • 1
  • 12
Attila O.
  • 241
  • 2
  • 11
  • 10
    "Applications programming is a race between software engineers, who strive to produce idiot-proof programs, and the universe which strives to produce bigger idiots. So far the Universe is winning." --Rick Cook (*The Wizardry Compiled*) –  Mar 03 '13 at 03:16

3 Answers3

3

If I'm understanding your indexes correctly they aren't the most efficient way to store them.

You can't sort your table on two keys at once anyway, thus I don't think you should try to sort it at all. Rather, sort your indexes.

10k rows--a two-byte value can refer to any entry in your table. Thus build two arrays that are initially seeded with 1..10k (or however many entries are in your table). While these are not pointers in the CPU's sense of the word use them as pointers anyway. You sort both arrays based on the values in the table.

Insert is handled by simply appending the record and then rebuilding the arrays. Yes, a fairly expensive operation but since you have specified the arrays aren't all that big and can't grow too much this shouldn't be being done too often. Anything you do is inherently at least O(n), a full resort is only O(n log n), I would take the latter route. (And you might even find it's faster as it involves a lot less writing to main memory than moving the records around.)

Note that these arrays are simply two-byte values, NOT key-value pairs like you seem to be indicating.

A couple of other things also come to mind--you seem unusually concerned with the data size. That says to me that either you are transmitting these blocks (at which point the indexes can be omitted as they can be recreated on the other end) or you have a whole bunch of these in memory at once. In the latter case if you are using a language that supports weak references you can use them--let the indexes for blocks not being actively used to get garbage collected and then recreated when they are needed.

Loren Pechtel
  • 3,371
  • 24
  • 19
  • In your last paragraph, you forget the option “premature optimization”. For 10k rows, there's no need to save anything. – Donal Fellows Mar 03 '13 at 07:32
  • *either you are transmitting these blocks .. or you have a whole bunch of these in memory at once* .. or their application is running on a platform with extremely tight memory constraints (e.g. some embedded system). – Carson63000 Mar 03 '13 at 07:48
  • To justify memory constraints: the entire table and both indexes should fit in one entity in the App Engine Datastore, which sets the limit to 1Mb, but the data compresses quite well, so I apply a quick `zlib` compression. Also, I wish not to decode each row in memory just to find one. But thanks for your answer, you seem to be suggesting to take the simple approach. (And yes, I use two-byte indexes too.) – Attila O. Mar 03 '13 at 11:27
  • @AttilaO.: Note that my approach does **NOT** require saving the indexes, they can be recreated when it's loaded. – Loren Pechtel Mar 04 '13 at 04:00
  • @LorenPechtel we would still save the indexes because most of the time we will just want to load the first `n` rows, without having to go through the rest. – Attila O. Mar 04 '13 at 07:39
  • @AttilaO.: You said there were two sort keys--"first" is not a meaningful concept. – Loren Pechtel Mar 05 '13 at 05:13
  • @LorenPechtel you're right — by "first" I meant first by one specific sort key. It has to work with both sort orders though. – Attila O. Mar 05 '13 at 11:14
2

If data validity is critical, then any transformation to the data must take the data transformed from a valid state to a valid state. The transformation mechanisms should ensure validity of output given valid input. An unsuccessful transformation should fail safely, leaving the data in a valid state.

Unit testing can only ensure that the transformation is valid under each condition you thought of while writing the tests. Building consistency into all data transformation methods ensures that the data remains valid under all possible conditions.

So, if data validity is a high priority, I suggest you build data modification methods that are always valid, and test them thoroughly. Don't trust consistency to coincidence.

pcurry
  • 222
  • 1
  • 12
1

First, one thing one can hold against your arguments:

  • if one wants to use IDs the way you wrote, he surely would first implement an ID-to-row index (so ID lookups don't require the original table in a special order, and you don't have to implement a binary search on that table)

But if you think about this for a moment, you see now you just have moved your original problem to a different level - one has to make sure the ID-to-row index is not out-of-sync. Of course, the advantage now is that you can rebuild this index independently when the row order of the original table changes. This makes sense when you have to expect such changes, you know when they happen and that they happen seldom / at specific points in time.

Nevertheless, is this really worth the hassle? When you are 100% sure that the row number in your original table is a valid and immutable primary key, build the most simple possible solution, based on row numbers, everything else is more error-prone and "premature nonsens". Otherwise use IDs.

Here are some scenarios where you cannot be sure your row number is a valid primary key:

  • rows in a relational database can typically not be expected to have a special order
  • in a file-based scenario: your amount of data grows and because of your space constraints, you have to split it over several physical tables. Then the ID may be still a valid primary key, but your row number not.

Make sure you are not in such a situation when going for a row-number based approach.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • IDs in my case point to rows in an actual database, they have nothing to do with the in-memory table that I construct, and while they are guaranteed to be unique 63-bit unsigned integers, no ordering is guaranteed. I agree on your point that adding a third index for IDs only moves my original problem to a different level. – Attila O. Mar 03 '13 at 11:34