Finding common prefixes for a set of strings

Question

I am trying to find common prefixes for a sorted set of strings. i.e. if the following strings are given:

AB123456
AB123457
ABCDEFGH
ABCDEFGX1
ABCDEFGY
XXXX

then my function should return three prefixes and their suffixes:

AB12345  6,7
ABCDEFG  H,X1,Y
XXXX     (no suffixes)

Some background information: I'm trying to compress a large amount of sorted strings. A traditional implementation of prefix compression would just store the difference of each string to the previous string. This does not allow fast random inserts or lookups, because all the previous strings have to be decompressed first. That's why I want to find common prefixes. Each string will just store the difference to this common prefix. Then I get fast random access for the cost of a few additional bytes (compared to the traditional implementation).

I'm not yet having a really good idea how to implement this. In my dreams I imagine a window which slides over the input stream, trying to find the best result. This smells like dynamic programming, a topic that I haven't been in touch with since university (long long ago).

If, however, computing the "best" result turns out to be extremely performance intensive I am willing to use the second-best result. Performance is important.

EDIT: After reading the first few answers, I understand that my question maybe was not precise enough. Maybe I can rephrase it a bit:

I am looking for the lowest cost (= minimum space utilization) in a graph. The graph starts with a sorted set of unique strings. (They are not compressed and therefore require the maximum space.) I now want to find common prefixes of the strings, so that the utilized space can be reduced. There should be just one level of prefixes, i.e. no prefix hierarchy (as it would be assumed in a trie).

The data will be serialized directly to disk, and mapped into memory when it's processed. It should not require additional parsing etc. A trie has a couple of disadvantages in this aspect. It has to store the pointers to the child nodes as offset into the serialized memory, each offset is 2 or 4 bytes wide. This is bad for space utilization, and has to be taken into account when calculating the "costs". Second, more important: if you insert or erase items in the middle then all the following offsets have to be updated. That's bad for performance. (cont) — cruppstahl, Nov 10 '14 at 19:42
(cont) My idea is to store all prefixes in a list. Each prefix has an offset to a "bucket" with all its suffixes. This requires less space overhead than a trie, has less cache misses because the data is not spread all over memory, might therefore even allow SIMD access, allows use of binary search for a string, and (in average case) the insert operations are cheap because a new suffix might just fit into its bucket if the bucket is large enough. — cruppstahl, Nov 10 '14 at 19:46
This sounds like an optimization problem: you want to find the smallest set of longest prefixes such that each prefix precedes at most/around N words (N may be tied to the size of your cache, or page size, etc). This way if prefixes are in a hash, you have O(1) look up into a bucket + O(log N) binary search w/in bucket to check if a string was in the original list. I am looking at something similar. How have you solved this? — lynxoid, Sep 22 '20 at 07:00

Mike Dunlavey · Answer 1 · 2014-11-10T13:46:46.103

3

You want to store strings in compressed form (to save space, I guess), but you want fast lookup, is this right? If I were you, I would go for the speed and use a trie (for the first few characters). It has O(log n) lookup, and it will automatically condense common prefixes.

A lot depends on the statistics of the strings, like how many there are, and their typical length.

ADDED: For the strings you gave, the trie would look like this:

- A B - 1 2 3 4 5 - 6 .
  |     |           |
  |     |           7 .
  |     |           
  |     C D E F G - H .
  |                 |
  |                 X 1 .
  |                 |
  |                 Y .
  |
  X X X X .

Each node of the trie contains a little "dictionary" of "words", initially 1 letter long, and each "word" points to a sub-node. If that sub-node contains only one "word" in its own "dictionary", then that "word" can be absorbed into its parent's "word", and that's how you build up the prefixes.

edited Nov 10 '14 at 13:46

answered Nov 10 '14 at 02:20

Mike Dunlavey

12,815
2
35
58

1

You suggest a trie for the first few characters. But for how many characters? How can I determine the length of the prefixes in order to minimize total space utilization? That's the problem i want to solve, and it's independent of the "string statistics". – cruppstahl Nov 10 '14 at 08:10
this algorithm should work with strings of all length and all types. It should even work with binaries (i.e. non-printable characters). I understand that a generic solution is not as efficient as one that is tailored to certain data, but nevertheless i look for a solution that works well in all cases. – cruppstahl Nov 10 '14 at 13:38
@cruppstahl: Check my edit above. That should work for any content. – Mike Dunlavey Nov 10 '14 at 13:51
Thanks for your patience, I appreciate it :) I have added a comment to my question to explain more of the rationale why my trie should only have a depth of 1. – cruppstahl Nov 10 '14 at 19:50
@cruppstahl: Sorry, I understand you want just one level, but I have a hard time understanding why. – Mike Dunlavey Nov 10 '14 at 20:52
I'm writing a database. It's organized as a B-tree, and now I want to add prefix compression for the keys, which can be arbitrary strings of any length. They're sorted. Each node in the B-tree should be compressed. It should provide relatively fast random updates and lookups. The database uses memory mapping, therefore the B-tree must not require additional memory structures at runtime. – cruppstahl Nov 11 '14 at 06:26

score -2 · Answer 2 · answered Nov 10 '14 at 01:53

-2

You should be able to use a substring of the string you want to check unless the prefix length varies. You can use an if statement to make sure the string is long enough to have the prefix.

answered Nov 10 '14 at 01:53

Josh Eblin

1
1

2

That is obvious. The question is: for a sorted set of strings, how can i split this input in prefixes and suffixes to minimize the storage and achieve the best compression? – cruppstahl Nov 10 '14 at 08:01

Finding common prefixes for a set of strings

2 Answers2