I am trying to find common prefixes for a sorted set of strings. i.e. if the following strings are given:
AB123456
AB123457
ABCDEFGH
ABCDEFGX1
ABCDEFGY
XXXX
then my function should return three prefixes and their suffixes:
AB12345 6,7
ABCDEFG H,X1,Y
XXXX (no suffixes)
Some background information: I'm trying to compress a large amount of sorted strings. A traditional implementation of prefix compression would just store the difference of each string to the previous string. This does not allow fast random inserts or lookups, because all the previous strings have to be decompressed first. That's why I want to find common prefixes. Each string will just store the difference to this common prefix. Then I get fast random access for the cost of a few additional bytes (compared to the traditional implementation).
I'm not yet having a really good idea how to implement this. In my dreams I imagine a window which slides over the input stream, trying to find the best result. This smells like dynamic programming, a topic that I haven't been in touch with since university (long long ago).
If, however, computing the "best" result turns out to be extremely performance intensive I am willing to use the second-best result. Performance is important.
EDIT: After reading the first few answers, I understand that my question maybe was not precise enough. Maybe I can rephrase it a bit:
I am looking for the lowest cost (= minimum space utilization) in a graph. The graph starts with a sorted set of unique strings. (They are not compressed and therefore require the maximum space.) I now want to find common prefixes of the strings, so that the utilized space can be reduced. There should be just one level of prefixes, i.e. no prefix hierarchy (as it would be assumed in a trie).