heuristic for searching through non-perfectly sorted data

Question

Given sorted data, the search solution is obvious. Given unsorted data, sensible options are sort then search or just linear search.

This question is about what to do if the data is somewhat sorted, but cannot be re-organized (write operations require sector erase, and sectors won't fit in ram).

the details:

Data records are logged sequentially with a timestamp. The timestamp clock however, from time to time resets to Epoch for a while before it's synced or it's simply adjusted back due to daylight savings etc.

The search result should be the log record with timestamp closest to the time specified. While performing a binary search for record at a particular timestamp, it is possible to land on a patch of discontiguous log data, and pivot the wrong direction.

Besides a brute linear method, is there a heuristic we can take advantage of here? e.g. a Simplex Search immune to local minima?

update:

We can assume about 95% of the records are in sorted sequence. about 5% (scattered through the log) are the dirty hiccups. The the beginning of the hiccups can be identified when the timestamp goes backwards in time and and have an earlier stamp than the beginning of the log

is there a distinctive characteristic with the erroneous timestamps? if yes then just go linearly forward until a "good" record comes up and pivot on that — ratchet freak, Apr 09 '13 at 17:00
@ratchetfreak the only possible distinction is that they are likely to be older timestamps, not guaranteed. The problem with linear through the patch is that it could be going for a long time. — MandoMando, Apr 09 '13 at 17:50
there is the indexing option also, beside sorting and scanning — CapelliC, Apr 09 '13 at 18:01
A sample block of log records could maybe help. Also, how is your data organized, text file, database record, fixed format ASCII data, etc? — Jan Doggen, Apr 09 '13 at 18:42
@JanDoggen the data is in binary records with fixed record size stored on a serial eeprom. A meaningful sample log isn't possible. Though I'll update the question with assumptions about 'dirtiness'. — MandoMando, Apr 09 '13 at 18:55
@CapelliC index is a nice idea. Though creating it and maintaining it falls into the same trap since we can't re-write once we write and an erase operation would delete the entire index. — MandoMando, Apr 09 '13 at 18:57
why not selectively delete ? An index can be compressed on adjacency range, making deletion costless... — CapelliC, Apr 09 '13 at 19:17
@CapelliC a delete operation would wipe out an entire sector which is larger than ram size. Then result in fragmentation. To implement it properly, we'd need to scrub-re-write the entire log. And keep doing it to maintain. — MandoMando, Apr 09 '13 at 19:27
Maybe silly: You can't fix the root cause of the clock resetting? ie. if (currentTimeStamp < lastLogTimeStamp){ force clock resync } writeToLog(...) — Steven Evers, Apr 09 '13 at 20:07

score 3 · Answer 1 · answered Apr 10 '13 at 06:15

Additional assumptions

This answer is based on the following additional assumptions:

you can easily determine the timestamp for the beginning of the log, and
it is feasible to store the positions of the hiccups (optional)

Search algorithm

The search is split into two different algorithms really. In case that you search for a log with a timestamp that is after the beginning of the log, you know that it will not be found in the hiccups and use non-hiccup search below. In case that you search for a timestamp before the beginning of the log you use hiccup search instead. If you do not search by the timestamp, but some other criteria, you first try the non-hiccup search due to its 95% coverage and then try the hiccup search if you failed to find anything.

Optionally, you can speed up the non-hiccup search by a preprocessing step.

Preprocessing (optional)

If possible, pre-analyze your data with a linear search to find out the positions of hiccup data. This depends entirely on the feasability of being able to store these ranges, which might be possible given that they only amount to 5% of your logs (or it might not, in which case performance degrades).

Note that the corresponding datastructure should be updated whenever you write new logs, or at least it should be able to tell you up to which point of the logs the preprocessing was performed.

Non-hiccup search

Searching the non-hiccup data is possible via a combination of binary and linear search. You perform a normal binary search, however, when your pivot element is timestamped before the beginning of the log, i.e. the pivot element is a hiccup, you need to determine the first log entry before the pivot element and use that as the real pivot element of your binary search.

This first log entry with a timestamp after the beginning of the log is found via a linear search starting from the hiccup pivot element. If you know from preprocessing or incremental updates to your cached hiccups where the relevant pivot element is positioned, you can jump there in constant time. If you had to run the full linear search, you update a datastructure to store that these positions are covered by hiccup data, such that next time you can quickly determine the right pivot element.

Hiccup-search

This depends on whether you have made the preprocessing. If not, it is a linear search through all your data (and you can do the preprocessing parts at this time as well). Otherwise, your preprocessed datastructure is able to tell you the positions of the hiccup data and you can search directly through these, i.e. perform a linear search only through the 5% of hiccup data.

Performance

For non-hiccup search with fully preprocessed data you can get almost the performance of a default binary search. Instead of a simple comparison at each step though you need an additional comparison to determine if you hit a hiccup element and if so you need an access to your datastructure to find the real pivot element. This additional work, however, is slightly alleviated by the fact that you not only rule out half your dataset, but you also rule out the hiccup data-part.

Of course, if you have to resort to linear search, this degrades accordingly.

The hiccup search is a bad case if you do not have information about existing hiccups available and need to search linearly through all the data.

Finally, if you do not search for a timestamp, but some other criteria and no such log entry exists, this is your worst-case. In fact, if you have no datastructure for the hiccups, it ends up being slower than linear search, as both search runs may linearly scan the same hiccup positions (though it is still O(n)).

If you do have the datastructure available and fully preprocessed, the worst-case runtime comes down to O(max(log(M)*log(N), M)) where M is the amount of hiccup data, which is searched linearly, and assuming you can lookup the end of hiccup data given any hiccup position from your datastructure in O(log(M)).

@Franks thanks for this! I was racking a scheme where on landing on a hiccup, both possible future pivots would be examined to see if a 1/4 reduction (instead of 1/2) of data size is possible. Wasn't sure what to do with the case where all three (current pivot and two possibilities) are all hiccups. With your suggestion, a linear walk to the edge of hiccup patch will work. — MandoMando, Apr 10 '13 at 14:04