14

I have a fairly complex search problem that I've managed to reduce to the following description. I've been googling but haven't been able to find an algorithm that seems to fit my problem cleanly. In particular the need to skip arbitrary integers. Maybe someone here can point me to something?

Take a sequence of integers A, for example (1 2 3 4)

Take various sequences of integers and test if any of them match A such that.

  1. A contains all of the integers in the tested sequence
  2. The ordering of the integers in the tested sequence are the same in A
  3. We don't care about any integers in A that are not in the test sequence
  4. We want all matching test sequences, not just the first.

An example

A = (1 2 3 4)
B = (1 3)
C = (1 3 4)
D = (3 1)
E = (1 2 5)

B matches A

C matches A

D does not match A as the ordering is different

E does not match A as it contains an integer not in A

I hope that this explanation is clear enough. The best I've managed to do is to contruct a tree of the test sequences and iterate over A. The need to be able to skip integers leads to a lot of unsuccessful search paths.

Thanks

Reading some suggestions I feel that I have to clarify a couple of points that I left too vague.

  1. Repeated numbers are allowed, in fact this is quite important as it allows a single test sequence to match A is multiple ways

    A = (1234356), B = (236), matches could be either -23---6 or -2--3-6

  2. I expect there to be a very large number of test sequences, in the thousands at least and sequence A will tend to have a max length of maybe 20. Thus simply trying to match each test sequence one by one by iterating becomes extremely inefficient.

Sorry if this wasn't clear.

David Gibson
  • 241
  • 1
  • 4
  • 4
    You sound as if you simply want to detect subsequences (http://en.wikipedia.org/wiki/Subsequence). Is that it? Then try searching for "subsequence algorithm". – Kilian Foth Nov 12 '13 at 12:38
  • Honestly, a few thousand sequences with max length <= 20 does not sound a large number to me. A simple brute-force approach should do the trick. Or do you have thousands of sequences "A", each one to test against thousands of possible subsequences? – Doc Brown Nov 12 '13 at 15:26
  • There is a continual stream of sequences A but they are completely independent of each other. However a delay in processing one will directly delay all others, so speed is important. – David Gibson Nov 12 '13 at 15:53
  • 1
    How large is your alphabet? Do you really have arbitrary integers, or is there a finite range of values such that we could do some precalculations? – Frank Nov 13 '13 at 06:19
  • The possible range of integers is in the 100,000s – David Gibson Nov 13 '13 at 09:29
  • This shold perhaps be a better fit for the ComputerScience.SE site? http://cs.stackexchange.com/ – Avner Shahar-Kashtan Nov 13 '13 at 19:07
  • @AvnerShahar-Kashtan possibly, there is an overlap between the two sites on this aspect of scope. I'd suggest glancing at a (contrived) question that was multiply posted to see what types of answers each site would bring - [P.SE](http://programmers.stackexchange.com/questions/206073/) and [CS.SE](http://cs.stackexchange.com/questions/13420/). It depends on what type of answer one wants to which way it should be posted. Since its posted here, and getting good (if not a great) answer(s) here, it should remain here. –  Nov 13 '13 at 19:30

3 Answers3

18

Hmm, I can think of two possible algorithms: A linear scan through the A sequence, or building a dictionary with constant-time lookup of the indices.

If you are testing many potential subsequences B against a single larger sequence A, I'd suggest you use the variant with the dictionary.

Linear Scan

Description

We maintain a cursor for the sequence A. Then we iterate through all items in the subsequence B. For each item, we move the cursor forward in A until we have found a matching item. If no matching item was found, then B isn't a subsequence.

This always runs in O(seq.size).

Pseudocode

Imperative style:

def subsequence? seq, subseq:
  i = 0
  for item in subseq:
    i++ while i < seq.size and item != seq[i]
    return false if i == seq.size
  return true

Functional style:

let rec subsequence? = function
| _ [] -> true
| [] _ -> false
| cursor::seq item::subseq ->
  if   cursor = item
  then subsequence? seq subseq
  else subsequence? seq item::subseq

Example implementation (Perl):

use strict; use warnings; use signatures; use Test::More;

sub is_subsequence_i ($seq, $subseq) {
  my $i = 0;
  for my $item (@$subseq) {
    $i++ while $i < @$seq and $item != $seq->[$i];
    return 0 if $i == @$seq;
  }
  return 1;
}

sub is_subsequence_f ($seq, $subseq) {
  return 1 if @$subseq == 0;
  return 0 if @$seq == 0;
  my ($cursor, @seq) = @$seq;
  my ($item, @subseq) = @$subseq;
  return is_subsequence_f(\@seq, $cursor == $item ? \@subseq : $subseq);
}

my $A = [1, 2, 3, 4];
my $B = [1, 3];
my $C = [1, 3, 4];
my $D = [3, 1];
my $E = [1, 2, 5];

for my $is_subsequence (\&is_subsequence_i, \&is_subsequence_f) {
  ok $is_subsequence->($A, $B), 'B in A';
  ok $is_subsequence->($A, $C), 'C in A';
  ok ! $is_subsequence->($A, $D), 'D not in A';
  ok ! $is_subsequence->($A, $E), 'E not in A';
  ok $is_subsequence->([1, 2, 3, 4, 3, 5, 6], [2, 3, 6]), 'multiple nums';
}

done_testing;

Dictionary lookup

Description

We map the items of the sequence A to their indices. Then we look up suitable indices for each item in B, skip those indices that are to small, and pick the smallest possible index as a lower limit. When no indices are found, then B isn't a subsequence.

Runs in something like O(subseq.size · k), where k describes how many duplicate numbers there are in seq. Plus an O(seq.size) overhead

The advantage of this solution is that a negative decision can be reached much faster (down to constant time), once you have paid the overhead of building the lookup table.

Pseudocode:

Imperative style:

# preparing the lookup table
dict = {}
for i, x in seq:
  if exists dict[x]:
    dict[x].append(i)
  else:
    dict[x] = [i]

def subsequence? subseq:
  min_index = -1
  for x in subseq:
    if indices = dict[x]:
      suitable_indices = indices.filter(_ > min_index)
      return false if suitable_indices.empty?
      min_index = suitable_indices[0]
    else:
      return false
  return true

Functional style:

let subsequence? subseq =
  let rec subseq-loop = function
  | [] _ -> true
  | x::subseq min-index ->
    match (map (filter (_ > min-index)) data[x])
    | None -> false
    | Some([]) -> false
    | Some(new-min::_) -> subseq-loop subseq new-min
  in
    subseq-loop subseq -1

Example implementation (Perl):

use strict; use warnings; use signatures; use Test::More;

sub build_dict ($seq) {
  my %dict;
  while (my ($i, $x) = each @$seq) {
    push @{ $dict{$x} }, $i;
  }
  return \%dict;
}

sub is_subsequence_i ($seq, $subseq) {
  my $min_index = -1;
  my $dict = build_dict($seq);
  for my $x (@$subseq) {
    my $indices = $dict->{$x} or return 0;
    ($min_index) = grep { $_ > $min_index } @$indices or return 0;
  }
  return 1;
}

sub is_subsequence_f ($seq, $subseq) {
  my $dict = build_dict($seq);
  use feature 'current_sub';
  return sub ($subseq, $min_index) {
    return 1 if @$subseq == 0;
    my ($x, @subseq) = @$subseq;
    my ($new_min) = grep { $_ > $min_index } @{ $dict->{$x} // [] } or return 0;
    __SUB__->(\@subseq, $new_min);
  }->($subseq, -1);
}

my $A = [1, 2, 3, 4];
my $B = [1, 3];
my $C = [1, 3, 4];
my $D = [3, 1];
my $E = [1, 2, 5];

for my $is_subsequence (\&is_subsequence_i, \&is_subsequence_f) {
  ok $is_subsequence->($A, $B), 'B in A';
  ok $is_subsequence->($A, $C), 'C in A';
  ok ! $is_subsequence->($A, $D), 'D not in A';
  ok ! $is_subsequence->($A, $E), 'E not in A';
  ok $is_subsequence->([1, 2, 3, 4, 3, 5, 6], [2, 3, 6]), 'multiple nums';
}

done_testing;

Dictionary Lookup Variant: Encoding as a Finite State Machine

Description

We can further reduce algorithmic complexity down to O(subseq.size) if we trade in more memory. Instead of mapping elements to their indices, we create a graph where each node represents an element at its index. The edges show possible transitions, e.g. the sequence a, b, a would have the edges a@1 → b@2, a@1 → a@3, b@2 → a@3. This graph is equivalent to a finite state machine.

During lookup we maintain a cursor which initially is the first node of the tree. We then walk the edge for each element in the sublist B. If no such edge exists, then B is no sublist. If after all elements the cursor contains a valid node, then B is a sublist.

Pseudocode

Imperative style:

# preparing the graph
graph = {}
for x in seq.reverse:
  next_graph = graph.clone
  next_graph[x] = graph
  graph = next_graph

def subseq? subseq:
  cursor = graph
  for x in subseq:
    cursor = graph[x]
    return false if graph == null
  return true

Functional style:

let subseq? subseq =
  let rec subseq-loop = function
  | [] _ -> true
  | x::subseq graph -> match (graph[x])
    | None -> false
    | Some(next-graph) -> subseq-loop subseq next-graph
  in
    subseq-loop subseq graph

Example implementation (Perl):

use strict; use warnings; use signatures; use Test::More;

sub build_graph ($seq) {
  my $graph = {};
  for (reverse @$seq) {
    $graph = { %$graph, $_ => $graph };
  }
  return $graph;
}

sub is_subsequence_i ($seq, $subseq) {
  my $cursor = build_graph($seq);
  for my $x (@$subseq) {
    $cursor = $cursor->{$x} or return 0;
  }
  return 1;
}

sub is_subsequence_f ($seq, $subseq) {
  my $graph = build_graph($seq);
  use feature 'current_sub';
  return sub ($subseq, $graph) {
    return 1 if @$subseq == 0;
    my ($x, @subseq) = @$subseq;
    my $next_graph = $graph->{$x} or return 0;
    __SUB__->(\@subseq, $next_graph);
  }->($subseq, $graph);
}

my $A = [1, 2, 3, 4];
my $B = [1, 3];
my $C = [1, 3, 4];
my $D = [3, 1];
my $E = [1, 2, 5];

for my $is_subsequence (\&is_subsequence_i, \&is_subsequence_f) {
  ok $is_subsequence->($A, $B), 'B in A';
  ok $is_subsequence->($A, $C), 'C in A';
  ok ! $is_subsequence->($A, $D), 'D not in A';
  ok ! $is_subsequence->($A, $E), 'E not in A';
  ok $is_subsequence->([1, 2, 3, 4, 3, 5, 6], [2, 3, 6]), 'multiple nums';
}

done_testing;
amon
  • 132,749
  • 27
  • 279
  • 375
  • As an aside, have you poked at how `study` works and if the algorithms it applies could have some practical application here? –  Nov 12 '13 at 18:04
  • 1
    @MichaelT I'm not sure I understand … I am an undergrad, but I haven't yet discovered how to actually study . If you are talking about the Perl builtin function: It's a no-op nowadays. The current implementation is just a dozen lines of backwards compatibility. The regex engine employs such heuristics directly, such as searching for constant strings before matching variable-sized patterns. `study` had previously built character-to-position lookup tables, not unlike my second solution. – amon Nov 12 '13 at 19:06
  • updated with even better algorithm – amon Nov 13 '13 at 19:00
  • Elaborating more on that FSM, you could 'compile' all the test sequences into one FSM and then run through the entire sequence. Depending on which state you were in at the end determines which subsequences were matched. Its certainly the thing that one would rather have the computer do rather than do by hand for any non-trivial one. –  Nov 15 '13 at 22:37
  • @MichaelT You are correct that we *could* build a recognizer this way. However, we are already down to *n·O(B)* + initialization cost in *O(f(A))*. Building the trie-like structure of all Bs would take something like *O(n·B)*, with the matching being in *O(A)*. This has a real chance of being cheaper theoretically (building the graph in the 3rd solution can be expensive, but it's just a one-time cost). I think a trie is better suited for *A ≫ n·B*, and has the disadvantage that it can't handle streaming input – all Bs have to be loaded prior to matching. I'll probably update the answer in 6h. – amon Nov 16 '13 at 07:25
6

Here is a practical approach which avoids "the hard work" of implementing your own algorithm, and also avoids to "reinvent the wheel": utilize a regular expression engine for the problem.

Just put all numbers of A into a string, and all numbers of B into a string separated by the regular expression (.*). Add a ^ character at the beginning and $ at the end. Then let your favorite regular expression engine search for all matches. For example, when

A = (1234356), B = (236)

create a reg exp for B like ^(.*)2(.*)3(.*)6(.*)$. Now run a global regexp search. To find out at which positions in A your subsequence matches, just check the length of the first 3 submatches.

If your range of integers leaves 0 to 9, you may consider to encode them by single letters first to make this work, or you have to adapt the idea using a separation character.

Of course, the speed of that approach will depend a lot on the speed of the reg exp engine your are using, but there are highly optimized engines available, and I guess it will be hard to implement a faster algorithm "out of the box".

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • One need not go all the way to invoking a regex and its engine. It would be possible to use a simple deterministic finite automata to run it. Its a 'straight line' route through. –  Nov 12 '13 at 17:40
  • @MichaelT: well, I don't have a "generic finite automata" library at hand, and the OP did not tell us about the programming language he uses, but regular expressions are available today for almost every serious programming language "out the box". That should make my suggestion very easy to implement, with much less code than, for example, amon's solution. IMHO the OP should give it a try, if it is too slow for him, he could still try if a more complicated solution will serve him better. – Doc Brown Nov 12 '13 at 17:53
  • You don't need a generic library. All you need is the array of the 'pattern' and a pointer to the index in the array. The index points to the next "looking for" value, and when you read it from the source, increment the index. When you hit the end of the array, you've matched it. If you read the end of the source without reaching the end, you haven't matched it. –  Nov 12 '13 at 17:58
  • @MichaelT: then why don't you post a sketch of that algorithm as an answer? – Doc Brown Nov 12 '13 at 18:00
  • Mostly because its already better answered already - "We maintain a cursor for the sequence A. Then we iterate through all items in the subsequence B. For each item, we move the cursor forward in A until we have found a matching item. If no matching item was found, then B isn't a subsequence." –  Nov 12 '13 at 18:03
0

This algorithm should be quite efficient if getting the length and iterating the sequence is efficient.

  1. Compare the length of both sequences. Store the longer in sequence and the shorter in subsequence
  2. Start at the beginning of both sequences and loop until the end of sequence.
    1. Is the number at the current position of sequence equal to to the number at the current postion of subsequence
    2. If yes, move both positions one further
    3. If not, move only the position of sequence one further
  3. Is the position of subsequence at the end of the sequence
  4. If yes, the two sequences match
  5. If not, the two sequences don't match
MarcDefiant
  • 1,099
  • 2
  • 6
  • 8