24

I know several basic string-matching algorithms such as KMP or Boyer-Moore, but all of those analyze the pattern before searching.However, if one has a single character, there is not much to analyze. So is there any better algorithm than the naive search of comparing every character of the text ?

manlio
  • 4,166
  • 3
  • 23
  • 35
Christian
  • 365
  • 2
  • 3
  • 13
    You can throw SIMD instructions at it, but you won't get anything better than O(n). – CodesInChaos Mar 19 '16 at 12:58
  • 7
    For a single search or multiple searches in the same string ? – Christophe Mar 19 '16 at 18:27
  • KMP is definitely not something I would call a "basic" string-matching algorithm... I'm not even sure it's such a fast either, but it's historically important. If you want something basic try the Z algorithm. – user541686 Mar 19 '16 at 20:39
  • Suppose there was a character position the search algorithm didn't look at. Then it wouldn't be able to distinguish between strings with the needle character in that position, and strings with a different character in that position. – user253751 Mar 19 '16 at 22:00

5 Answers5

30

It being understood that the worst case is O(N), there are some very nice micro-optimizations.

The naive method performs a character comparison and an end-of-text comparison for each character.

Using a sentinel (i.e. a copy of the target character at the end of the text) reduces the number of comparisons to 1 per character.

At the bit twiddling level there is:

#define haszero(v)      ( ((v) - 0x01010101UL) & ~(v) & 0x80808080UL )
#define hasvalue(x, n)  ( haszero((x) ^ (~0UL / 255 * (n))) )

to know if any byte in a word (x) has a specific value (n).

The subexpression v - 0x01010101UL, evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80.

The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80).

By ANDing these two sub-expressions (haszero) the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second (April 27, 1987 by Alan Mycroft).

Now we can XOR the value to test (x) with a word that has been filled with the byte value in which we're interested (n). Because XORing a value with itself results in a zero byte and nonzero otherwise, we can pass the result to haszero.

This is often used in a typical strchr implementation.

(Stephen M Bennet suggested this on December 13, 2009. Further details in the well known Bit Twiddling Hacks).


PS

this code is broken for any combination of 1111's next to a 0

The hack passes the brute force test (just be patient):

#include <iostream>
#include <limits>

bool haszero(std::uint32_t v)
{
  return (v - std::uint32_t(0x01010101)) & ~v & std::uint32_t(0x80808080);
}

bool hasvalue(std::uint32_t x, unsigned char n)
{
  return haszero(x ^ (~std::uint32_t(0) / 255 * n));
}

bool hasvalue_slow(std::uint32_t x, unsigned char n)
{
  for (unsigned i(0); i < 32; i += 8)
    if (((x >> i) & 0xFF) == n)
      return true;

  return false;
}

int main()
{
  const std::uint64_t stop(std::numeric_limits<std::uint32_t>::max());

  for (unsigned c(0); c < 256; ++c)
  {
    std::cout << "Testing " << c << std::endl;

    for (std::uint64_t w(0); w != stop; ++w)
    {
      if (w && w % 100000000 == 0)
        std::cout << w * 100 / stop << "%\r" << std::flush;

      const bool h(hasvalue(w, c));
      const bool hs(hasvalue_slow(w, c));

      if (h != hs)
        std::cerr << "hasvalue(" << w << ',' << c << ") is " << h << '\n';
    }
  }

  return 0;
}

Lots of upvotes for an answer which makes the assumption one chararacter=one byte, which is nowadays not the standard any more

Thank you for the remark.

The answer was meant to be anything but an essay on multi-byte / variable-width encodings :-) (in all fairness that's not my area of expertise and I'm not sure it's what the OP was looking for).

Anyway it seems to me that the above ideas/tricks could somewhat be adapted to MBE (especially self-synchronizing encodings):

  • as noted in Johan's comment the hack can 'easily' be extended to work for double bytes or anything (of course you cannot stretch it too much);
  • a typical function that locates a character in a multibyte character string:
  • the sentinel technique can be used with a little foresight.
manlio
  • 4,166
  • 3
  • 23
  • 35
  • 1
    This is a poor man's version of SIMD operation. – Ruslan Mar 19 '16 at 13:58
  • @Ruslan Absolutely! This is often the case for effective bit twiddling hacks. – manlio Mar 19 '16 at 14:39
  • Indeed, but the reason we still need it is that compilers have never been taught to vectorize code using general purpose registers/instructions, only using simd... :-( – R.. GitHub STOP HELPING ICE Mar 19 '16 at 14:46
  • 2
    Nice answer. From a readability aspect, I don't understand why you write `0x01010101UL` in one line and `~0UL / 255` in the next. It gives the impression that they must be different values, since otherwise, why write it in two different ways? – hvd Mar 19 '16 at 19:08
  • 3
    This is cool because it checks 4 bytes at once, but it requires multiple (8?) instructions, since the `#define`s would expand to `( (((x) ^ (0x01010101UL * (n)))) - 0x01010101UL) & ~((x) ^ (0x01010101UL * (n)))) & 0x80808080UL )`. Wouldn't the single-byte comparison be faster? – Jed Schaaf Mar 19 '16 at 22:36
  • 1
    Note that the `#define haszero(v) ( ((v) - 0x01010101UL) & ~(v) & 0x80808080UL )` does not work it the pattern 0x0101010100 occurs. It will give a false positive for all the ones as well. In text this is rarely a problem of course. – Johan Mar 20 '16 at 00:20
  • @johan, `0x0101010100` is a five-byte constant, this algorithm was apparently written for 32-bit ints. (although, my gcc compiles 'unsigned long' to 64 bits so this code would be incorrect). – sig_seg_v Mar 20 '16 at 07:55
  • @hvd, I feel like `0x01010101UL` reads as 'subtract 1 from every byte', and `~0UL/255 * n` reads as 'factor to multiply to duplicate across multiple bytes'. I had to mentally work out the math to understand it, if we're in base-100 with two 'bytes', it goes `~0` = `9999`, so `9999/(100-1)` = 101. I feel like it makes sense that the constant on the previous line is doing something else, although it is equivalent. – sig_seg_v Mar 20 '16 at 07:57
  • @sig_seg_v I feel like `0x01010101UL` reads just fine as "factor to multiply to duplicate across multiple bytes" too, and plenty of people know that that idea works for any base including decimal. I can calculate decimal 1010101 * 34 as 34343434 in my head without having to have involved any division by 99 anywhere. I'm sure you can too. – hvd Mar 20 '16 at 08:06
  • @sig_seg_v, this code is broken for any combination of 1111's next to a 0. – Johan Mar 20 '16 at 10:54
  • 1
    @DocBrown, the code can easily be made to work for double bytes (i.e. halfwords) or nibbles or anything. (taking into account the caveat I mentioned). – Johan Mar 20 '16 at 10:56
  • @johan, I don't think it's broken, say for `v = 0x0303ff03`, then `v-0x01010101` = `0x0202fe02`, and `~v&0x80808080` = `0x80800080`, so the macro returns 0. I also got a return (`0x80808080`) working out `0x1010100`, meaning the algo seems to work. did you mean something else? – sig_seg_v Mar 20 '16 at 13:11
  • @Johan it seems to work and, as you said, the hack can be adapted to multibyte characters (probably requires [self-synchronizing](https://en.m.wikipedia.org/wiki/Self-synchronizing_code) encoding, eg UTF8, to be effective) – manlio Mar 20 '16 at 18:45
  • @sig_seg_v, yes I know it is broken as per my remarks before, because I've been bitten by that very brokenness. I don't understand what it is you don't 'get' about that. It works except for very specific input, see above. – Johan Mar 20 '16 at 22:14
  • Ok, now you got my upvote, too, ;-) – Doc Brown Mar 21 '16 at 12:33
  • I don't get the remark from Johan. The `haszero` tells you that one byte out of 4 is a zero. You still need to test for each byte in such word afterward to locate where's the 0. So a 1111 sequence will not cause a false positive, since individual byte testing will not detect a zero. – xryl669 Oct 14 '16 at 16:34
  • If you're doing UTF8, you can remove the `& ~v` part, since byte with high bit set are special, and you should handle them differently (and since there no single representation for a unicode char in UTF8). So you'll have to do a `v & 0x80808080` test anyway somewhere to figure this out. – xryl669 Oct 14 '16 at 16:36
  • @Johan, `0x01010101 - 0x01010101 = 0`, `0 & ~0x01010101 = 0`, `0 & 0x80808080 = 0`, thus haszero returns 0 for such input, as it should. For `0x01010100 - 0x01010101 = 0xFFFFFFFF`, `0xFFFFFFFF & 0xFEFEFEFF = 0xFEFEFEFF`, `0xFEFEFEFF & 0x80808080 != 0`, as expected, same for 0x111100 or 0x11111100. Of course, on a 64bits integer machine, this does not work, but is easily extended to 0x0101010101010101 instead... – xryl669 Oct 14 '16 at 17:07
20

Any text search algorithm which searches for every occurence of a single character in a given text has to read each character of the text at least once, that should be obvious. And since this is sufficient for a one-time search, there can be no better algorithm (when thinking in terms of run time order, which is called "linear" or O(N) for this case, where N is the number of characters to search through).

However, for real implementations, there are surely lots of micro-optimizations possible, which do not change the run time order at a whole, but lower the actual run time. And if the goal is not to find every occurence of a single character, but only the first, you can stop at the first occurence, of course. Nevertheless, even for that case, the worst-case is still that the character you are looking for is the last character in the text, so the worst case run time order for this goal is still O(N).

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
8

If your "haystack" is searched more than once, a histogram based approach is going to be extremely fast. After the histogram is built, you only need a pointer lookup to find your answer.

If you only need to know whether the searched pattern is present, a simple counter can help. It can be extended to include the position(s) at which each character is found in the haystack, or the position of the first occurence.

string haystack = "agtuhvrth";
array<int, 256> histogram{0};
for(character: haystack)
     ++histogram[character];

if(histogram['a'])
    // a belongs to haystack
Sam
  • 251
  • 1
  • 5
1

If you need to search for characters in this very same string more than once, then a possible approach is to divide the string into smaller parts, possibly recursively, and to use bloom filters for each of these parts.

Since a bloom filter can tell you for sure if an character is not in the part of the string that's "represented" by the filter, you can skip some parts while searching for characters.

As example: For the following string one could split it into 4 parts (each 11 characters long), and fill for each part a bloom filter (perhaps 4 byte large) with the characters of that part:

The quick brown fox jumps over the lazy dog 
          |          |          |          |

You can speed up your search, e.g. for the character a: Using good hash functions for the bloom filters, they'll tell you that - with high probability - you don't have to search in neither the first, second nor third part. Thus you save yourself from checking 33 characters and instead only have to check 16 bytes (for the 4 bloom filters). This is still O(n), just with a constant (fractional) factor (and in order for this to be effective you'll need to choose bigger parts, to minimize the overhead of calculating the hash functions for the search character).

Using a recursive, tree-like approach should get you near O(log n):

The quick brown fox jumps over the lazy dog 
   |   |   |   |   |   |   |   |---|-X-|   |  (1 Byte)
       |       |       |       |---X---|----  (2 Byte)
               |               |-----X------  (3 Byte)
-------------------------------|-----X------  (4 Byte)
---------------------X---------------------|  (5 Byte)

In this configuration one needs (again, assuming we got lucky and didn't get a false positive from one of the filters) to check

5 + 2*4 + 3 + 2*2 + 2*1 bytes

to get to the final part (where one needs to check 3 characters until finding the a).

Using a good (better as the above) subdivision scheme you should get pretty nice results with that. (Note: Bloom filters at the root of the tree should be larger than close to the leaves, as shown in the example, to get a low false positives probability)

Daniel Jour
  • 683
  • 3
  • 14
1

If the string is going to be searched multiple times (typical "search" problem), solution can be O(1). Solution is to build an index.

E.g :

Map, where Key is the Character and Value is a list of indices for that character in the string.

With this, a single map lookup can provide the answer.

Shamit Verma
  • 759
  • 4
  • 9