3

I need to search a long string of binary string data (couple of megabytes worth of binary digits) for patterns of binary digits. There are about 100 000 different patterns to search for. Each is a maximum of 10 bits/character length patterns of 1's and 0's. I then need to display the most common pattern of all of these.

The binary data is fed in real time to some sort of text file or database. Its essentially a large block of binary digits that I need to search for a couple thousand different patterns in the fastest most effective way possible. Each new digit that comes in does so in about like 10 seconds time one after the other.

So for example, the following string might come in:

10101110000110001011011111000101011010111101110100000011011101010101010111011100101010101011000101110011011111110001010010000110101011100000110011100101001010011001011010110101101010001010101010100010101010010100101001010101010111010101010010100101010011101010101010101001001010110101

and I need to find whats the most common one out of these 100 000 patterns, such as 0001100111, 0110011011, 1100011 and so on.

What is the best way to do this on real time data being fed to a text file would be? Does not really need to be exactly real time however I would rather it be as real time as possible. Id also rather it be a straight console affair so as to make it as latency free as possible however I can adapt as it doesn't really need to be terminal either.

I've got about 1 year of programming experience in Bash. Would a simple bash script be fast enough for 100 000 of such patterns? But all those thousands of patterns makes me fear that just a simple bash script solution might be a little too slow and outdated. Or can SQL be really faster than just a straight bash script? I've heard about binary matching algorithms but I dont really know what those are and it seems a little out of my league at the moment, however I'm willing to get to the bottom if it if its truly the most effective way. What's the best way to do this ?

Christophe
  • 74,672
  • 10
  • 115
  • 187
user289944
  • 39
  • 1
  • 3
  • 4
    10 seconds is a **long** time for a computer. How is a pattern defined? The most common 'pattern' will be a 1 or a 0. – doubleYou Dec 02 '17 at 23:09
  • Also: bash will be slow compared to many other approaches - but it might be fast enough. SQL is not obviously helpful here. Do you have a reason for mentioning it? Finally: what holiday? ;-) – doubleYou Dec 02 '17 at 23:13
  • 2
    Did you try any of the algorithms in the canonical Wikipedia article about [string searching/string matching](https://en.wikipedia.org/wiki/String_searching_algorithm)? And no, your problem has nothing to do with SQL, not even close. – Doc Brown Dec 03 '17 at 00:31
  • I mentioned SQL because Im a programming neophyte with not enough knowledge or time to test out reality about it and the internet clearly stated that it was a database thing. Just the musings of a lost neophyte. Looks like a simple bash array is all I really need. – user289944 Dec 03 '17 at 11:53

1 Answers1

5

Reformulated problem statement

You have some huge binary content and you'll have to scan it for a set of thousands of bit patterns. The main problem is that the bit patterns can happen anywhere in the content, and not just at the start of a byte.

Algorithmic solution

Let's say that each pattern has an id, a initial bit offset 0, a bit length, and is defined in a sequence of bytes aligned on the first byte's most significant bit. The sequence of bytes shall be long enough to accommodate 7 more bits (e.g. 110110..-........ where the points represent unused bits in the pattern bytes that will be set to 0, and dash a change of byte).

  1. Build for each pattern the 7 derivates obtained by shifting the bits of all the pattern by an offset of 1 to 7 (e.g. .110110.-........, ..110110-........, ...11011-0......., etc...), and store all 7 of them and the original pattern into a map mapping to the pattern id.

  2. For every map item:

    1. For every byte offset of content, so that the remaining content length is long enough compared to the length n of the pattern bytes:

      1. Make a copy of the n content bytes, and set the first offset bits of the copy to 0. Set the 8-((offset+bit length)%8) last bits to 0 as well.

      2. compare the altered copy with the pattern, byte by byte. If every byte is equal, increase the count for the mapped pattern.

  3. Sort the patterns by decreasing pattern count

This algorithm will be very fast because step 5 is done by byte compare, as for strings. Step 4 is fast, as it is a based on two binary AND operations on the copy.

Christophe
  • 74,672
  • 10
  • 115
  • 187
  • Wow, awesome response!! I think its a little overkill for my situation and way above my current level of programming. – user289944 Dec 03 '17 at 12:03
  • @user289944 Thanks :-) Don't let impress yourself: Binary operations like AND, OR and SHIFT are not not so difficult ;-) – Christophe Dec 03 '17 at 12:11