36

I have the following problem: I have a database containing more than 2 million records. Each record has a string field X and I want to display a list of records for which field X contains a certain string. Each record is about 500 bytes in size.

To make it more concrete: in the GUI of my application I have a text field where I can enter a string. Above the text field I have a table displaying the (first N, e.g. 100) records that match the string in the text field. When I type or delete one character in the text field, the table content must be updated on the fly.

I wonder if there is an efficient way of doing this using appropriate index structures and / or caching. As explained above, I only want to display the first N items that match the query. Therefore, for N small enough, it should not be a big issue loading the matching items from the database. Besides, caching items in main memory can make retrieval faster.

I think the main problem is how to find the matching items quickly, given the pattern string. Can I rely on some DBMS facilities, or do I have to build some in-memory index myself? Any ideas?

EDIT

I have run a first experiment. I have split the records into different text files (at most 200 records per file) and put the files in different directories (I used the content of one data field to determine the directory tree). I end up with about 50000 files in about 40000 directories. I have then run Lucene to index the files. Searching for a string with the Lucene demo program is pretty fast. Splitting and indexing took a few minutes: this is totally acceptable for me because it is a static data set that I want to query.

The next step is to integrate Lucene in the main program and use the hits returned by Lucene to load the relevant records into main memory.

Giorgio
  • 19,486
  • 16
  • 84
  • 135
  • 2
    2 million records * 500 bytes = 1 GB of data. That's a **lot** of data to search, whichever way you go about it - is each value of X likely to be unique, or will you have many records with the same value of X? –  Nov 09 '11 at 14:14
  • 1
    That would also be a **lot** of data to attempt to store in memory as a cache for quick retrieval. That would equate to more than 1GB per user session. – maple_shaft Nov 09 '11 at 14:18
  • My previous comment assumes a web application. Is this a web application? – maple_shaft Nov 09 '11 at 14:19
  • It is a desktop application. Values in the records are not necessarily unique. Also, I am searching for substring not for an exact match. – Giorgio Nov 09 '11 at 14:33
  • @maple_shaft: I would only cache the records that I have accessed recently. If I change the query string and a record still matches, it is still in the cache. – Giorgio Nov 09 '11 at 14:34
  • @Mark Bannister: I will have possibly many records matching the query string (substring). But I can fix a number N, e.g. 100, and only retrieve the first N records. As the user types in more characters, the result set should contain only a few items, or even one item only. – Giorgio Nov 09 '11 at 14:37
  • @Giorgio, you misunderstood my question. I did not ask how many records would match the search criteria; I asked whether each value of X is likely to be unique - ie. how many **distinct** values of X are likely: eg. 2 million or 20? –  Nov 09 '11 at 15:53
  • @Mark Bannister: Ok, I thought I had already answered (see my comment above: values in the records are not necessarily unique). But, I expect each distinct value to be repeated very few times, like less than 100. That would make roughly 20 000 distinct values, but probably more, because on average each distinct value should not occur more than 3, 4 times. I hope this helps. – Giorgio Nov 09 '11 at 16:05
  • If your database does not support full text search, there is a way you can do this but it would take me some effort to explain it and it requires coding. If you are interested please let me know. – NoChance Nov 09 '11 at 17:33
  • @Emmad Kareem: Is this some known techique? Maybe a link to some documentation will be sufficient? – Giorgio Nov 11 '11 at 12:59

7 Answers7

21

The technology you are looking for is full-text indexing. Most RDBMS have some sort of built-in capabilities which could work here, or you could use something like Lucene if you wanted to get fancier and/or just run it in memory.

Wyatt Barnett
  • 20,685
  • 50
  • 69
  • 1
    In my opinion the fulltext options in any RDBMS is a workaround to make it do something it isn't designed for: "search in some pile of unstructured unrelated data". If you're building a searchengine, you just don't use an RDBMS. It may work for small datasets but lakcs any sort of scaling. Searching through piles of unstructured data is not a nail, so don't use a hammer. Use the right tool for the job. – Pieter B May 08 '15 at 09:41
21

Instead of putting your data inside the DB, you can keep them as a set of documents (text files) separately and keep the link (path/url etc.) in the DB.

This is essential because, SQL query by design will be very slow both in sub-string search as well as retrieval.

Now, your problem is formulated as, having to search the text files which contains the set of strings. There are two possibilities here.

  1. Sub-string match If your text blobs is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every file to find best possible files that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside. But you may still make at least 100+ grep (worst case 2 million) before returning.

  2. Indexed search. Here you are assuming that text contains set of words and search is limited to fixed word lengths. In this case, document is indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
    a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
    b. OpenFTS : http://openfts.sourceforge.net/
    c. Sphinx http://sphinxsearch.com/

Most likely if you need "fixed words" as queries, the approach two will be very fast and effective.

Giorgio
  • 19,486
  • 16
  • 84
  • 135
Dipan Mehta
  • 10,542
  • 2
  • 33
  • 67
  • 2
    This is an interesting concept but it seems unlikely that a developer can easily search through 1GB of textual data faster and more efficiently than a database engine. Much smarter people than you and I have laboured over query optimizers to do just that and it is a bit naive to think that you can somehow do that more efficiently. – maple_shaft Nov 09 '11 at 17:25
  • 4
    @maple_shaft The examples i have given are not RDBMS database engines. They are more like "search engines" if you would like to call it. There is a huge conceptual difference between picking up a list out of an index(or hash table) versus searching through 1GB of data all over again every time a query fires. So what i am suggesting is not a minor tweak. – Dipan Mehta Nov 09 '11 at 18:02
  • This seems an interesting idea but I wonder how it would work. I would have more than 2 000 000 files, each about half a kilobyte in size. Or are you suggesting to have more than one record per file? What would be the difference wrt a database? – Giorgio Nov 09 '11 at 19:21
  • I'm not convinced that this would necessarily perform any better than, say, SQL fulltext index. – Kirk Broadhurst Nov 10 '11 at 05:08
  • @Giorgio - yes that is how full text search engines would work. The key difference here is a pre-indexed pages vs. the in memory search (again for every time a query comes). – Dipan Mehta Nov 10 '11 at 08:33
  • @KirkBroadhurst MySQL's Full Text search - (see http://dev.mysql.com/doc/refman/4.1/en/fulltext-search.html) is conceptual same example as i have shown with Lucane or Sphinx. The other answers were discussing the "in-memory indexes" which is quite unnecessary. – Dipan Mehta Nov 10 '11 at 08:36
  • @Dipan Mehta: I am trying out Lucene and it seems very easy to use. So I could put the records in text files and use Lucene to search through them. Then I just load into main memory the first N records that have matched. When I have more news I will report. – Giorgio Dec 18 '11 at 20:01
  • @Giorgio. Great. – Dipan Mehta Dec 19 '11 at 04:10
8

Have you considered a trie? Basically you build a tree using common prefixes, so all words that start with the same letters are children of the same node. If you're going to support matching on any substring, then you'll have to generate some sort of permuted index and build your trie from that. That may wind up blowing your storage requirements way out, though.

TMN
  • 11,313
  • 1
  • 21
  • 31
  • 1
    YES! I was thinking about a tree structure and I remembered that there was something similar that might suit me, but I did not remember trie's because I have never used them. Regarding storage requirement: remember that I need to retrieve only the first N entries (e.g. N = 100) because it makes no sense to populate a table with 20000 hits. So each node of the trie would point to at most N entries. Also, I forgot to mention that I need fast access but I do not need fast update, because the data is only loaded once. The trie idea on a permuted index could really work! – Giorgio Nov 09 '11 at 19:16
  • 1
    Good answer but as you note, a trie is great for matching the *start* of your words but will quickly get complex and very large if matching any substring... – Kirk Broadhurst Nov 10 '11 at 05:10
  • As a first experiment, I have tried to build the set of all sub-strings appearing in the strings I have to search which, if I understand correctly, correspond to the paths of the trie. I got an out-of-memory exception (with 256M of heap for the JVM) at sub-strings of length 6. So I am afraid this solution is not feasible, unless I am doing something wrong. – Giorgio Nov 13 '11 at 16:32
5

I would like add on top of Wyatt Barnett's answer that a RDBMS solution with full-text indexing on the appropriate column will work, but if you want to utilize a local cache of previously fetched records then you need to a plan to utilize these cached records to your advantage.

One option is to collect the unique identifiers of these records that you EXPLICITLY do not want to retrieve from the query and include them, possibly in a NOT IN or a NOT EXISTS.

Word of caution though, using NOT IN or NOT EXISTS tends not to be cheap and MAY negatively influence your query performance or query plan depending on what database engine you are utilizing. Run an explain plan on your final query to ensure that all of your indexes on the affected columns are being utilized.

It also doesn't hurt to do a performance comparison between the two approaches to see which is faster. You may be surprised to find out that maintaining a local cache and filtering those from your query explicitly may have worse performance than a finely tuned query that fetches all records.

maple_shaft
  • 26,401
  • 11
  • 57
  • 131
  • maple_shaft and @Wyatt Barnett: Thanks a lot for the suggestions. I will have to do some reading and try out different solutions. Not all databases support full indexing, MySQL (which I am currently using) does (http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html). I will try to do some tests and then report here. – Giorgio Nov 11 '11 at 08:31
2

Just in case you missed it. If you use Lucene for your database instead of in-DB supported text search, you will have to be extremely careful when making modification to your DB. How do you make sure that you can have atomicity when you have to make changes in both the DB and the external resources (Lucene)? Yes it can be done, but there will be lot of work.

In short, you are losing the DB transactional support if you put Lucene in your data schema.

InformedA
  • 2,991
  • 2
  • 19
  • 28
1

Have you considered Sphinx? http://sphinxsearch.com if you can use a 3rd party tool this would be ideal for what your're trying to achieve, its much more efficient at full text search than any RDBMS that I have personally used.

twigg
  • 265
  • 1
  • 2
  • 9
1

It is somewhat strange that none of the answers presented the term "inverted index", the technology underlying all solutions similar to Apache Lucene and others.

The inverted index is a mapping from words to documents ("record-level inverted index") or even precise word locations within the document ("word-level inverted index").

AND and OR logical operations are trivial to implement. If you have precise word locations, it is possible to look for adjacent words, thus making phrase searches possible.

So, think about an index containing (word, file, location) tuples. When you have e.g. ("inverted", "foo.txt", 123) then you just check whether ("index", "foo.txt", 124) is part of the index to search for the full phrase "inverted index".

While I'm not recommending you to reimplement a full-text search engine from the scratch, it is useful to know how technologies such as Apache Lucene work.

So, my recommendation is to learn how inverted indexes work and choose a technology using them such as Apache Lucene. Then you at least have a solid understanding of what can be done and what can't be done.

juhist
  • 2,579
  • 10
  • 14