Design for autocompleting words in search engine?

Question

I'm trying to implement an autocomplete feature for a search engine. I have a database of words(stemmed) that occur in the documents that I have for users to search.

What I am thinking of doing is:

Constructing a graph with words as nodes and weighted edges with weights as the number of times those 2 words have been searched together.
Each node also keeps track of which is the most commonly typed word next to it.
Whenever a user starts typing his query, this graph is queried and the first word is predicted by string matching and then ranking the matched words according to search frequency.
When the user starts typing the next word, the graph is visited and the neighbours of this node are retrieved and shown to the user in order of frequency (weight of edges).

This is a rough overview of what I am trying to achieve. I have not implemented it yet. It would require a fair amount of work to implement this, and I think it would be quite slow. And to be quite honest, I don't think doing that much work knowing that it'll be quite slow is worth it.

If I am on the right track with this conceptual design, are there any optimizations that should be made to this to improve performance? Or perhaps an entirely different design that I should be using instead?

I found only 2 papers on query prediction, this and this. I went through the latter but it isn't that related to my use-case.

You have 'database' tagged. Is this in a database that you need to query? or is it in an accessible memory structure? Are you encountering any existing performance issues (there are many things that can affect performance - and describing the existing architecture and systems can help in identifying possible places to improve the performance)? — , Mar 10 '16 at 19:19
... In other words, there's not enough information here to make your question reasonably answerable. Any attempt at an answer would be guessing. — Robert Harvey, Mar 10 '16 at 19:20
Does your autocomplete only consider two words at a time? For example, if I understand you correctly your algorithm will make it so if I type `memory` it will check a list and autocomplete `foam` or `loss` or `use` depending on which one is more commonly searched. But what about if I type two words, like `memory for` - does it try to autocomplete for "memory for" or for the term "for"? — Rachel, Mar 10 '16 at 19:32
@MichaelT I removed the tag database. The 'database' I have here is [this](https://xapian.org/docs/admin_notes.html#databases) which contains all the words that I have. I need to construct a graph with these words and then query this graph for the abovementioned information. — Ayush Gupta, Mar 10 '16 at 19:34
@Rachel words like for, if, it, etc will be skipped. I'm trying to set up a basic and usable autocomplete functionality right now. As for considering multiple words at a time, I guess this algorithm can be extended easily to deal with that. — Ayush Gupta, Mar 10 '16 at 19:37
@RobertHarvey I have not implemented it yet. It would require a fair amount of work to implement this, and it would be quite slow. To be quite honest, I don't think doing that much work already knowing that it'll be quite slow is worth it. — Ayush Gupta, Mar 10 '16 at 19:46
@JayElston I found only 2 papers on query prediction, [this](http://dl.acm.org/citation.cfm?id=1993040&CFID=561173001&CFTOKEN=66538354) and [this](http://research.microsoft.com/en-us/people/djiang/icde09.pdf). I went through the latter but it isn't that related to my use-case. Do you know of any papers I that could help me? — Ayush Gupta, Mar 10 '16 at 19:58

score 2 · Answer 1 · answered Mar 11 '16 at 19:59

What you're describing will require tremendous overhead and calculating the effencicy is a non-polynomial problem (traveling salesman).

At the same time it should be noted that if properly implemented this solution will deliver as intended. However you can achieve the same result with far less overhead.

You will need:

A hash-table dictionary that provides pointers to

A priority heap of priority heaps, the top-level heap is for the first word, and subsequent heaps are for the 2nd, 3rd, etc level words.

This will provide fast access and place all penalties on the write functions.

score 1 · Answer 2 · answered Mar 20 '16 at 14:53

N-Grams can be a solution here.

Process the corpus to prepare an index of N-Grams
Find out all possible suffixes phrases for current query

Creation of N-Grams would be slow, but lookups should be nlog(n)

Edit :

Two examples of this technique :

https://qbox.io/blog/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams

http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx

Design for autocompleting words in search engine?

2 Answers2