NLP - Queries using semantic wildcards in full text searching, maybe with Lucene?

Question

Let's say I have a big corpus (for example in english or an arbitrary language), and I want to perform some semantic search on it. For example I have the query:

"Be careful: [art] armada of [sg] is coming to [do sg]!"

And the corpus contains the following sentence:

"Be careful: an armada of alien ships is coming to destroy our planet!"

It can be seen that my query string could contain "semantic placeholders", such as:

[art] - some placeholder for articles (for example a / an in English) [sg], [do sg] - some placeholders for NPs and VPs (subjects and predicates) I would like to develop a library which would be capable to handle these queries efficiently. I suspect that some kind of POS-tagging would be necessary for parsing the text, but because I don't want to fully reimplement an already existing full-text search engine to make it work, I'm considering that how could I integrate this behaviour into a search engine like Lucene?

I know there are SpanQueries which could behave similarly in some cases, but as I can see, Lucene doesn't do any semantic stuff with stored texts.

It is possible to implement a behavior like this? Or do I have to write an own search engine?

score 2 · Answer 1 · answered Dec 03 '14 at 09:46

I would consider using a standard keyword search with the nouns and verbs from your query as a way of generating a shortlist of possible results and then using an NLP parser (e.g. Stanford Core NLP) to preform a more detailed analysis on each contender in order to filter them to only exact matches. Assuming a reasonable corpus size and that the queries use words that don't generate a large number of matches, this should be adequate.

I think this is an interesting idea – InformedA Dec 03 '14 at 10:08 — InformedA, Dec 03 '14 at 10:08

score 1 · Answer 2 · answered Jun 06 '14 at 05:34

As you use the terms NP and VP, I think you are talking about syntactic, not semantic. There are differences between the two. You can check out dependency grammar to see how it is different from CFG syntactic grammar.

I think if you use wildcards for semantic search, people would consider that a 'hack'.

This will be what I would do:

To properly do semantic search, you first need of course a semantic analyzer (for Lucence, if you see in term of Lucene). A semantic analyzer will of course requires a semantic parser. And this is the part where you have problem because semantic parsers right now are not good enough.

Assume you could have a good semantic parser, the next part would be use the semantic graph and store this graph in the search engine. There can be many ways to do it. The most obvious one would be to simplify the graph and to use each edge as a predicate, store the predicates and do search on each predicate in an RDF manner. More on RDF: http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/

In your example (remove the article one, because it's syntactic, not semantic), search query would be: [X of Y] where X = amarda [X coming to do Z]

I disagree that NLP parsers aren't good enough. I've done work with Stanford Core NLP and it gets very good accuracy on its tagging; more than good enough for this kind of application. Now, depending on corpus size you might argue it is too slow/resource intensive, but that's an entirely different problem. — Jules, Dec 03 '14 at 09:38
You are right, Stanford Core NLP has very good tagging accuracy, I was talking about the semantic parser part at least for the tasks I used it. Nevertheless, depending on how you use it for this problem, you might get very good result. Stanford Core NLP is a very good tool, it is much better now. — InformedA, Dec 03 '14 at 10:12

score 0 · Answer 3 · answered Dec 12 '12 at 17:54

Not a Lucene expert but:

You could create your own tokenizers that indexes e.g. [ART] as a token. That way you might be able to do phrase searches like the example above. This won't work if 'armada' is sometimes a [SG] or some other token. You'd need a specialized wildcard search if that is the case.

Perhaps a conventional wildcard search through Lucene will allow you to gather at least a pretty good result set to start with. With a second pass you could parse out the [ART] qualifiers and such to come to a final result set.

NLP - Queries using semantic wildcards in full text searching, maybe with Lucene?

3 Answers3