1

At the moment, we're using sql server full text search, but it's too inflexible.

The main thing we do is look up names of people from a database based on a search query. The searches need to be fast, and they need to be fuzzy. SQL Full Text Search doesn't really support fuzzy matching especially when combined with the thesaurus option. Therefore I need a better solution.

My research suggests that lucene and solr are widely used enterprise solutions, but my searching suggests these are more designed for indexing things like documents and webpages, or what it refers to as 'unstructured data'.

Our data is very well structured, and therefore I'm unsure if it's suitable for this type of work or if I should be investigating another product. According to the book I have Solr 1.4 Enterprise Search Server it supports all of the above except for prefix matching out of the box, however it states there are performance issues with substring searches.

Do you think that solr/lucene is a good technology to investigate for solving my problem? If not, do you have an alternative?

Any advice is welcome. I am a .NET developer, hence solrnet rather than solr.

NibblyPig
  • 2,995
  • 3
  • 16
  • 19
  • 3
    Your question requires too many details of your situation in order to answer it properly. You would be better off setting up prototyping environments and investigating the various tools you mentioned. That's the only way you'll find which one is easiest for you to use and which one meets your needs. –  Dec 19 '13 at 14:52
  • 1
    I guess my main question is, is it suitable to use solr to query a list of names using fuzzy matching? – NibblyPig Dec 19 '13 at 14:55
  • 3
    Have you looked at solr? Have you read their website? I would think that sort of question (and more) would be handled through their documentation. If you can't figure it out from their documentation, then maybe that's a sign not to use their product. –  Dec 19 '13 at 14:57
  • 1
    I have, yes. It talks mostly about documents, which is not what I am using. I appreciate you're trying to help but it's a question that I would have thought anyone who has used solr extensively would be able to answer with 'yes' or 'no' without even thinking about it. And if the answer is no, well, that will save several hours of work investigating it, and thus my employer money. Sorry to be curt. The product is -huge- though and I am just asking, should I look into this? Or something else? – NibblyPig Dec 19 '13 at 15:04
  • 3
    SLC - No worries about being direct, and allow me to reciprocate the courtesy. The first problem is that your question as written is off-topic for Programmers. It's either too broad, an off-site resource recommendation, or primarily opinion based. Even if we focused on the fuzzy-logic matching, it's still not a good fit for the site as it remains an off-site resource recommendation. I don't see a programming whiteboard type question in your current question. –  Dec 19 '13 at 15:14
  • I disagree, it's the perfect question for this site and it meets with the site's guidelines. However I'll update the question to reflect a bit more clearly what I am asking. – NibblyPig Dec 19 '13 at 15:17
  • 3
    @SLC To be curt, we aren't your paid consulting service, or even paid :) If you want enterprise level answers for your *specific* case you should seek enterprise level consulting – Ampt Dec 19 '13 at 15:18
  • 1
    I understand, but if you don't want to help or are unable to help, why are you posting here? I'm clearly just asking for a quick, yes or no, not a full enterprise consultation service. Are you just trolling my question? – NibblyPig Dec 19 '13 at 15:22
  • @SLC everything you send to solr (for the purpose of indexing) *is* a document, from solr's perspective. it doesn't matter if it's two words or a thousand words long. and if you read lucene's documentation you'll see that it supports fuzzy searches, you just need to decide whether its approach comes close to what you need. you can have solr up and running in no time, and a couple of hours spent on research isn't the end of the world. – devnull Dec 19 '13 at 15:24
  • @SLC The point of SO is to gather knowledge for *future* visitors to use. It has to be specific enough to cover a use case, but broad enough to not be limited to *only* this use case, which is the problem with this one. Who will this benefit down the line? – Ampt Dec 19 '13 at 15:26
  • This isn't SO. The specific topic of the question is, can you use solr not just for indexing large documents/web pages as it appears to have been designed, but for smaller intensive queries such as you'd find in a database column, ie. taking SQL Server full text search a step further. I'd have thought that'd be quite useful to know. In order for me to find out I'd have to literally implement it, which would take a couple of weeks, and measure the performance. And since I know nothing about it that would take a very long time to do that compared to someone saying, I've done that, and it worked. – NibblyPig Dec 19 '13 at 15:32

1 Answers1

4

I have only experience with Solr and Sphinx, so can't really compare too much. And we don't use much 'fuzzy' search. But I worked with Solr a lot and think I know the docs quite well.

First of all, the term 'document' is to be understood in a very technical way. By no means does this limit your search to typical text documents. We use Solr to search products from our database and in this context a document is just a denormalized data stream representing a single product with all the text information normally in related tables directly attached to the product document. So whatever you want to search for, if it has some kind of unique id and textual descriptions (tags, category, assortment, brand...) it qualifies as document.

Ok, back to fuzzy. This kind of search is difficult, because you can't really make much use of indexes here. You would have to compare a string against every single indexed string and calculate some kind of 'distance' value and then select against a maximum distance. Solr offers Fuzzy Search and Proximity Search, but since we don't use it, I can't say how good they perform. But as one can read on the internet from version 4.0 Lucene uses something called Levenshtein Automata which should enable fuzzy search on very large indexes.

Maybe interesting here too us the way Solr builds the index. Before indexing every string goes through filters and tokenizers. There some of the magic happens and you can strongly influence how good the index will be for your kind of data. There are already a lot of default Filters and Tokenizers, but you can even write your own. So maybe there would be ways to improve performance here a bit.

Beyond that there are a few things in which Solr is really great. Mainly facetted search, where you search and count for example how many products are in a given assortment. Just make a list of all assortments and counts, then go on and make the same list in near real time together with a search term. Or select another facet (say brand=CocaCola), a search term (q=light) and again get a list with all assortments and number of products. This cross referencing came out so nice and fast that we actually replaced nearly every single SQL product listing or searching on our web site with Solr queries.

thorsten müller
  • 12,058
  • 4
  • 49
  • 54
  • Thank you for your response that's very good to know. I'm trying to work out on balance if we should invest time in pursuing this or if there is an alternative that is better designed for the specific task of intensive matching of short words. This post really helps, thanks. – NibblyPig Dec 19 '13 at 17:51