Partial name matching in millions of records

Question

We have developed a web based application for name matching. It operates by breaking names into parts and the Soundex value of each part is stored in a database. The Levenshtein distance metric is used to apply percentage matching of sound as well as spelling against a given name.

At runtime, we load all records into memory and apply the Levenshtein distance to all of the Soundex values and the spelling of all the parts of all the names.

This was working fine at first because there were at maximum 20 thousand names, but now one of our clients has 30 million names. Loading this huge list in memory for each request and applying this type of matching is a pathetic approach, using a lot of memory and execution time.

We are looking for suggestions to search database of 30 million records or more in near future with percentage matching of Sound and Spelling.

Core Functionality

End user enters the name to be matched and minimum percentage. We are supposed to show all those names in database for which any part of the name matches with any part of the given name upto the given percentage. Full name is not required to be matched, any part if matches upto the percentage is a success. For example.

Given Name: Helen Hunt
Name in DB: Holly Hunter

Both parts of both names are not matching exactly but upto some extent, let us assume 80%, so if user enter 80% then the name in DB must be shown as matching name.

Are you using SQL Server? I see you tagged it asp.net. Thinking the possibility of a CLR assembly which would prevent the network traffic and let SQL server manage the memory. — RubberChickenLeader, Aug 10 '16 at 18:53
@bjan where are the names stored? are they stored in SQL Server? — RubberChickenLeader, Aug 10 '16 at 20:23
What are you searching for? The top 100 names which match a given query best? — Doc Brown, Aug 10 '16 at 20:33
@DocBrown user enters the name to be matched and percentage and we show all names if any part of the name in db matches with any part of the given name upto the given percentage. — bjan, Aug 11 '16 at 04:36

svidgen · Accepted Answer · 2016-08-11T14:15:37.830

Without knowing the full details of what you need, you probably want to do one of the following:

Use an existing search tool, like Sphinx or Lucene
Perform n-gram approximate matching

I don't fully know what's involved installing and configuration sphinx; but, I'm under the impression you can point it at a database, tell it which fields to index, how to weight the results, and it'll give you an ordered list of matching records back.

For user-facing or mission critical stuff, use an existing search tool.

If you're just feeling academic ... Play with ngrams:

An ngrams lookup table can serve as your initial set of potential matches, and you can use Levenshtein distances to prune and sort the results.

Assuming you want to search people, you might do something like:

_ people _________
personId: int
name: varchar
soundex_name: varchar

_ people_ngrams __
personId: int
ngramId: int

_ ngrams _________
ngramId: int
ngram: char(3)
count: int

You can either periodically rebuild your ngrams or build them on-the-fly. Either way, a simple, naive search algorithm can look like this:

search_ngrams = ngrammify(soundex(search_string));

notable_ngrams = select top 10 *
  from ngrams
  where ngram in (search_ngrams)
  order by count asc;

possible_matches = select top 1000 distinct people.*
  from people_ngrams, people
  where ngramId in (notable_ngrams);

best_matches = top 100 possible_matches
  ordered by Levenshtein_distance(match, soundex(search_string));

Using something pretty similar to this (but with a little more ngram "popularity" tuning, blacklists, whitelists, etc.), I've seen this sort of algorithm fuzzily merge records between data sets in bulk, as well as facilitate custom fuzzy search utilities and ongoing records de-duplication efforts.

Now, in my case, I wasn't matching millions of records, I was looking to select the best possible merges between a two data sets on the order of hundreds of thousands of records each. And, we wanted it to work fairly quickly -- within a few minutes. (Quick, what's 100,000 * 100,000?) And, we were successful.

So, with the right tuning, this sort of thing can be snappy and effective. We were ultimately able produce a merged set on a humble, dated, dual-core machine in a few minutes, with "questionable" merges automatically flagged for manual review. But, it took a lot of time to find the ngram popularity/relevance sweet-spot, and the right string-distance thresholds, and blacklists, and whitelists ... etc.

THAT SAID, you can really get sucked into a hole working on this stuff. For any real-world production-level stuff, you should generally use a well-established tool that's already made and optimized for this kind of searching.

Like Sphinx or Lucene.

I just searched fuzzy on [Sphinx 2.2.11-release reference manual](http://sphinxsearch.com/docs/current.html#extended-syntax) and it looks that it matches exact word while I need to match words partially. Correct me if I am wrong about this. — bjan, Aug 11 '16 at 05:07
@bjan Yeah. Looking at the docs further, I'm not sure Sphinx's fuzzy search is exactly what you're looking for. It can use a [soundex morphology](http://sphinxsearch.com/docs/current.html#conf-morphology). But, based on your recent edit, you *may* want to roll your own ngram + string-distance search. And as I said above, it can take awhile to tweak the algorithm and thresholds to get right; but, it's not infeasible. And, if you need that level of flexibility ... — svidgen, Aug 11 '16 at 14:13
@bjan Oh, I also totally forgot about [Lucene](http://lucene.apache.org/core/). I'm not sure it does what you need either; but, it's pretty darn popular, and worth looking at before you roll your own. [Lucene's docs](http://lucene.apache.org/core/6_1_0/core/org/apache/lucene/search/package-summary.html) mention fuzzy searching and rankings using the Levenshtein string distance. — svidgen, Aug 11 '16 at 14:15

Partial name matching in millions of records

1 Answers1

Linked