We have developed a web based application for name matching. It operates by breaking names into parts and the Soundex value of each part is stored in a database. The Levenshtein distance metric is used to apply percentage matching of sound as well as spelling against a given name.
At runtime, we load all records into memory and apply the Levenshtein distance to all of the Soundex values and the spelling of all the parts of all the names.
This was working fine at first because there were at maximum 20 thousand names, but now one of our clients has 30 million names. Loading this huge list in memory for each request and applying this type of matching is a pathetic approach, using a lot of memory and execution time.
We are looking for suggestions to search database of 30 million records or more in near future with percentage matching of Sound and Spelling.
Core Functionality
End user enters the name to be matched and minimum percentage. We are supposed to show all those names in database for which any part of the name matches with any part of the given name upto the given percentage. Full name is not required to be matched, any part if matches upto the percentage is a success. For example.
Given Name: Helen Hunt
Name in DB: Holly Hunter
Both parts of both names are not matching exactly but upto some extent, let us assume 80%, so if user enter 80% then the name in DB must be shown as matching name.