Effecient algorithm for data deduplication in procedural code

Question

I have written a data cleansing application which, for the most part, works well. It is not designed to handle large volumes of data: nothing more than about half a million rows. So early on in the design process, a decision was made to try and do as much of the work in-memory as possible. The thought was that writing to database or disk would slow things down.

For most of the various cleaning operations the application offers, this has proved true. When it comes to deduplication, however it is absurdly slow. Running on a fairly powerful server, it takes about a full 24 hours to dedupe half a million rows of data.

My algorithm runs along these steps in pseudocode:

List<FileRow> originalData;
List<FileRow> copiedData = originalData.Copy;

foreach(FileRow original in originalData)
{
    foreach(FileRow copy in copiedData)
    {
        //don't compare rows against themselves
        if(original.Id != copy.Id) 
        {
            // if it's a perfect match, don't waste time with slow fuzzy algorithm
            if(original.NameData == copy.NameData)
            {
                original.IsDupe = true;
                break;
            }

            // if it's not a perfect match, try Jaro-Winkler
            if(_fuzzyMatcher.DataMatch(original.NameData, copy,NameData))
            {
                original.IsDupe = true;
                break;
            }
        }
    }
}

Look at this, it's obvious why it's so slow: where other operations can cycle through each row, this has to go through the whole file again for each row. So the processing time increases exponentially.

I have also used threading elsewhere to speed things up, but my attemtps to thread this have failed. In the real-world code we don't just make duplicates as "true" but group them, so that all instances of a given match get a unique Id. But the procedure has no way of knowing whether another thread has found and marked a duplicate, so threading leads to errors in the grouping Id assignment.

To try and improve matters we added a db-based cache of common Jaro-Winkler matches to try and eliminate the need for that relatively slow method. It didn't make a significant difference.

Is there another approach I can try, or improvements I can make to this algorithm to make it faster? Or am I better off giving up trying to do this in memory and writing it to a database to do the job there?

Will the first character of `original.NameData` always match `copy.NameData`'s first character? If so, you could sort `copiedData` then break when you hit the mismatch. — Dan Pichelman, Jan 25 '18 at 16:41
That's quadratic complexity, no? If so, you can speed up dramatically with just about any data structure that offers better than linear-time complexity search times (just about any associative container like a dictionary would be a start), perhaps only for a coarse check before you use Jaro–Winkler. Anything logarithmic for better should shave those hours down to minutes or better. — , Jan 25 '18 at 17:36
I think these commentators overlook that fuzzy matches like a "Jaro-Winkler comparison" are not easily speeded up by some dictionary approach. A dictionary can help you to find quickly all exact matches, but it does not reduce the running time for the majority of non-matches you want to pass to the slow fuzzy matcher. So these people haven't really understood the problem. — Doc Brown, Jan 25 '18 at 20:19
Matt, how big is the allowed deviation of two words using the Jaro-Winker metrics? — Doc Brown, Jan 25 '18 at 20:22
@docbrown I'm at home now, but I think it's set to about 0.8 similarity for a match. The real code is rather more complex as it checks first name matches but then cascades down to check address lines too. — Bob Tway, Jan 25 '18 at 22:12
Have you run a profiler on this? Where are the hot spots (the places where the code is spending the most time executing)? — Robert Harvey, Jan 25 '18 at 23:18
@RobertHarvey I haven't, no. My shop is the kind of place where they're happy to throw £75k staff costs at this project but not a few hundred on a profiler :( — Bob Tway, Jan 26 '18 at 08:59
@DocBrown It's set to 0.88 similarity. In the actual code it has a three-step fuzzy match call based on surname, title and then forename, backing out of the cascade at any step where it fails to find a match above the 0.88 threshold. — Bob Tway, Jan 26 '18 at 09:05
@DocBrown I see -- I'm not so used to these algorithms dealing with fuzzy string distance. My original thought was that maybe you could use some kind of data structure to eliminate the lion's share of the candidates before using a string distance algorithm, like one as simple as what Dan pointed out based on a first character. But that naturally only works if the fuzzy matching requires at least the first characters to be identical to adequately satisfy the notion of a match. — , Jan 26 '18 at 10:40
I am pretty sure a database won't help you as long as it does not provide some nifty indexing for this kind of fuzzy matching. And if it does, you may be able to find some isolated component which can do the same indexing without a database, in-memory, which will then surely be faster. — Doc Brown, Jan 26 '18 at 13:59

JimmyJames · Answer 1 · 2018-01-25T22:48:37.873

It seems to me that the only way you can really change the time complexity is to start to turn the Jaro-Winkler algorithm 'inside-out' by collecting statistics on each value that are relevant to the algorithm. I'll be honest and admit that I'd never heard of this algorithm prior to reading your post (thanks!). As such I am going to start typing out some vague ideas and hope the formulate into a coherent approach. Hopefully these ideas with either work or give you some other ideas that do.

So looking at the wiki page it seems there are only three things you need to figure out:

length of the strings: s
transpositions: t
matching characters: m

Getting the length of each string is beyond easy. It's always the same for each string: done. But transpositions and matches are specific to each comparison. But we don't necessarily have to know exact values for these in order to determine whether two strings are a candidate pair. I think you can probably create tests that will help narrow things down.

The first thing that comes to mind is inspired by bloom filter: simply index each string by the characters it contains. Literally take all the letters and put a reference to the strings that contains it.

Then I can take a string like CAT and find all the other words that contain a 'C', 'A', and 'T'. I'll denote that as [C,A,T]. I can then look for [C,A],[C,T],and [A,T]. I'll presume for a moment that anything with less than two of 'C', A', or 'T' would not meet the threshold i.e. you know that m/s1 must be less than 1/3. You can take this a bit further and start calculating a upper-bound value for the comparison. This need not be very precise. For example with the strings in [C,A,T] you might have these upper-bounds:

cattywampus: (3/3 + 3/11 + 1)/3 = 0.758
tack: (3/3 + 3/4 + 1)/3 = 0.917
thwack: (3/6 + 2)/3 = 0.833
tachometer: (3/10 + 2)/3 = 0.767

If your threshold is .8, you can eliminate two of these because you know the best they could do is under your minimum. For the two letter index, you know that the first factor m/s1 can never be more than 2/3 and you can do the same analysis. Any string that has none of 'C', 'A', and 'T' has a result of 0, by definition

I think this is a start to getting away from quadratic time. You can probably do better with some more interesting data structures or heuristics. I'm sure someone can suggest some of these techniques. One not well formulated idea is to extend this to index to not only the characters but also to a character-position index.

The problem is called: all pairs similarity search. In the approximate matching literature, this is called filtration. It uses inverted indexing. The idea is to reject all pairs that are too far apart before calculating how close they are. What you are doing is the subset equivalent of the substring partitioning method called q-grams. I was reading to implement approximate substring search. — Dan D., Jan 26 '18 at 01:50
@DanD. You seem well informed. The terms you have provided are probably enough but if you have any good references for these, they are much appreciated. — JimmyJames, Jan 26 '18 at 02:27

svidgen · Answer 2 · 2018-01-29T17:07:44.777

Fun problem!

I did this sort of thing for a data merge and migration over 8 years ago. We were operating on the order of hundreds of thousands of records and were ultimately able to run the merge to produce a full set of spot-checkable results in minutes. (Not on a powerful server, mind you.) We used a Levenshtein distance for our fuzzy matching, but the concept looks about the same.

Essentially, you want to look for indexable heuristics that can limit the number of candidate matches for any given record. Indexable because the lookup itself needs to operate in O(log(n)) if the whole merge needs to operate in O(n*log(n)) time (on average).

There are two types of heuristics we looked for:

Other account attributes
n-grams (full text indexing, more or less)

Firstly, before things get complicated, consider whether there other attributes you can group on as-they-are or in any slightly-altered state to create smaller clusters. For example, is there reliable location data you can index and cluster on?

If you have ZIP code data, for example, and you know them to be reasonably accurate, something like can "trivialize" the problem right there.

Otherwise, you need to build a fuzzy full-text index -- or at least sort of.

For this, I wrote a custom trigram index with PHP + MySQL. But, before I explain that solution, Here's my disclaimer:

I did this over 8 years ago. And, I dove straight into building my own trigram index and ranking algorithm because I wanted to understand it better. You can probably leverage a simple FULLTEXT index or use an off-the-shelf search engine like Sphinx to achieve the same results, if not better.

That said, here's the basic fun solution:

For each record
1. Index search-column length (a special optimization for de-dupe jobs!)
2. Extract n-grams
3. For each extracted n-gram
  - Add n-gram to a ngram_record table (or collection)
Generate n-gram statistics
1. For each ngram_record
  - Initialize or update ngram cardinality
2. Calculate median cardinality
3. Calculate standard deviation
4. For each ngram
  - Assign relevance as a function of cardinality against the distribution
For each record (source)
1. Find top N most relevant ngrams
2. For each ngram
  - Find records containing the ngram
  - Filter for records with length +/- C of source length
  - Assign relevance = ngram relevance
3. Group candidates by ID, summing relevance
4. Order by relevance descending
5. Perform fuzzy string-distance match against the M most relevant candidates

Note that this solution is functional, but very "immature." There is a lot of room for optimization and enhancement. An additional layer of indexing could be employed, for example, where the search field is broken into word ngrams before breakdown into character ngrams.

Another optimization I made, but didn't fully mature, was assigning relevance back through to the ngram_record relationships. When you search for a given ngram, you'll get better results if you can select records for which the ngram has similar relevance between the "source" and "candidate" records, where relevance is a function ngram relevance, its frequency in the record, and the length of the record.

There's also a lot of room for tweaking your N, M, and C values above.

Have fun!

^{Or use Sphinx. Seriously, if you're not going to have fun, find a fulltext search engine and use that.}

^{And as it turns out, I actually answered a fuzzy search question similarly a couple years ago. See: Partial name matching in millions of records.}

score 1 · Answer 3 · edited Jun 16 '20 at 10:01

The problem

The main problem is your O(n^2) sequential algorithm. For 500.000 lines thats 250.000.000.000 iterations. If it takes 24 hours to run it, that means 300 nanoseconds per iteration.

Immediate minor improvements

You compare every item of the list with every other, but twice: first you compare a in the outer loop with b in the inner loop, then later, you'll compare b in the outer loop with a in the inner loop.

The Jaro-Winkler metric is symmetric, so it's not necessary to do the comparison twice. In the inner loop, it is sufficient to iterate through the remaining elements. This improvement is not extraordinary: it's still O(n^2), but it will at least cut your execution time in two.

As a side remark, even if it will be insignificant improvement in comparison to the main issue, depending on the language, you could have some performance issues related to lists (see for example here for C# lists with foreach, or consider memory allocation/deallocation overhead for large lists in C++).

An in memory soluton ?

If you'd just need to find the exact dupe, a much faster way would be:

Calculate some hash code when iterating through the lines, and populate a map that relates the hash code to a list of matching lines.
After the first pass, you can then iterate through the map, ignore the the entries with a list of a single element, and process the lists that have several elements (potential groups to be identified, because the same hash doesn't always guarantee that it's the same value).

The complexity of this algorithm would be O(2n) so 1 millions iterations in the best case, and O(n^2) in the hypothetical worst case (being all the lines are a single dupe). The real case depends on the number of duplicate groups and the number of elements in each of these groups, but I expect this to be an order of magnitude faster than your approach.

The fuzzy match could be easily implemented in the same way, if the matching could be expressed through a "normalization" function f() defined so that f(record1)==f(record2) means there is a match. This would work for example, if the fuzzy match would be based on some variants of the soundex

Unfortunately the Jaro Winkler distance doesn't meet this requirement, so that every line must be compared with every other.

A database solution

Intuitively, I'd say that using a DBMS approach could also do the job, especially if your fuzzy match is a little more complex, and if you work with fields.

Populating a DBMS with half a million lines should in principle take far below 24 hours if your server is appropriately dimensioned (single pass bulk upload). A sorted SELECT, or a GROUP BY clause would easily find out the exact dupes. Same is true for a fuzzy match having a "normalization" function.

But for fuzzy matches that require explicit comparison, like the Jaro-Winkler, it will not help very much.

Divide et impera variant

If your metric is not applied on the line as a whole but on a set of fields, the DBMS could reduce the number of comparisons by working at field level. The idea is to avoid comparing all the records between them but consider only smaller subsets for which the effect of combinatorial explosion remain in a reasonable range:

For the relevant field, select the unique values. These often form a smaller set.
Calculate the metric in this smaller set, to identify the potential groups
Ignore pair values with an insufficient proximity

In the following example, you'd compare only 3 values instead of 5:

George
Melanie  
Georges 
George  
Melanie

Which would result with a threshold of 85% in:

George  / Georges    97%   (promising)
George  / Melanie    54%   (ignored)
Melanie / Georges    52%   (ignored)

If several fields are involved, you'd process each field individually to identify the potential promising matching subgroups. For example:

George    NEW-YORK
Melanie   WASHINGTON
Georges   NEW IORK
George    OKLAHOMI
Melanie   OKLAHOMA

would add a second list of candidate groups after removing the non promising values):

NEY-YORK / NEW IORK
OKLAHOMAI / OKLAHOMA

You'd then select the records having all the values of each relevant fields in the promising groups: here { George, Georges } and { NEW-YORK, NEW IORK, OKLAHOMAI, OKLAHOMA }. The only records returned would be:

George    NEW-YORK
Georges   NEW IORK
George    OKLAHOMI

There are then two strategies:

either you run your algorithm on the selected records, if the subset is reduced enough.
or you accelerate the matching by looking for every record only at those corresponding to a potential subgroup value (you can imagine this a a kind of tagging with the head of the subgroup of each field, at the expense of some space).

The second approach would result in:

  selected values           field group tag
-------------------        ------------------
George    NEW-YORK    ->   George NEW-YORK
Georges   NEW IORK    ->   George NEW-YORK
George    OKLAHOMI    ->   George OKLAHOMA

You then have your group by selecting with a GROUP BY on the tags, considering of course only groups HAVING more than 1 record.

If you can come up with a good idea for this hypothetical function f for the named Jaro-Winkler metrics then this could become a good answer. If not, I am pretty sure this is no step towards a solution. — Doc Brown, Jan 25 '18 at 20:19
@DocBrown oops, I didn't realize it was the distance between two lines. See edits. — Christophe, Jan 25 '18 at 22:08

Effecient algorithm for data deduplication in procedural code

3 Answers3

The problem

Immediate minor improvements

An in memory soluton ?

A database solution