The problem
The main problem is your O(n^2) sequential algorithm. For 500.000 lines thats 250.000.000.000 iterations. If it takes 24 hours to run it, that means 300 nanoseconds per iteration.
Immediate minor improvements
You compare every item of the list with every other, but twice: first you compare a
in the outer loop with b
in the inner loop, then later, you'll compare b
in the outer loop with a
in the inner loop.
The Jaro-Winkler metric is symmetric, so it's not necessary to do the comparison twice. In the inner loop, it is sufficient to iterate through the remaining elements. This improvement is not extraordinary: it's still O(n^2), but it will at least cut your execution time in two.
As a side remark, even if it will be insignificant improvement in comparison to the main issue, depending on the language, you could have some performance issues related to lists (see for example here for C# lists with foreach, or consider memory allocation/deallocation overhead for large lists in C++).
An in memory soluton ?
If you'd just need to find the exact dupe, a much faster way would be:
- Calculate some hash code when iterating through the lines, and populate a map that relates the hash code to a list of matching lines.
- After the first pass, you can then iterate through the map, ignore the the entries with a list of a single element, and process the lists that have several elements (potential groups to be identified, because the same hash doesn't always guarantee that it's the same value).
The complexity of this algorithm would be O(2n) so 1 millions iterations in the best case, and O(n^2) in the hypothetical worst case (being all the lines are a single dupe). The real case depends on the number of duplicate groups and the number of elements in each of these groups, but I expect this to be an order of magnitude faster than your approach.
The fuzzy match could be easily implemented in the same way, if the matching could be expressed through a "normalization" function f()
defined so that f(record1)==f(record2)
means there is a match. This would work for example, if the fuzzy match would be based on some variants of the soundex
Unfortunately the Jaro Winkler distance doesn't meet this requirement, so that every line must be compared with every other.
A database solution
Intuitively, I'd say that using a DBMS approach could also do the job, especially if your fuzzy match is a little more complex, and if you work with fields.
Populating a DBMS with half a million lines should in principle take far below 24 hours if your server is appropriately dimensioned (single pass bulk upload). A sorted SELECT, or a GROUP BY clause would easily find out the exact dupes. Same is true for a fuzzy match having a "normalization" function.
But for fuzzy matches that require explicit comparison, like the Jaro-Winkler, it will not help very much.
Divide et impera variant
If your metric is not applied on the line as a whole but on a set of fields, the DBMS could reduce the number of comparisons by working at field level. The idea is to avoid comparing all the records between them but consider only smaller subsets for which the effect of combinatorial explosion remain in a reasonable range:
- For the relevant field, select the unique values. These often form a smaller set.
- Calculate the metric in this smaller set, to identify the potential groups
- Ignore pair values with an insufficient proximity
In the following example, you'd compare only 3 values instead of 5:
George
Melanie
Georges
George
Melanie
Which would result with a threshold of 85% in:
George / Georges 97% (promising)
George / Melanie 54% (ignored)
Melanie / Georges 52% (ignored)
If several fields are involved, you'd process each field individually to identify the potential promising matching subgroups. For example:
George NEW-YORK
Melanie WASHINGTON
Georges NEW IORK
George OKLAHOMI
Melanie OKLAHOMA
would add a second list of candidate groups after removing the non promising values):
NEY-YORK / NEW IORK
OKLAHOMAI / OKLAHOMA
You'd then select the records having all the values of each relevant fields in the promising groups: here { George, Georges } and { NEW-YORK, NEW IORK, OKLAHOMAI, OKLAHOMA }. The only records returned would be:
George NEW-YORK
Georges NEW IORK
George OKLAHOMI
There are then two strategies:
- either you run your algorithm on the selected records, if the subset is reduced enough.
- or you accelerate the matching by looking for every record only at those corresponding to a potential subgroup value (you can imagine this a a kind of tagging with the head of the subgroup of each field, at the expense of some space).
The second approach would result in:
selected values field group tag
------------------- ------------------
George NEW-YORK -> George NEW-YORK
Georges NEW IORK -> George NEW-YORK
George OKLAHOMI -> George OKLAHOMA
You then have your group by selecting with a GROUP BY on the tags, considering of course only groups HAVING more than 1 record.