Identifying similar customer data

Question

I have a large database of customers. There's a need to identify customers who have two or more customer IDs.

This comes mostly from people at the front-desk creating new customers instead of searching for their existing IDs or doing a bad data capture, so we end up having this sort of thing:

ID         name        surname        birthday
------------------------------------------------
14         Juan        Vazquez        1955-01-20
78         Juan        Vasquez        1980-06-12
85         Juan        Vazquez        1980-06-12

IDs 78 and 85 almost surely belong to the same person (notice same birthday but a slight difference in surname)

ID 14 definitely is another person due to a big difference in birthdays related to the similar records

Therefore we can assign equivalencies:

MainID      EquivalentID
------------------------
78          85

What solution can you recommend to identify similar records based on rules?

You can never be sure if these are really the same person. You cannot group these just because they have similar names. For the long term I would recommend adjusting the UI for the front desk users so that re-using customer sets is encouraged. This can be done by having the user type the name into a search box and only then displaying a "create $yourquery entry"-Button. Just like StackExchange does with their question title-box. — BlueWizard, Jul 20 '17 at 06:24
I agree you can never be 100% sure it's the same person. However my question is about a method to identify **similar** records based on rules, like spelling , same birthday, same email address or phone, etc. — azathoth, Jul 20 '17 at 15:32
Your front end would also benefit from a "Did you mean..." result set. — Jan Doggen, Jul 30 '17 at 10:36

score 1 · Answer 1 · answered Jul 21 '17 at 13:59

The approach I've seen used for this is Bayesian inference. There are plenty of technical details on the web to find for how to do this but the basic idea is that you start with some estimate of the likelihood of two random records being the same person. Then you look at each attribute you want to use and see if they match. Each one that matches increases the chance that this is the same person. You decide at what threshold you decide they are the same person. Often there are people involved to review/confirm these.

As Paparazzi mentions, the Levenshtein distance is a useful tool. Soundex and other similar algorithms are often also used but I would caution that the names in your system do not all come from the same language, such 'sounds like' algorithms are of limited use.

But any test you can think of that would make you more confident that these are the same person can be used in this approach.

paparazzo · Answer 2 · 2017-07-21T15:18:41.040

0

For name you can use Levenshtein distance. It measures the difference between words.

edited Jul 21 '17 at 15:18

answered Jul 21 '17 at 13:37

paparazzo

1,937
1
14
23

Identifying similar customer data

2 Answers2