Correlating Search Classifiers in a Database Scan for Sensitive Information

Question

Problem Description

I am working on an enterprise data discovery project that is designed to scan databases for sensitive information. The basic search unit is called a classifier and covers things like 'Social Security Number', 'LastName', 'Driver's License', 'Credit Card Number', etc.

Currently, each classifier is an independent item with its own regex pattern, so a search for 'Driver's License' will yield matches for any pattern that matches the format ^[A-Z]\d{3}-\d{4}-\d{4}$.

We want to reduce false positives from this approach by leveraging data from related classifiers. For instance, if both my last name and driver's license number appear in the same record, I should be able to verify that the first letter of my last name matches the first character of the driver's license number instead of only relying on the regex pattern.

Here's another example: suppose I am searching for 'Zip Codes' on a database and a scan marks the following records as matches:

FirstName LastName 123 Fake Street City, Illinois 61234
012345678 01234567 012345678 987654321 1234556 61234

I'd like to assign a higher confidence rating to the first match, because it is located near related classifiers like 'Street Address' and 'State' while the second match is probably a false positive from a stream of unrelated digits.

What kind of problem domain is involved here and what existing algorithms apply? I've looked through articles on record linkage, semantic matching, information extraction, and other things, but I can't seem to find research on the exact idea I'm explaining.

[How do I express subtle relationships in my data?](http://programmers.stackexchange.com/questions/178951/how-do-i-express-subtle-relationships-in-my-data/178955#178955) — Adam Zuckerman, Jan 14 '16 at 05:01

Correlating Search Classifiers in a Database Scan for Sensitive Information

0 Answers0