As with most automation, it can be helpful to settle for some large percentage of automation and then work toward 100%, rather than requiring 100% from the first solution.
Since this is inherently an open-ended problem that could come down to human judgement for any particular input line, I wouldn't expect to have 100% automation short of requiring the data to be submitted in a uniform format, or having an acceptable threshold of misclassifications.
Assuming you can't impose a format (based on your other comments), you might be able to use a decision tree here to handle at least the most-common cases.
For example:
Parse each line into some "unclassified" representation; probably just a random-access container where you can choose a field by its column number.
Pass each instance through a decision tree to classify it:
- If you can at least reliably determine the country, that should tell you what format the postcodes should be in.
- Given the country, you can determine which column has the postcode, based on parsability.
- With those two pieces of information, hopefully you can determine which format is being used.
- If nothing else, flag the line for human classification.
The nice part about this sort of approach is that you can add/change rules to automate more and more of the work, incrementally reducing the human intervention required. In fact, you could even mix hard-coded logic with ML in separate nodes of the decision tree.
If there are ambiguities, you can conditionally execute a branch to see which one works out. If at any point in the decision you don't have any more rules that could make the determination at least as well as a human, then you can always have a human handle that particular case.
The difficult part will be testing it. Ideally, you could use previously-classified data files and their human classifications as test data, but it's often undesirable (maybe even illegal) to use real data in your CI tests. To start with, you might just have humans verify 100% of the automated determinations, then cut back that % as it becomes more reliable.
As far as implementation, You can probably start with some abstract interface that contains a single method to classify an address.
Using Java as an example:
interface Classification { }
interface Classifier {
Classification classify(String[] cols) throws ClassificationException;
}
Then each Classifier
implementation would be the root of a decision tree conditioned on the decision that led to that root. If a Classifier
can only make a partial determination, it can delegate to one or more child Classifier
s; thus, the tree structure. (Note that you really only need the interface if you want nodes to be interchangeable. Otherwise, each node could have it's own input requirements.)
If a branch fails, it can throw
a ClassificationException
that can be caught by a parent Classifier
that knows how to handle it, e.g., by trying a different branch.
Although the example is in Java, you can do something similar in languages that don't have exceptions (perhaps by returning null
, None
, etc.), and the tree structure will also work for purely functional languages.
There are plenty of other approaches out there (such as neural nets and Bayes classifiers), but this one should be quite easy to prototype and iterate on in nearly any language. You could even start by just having it prepend a column with the format guess for each line.