System Design: Efficient way to import hundreds of different CSV formats

Question

OVERVIEW

We have multiple data providers who provide us information from a large set of data sources. These end data sources provide data in CSV format.

Since, the CSV formats are not consistent across different sources, we currently have a manual process in place where a human goes through each CSV file we receive and updates the column names in each file with the names understood by our system and then those CSVs are ingested.

For Example:

Let us assume that an Address object exists in our system and is represented by Line1, Line2, City, State, Country, PostalCode columns.

We would then get CSVs having columns like these:

Address1,Address2, City, County, Country, Zip
Line1,Line2, Line3, City, Country, Zipcode
Address, City, State, Country, Postcode

Which we would map to our representation manually as following:

Address1-->Line1, Address2-->Line2, City-->City, NULL-->State, Country-->Country, Zip-->PostalCode
Line1-->Line1,Line2+Line3-->Line2, City-->City, NULL-->State, Country-->Country, Zipcode-->PostalCode
Address-->Line1, NULL-->Line2, City-->City, State-->State, Country-->Country, Postcode-->PostalCode`

REQUIREMENT

We need a solution which can replace the human intervention entirely, and smartly identify and map the incoming CSV to the format understood by our system. Please suggest a solution or point me in the direction I should take.

"a human goes through each CSV file we receive and updates the column names in each file with the names understood by our system" - once for each file, or once for each data provider? — Doc Brown, Sep 26 '20 at 05:42
(I assume you get multiple files from each provider on a regular basis). — Doc Brown, Sep 26 '20 at 06:03

Bogdan · Answer 1 · 2020-09-25T13:24:31.583

You practically have two options:

have a tool that takes in a records definition of a file and a CSV file with data. Basically, for each data provider you define how their records match with your records and the tool does the mapping when processing a CSV file. Many years ago I worked with a library called flatworm that allowed you to parse files given some record types. You might find newer/better tools by doing a search.
convince the data providers to give you data in the format you expect, not in the format they use. The advantage with this method is that you only have to deal with one record format and you have less work implementing this solution because you spread the implementation effort between the providers.

Try to go for the second option. And if not all the providers can implement this on their side (for various reasons, like not having developers, etc - even though they all can add the same column headings to help you identify the data, as Ewan mentioned in the comment above) then you can fallback on option one for them.

Your second point is pretty valid and we wanted that control. Alas, we are not in a position to demand. You and @DocBrown seem to be suggesting the same/similar points. — Rishabh, Sep 28 '20 at 12:25

Doc Brown · Answer 2 · 2020-09-26T20:51:51.700

There is nothing wrong in letting a human define the column mapping for each new data provider if the number of different providers is something around "some hundreds". Give a human ~five minutes per provider to figure out the correct mapping, that will take approximately one day of work to define the mapping for 100 providers. It should be obvious that once a mapping was determined, it should be saved and reused for the specific provider.

This will usually be way more effective than investing some weeks of work into trying to develop some clever heuristic (see canonical XKCD).

Of course, the situation will change when data providers from time to time change their data format and don't actively inform you about, so a human need to check each and every CSV file again. Still, you can try to start imports with the mapping once defined and run some automated sanity checks whether the result looks like a valid address list. Such sanity check can be implemented mostly independent from the specific provider, so it is probably worth the time.

Your suggestion is reasonable and makes sense. Made me realise I need to put in effort into making it easier to map the columns for the human, so that is things do break when the data provider changes their format, it is easy and quick to correct the mapping. — Rishabh, Sep 28 '20 at 12:16

score 0 · Answer 3 · answered Sep 25 '20 at 11:12

You can mash the address together by joining all the lines with a line break. Then use a 3rd party address parser to convert to a standard format.

In fact you might only need to use the address parser on the firs few addresses in each file before you can determine the column headings from the results of the parsing.

Concatenated fields will still give you problems however, generally address parsing is hard, especially international addresses. You might be better just having a multiline address field for many use cases.

Thanks for your suggestion. but I am worried, I may be bringing in complications into the solution. — Rishabh, Sep 28 '20 at 12:17

Kevin P. Barry · Answer 4 · 2020-11-28T01:09:32.337

As with most automation, it can be helpful to settle for some large percentage of automation and then work toward 100%, rather than requiring 100% from the first solution.

Since this is inherently an open-ended problem that could come down to human judgement for any particular input line, I wouldn't expect to have 100% automation short of requiring the data to be submitted in a uniform format, or having an acceptable threshold of misclassifications.

Assuming you can't impose a format (based on your other comments), you might be able to use a decision tree here to handle at least the most-common cases.

For example:

Parse each line into some "unclassified" representation; probably just a random-access container where you can choose a field by its column number.
Pass each instance through a decision tree to classify it:
1. If you can at least reliably determine the country, that should tell you what format the postcodes should be in.
2. Given the country, you can determine which column has the postcode, based on parsability.
3. With those two pieces of information, hopefully you can determine which format is being used.
4. If nothing else, flag the line for human classification.

The nice part about this sort of approach is that you can add/change rules to automate more and more of the work, incrementally reducing the human intervention required. In fact, you could even mix hard-coded logic with ML in separate nodes of the decision tree.

If there are ambiguities, you can conditionally execute a branch to see which one works out. If at any point in the decision you don't have any more rules that could make the determination at least as well as a human, then you can always have a human handle that particular case.

The difficult part will be testing it. Ideally, you could use previously-classified data files and their human classifications as test data, but it's often undesirable (maybe even illegal) to use real data in your CI tests. To start with, you might just have humans verify 100% of the automated determinations, then cut back that % as it becomes more reliable.

As far as implementation, You can probably start with some abstract interface that contains a single method to classify an address.

Using Java as an example:

interface Classification { }

interface Classifier {
  Classification classify(String[] cols) throws ClassificationException;
}

Then each Classifier implementation would be the root of a decision tree conditioned on the decision that led to that root. If a Classifier can only make a partial determination, it can delegate to one or more child Classifiers; thus, the tree structure. (Note that you really only need the interface if you want nodes to be interchangeable. Otherwise, each node could have it's own input requirements.)

If a branch fails, it can throw a ClassificationException that can be caught by a parent Classifier that knows how to handle it, e.g., by trying a different branch.

Although the example is in Java, you can do something similar in languages that don't have exceptions (perhaps by returning null, None, etc.), and the tree structure will also work for purely functional languages.

There are plenty of other approaches out there (such as neural nets and Bayes classifiers), but this one should be quite easy to prototype and iterate on in nearly any language. You could even start by just having it prepend a column with the format guess for each line.

System Design: Efficient way to import hundreds of different CSV formats

4 Answers4