How can I efficiently diff a CSV file against a database?

Question

I have an inventory of products stored in Postgres. I need to be able to take a CSV file and get a list of changes—the things in the CSV file that are different to what is in the database. The CSV file has about 1.6 million rows.

The naive approach is to simply take each row, retrieve that product from the database using the key field, make the comparison, emit the changes (including updating the database), then move on to the next row. However, that many round trips causes the whole process to take a long time (upwards of two minutes). I've tried locally caching the inventory in an off-heap map (using MapDB), which improved the performance a lot, since I only needed to hit the database to write changed data, but I didn't figure out a way to make that scale. There will be many inventories for different customers. Perhaps some kind of sharding approach would be needed, but then I have to deal with nodes going on- and offline. Maybe Akka Cluster could help here too.

Are there some good approaches that I'm overlooking?

Could you create a temporary table and import the CSV content? -- https://stackoverflow.com/questions/2987433/how-to-import-csv-file-data-into-a-postgresql-table — Ben Cottrell, Aug 31 '22 at 06:28
How big are each of your rows - or more directly, why can you not just load all ~3M items into memory at once? — Philip Kendall, Aug 31 '22 at 06:44
If the database contains something which the CSV does not then should it be deleted? If so then the CSV is your master list; the table can possibly be cleared and repopulated. You can achieve this with some sort of temp table rename scheme. — MonkeyZeus, Aug 31 '22 at 15:26
can't timestamps filter out rows you don't care about -- i.e., records since last comparison — BurnsBA, Aug 31 '22 at 18:29
Is your CSV file sorted by some unique value, such as the ID? If so, you could query all of the rows, ordered the same way, and stream the result and stream reading the CSV, keeping one of each in memory at a time. — duckbrain, Sep 01 '22 at 18:07
You could load the csv with the `file_fdw` module and run queries against it like any other table: https://www.postgresql.org/docs/current/file-fdw.html — GammaGames, Sep 02 '22 at 14:30
@MonkeyZeus In the case of a manual upload like this, the CSV file would be the master list, but it's still worth diffing, because lots of rows will be unchanged, and the changes are being sent out to third parties, both smaller companies who are likely to have less resources, and to services like AdWords. — Isvara, Sep 03 '22 at 00:13
So can you download the Postgres table as a CSV and diff the CSVs? — MonkeyZeus, Sep 06 '22 at 12:44
@MonkeyZeus Even if I could (it's multiple tables), downloading and parsing a CSV file would just give me the same data as fetching the data directly from the table. — Isvara, Sep 07 '22 at 16:33
@Isvara I see. I just figured that total time to diff might be lower if you get the table as a CSV. — MonkeyZeus, Sep 07 '22 at 19:37

score 39 · Answer 1 · answered Aug 31 '22 at 06:45

39

Since your goal is to produce a list of changes, not to change the stored records, the way to go is to simply export the database to text and wrangle it once to get into exactly the same CSV format. Provided you have a functional sorting criterion, the standard diff tool will have no problem processing millions of lines.

answered Aug 31 '22 at 06:45

Kilian Foth

107,706
45
295
310

I'm not going to shell out to a diff tool. I don't even know of a tool that knows how to diff CSV files—columns can be in any order, quoting is optional in some cases, etc. And yes, the stored records will be updated with the changes. – Isvara Sep 07 '22 at 16:38
@isvara Just save the 2 files with the same column order and the same sort, and use one of the many free diff tools out there. e.g. git has one you can use easily. For formatting issues, you can achieve parity by importing then exporting both through the same third party csv tool, e.g. Google sheets. I did this exact thing last week with excellent results. – ogoldberg Sep 07 '22 at 18:45
@ogoldberg That's too complicated a solution: read the data from the database, format it into a CSV file that has the same format as the incoming one (CSV isn't a well standardized format), save the two CSV files to local disk, spawn another process to run a diff tool, parse the output of the diff tool to see which data changed, publish the changes and write them back to the database, delete the two CSV files. It might be fine for a one-off manual process, but that's not how to write a program. – Isvara Sep 07 '22 at 19:35

score 22 · Accepted Answer · answered Aug 31 '22 at 08:56

Since the roundtrip seems to be the issue, you could:

either opt for a local solution, with the scaling issue you mentioned. (You could still try to split the task across several local nodes, each responsible of a subrange of the index space).
or opt to the db solution, bulk uploading your csv in a temporary table, and let the db server work very efficiently on (indexed) tables. The benefit of this approach is that you’d reach the scalability of the db itself. You could fine tune the approach for any distribution scheme that woukd already be in place, if it’s already a distributed database.

Some more thoughts:

if you have many columns/fields to compare, you may consider adding a hash code on each row, in the csv as well as in the db (updated at every row change). The hash code would be calculated using the fields that are relevant for the comparison. Finding the diff is then reduced to finding the new rows and the existing rows with a difference on the hash.
ultimately, it would be more efficient to handle the problem at the source, i.e intercepting the events that would cause the csv to change, or using some kind of timestamp of the last change. But ok, this is not always possible.

Unfortunately it's not possible at this point to get those events—they happen in someone else's system. Using a combination of answers, pulling all the digests from the database and comparing those locally might be the way to go. — Isvara, Aug 31 '22 at 10:43
I'd personally let the DB compute the hashes during the loading of the CSV into a temporary table. Seems faster and makes the process simpler, since in that way it remains an internal detail inside the DB and doesn't require any kind of external change. — GACy20, Aug 31 '22 at 13:25
@GACy20 I agree for the scenario of the temporary db table, since it moreover allows to share the same code for calculating the hash. In ly answer I was more neutral, since this approach can also be used to accelerate the local scenario. Thanks for making it more explicit. — Christophe, Sep 01 '22 at 06:14

Daniel Koszta · Answer 3 · 2022-08-31T16:02:43.733

There are already some good answers, but here's another possibility: if you could sort the CSV file by key and add a similar ORDER BY to the SQL query, you could compare the rows obtained from the query to the rows obtained from the CSV file without loading any of them fully into memory. (If key matches, compare the row. If it doesn't match, you can deduce if it's a removal or insertion from the ordering. Then, move the cursor.)

(Not sure what language or library you want to use, but plain JDBC allows you to fetch the next rows without loading the whole result into memory.)

score 3 · Answer 4 · answered Sep 01 '22 at 08:58

3

For me the most intelligent and simple way will be to put the files in database and write some SQL script to compare what you want.

answered Sep 01 '22 at 08:58

Daljeet

41
2

score 2 · Answer 5 · answered Sep 01 '22 at 09:59

To make this scale both for size of CSV and the amount of products in the database, you could do:

Read e.g. 10 000 rows from CSV to memory. Collect all the keys.
Fetch the 10 000 corresponding database rows. Depending on database, it can make sense to create a temporary table with the keys and perform a JOIN with that to fetch the rows. If the CSV file is sorted, you may be able to select keys between min and max value from the CSV.
Collect the required updates into an array in RAM and execute a single query to write to database. Typically INSERT INTO table VALUES (row1),(row2),...,(rowN) ON DUPLICATE KEY UPDATE x=VALUES(x).

This will reduce the time overhead of waiting for database to respond, while still limiting the length of individual queries and required RAM.

How can I efficiently diff a CSV file against a database?

5 Answers5