Deduplicating data during batch processing

Question

I constantly struggling to solve data duplication problems efficiently (storing data from any source to RDBMSes). My main concern is speeding up inserts/batch processing.

Scenario: I read data from different sources, mainly in json format and I need to process them in way (input: nosql documents), where I only store different¹ rows once in the database. The number of insertables differ, but around 1-10m rows. The speed of reading is fast, I can read as fast as I can, on as many threads as I can.

¹ Different usually means certain fields in the input document should be unique if they are the same together. Like If I have to save Products and have these fields:

{
    "manufacturer": "Xy manufacturer",
    "manufacturerReference": "Manufacturer's product id, barcode = identifier",
    "name": "..",
    "price": ...

}

All combinations of the manufacturer and manufacturerReference should be represented only once in the dataset after processing.

My first thought was quickly dropped, I just wanted to create unique indexes in the database and handle duplicate key exceptions, but most of the time it just won't work. All the databases have length limit on the unique keys (SQL Server 900 bytes) and using varchar(255+) fields quickly fill it, also filling up the logs with exceptions seems like a sloppy solution.

So, the problem arises when I want to save the data. The naive approach would be to:

Try to find out if the insertable is a duplicate or not with a query like select 1 from X where manufacturer=$1 and manufacturerReference=$2
If we got no existing row, do the insert, otherwise return.

This solution can be really slow, so we would need indexes on all the supposed to be unique fields, which can make the datastore's size increase dramatically with lot's of rows (also, defining indexes on big varchar rows should not be appropriate as well).

This evolved into another solution, where I use an appropriate hashing algorithm (md5 usually) to effectively merge the fields (fingerprint = md5(manufacturer + manufacturerReference)) into a single field, which can be stored in a small place (for md5, binary(16)). After, I can define unique indexes on the fingerprint. Inserting rows still remain the same, either I keep swallowing the database errors, or I make an existence check beforehand.

Which leads me to another problem, database locking and deadlocks. Doing selects and inserts in the same transaction with many threads simultaneously will be slow and erroneous (as I experienced). The performance I want when I process millions of rows should not be hours (which currently is).

How this problem usually solved (best practice?)? If I had to store the data in a NoSQL store, my problem could be easily solved (new inserts for the same fingerprint just creates a new version), but the targets are MySQL, SQL Server and PostgreSQL usually.

`if I had to store the data in a NoSQL store, my problem could be easily solved (new inserts for the same fingerprint just creates a new version)` -- I don't understand why you can't do the same thing in MySQL, SQL Server or PostgreSQL — Robert Harvey, Nov 22 '17 at 20:16
@RobertHarvey Because I don't want to store the duplicates and what would change? I would still need to check for the `fingerprint` values. When you do a duplicate `id` insert in Apache Solr for example, you can optimize the duplicates later. — appl3r, Nov 22 '17 at 20:20
Don't you need the indexes on the business keys either? If that data will stay in that RDBMS and will be maintained there further by some UI application, you want still the DB to prevent a user to insert duplicate data. Thats exactly what unique keys are made for. — Doc Brown, Nov 22 '17 at 20:29
... and as an idea: can't you hold the keys or fingerprints of the processed rows in memory in a big hashset? This will allow you to check for duplicates extremely fast, and without any database locking problems. Of course, you need enough main memory for this, but you wrote 10millon rows, using a fingerprint of ~100 bytes should make collision risks small enough and requires not more than 1GB RAM. — Doc Brown, Nov 22 '17 at 20:34
I'm skeptical about these locking problems. Did you actually see them? Chances are decent with this size of data and a nicely distributed hash key, you may just have some code that is introducing those problems. If you actually select---for update, yes you will introduce concurrency delay, not to mention client round-trip delay. If you just insert, say, with a where clause, the only contention you should see from this workload is index maintenance and new block allocation. — joshp, Nov 22 '17 at 20:37
OP, I think it is very impolite after reading my comment not answering my question. This could also help others giving better anwers. — Doc Brown, Nov 23 '17 at 06:51
@DocBrown What do you want me to say? If I can't create a unique index during batch processing, because the index would be too long, do you think I can do it, when I create the indexes for UI optimization? Making a hashset with most of the fingerprint can be useful though. — appl3r, Nov 23 '17 at 07:05
I asked "can't you hold the keys/fingerprints in memory"? There is no restriction about length for keys of a hashset. — Doc Brown, Nov 23 '17 at 07:28

score 3 · Answer 1 · edited Nov 22 '17 at 20:44

I usually approach this by bulk inserting to a staging table and then use a merge sql statement (https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql) to the production table. In the merge statement you can use all kinds of logic to insert or update rows in bulk.

For deduping I use rownumber (https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql) with partition over the specific columns that need to be unique (ie the composite key). I ignore everything where the rownumber > 1. If the data has a update-date column, you can have a order by in the rownumber function so you take the last known version of the properties that aren't part of the composite key.

score 1 · Answer 2 · answered Nov 22 '17 at 21:21

Storage is what will ultimately give you the final insert performance

For example if you go with MySQL, you can run InnoDB in full ACID-compliance mode, but will take a minute to check and insert half a million rows

Experimentation helps

It also helps to understand what you are asking the server to do

checking for duplicates? reading is fast so not really a big deal
computing a hash key? bit more CPU required but not that bad
index caches to update? well a little less rows a minute now
flush operations instantaneously to disk? even less rows a minute
want to insert products one by one and show a very accurate progress bar? considerably less and less rows can now be inserted every minute

Other modes or a different storage engine can yield a faster record count, but at less reliability meaning REPAIR TABLE every now and again, if something awkward happens

Trading some SELECT speed for more INSERT speed is also an option

keep the duplicates
put an id on everything
SELECT only the newest ids when duplicates are found

And then you can prune duplicates from time to time, or when you're done INSERTing

score 0 · Answer 3 · answered Nov 22 '17 at 20:24

What you have reminds me of many similar problems I've faced in my ETL days. And yes, RDBMSes are not really designed for this process - you need other tools.

I'm assuming that you are doing batch processing - correct? Are you using a custom script for converting the Json documentss to database insert statements? If so, what you could do is add a pre-processing step, which is something I've had to do many, many times.

There are various ways do to accomplish this, but here are a couple of options:

A) Two step process, using the database to dedup:

Read each document, convert it to a delimited format that matches the fields you need to build your insert statements.
As you write the records out, include the fingerprint as a new field. Add the fingerprint field to your destination table (or even create a new table for this purpose).
Make a unique index on this field.
Now, read the delimited file, convert to database statements and do the inserts. As you do, your database will reject any duplicates because of the new index. The disadvantage here is that you must do each insert as a discrete transaction (if not, one duplicate will ruin your batch).

B) Pure script process. Personally I don't like relying on the database to do this work. I'd write out the delimited file as just explained, but then I'd write another script to go through the delimited file and weed out any duplicates based on either your md5 hash or field comparisons, before it touches the database. One that's done, you'll have a "clean" file that you can batch load into your database, saving you even more time.

Moral of the story is: don't try to make the database do work it really isn't good for.

Deduplicating data during batch processing

3 Answers3