I constantly struggling to solve data duplication problems efficiently (storing data from any source to RDBMSes). My main concern is speeding up inserts/batch processing.
Scenario: I read data from different sources, mainly in json format and I need to process them in way (input: nosql documents), where I only store different1 rows once in the database. The number of insertables differ, but around 1-10m rows. The speed of reading is fast, I can read as fast as I can, on as many threads as I can.
1 Different usually means certain fields in the input document should be unique if they are the same together. Like If I have to save Products and have these fields:
{
"manufacturer": "Xy manufacturer",
"manufacturerReference": "Manufacturer's product id, barcode = identifier",
"name": "..",
"price": ...
}
All combinations of the manufacturer
and manufacturerReference
should be represented only once in the dataset after processing.
My first thought was quickly dropped, I just wanted to create unique
indexes in the database and handle duplicate key exceptions, but most
of the time it just won't work. All the databases have length limit on
the unique keys (SQL Server 900 bytes) and using varchar(255+)
fields quickly fill it, also filling up the logs with exceptions seems like a sloppy solution.
So, the problem arises when I want to save the data. The naive approach would be to:
- Try to find out if the insertable is a duplicate or not with a query like
select 1 from X where manufacturer=$1 and manufacturerReference=$2
- If we got no existing row, do the insert, otherwise return.
This solution can be really slow, so we would need indexes on all the supposed to be unique fields, which can make the datastore's size increase dramatically with lot's of rows (also, defining indexes on big varchar rows should not be appropriate as well).
This evolved into another solution, where I use an appropriate hashing
algorithm (md5 usually) to effectively merge the fields (fingerprint = md5(manufacturer + manufacturerReference)
) into a single field, which
can be stored in a small place (for md5, binary(16)
). After, I can
define unique indexes on the fingerprint
. Inserting rows still
remain the same, either I keep swallowing the database errors, or I
make an existence check beforehand.
Which leads me to another problem, database locking and deadlocks. Doing selects and inserts in the same transaction with many threads simultaneously will be slow and erroneous (as I experienced). The performance I want when I process millions of rows should not be hours (which currently is).
How this problem usually solved (best practice?)? If I had to store the data in a NoSQL store, my problem could be easily solved (new inserts for the same fingerprint
just creates a new version), but the targets are MySQL
, SQL Server
and PostgreSQL
usually.