0

I have an array of records for bids (sample below in JSON). I would like to store this data in a relational database (Postegresql), however, the supplier data is not given with IDs, and there will be some entries that need to be deduplicated. For instance, in the example below the "John Smith and Associates" is listed under several similar names as a supplier.

Should I give each supplier a unique ID (even those with matching names) and then deduplicate after the db is populated, or do this while adding entries to the database?

[ { "Solicitation No": "B2342", "Issuing Organization": "VT Timber Sales", "Award Date": "2017/06/29", "Supplier_details": [{ "Successful Supplier(s)": "John Smith & Associates", "Supplier City": "Georgetown", "Award Total": "$22034.13" }] }, { "Solicitation No": "B2344", "Issuing Organization": "VT Timber Sales", "Award Date": "2017/06/30", "Supplier_details": [{ "Successful Supplier(s)": "John Smith & Assoc", "Supplier City": "Georgetown", "Award Total": "$5034.13" }, { "Successful Supplier(s)": "Some Logging ltd.", "Supplier City": "Georgetown", "Award Total": "$1034.13" }] }, (...)

2 Answers2

2

You should avoid populating database with duplicates whenever possible.

If you have a multi-user system, creating a duplicate with it's own ID, could cause another user to refer by accident to this additional ID before tour cleaning job occurs. In this case it will be much more difficult to clean the mess!

A second argument is that you know that the source can contain duplicates. If you insert duplicates, you later no longer know which one come from unreliable source and which come from other sources (such as manual entry in the db).

Christophe
  • 74,672
  • 10
  • 115
  • 187
  • Do you have a suggestion for a library for deduplicating as entries are added? The previous tool I was considering deduplicates on the DB instead of on entries added https://github.com/dedupeio/dedupe – Ryder Bergerud Jul 26 '17 at 16:19
  • The easiest way is to decide what makes the identity (eg solicitation+supplier name? Sollicitation+supplier name+supplier town? Same and date?) of a record and simply not to insert a record if a query shows that it exists. I wouldn't use a library for such step – Christophe Jul 26 '17 at 16:51
  • 2
    While I agree that you definitely shouldn't insert duplicates into your normative tables, the DBMS can be quite valuable for performing the deduplication. This suggests, at least, inserting the duplicates into a temporary table and deduplicating there is a viable thing to do. If the original JSON is not permanently stored (though I would encourage doing so), more permanent "staging" tables that hold the data before deduplication can be useful. Then, if you realize you made a mistake in the deduplication, you'd have something close to the original data with which to correct it. – Derek Elkins left SE Jul 27 '17 at 00:07
  • @DerekElkins You should put that comment into an answer. – Tulains Córdova Jul 27 '17 at 00:52
  • Indeed, if your system has more complex interfacing requirements (e.g. Aggregating different sources, enriching data, splitting input, verification of business rules, etc...), you could use an ETL like approach with a dedicated staging area (temporary tables) as suggested by @DerekElkins. But it adds a layer of complexity. And from your single example i'm not sure that this would be justified in your specific case. – Christophe Jul 27 '17 at 08:13
0

Most databases have something like MERGE or ON DUPLICATE KEY UPDATE and that's what you could use. Add a unique key spanning all relevant columns and the DB won't ever let you insert a duplicate.

If you're using a decent programming language, then a deduplication in code is pretty easy and (as long as all data fit in memory) also faster than when using the database (since you save yourself communication costs). Still, you should create the table unique key, so that the database checks for duplicates, too (keep your data consistent). This check helps in case of a programming error or concurrent access.

maaartinus
  • 2,633
  • 1
  • 21
  • 29