Reproducible uniqueness

Question

So, I have a bunch of calls that are all being generated with a UUID1 throughout the day. At the end of the call the call is processed and some metrics around that call are generated and stored in Rethinkdb/Cassandra.

Each call will generate something that looks like this

[{
   "company": "foo company",
   "campaign": "bar campaign",
   "hash": "",
   "stat": "talk-time",
   "date": 1467356399,
   "value": 176
 },
 {
   "company": "foo company",
   "campaign": "bar campaign",
   "hash": "",
   "stat": "sale",
   "date": 1467356399,
   "value": 1
 },
 {
   "company": "foo company",
   "campaign": "bar campaign",
   "hash": "",
   "stat": "call-back",
   "date": 1467356399,
   "value": 0
 },
 ...
 ]

I need these stats to be unique within the database. My current solution is to take the UUID of the Call that is stored in Postgres and add it to a string that will look something like this for the first stat uuid1_foo-company_bar-campaign_talk-time_1467356399 and hash it using a SHA-512. Currently, I am using that hash as the ID on Rethinkdb to gain uniqueness.

The reason these need to be unique and reproducible is because sometimes we have to go back and reprocess all the calls for a given day and we need to ensure that stats generated the first time they were processed we kept and not duplicated. If they are duplicated all our reporting would be incorrect.

Is there a better way using these tools to generate unique stats for a call that can be reprocessed later without inserting duplicate values?

Also, it seems that Rethink has a max length of 127 for the primary ID where SHA-512's are 128 characters long, hence rethinking this design.

What's the point of the SHA-512? Since the string incorporate all the other components, they should already be unique (right)? If there *are* duplicates, then you'll get the duplicates in the SHA-512, too. But if there *aren't* duplicates, then the SHA-512 is adding a very small chance of creating some. Leaving the string in a "human readable" form might also make some kinds of debugging easier, too. — Joshua Taylor, Jul 05 '16 at 18:21
@JoshuaTaylor Changing it from readable to the SHA was just to uniform the length of the string. Essentially the human readable should be unique within the application. It would be easier to debug in the readable form as well, yes. The only piece that is not incorporated is the `value` that is left to be just the only real piece of data. — Jared Mackey, Jul 05 '16 at 18:25
There are ways of getting a uniform length string that wouldn't add the possibility of duplicates, though. E.g., since the UUID never contains spaces, you could left-pad with spaces until the string is whatever length. (That's just one option, of course.) That won't introduce duplicates like the SHA-512 could, and the string stays human readable. — Joshua Taylor, Jul 05 '16 at 18:27
Why not just have company name and the other values as separate columns in the postgres table, and uuid as another column, with a unique constraint covering all the columns? — Daenyth, Jul 05 '16 at 18:28
@JoshuaTaylor I do like that idea and it would be a good possibility if I were able to change the max length of the ID in rethink. The examples above are rather short examples. — Jared Mackey, Jul 05 '16 at 18:29
@Daenyth postgres is too slow and too cumbersome for the amount of data we are generating and massive queries are running. But in fact our first approach was to do something similar and then just aggregate that over to Cassandra and Rethink but it quickly blew up when trying to aggregate the values. — Jared Mackey, Jul 05 '16 at 18:30
@Daenyth: Why would you do that, when the UUID is already globally unique? — Robert Harvey, Jul 05 '16 at 18:52
@electrometro: The UUID is already globally unique. Why don't you just use that, directly? — Robert Harvey, Jul 05 '16 at 18:53
@RobertHarvey Yes it is unique for the call, but not for all 50 stats it is used to create. Cannot give all 50 stats the same UUID or else they wouldn't be unique. — Jared Mackey, Jul 05 '16 at 19:07
Then generate a UUID for each stat, or increment a sequence number and combine with the original UUID. — Robert Harvey, Jul 05 '16 at 19:37
@RobertHarvey that is what this entire question is about, how do I generate a unique identifier for each stat that is reproducible later. Just calling `uuid.uuid4()` is not reproducible later. As in I won't get the same one for the same stat on the same call when reprocessing it a week later. — Jared Mackey, Jul 05 '16 at 19:39
I'm confused. Every record you create (including the stat records) should have some sort of unique identifer generated for it. Ideally, that identifier you so create is globally unique, so that you only have to refer to it specifically, and nothing else. What do you mean by "reproducible?" — Robert Harvey, Jul 05 '16 at 19:40
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/42067/discussion-between-robert-harvey-and-electrometro). — Robert Harvey, Jul 05 '16 at 19:42

Robert Harvey · Accepted Answer · 2016-07-05T20:21:54.483

1

I see two possibilities for creating a new key:

Generate A UUID5, which is based on the SHA-1 hash of a namespace identifier (which is a UUID) and a name (which is a string). Use your original UUID and a unique string combination within your record, or
Generate a SHA512 hash of your entire record, encode it to a base64 representation, and append the resulting 8 characters to the end of your original UUID.

edited Jul 05 '16 at 20:21

answered Jul 05 '16 at 20:16

Robert Harvey

198,589
55
464
673

I went with option 1. I ended up having rethink create a UUID5 from the string that I was previously hashing and then concatenating that UUID5 to a UUID1 of the parent object. – Jared Mackey Jul 11 '16 at 15:54

Reproducible uniqueness

1 Answers1