Is using hashes for primary keys a good idea?

Question

The Austrian electronic ID card relies on the so-called sector identifiers. For example a hospital gets to identify a person by getting a sectorId for that person, which is computed roughly as follows:

sha1(personalId + "+" + prefix + sectorId); // prefix is constant and irrelevant

Is that a good idea? I think the possibility of collision, no matter how small, poses a risk.

In hashtables, when there's a collision, you have other means of establishing equality, but with primary keys you can't possibly have two that are identical. That can be circumvented by a composite key, but then the point of a unique sector identifier is lost.

Is it ok to do that and is there a good way to have it that way without it breaking at some point?

Even if this algorithm creates a duplicate, aren't there other checks in a system that would have indexes that do not allow duplicates? If I go to another hospital without my ID card, wouldn't there be other means to lookup an individual besides this number? — JeffO, Aug 19 '15 at 17:26
What is the point in using a hash algorithm at all? `personalId` + `sectorID` will already serve as a unique identifier, and since there is nothing in there like a password which must be hidden, hashing seems of no actual use. What am I missing? Or is the "personID" something secret? — Doc Brown, Aug 19 '15 at 17:39
Why would you trust a UUID which typically (V4) consists of 122 random bits over a 160 bit has? Accidental collisions will obviously be rarer for the latter. — CodesInChaos, Aug 19 '15 at 19:42
@DocBrown I was curious about that myself. So, I found and linkified it above. I lost interest after about 10 seconds, so I'm still not entirely sure I see the point ... but, it has something to do with privacy ... I think. — svidgen, Aug 20 '15 at 14:04
If you pick a better hash then no human on the planet knows how to create even a single collision. Many have tried. — usr, Aug 20 '15 at 18:58

score 9 · Answer 1 · edited May 23 '17 at 11:33

9

This former SO article tells you how to calculate the collision probability. For SHA-1, b is 160. The number of people living in austria is below 10 millions. Even if each living person in austria is registered in a hospital with a unique person/sector ID, that just makes a collision probability of less than 3.5 x 10^-35. I guess that should be small enough for most practical purposes.

edited May 23 '17 at 11:33

Community

1

answered Aug 19 '15 at 14:19

Doc Brown

199,015
33
367
565

1

Well, are you sure that argument will hold much weight with the jury when it's about life and death? – Deduplicator Aug 19 '15 at 14:37
1

@Deduplicator: I guess the chance of getting a collision because of a hardware failure (some bits flipping in RAM or magnetic storage) or a human failure (for example, a typo) will be much higher, independent of what kind of IDs or hashing is used. But of course, a pettifogger might see this different. – Doc Brown Aug 19 '15 at 14:46
My point being that any lawyer most likely is one... ;-) – Deduplicator Aug 19 '15 at 15:08

score 4 · Answer 2 · edited May 23 '17 at 12:40

4

Hashes will inevitably collide if they're smaller than all possible combinations of data.

See this excellent answer: https://softwareengineering.stackexchange.com/a/145633

If primary keys are not supposed to be meaningful (human readable; containing retrievable traits of data), I would just go with GUIDs.

Yes, theoretically they can collide as well, but heat death of the universe is likely to happen first. See https://stackoverflow.com/a/184897

EDIT: addressing @DocBrown's counterpoints to clear things up (and to avoid lengthy discussion in comments)

Generating the identifier out of person id or sector id was not OP's requirement (indeed, he admitted that resorting to GUIDs was what he suggested himself).

I never claimed GUIDs are suitable as an overall replacement for SHA-1, or hashing in general (of course they're not), I'm only saying they could be used in this particular case - for uniquely identifying some entities. As this is what they're for by definition.

It was never a requirement that these identifiers must be reconstructible from the data (which is an advantage of hash functions). Please evaluate my answer within the context of the actual question.

edited May 23 '17 at 12:40

Community

1

answered Aug 19 '15 at 12:12

Konrad Morawski

9,721
4
37
58

@Bozho I think your suggestion is as good as it gets. Using random 128-bit identifiers keeps things simple (big plus already), and you can always prefix these values with something meaningful if you want. The only downside is that the resulting values would be long, but well, you can't have everything. I suppose they wouldn't be normally visible to anyone anyway - not be used as some PINs people are expected to quote over the phone. – Konrad Morawski Aug 19 '15 at 12:19
Why the downvote? If you disagree with my answer, please take an opportunity to point out what it is so I can improve it – Konrad Morawski Aug 19 '15 at 12:21
long values are no problem indeed. (P.S.downvote wans't mine) – Bozho Aug 19 '15 at 12:25
@Bozho I know, I've got +1 / -1 :) Your question was downvoted too – Konrad Morawski Aug 19 '15 at 12:26
2

GUIDs have 128 bit, SHA1 produces 160 bit output. So what makes you believe GUIDs are a better choice over the SHA1 hashes the OP mentioned in his question? – Doc Brown Aug 19 '15 at 14:10
1

@DocBrown I'm admittedly not an expert in the field, but length of output by itself is not an issue, any hash function will still return the same output for the same input (that's sort of the point). If `personalId + "+" + prefix + sectorId` is guaranteed to be unique, then perhaps it could even be used raw, why not, SHA1 doesn't add any extra uniqueness. The problem - as I understand it - is that this formula may not yield unique outputs, especially if the system is expected to function for a long time (maintainability reasons might require eg. adding more sector IDs - caution advised) – Konrad Morawski Aug 19 '15 at 14:23
@DocBrown note that I'm not saying hashing is bad here. I just think that using GUIDs is good enough, and it has an advantage of being braindead simple, since GUIDs are independent of the data which they identify. Generating a GUID could also be faster than hashing, although it depends on implementation details of course. – Konrad Morawski Aug 19 '15 at 14:26
5

I still do not get it how a GUID is of use here. Using GUIDs is no hashing algorithm, a GUID cannot be generated from a personID / sectorID. It might be used as a alternative for the latter if generation of unique personIDs would be a problem otherwise (which I guess is not), but it is not a replacement for something like SHA-1. – Doc Brown Aug 19 '15 at 14:42
Well, I think the goal of the OP is not really clear (see my comment below the question). And where did you get the information from that the OP suggested GUIDs by himself, or that generating the identifier out of person id or sector id was not OP's requirement (that is exactly what I get from the question)? P.S. downvote was not from me. – Doc Brown Aug 20 '15 at 10:51
@DocDrown if I didn't go crazy, I remember he wrote something to this effect in a comment below my question which - however - seems to be deleted now. I don't lose sleep over downvotes :) – Konrad Morawski Aug 20 '15 at 10:54
1

IMHO GUIDs are not solving the problem of the OP. GUIDs are helpful to generate unique identifiers in a **decentralized** manner - the "Austrian base register" is a pretty much centralized institution, they do not have that problem - personalId + sector code is already a unique ID, why to make it more complicated? The interesting question is: why do they apply a hashing? But that is something I expect the OP to tell us. – Doc Brown Aug 20 '15 at 11:00

score 1 · Answer 3 · answered Aug 19 '15 at 22:49

1

Using a Hash or GUID as Primary Key is also bad idea because it causes Index Fragmentation and frequent Page Splits.

answered Aug 19 '15 at 22:49

Gordon Bell

121
4

Is using hashes for primary keys a good idea?

3 Answers3