9

Background

The "all-PK-must-be-surrogates" approach is not present in Codd's Relational Model or any SQL Standard (ANSI, ISO or other).

Canonical books seems to elude this restrictions too.

Oracle's own data dictionary scheme uses natural keys in some tables and surrogate keys in other tables. I mention this because these people must know a thing or two about RDBMS design.

PPDM (Professional Petroleum Data Management Association) recommend the same canonical books do:

Use surrogate keys as primary keys when:

  1. There are no natural or business keys
  2. Natural or business keys are bad ( change often )
  3. The value of natural or business key is not known at the time of inserting record
  4. Multicolumn natural keys ( usually several FK ) exceed three columns, which makes joins too verbose.

Also I have not found canonical source that says natural keys need to be immutable. All I find is that they need to be very estable, i.e need to be changed only in very rare ocassions, if ever.

I mention PPDM because these people must know a thing or two about RDBMS design too.

The origins of the "all-surrogates" approach seems to come from recommendations from some ORM frameworks.

It's true that the approach allows for rapid database modeling by not having to do much business analysis, but at the expense of maintainability and readability of the SQL code. Much prevision is made for something that may or may not happen in the future ( the natural PK changed so we will have to use the RDBMS cascade update funtionality ) at the expense of day-to-day task like having to join more tables in every query and having to write code for importing data between databases, an otherwise very strightfoward procedure (due to the need to avoid PK colisions and having to create stage/equivalence tables beforehand ).

Other argument is that indexes based on integers are faster, but that has to be supported with benchmarks. Obviously, long, varying varchars are not good for PK. But indexes based on short, fix-length varchar are almost as fast as integers.

The questions

- Is there any canonical source that supports the "all-PK-must-be-surrogates" approach ?

- Has Codd's relational model been superceded by a newer relational model ?

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
  • "authoritative" may be a better term than "canonical." The latter term implies that we're discussing a particular project or named philosophy, rather than a general database design rule. – DougM Jul 11 '13 at 16:34
  • 6
    Well, I don't know a canonical source, but to my experience the "all-PK-must-be-surrogates", to be precise the "the PK should be always an autogenerated field named `TablenameID`" works very, very well. I have seen that working in practice with an enterprise-sized db with more than 500 tables, and since that time I use this for database modeling whenever possible. – Doc Brown Jul 11 '13 at 16:39
  • 2
    Answers: 1) No! 2) Even bigger **NO**! – nvogel Jul 11 '13 at 19:07
  • 5
    Have to give this question a big -1. That surrogate keys are not required by the dbms is an indicator that they are not a 'must'. That said, as others have pointed out, using them consistently is a *smart idea* and helps avoid complications down the road as data changes. – GrandmasterB Jul 11 '13 at 19:58
  • The word "surrogate" is used in more than one sense in this context. In early usage, a natural key was described as a surrogate for an entity. Entities, like persons or airliners, do not really get entered into the database. They aren't data. The natural key identifies an entity and is data. So it can represent the entity inside the database. Provided of course that the enterprise does not mismanage the natural key. Later usage uses the word "surrogate" in the sense that an artificial key is used as a surrogate for the natural key. – Walter Mitty Aug 02 '13 at 04:23
  • @WalterMitty In my question I'm using the later meaning of the term. By the way, do you know any canonical source that supports the "all-PKs-must-be-surrogates-no-matter-what" practice ? – Tulains Córdova Aug 02 '13 at 09:45
  • @user61852, no I do not. My previous readings on this confirms the point that sqlvogel made in his answer, that the concept of primary key is not inherent in the relational model of data. All candidate keys are equivalent when viewed from a logical perspective. There's more, but I don't want to try to pack it all into a comment. – Walter Mitty Aug 02 '13 at 10:55
  • I'll add that in practice, I choose a primary key for every table (with one or two exceptions due to bizarre circumstances). My decision to use a non natural key ("surrogate key" if you like) is usually besed on how much I distrust the originators of the natural keys. In my experience, the people who originate codes will mess things up if Murphy's law permits them to. That's nearly always. – Walter Mitty Aug 02 '13 at 10:59
  • @WalterMitty "I choose a primary key for every table": me too. I also use surrogate keys, just in the 4 cases stated in my question. I'm a little more humble. If a whole body of countries trust a PK originator (who came into existence because of the computer era) then, who I am to nos trust them ? I don't have such grandiose image of myself. Also if a natural key changes once in ten years, that's what cascade updates are built into most moder RDBMS. – Tulains Córdova Aug 02 '13 at 11:47
  • @WalterMitty Any canonical source to support that radical practice ? The mixed (surro when needed, natural when fit) approach is supported by Codd's papers and a whole lot of canonical sources. – Tulains Córdova Aug 02 '13 at 11:51
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/9920/discussion-between-walter-mitty-and-user61852) – Walter Mitty Aug 02 '13 at 12:32
  • Sqlvogel's answer is a correct and complete response to this question. There is a widespread practice of declaring a PK named ID, and using the autonumber feature of the DBMS to populate it with unique values. There is an upside and a downside to this practice. I am prepared to discuss this at some length, complete with war stories, but only in a discussion forum. The Programmer area of SE is not a discussion forum. – Walter Mitty Aug 02 '13 at 12:03
  • See also https://dba.stackexchange.com/q/6108/4484 – Martin Schröder Jun 19 '20 at 12:31

2 Answers2

15

Primary and Foreign Keys do not have to be readable. Their purpose is to maintain the internal relational structure of the database, not to be read by a human.

Naturally, if there is an appropriate natural key that will never change (I claim these are as rare as hen's teeth or four-leaf clovers, but...), you can use that, and some customers will make that one of their requirements.

But why add the additional complexity to a database system, for little appreciable benefit? Primary Surrogate keys are system-generated, guaranteed to be unique, guaranteed to never change, and are the same data type for all tables. They will have the same reliable behavior under all circumstances.

If you're looking for a canonical resource that supports this practice, you won't find one. There are just as many designers on the other side of the aisle that will viciously defend their use of natural, composite keys with clustered indexes as primary keys, and all of the canonical resources say that it is the designer's choice.

See Also
http://en.wikipedia.org/wiki/Surrogate_key

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • I meant readibility of the code. – Tulains Córdova Jul 11 '13 at 16:39
  • 1
    he's not asking if the approach should be used. he's asking who he can reference to support his decision. Do you know of any organization or published work that advocates this design approach? Do you yourself have published work he can cite be beyond this website? – DougM Jul 11 '13 at 16:41
  • @DougM: Look at the last paragraph of my answer. – Robert Harvey Jul 11 '13 at 16:42
  • 1
    +1 for confirming than there's no canonical source that supports the "all-surrogate" practice. – Tulains Córdova Jul 11 '13 at 16:42
  • 1
    @user61852: Readability of the code is not affected in any way by using surrogate keys. – Robert Harvey Jul 11 '13 at 16:42
  • @Robert: yes, I can see that. you say there isn't a source that will end debate. Do you know of one that even agrees with you? – DougM Jul 11 '13 at 16:44
  • @DougM That wasn't the question. – Robert Harvey Jul 11 '13 at 16:46
  • 1
    @user61852: That's a different problem, having to do with normalization. It has nothing to do with surrogate keys, and it certainly doesn't prevent you from executing the SQL you just wrote. Six joins seems unlikely; if you need COUNTRY for that query, just add the FK for the COUNTRY table to the AIRPORT table, and do *one* join. – Robert Harvey Jul 11 '13 at 16:47
  • @RobertHarvey No. It's not about normalization. You understood that because I made a mistake. The query is `SELECT * FROM AIRPORT WHEN **COUNTRY_CODE** = 'FR'`; . `COUNTRY_CODE` is part of a foreign key with the table `PROVINCE_STATE` whose primery key is `(PROV_STATE_CODE, COUNTRY_CODE)`. Natural keys cascade so you can query the table `AIRPORT` by `COUNTRY_CODE`, which is an ISO code know by the users. – Tulains Córdova Jul 11 '13 at 17:34
  • 1
    *"appropriate natural key that will never change"*, Codd's relational model doesn't say PK must be immutable. All I find is that they need to be estable, i.e need to be changed only in very rare ocassions, if ever. I just want to know what canonical source supports that statement. – Tulains Córdova Jul 11 '13 at 17:41
  • As I stated before, I don't think such a canonical resource exists. – Robert Harvey Jul 11 '13 at 17:44
  • @RobertHarvey I would appreciate if you add an edit in your answer saying just that. You only say that "a source to end the debate doesn't exist", but I didn't ask for "a source to end the debate" but for "a canonical source that supports" the practice. – Tulains Córdova Jul 11 '13 at 17:53
  • @user61852: Done. – Robert Harvey Jul 11 '13 at 17:54
  • +1 again. Thanks. I took the liberty to add bold type to the sentence so the readers can see it easier. Hope you don't mind. – Tulains Córdova Jul 11 '13 at 18:00
  • -1 for `if there is an appropriate natural key that will *never* change`. I strongly oppose this as an absolute statement. "Not expected to change" is sufficient criteria for a key. Otherwise, you're doing the equivalent to `const int A = 65; const int B = 66;` just in case the ASCII encodings change some day. I don't expect them to change, so I freely assume that an `A` will always be `65`, without providing for the possibility that I may be wrong. This saves me a great deal of development time and mental effort and improves readability. – Bobson Jul 11 '13 at 18:13
  • 2
    @Bobson: I already stated in my answer that there are dissenting opinions, and I agree with your position in the paragraph below the statement you quoted, so your downvote seems... capricious. – Robert Harvey Jul 11 '13 at 18:16
  • @RobertHarvey - The fact that you discuss it in your answer at all is irrelevant to the answer of "No, there are no sources". – Bobson Jul 11 '13 at 18:17
  • If a value can change then it is by definition not stable, you can't rely on it to be tomorrow what it is today. http://docs.jboss.org/hibernate/core/3.3/reference/en/html/mapping.html#mapping-declaration-naturalid makes the specific point that when Natural values are used as Primary Keys they should be immutable, though natural keys are recommended. – James Snell Jul 11 '13 at 18:17
  • 1
    @Bobson: Well, now you're just trying to be combative. The fact that this (sensible) answer already has two downvotes is merely indicative that the subject is contentious. This argument has been going on at least as long as the One True Brace Style argument, and I suspect it will go on for a little while longer. Hopefully we can resolve it before the sun turns into a Red Giant. :) – Robert Harvey Jul 11 '13 at 18:19
  • @RobertHarvey - Sorry, I'm not usually that argumentative about things - I think I'm off today. In general, I prioritize programmer mindspace and code readability over highly unlikely future changes. But I work on software where rolling new versions simply consists of putting in the work to code, test and deploy it. If I were writing software which had a six-month approval and testing process for any changes (like one FAA contractor I interviewed with), I might feel differently. – Bobson Jul 11 '13 at 18:38
  • 2
    *"Primary and Foreign Keys are not supposed to be readable."* !! Who exactly is "supposing" such a thing, other than Robert? The question is about the relational model which certainly never, ever supposed such a thing. – nvogel Jul 11 '13 at 19:43
  • 4
    @sqlvogel *[sigh]* Make them readable if you wish. Really, it's OK. As long as you can guarantee immutability and uniqueness, you can paint them green for all I care. My point is that readability is really low on the importance scale. – Robert Harvey Jul 11 '13 at 19:45
  • 2
    @RobertHarvey Of course. No objection to hearing your opinions either but your first sentence reads a little too much like you are asserting some implicit property or intention of those things. Again as others already said, "immutability" is not a requirement. *Stability* (a relative term not an absolute one) is a useful or desirable attribute of key; immutability is not and is anyway illusory. – nvogel Jul 11 '13 at 19:51
  • 1
    @sqlvogel: To be absolutely clear, I'm not saying that system-generated keys will never have to change. What I *am* saying is that you should *design your system* so that they will never have to change. – Robert Harvey Jul 11 '13 at 20:07
  • 2
    @RobertHarvey, Perhaps you could, but the natural keys presumably would still need to change and their uniqueness would still need to be enforced and they would still need to be controlled in the same way. Those "problems" aren't going to be solved by adding a surrogate so you've potentially just added complexity by making it necessary to maintain more keys than you had before. *Potentially* that is. In practice it depends on many things, that's why "one rule fits all" is no help in my opinion. – nvogel Jul 11 '13 at 20:18
  • @sqlvogel: Surrogate keys are not supposed to solve those problems. Surrogate keys relieve the developer of the additional problems with the database structure that will occur when you have to solve those problems. They decouple the relational structure from the natural keys. That's all this is about. – Robert Harvey Jul 16 '13 at 14:41
  • In other words, if you have surrogate keys, *it won't matter if you change the natural keys;* the database structure will be unaffected. – Robert Harvey Jul 16 '13 at 14:53
  • 2
    The relational database's *structure* is not affected if you change *any* key's value. But it is not implicit that changing a natural key "won't matter" if you have a surrogate. As you already said: surrogate keys do not solve those problems for the developer or the user. What surrogates do is *add* certain new problems and complexities and whether it is worthwhile doing that or not is highly dependent on the circumstances. The lazy "surrogates everywhere" approach is not a prudent strategy for database design. – nvogel Jul 16 '13 at 16:06
  • 2
    @sqlvogel: I guess we're going to have to agree to disagree on this one. That each table needs a surrogate key seems self-evident to me. – Robert Harvey Jul 16 '13 at 16:14
  • 1
    @RobertHarvey, Self-evidently they are *not* "needed" because many thousands of developers and users successfully use tables that don't have them. The very nature of a surrogate is that it is supposed to be internal plumbing that is orthogonal to any requirements. Natural keys are the business requirements. Surrogates are an optional piece of refactoring. – nvogel Jul 16 '13 at 17:44
  • @sqlvogel: Surrogates are not a refactor. You put them in at the very beginning. There's no point in putting them in after the fact. – Robert Harvey Jul 16 '13 at 17:48
  • 1
    @RobertHarvey, "changing code without modifying its external functional behavior" = what surrogates do. The stage(s) in the database lifecycle at which a surrogates are added to a database design is unimportant. Sometimes they are added at inception, sometimes not. Sometimes they are removed when they have outlived their usefulness. The point is that natural keys define and are defined by external functional behaviour; surrogates do not and are not. That is in a nutshell the difference between a surrogate and a natural key. – nvogel Jul 16 '13 at 19:26
  • 1
    @sqlvogel: Then perhaps I'm using the wrong term. The term that I'm looking for describes a scheme whereby each record in every table gets a unique, immutable serial number of some kind that is system-generated, indexed, and marked as the Primary Key. It is the key that participates in all joins; it is the key that determines the relational structure of the database. It is the key that is divorced from all other business concerns except establishing those relationships between tables. They are added as part of the original design. They never outlive their usefulness. – Robert Harvey Jul 16 '13 at 19:36
  • 1
    "divorced from all other business concerns except establishing those relationships between tables" and therefore it has no function in the business domain. It is not a functional requirement of any kind; it is a purely technical and optional implementation *choice* made by the database designer. – nvogel Jul 16 '13 at 19:50
  • @sqlvogel: I don't disagree with that. Purely technical decisions of this nature are made all the time; that's why non-functional requirements exist. – Robert Harvey Jul 16 '13 at 19:52
  • 1
    @RobertHarvey, I was repeating your own definition. The database designer could choose to establish relationships between tables with a surrogate or without. Since surrogates by your own definition have no other (business) function they can be discarded if the database designer doesn't find them useful. – nvogel Jul 16 '13 at 19:54
  • @sqlvogel: Absolutely. – Robert Harvey Jul 16 '13 at 19:55
  • 1
    @RobertHarvey So it seems we now agree that surrogates are a refactoring (=non functional) and not needed to satisfy functional requirements. They are *sometimes* useful to satisfy non-functional requirements but since they very often come at a cost (in storage, performance and complexity) their application should be evaluated carefully on a case by case basis. "Surrogates everywhere" is a lazy and careless policy. "Surrogates where appropriate" is common sense. – nvogel Jul 16 '13 at 20:03
  • 2
    @sqlvogel: No, that's not what we agreed to. ;) What we agreed to is that it's the designers choice, and that designers disagree on the relative merits of surrogates. The Wikipedia goes over the advantages and disadvantages of surrogate keys in detail; perhaps we can agree on *that*. You can evaluate the merits on a case-by-case basis if you like, but I've yet to see a system where using surrogates as PK's was not ultimately the better choice. – Robert Harvey Jul 16 '13 at 20:04
  • 3
    @RobertHarvey, If you aren't familiar with situations where not using a surrogate is a better option then some people might conclude you aren't best placed to advise on when or when not to use them. I've no intention of commenting on anything written on Witless-pedia. – nvogel Jul 16 '13 at 20:22
  • 1
    Wikipedia is not an authoritative, canonical source. – Tulains Córdova Aug 02 '13 at 12:00
  • @user61852: I clearly stated in my answer that such a canonical source does not exist. – Robert Harvey Aug 02 '13 at 16:35
9

"All PKs are surrogates" is not a very sound strategy at all and certainly not one that you are ever likely to find an "authoritative" source for.

Firstly think about what is meant by "primary key" in this context. In the relational model there are no "primary" keys - meaning no one key which is fundamentally different from any other key of the same table. In principle all keys in a relational database can and do enjoy the same status and have the same features and function, except to the extent that the database designer chooses otherwise. The singling out of any one key in a table with multiple keys is therefore essentially arbitrary (that was the word used by E.F.Codd), subjective and purely psychological (the view of Chris Date, Codd's colleague and collaborator). Unless it is explained what distinction is being drawn between a "primary" key and any other key it is therefore pretty meaningless and of no merit at all to assert that such a key "should" or "must" be anything.

Secondly, the argument has very little to do with indexes, which are a physical storage feature. Keys are a logical matter, not a physical one and there is no absolute reason to assume that the storage considerations of a "primary" key are or should be any different to other keys (see previous paragraph). We might reasonably assume that whatever storage structures are used, the storage overhead will in some measure be greater with a surrogate key than with no such key but as always the best answer here is "it depends". Storage decisions should be made on a a case-by-case bases and blanket rules are of very little help.

Thirdly, from a logical point of view the absolute requirement of a surrogate key makes very little sense. The requirement for a natural key is exactly the same with or without a surrogate. The need for information to be identifiable in the domain of discourse (i.e. with a natural key AKA "business key", "domain key") is the same. Yes, keys may need to be updated but then that's the nature of things sometimes. Adding a surrogate doesn't in itself necessarily make key updates easier to handle and sometimes it can make them harder.

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
nvogel
  • 604
  • 4
  • 11
  • excellent answer. There's lots more to be said on this score, but making your answer longer would not have made it better. You summarized the essential points well. – Walter Mitty Aug 02 '13 at 11:05