13

Background

I have this tables

+-------------------------+  +------------------------+
|Airport                  |  |Country                 |
|-------------------------|  |------------------------|
|airport_code string (PK) |  |country_code string (PK)|
|address string           |  |name string             |
|name  string             |  +------------------------+
+-------------------------+

+-------------------------+
|Currency                 |
|-------------------------|
|currency_code string (PK)|
|name string              |
+-------------------------+

airport_code is the IATA (International Air Transport Association ) airport code, you can see them in your luggage tags when you travel by plane.

enter image description here

country_code is the ISO 3166-1 A3 standard country code, you can see them in the olympics.

enter image description here

currency_code is the IS0 417 standard 3-chars currency code, you can see them in international currency exchange display boards.

enter image description here

Questions

Are these natural PKs good enough ?

Is using world respected standards, which are accepted by whole industries good enough for PKs ?

Do this tables need surrogates no matter what ?

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154

4 Answers4

16

No, they don't. Those keys are definitely good enough!

They're unique, not rarely going to change, and meaningful, which is a step up over a surrogate key. That's pretty much the definition of a good PK.

The restrictions about PKs being immutable and numeric-integer are not part of the Relational Model (Codd's) or any SQL standard (ANSI or other).

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
Bobson
  • 4,638
  • 26
  • 24
  • 3
    Primary keys must also be immutable, something IATA airport codes are definitely not. They can be changed at the whim of the IATA. – James Snell Jul 10 '13 at 15:53
  • 3
    @JamesSnell - IATA airport codes are about as immutable as country codes. You're talking about a *change* maybe once every decade, if that. See [here](http://www.airliners.net/aviation-forums/general_aviation/read.main/2725737/) for an discussion of the matter. There's lots of outdated codes that are still in place because they're too much trouble to change. Additionally, that's what a CASCADE update is for. **Mutable primary keys are legitimate, if not great practice.** – Bobson Jul 10 '13 at 16:55
  • @Bobson So which is the better practice, using a natural primary key that _might_ change at an _unknown time_ at the whim of a 3rd party, or a surrogate primary key that you're guaranteed to maintain control over? I think that's what James is getting at. There are tradeoffs either way. – Eric King Jul 10 '13 at 17:14
  • 1
    @EricKing - I'd vote for the natural primary key in this case. If it *has* to change, it's changeable, and by having meaning, it simplifies the data. – Bobson Jul 10 '13 at 18:17
  • 2
    @EricKing These 3rd party, happen to be composed of representatives of all major parties of many industries, then the standards are discussed for years, then voted until a reasonable consensus has been reached. Also they agree upon the mechanisms through which any change or new addition is made. Besides that, the code lists standards are created, not on a whim, but because a need exist for creating a controled, respected, agreed upon, list of codes for something, in order to be able to interoperate worldwide and communicate properly worldwide. – Tulains Córdova Jul 10 '13 at 19:55
  • 2
    @user61852 - You could say these standards are *made* to be primary keys. – Bobson Jul 10 '13 at 19:59
  • 1
    @EricKing Cascade updates have existed for decades in mayor RDBMS. Also, the canonical books don't say PKs must be immutable, only that they must not change often. – Tulains Córdova Jul 10 '13 at 20:00
  • @Bobson Surely, these standards have been developed in the XXI century in order to help computer systems interoperate. It's like the US area codes. They were created in order to be able to automate government systems. Not until computers came to exist were these standards were needed and created, in order to facilitate the interoperability between disparate systems. – Tulains Córdova Jul 10 '13 at 20:02
  • 3
    @Bobson: "There's lots of outdated codes that are still in place because they're too much trouble to change"-> possibly because they are primary keys? – Maciej Jul 10 '13 at 22:20
  • @Maciej - Indeed. Catch 22. But that's how standards work - the standard doesn't change because everyone's already following the standard. New standards could come around (HTML 4 -> 5), but the old ones stay the same. – Bobson Jul 11 '13 at 12:13
3

I think need is a very strong word, and in a strict sense, the tables probably do not need surrogate keys.

However, if it were my database, I would probably add surrogate keys anyway. I may not necessarily want my database design to depend on a bunch of third parties (IATA, ISO), regardless of how stable their standards are. Or, I may not want to depend on a particular standard at all (are there other currency code standards? I don't know). I would probably model my tables with surrogate keys like so:

+-------------------------+  +------------------------+
|Airport                  |  |Country                 |
|-------------------------|  |------------------------|
|airport_id       int (PK)|  |country_id     int (PK) |
|iata_airport_code string |  |iso_country_code string |
|icao_airport_code string |  +------------------------+
|faa_identifier    string |  
|address           string |  
|name              string |  
+-------------------------+

+-------------------------+
|Currency                 |
|-------------------------|
|currency_id int (PK)     |
|iso_currency_code string |
|name string              |
+-------------------------+

In other words, unless those industry standard codes are inherently important to my application, I wouldn't use them as the PK of my tables. They're just labels. Most of my other tables will probably have surrogate keys anyway, and this setup would add consistency to my data model. The cost of 'adding' the surrogate keys is minimal.

Update based on some of the comments:

Without knowing the context of the example tables, it's impossible to know how important things like IATA Airport Codes are to the application using the database. Obviously, if IATA codes are centrally important to and used pervasively throughout the application, it might be the correct decision, after proper analysis, to use the codes as the PK of the table.

However, if the table is just a lookup table that's used in a few corners of the app, the relative importance of the IATA codes may not justify such an prominent spot in the database infrastructure. Sure, you may have to make an additional join in a few queries here and there, but that effort might be trivial in comparison to the effort it would take to do the research to ensure that you fully understand the implications of making the IATA codes the primary key field. In some cases, not only do I not care, but I don't want to have to care about the IATA codes. @James Snell's comment below is a perfect example of something I might not want to have to worry about affecting the PK of my tables.

Also, consistency in design is important. If you have a database with dozens of tables that all have consistently designed surrogate keys, and then a few lookup tables that are using 3rd party codes as PK, that introduces an inconsistency. That's not altogether bad, but it requires extra attention in documentation and such that may not be warranted. They're lookup tables for goodness sake, just using a surrogate key for consistency is perfectly fine.

Update based on further research:

Ok, curiosity bit me and I decided to do some research on IATA airport codes for fun, starting with the links provided in the question.

As it turns out, the IATA codes are not as universal and authoritative as the question makes them out to be. According to this page:

Most countries use four-character ICAO codes, not IATA codes, in their official aeronautical publications.

In addition, IATA codes and ICAO codes are distinct from FAA Identifier codes, which are yet another way to identify airfields.

My point in bringing these up is not to begin a debate about which codes are better or more universal or more authoritative or more comprehensive, but to show exactly why designing your database structure around an arbitrary 3rd party identifier is not something I would choose to do, unless there were a specific business reason to do so.

In this case, I feel my database would be better structured, more stable, and more flexible, by forgoing the IATA codes (or any 3rd party, potentially changeable code) as a primary key candidate and use a surrogate key. By doing so, I can forgo any potential pitfalls that might crop up due to the primary key selection.

Eric King
  • 10,876
  • 3
  • 41
  • 55
  • 1
    So IATA standards are good enough for airlines but not for you ? – Tulains Córdova Jul 10 '13 at 15:32
  • 1
    @user61852 Yes. Unless, as I stated, IATA codes are inherently important to my database/application, as they presumable are to the airline industry. In which case, suddenly IATA codes become pretty useful. Until then, I'll stick with surrogate keys. – Eric King Jul 10 '13 at 15:35
  • 1
    Of course you will have to join all the way up to the airport table when you want to look for baggage from London Heathrow, because you can't do `select * from baggage where airport_code = 'LHR'`, meaning the database is only usable throw the application, which is a very narrow and proprietary approach, specially when the business owner is the one who paid for the database, and therefore owns it. Also you will have to write code to do mundane things like importing data from one database to another to avoid PK colisions. – Tulains Córdova Jul 10 '13 at 15:43
  • 1
    @user61852 Sure. Maybe. Like I said, it depends on the app and the database. What you're describing sounds like an application built around airport operations in one sense or another. In such a case, my design decisions may be different. You seem to be looking for an answer that says 'always use a natural key when one is available'. I don't prescribe to that thinking. I think 'it depends, and surrogate keys are sometimes ok even when natural keys are available". – Eric King Jul 10 '13 at 15:54
  • 2
    IATA codes are not immutable therefore they cannot be considered as PK candidates. Example: the code IDL was in New York, until it got renamed JFK. The IDL code is now in Mississippi. – James Snell Jul 10 '13 at 16:03
  • *"You seem to be looking for an answer that says 'always use a natural key when one is available'."*. But you seem to defend the dogma that says *"all-tables-must-have-surrogates-no-matter-what-and-no-analysis-needed"*. – Tulains Córdova Jul 10 '13 at 16:03
  • @JamesSnell RDBMS support cascading udpdate, and PK need not be immutable, they just need to change very little over time. A column that may, or may not, change once in ten years is a good PK. – Tulains Córdova Jul 10 '13 at 16:05
  • In this case a surrogate key is required - although personally I don't like surrogate keys to be unique at the table level, they should be unique at the database level as that allows for simpler glue tables. – James Snell Jul 10 '13 at 16:06
  • 1
    @JamesSnell That's a good reason I don't generally want to depend on 3rd parties for my database infrastructure decisions. Not that I _think_ their codes are going to change, but that _I don't want to have to care_. In this case surrogate keys allow me to abstract away the 3rd party implementation. – Eric King Jul 10 '13 at 16:07
  • @user61852 - PK's need to be immutable *forever* on the basis that you cannot guarantee that at some point you won't interface with a system which is not under your direct control. – James Snell Jul 10 '13 at 16:07
  • 1
    @user61852 _You seem to defend the dogma that says "all-tables-must-have-surrogates-no-matter-what-and-no-analysis-needed"._ You're being purposely obtuse now. In no way does my answer (or any of my comments) indicate that. As a matter of fact, my answer says that surrogate keys are _not needed_ in the table examples you gave. – Eric King Jul 10 '13 at 16:10
  • Forgive my obtuseness. What would be an example of a table where a surrogate key is not needed ? – Tulains Córdova Jul 10 '13 at 16:17
  • @user61852 When a natural key candidate meets your database design requirements. Sometimes they do, sometimes they don't. And sometimes it's not worth the effort to do the research to figure out if the natural key candidate will work (the IATA codes in this example) at all because doing so would take longer than adding a surrogate key and moving on with life. – Eric King Jul 10 '13 at 16:21
  • @EricKing How do you know if a natural key will work ? – Tulains Córdova Jul 10 '13 at 16:24
  • @EricKing Most of the time you don't research, but the business experts tell you how they do things. – Tulains Córdova Jul 10 '13 at 16:37
  • 1
    @user61852 You will only know if a natural key candidate will work after you've done some research. My point is that _depending on the app and the tables in question_, that research may not even be worth it. Just put in a surrogate key and move on. See my update for clarification. – Eric King Jul 10 '13 at 16:42
  • I read your update. Well, ISO and IATA did the analysis so you don't have to. Besides that, caring is good and researching is always worth it. Also if the client who is paying the development tells you they use IATA codes to do their business, and it works for them, that's good enough real-life test on the key being good. – Tulains Córdova Jul 10 '13 at 16:49
  • 1
    @user61852 There are a lot of assumptions in there... The IATA and ISO couldn't care less how their codes affect my database. Responsible research would still have to be done. And, if the client who is paying me tells me to use the codes, I'll use them. But that has absolutely nothing to do with your question or my answer to it. – Eric King Jul 10 '13 at 16:54
  • @JamesSnell - Why would it possibly matter if you need to *theoretically* interface with something down the road? Your primary key is what it is, and you build the interface with that taken into account. – Bobson Jul 10 '13 at 17:06
  • 2
    @EricKing IATA and ISO care about the codes being stable enough, unique and universaly accepted. That coincides a lot with the interest of a person designing a table. – Tulains Córdova Jul 10 '13 at 18:00
  • 3
    @user61852 - just because these are standard codes does not mean the airline system uses them as PK's (maybe you have more insight here?). Having a cascading update on such a massive scale seems like a very bad idea. – JeffO Jul 11 '13 at 14:12
2

If you take the "I use surrogate keys all the time" approach, you get to bypass this type of concern. That may not be a good thing because it's important to give your data some thought, but it certainly saves a lot of time, engergy and effort. If anyone were to adopt an acception to this rule, the listed examples certainly qualify because it takes an a near "act of congress" to make the change.

Ad hoc queries of a database with these natural keys is certainly helpful. Creating views that do the same thing by including the lookup tables can work just as well. Modern databases do a much better job with this type of stuff to the point where it probably doesn't matter.

There are some cases specific to the US, where standards were drastically changed: Postal code expanded from 5 - 9 digits, State abbreviations to a consistent 2 letters and get rid of the period (Remember when Illinois was Ill.?), and most of the world got to deal with Y2K. If you have a real-time app with data spread all over the world containing billions of records, cascading updates are not the best idea, but shouldn't we all work in places that face such challenges? With that dataset, you could test it for yourself and come up with a more diffinitive answer.

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
JeffO
  • 36,816
  • 2
  • 57
  • 124
  • +1 Great answer. Most of the time people is very dogmatic on this issue. Many database designers have a giant ego and consider themselves as the owners of the database and the data. Other's see OK that the owner of the data can only use it through an specific application, because he cannot make sense of it. They also prefer to make provisions for something that may or may not happen in the future while making a living hell of things that are done in a dayly basis like importing data and writing queries. Also failing to produce any kind of canonical bibliography that supports their view. – Tulains Córdova Jul 11 '13 at 15:15
  • By the way, the "I use surrogate keys all the time" rule is not in the Relational Model (Codd's) nor any SQL standard. Oracle data dictionary scheme uses natural keys whenever possible and artificial keys in the other instances. PPDM (https://ppdm.org/) also recommends the mixed approach and it uses it in its model. ANSI SQL Standard says nothing about all-surrogates. I think all-surrogates and all-natural are corrosive. Some natural and some surrogate is what relational model teach. – Tulains Córdova Jul 11 '13 at 15:24
1

While having surrogate keys on the fields is fine and there is nothing wrong with that something to consider might be the index page size itself.

Since this is a relational database you'll be doing a lot of joins and having a surrogate key of a numerical type might make it easier on the database to handle i.e. the index page size will be smaller and thus faster to search trough. If this is a small project it won't matter and you'll get by without any issues however the bigger the application gets the more you'll want to reduce bottlenecks.

Having a BIGINT, INT, SMALLINT, TINYINT or whatever integer-like data type might save you some trouble down the road.

Just my 2 cents

UPDATE:

Small project - used by a few, perhaps even a few dozen people. Small scale, demo project, project for personal use, something to add to a portfolio when presenting your skills with no experience, and the like.

Large project - used by thousands, tens of thousands, millions of users daily. Something you'd build for an national / international company with a huge user base.

Usually what happens is a select few of the records get selected often, and the server caches the results for fast access, but every now and then you need to access some less used record, at which point the server would have to dip into the index page. (in the above example with the airport names, people often fly domestic airlines, say Chichago -> Los Angeles, but how often do people fly from Boston -> Zimbabwe)

If VARCHAR is used that means the spacing is not uniform, unless the data is always the same lenght (at which point a CHAR value is more effective). This makes searching the index slower, and with the server already being busy handling thousands and thousands of queries per second now it has to waste time going trough a non-uniform index, and do the same thing again on the joins (which is slower than regular selects on an un-optimized table, take DW as example where there are as few joins as possible to speed up data retrieval). Also if you use UTF that can mess with the database engine as well (I've seen some cases).

Personally, from my own experience, a properly organized index can increase the speed of a join by ~70%, and doing a join on an integer column can speed up the join by as much as around ~25% (depending on the data). As the main tables start to grow and these tables get used on them, would you rather have an integer datatype occupy the column that has a few bytes vs having a VARCHAR / CHAR field that will occupy more space. It comes down to saving on disk space, increasing performance and the overall structure of a relational database.

Also, as James Snell mentioned:

Primary keys must also be immutable, something IATA airport codes are definitely not. They can be changed at the whim of the IATA.

So taking this into consideration, would you rather have to update 1 record that is bound to a number, vs having to update that one record plus all the records in the table on which you join to.

Toni Kostelac
  • 249
  • 1
  • 5
  • It's a valid thought, but the point of these tables is that there are only a finite quantity of records in each table. If you actually meant code size by `small project` and `bigger`, please update to clarify why that would matter. – Bobson Jul 10 '13 at 18:44
  • 1
    The restrictions about PKs being immutable and numeric-integer are not part of the Relational Model (Codd's) or any SQL standard (ANSI or other). – Tulains Córdova Jul 10 '13 at 20:49
  • 4
    Indexes based on **fixed length, short strings (like ISO codes)** are as fast as integers. Indexes based on variable length, long strings are not. – Tulains Córdova Jul 10 '13 at 20:51
  • That is what I stated (see the VARCHAR vs CHAR part above) i haven't had a chance to test a fixed lenght short string vs a numeric integer but I did have a chance to do so with a variable lenght and an integer – Toni Kostelac Jul 10 '13 at 21:04
  • 2
    Join performance is a straw man. Often, using natural keys means you don't need a join in the first place. – Mike Sherrill 'Cat Recall' Jul 12 '13 at 14:20