Are surrogate keys a known anti-pattern?

Question

My workplace's database has a pattern that I've not seen before. Every column that is intended to be a key, whether primary or foreign, ends in _SK. This is shorthand for "surrogate key". It appears to be an informal way to tell the developer that said column is safe to use in joins and won't have any type mismatches or unexpected behavior. For example, our table of dates has many columns that represent the date. DATE_PLAIN is the typical SQL DATE variable that shows data ISO style (e.g. 2022-10-30), DATE_VENDOR puts the date in the style that our vendor uses (e.g. 44300... their epoch is weird), and DATE_SK is always an INT that uses the familiar 20221030 format. By reading these column names, the pattern immediately tells the developer that DATE_SK is the one that you want for joins. Someone who uses either of the other two options in joins will run in to type mismatches and trouble (I learned that the hard way, e.g. our vendor inconsistently stored their dates as INT and DECIMAL).

This strikes me as a remarkably good idea, which raises the question of why I've not seen it before. Is it a known anti-pattern?

`UPDATE sales SET DATE_SK = 20221029 WHERE DATE_PLAIN = '2022-10-30'` Ooops. — Philip Kendall, May 30 '22 at 18:36
@PhilipKendall One hopes that we use 3NF properly and make DATE_SK the only date column in `sales`. Although I admit that we've not done that in practice. I presume it was a performance thing. — J. Mini, May 30 '22 at 18:38
The weird thing is having the same date in multiple formats. A single date format should be sufficient. — JacquesB, May 30 '22 at 20:00
@JacquesB I see nothing weird about this. User-facing products will want `DATE_PLAIN` and anything that comes from our vendor will use `DATE_VENDOR`, but who wants to do arithmetic and other manipulations with SQL's `DATE` type or deal with the inconsistent typing of `DATE_VENDOR` when you can just use the predictable integer that is `DATE_SK`? — J. Mini, May 30 '22 at 21:37
@J.Mini it’s a violation of “single source of truth”. Store data. Not presentations. — candied_orange, May 31 '22 at 01:00
If you want two date columns but one source of truth, some dbms offer computed columns, or use a View. Storing it twice is a bad idea. — John Wu, May 31 '22 at 03:23
@JacquesB I completely agree in general and was going to comment exactly the same. However, small exceptions can be made for _reporting_ databases where the values may be precalculated in order to squeeze every last bit of performance out of the queries themselves. As OP mentions the vendor, it does hint that this table is being used for reporting. That being said, such a reporting database should not be its own source of truth (the real database should be), but the right solution here might be to separate the real and reporting database and keep the multiple formats in the reporting database. — Flater, May 31 '22 at 07:58
@JacquesB it seems as if the vendor is inconsistent about how they do things then something like this is a necessary evil. — emory, May 31 '22 at 15:36
Any column can be used in joins, not just keys. I fail to see the point of adding suffixes to them. — The Impaler, May 31 '22 at 15:38
@TheImpaler While any column *can* be used, there can be significant performance impact from using non-indexed columns. So a name that indicates that it's a reasonable choice can be useful. — Barmar, May 31 '22 at 15:46
One antipattern I see here is using integers for dates as if they were strings. This allows for invalid dates to be encoded. I once had to deal with data like this where you would find Feb 29th for non-leap years. It also requires more storage than a date type. One source of confusion may be that date types aren't stored as ISO strings or any other human readable format. — JimmyJames, May 31 '22 at 16:19
@Barmar True, then the point is not about **keys** anymore, but about **indexing**. Foreign keys are indexed by default in clustered index-based engines, such as MariaDB, MySQL, SQL Server, SAP ASE, but are not in heap-based ones such as Oracle, PostgreSQL, DB2. With this idea an FK in Oracle should not have the `_SK` suffix, while the same FK in MySQL should. — The Impaler, May 31 '22 at 17:18
@TheImpaler While a foreign key might not be indexed by default, you probably should index unless the indexing overhead causes a problem. It's likely to be used in joins, and if you use options like `ON DELETE CASCADE` you want to find the matching rows efficiently. — Barmar, May 31 '22 at 17:35
Concerning the multiple date-oriented columns, that kind of thing should generally be done through a view, not a table. In general, you only want to actually store the piece of data in one single cell, and if you need it correlated and formatted different ways from within the database, you do so through things like views and functions. Redundant data is bad design by default (even if it's formatted differently), and exceptions to this are generally for very different reasons. — Panzercrisis, Jun 01 '22 at 02:13
Why do people think storing multiple views in a append-only table is a bad idea? In data warehousing you often have a DateTime dimension like this, which might contain separate Year, Month, Day, DayOfWeek, Hour, etc properties so you can query loads and loads of data without parsing DateTimes to see if it was a monday morning. — Tvde1, Jun 01 '22 at 11:54
I'm with @Tvde1 - The date are specifically being called out as being part of a calendar table, where this is expected practice. Also, such a table is populated by an automated process, and, when supported by the RDBMS, almost all the columns will be automatically derived in some fashion. — Clockwork-Muse, Jun 01 '22 at 15:51
@Tvde1 Does it make sense to use a surrogate key for the date table or would you use the date itself? I've seen (roughly) the former done and it made queries extremely complex with many joins that were error-prone. — JimmyJames, Jun 01 '22 at 16:59

score 53 · Accepted Answer · answered May 30 '22 at 21:46

53

The use of surrogate keys is by itself not an anti pattern. It is a way to create a stable primary key for an entity, that will never change and not depend on any application data.

But there are three anti-patterns in the practice you describe:

Using a suffix _SK to identify potential surrogate keys is a variant of the Hungarian notation, with all its drawbacks
Calling surrogate key DATE_SK that bears value such as 20220531 is misleading since it is in reality a natural key with a special encoding, so just the contrary of a surrogate key that should be completely unrelated to meaning to the data it refers to.
Keeping several DATEs columns in a same table with different encodings to refer to the same date seems to be a denormalization, with the risk of inconsistencies.

answered May 30 '22 at 21:46

Christophe

74,672
10
115
187

28

Though I agree to point 2 and 3, I don't see "_SK" as a form of Hungarian notation - at least, it does not have "all its drawbacks". Have a look at the [drawbacks section in the Wikipedia article you cited](https://en.wikipedia.org/wiki/Hungarian_notation#Disadvantages), most of them refer to type redundancy of "Systems Hungarian" in statically typed languages, which does not apply here. In fact, going through the list of those 12 disadvantages listed there 1-by-1, I did not find one which clearly applies here. – Doc Brown May 31 '22 at 08:21
4

@DocBrown Thanks for the opportunity to clarify. As said, it is a variant, so the drawbacks have to be transposed. I think for example that "The Hungarian notation is redundant when type-checking is done by the compiler." stays relevant, since type management is done by the SQL engine. Moreover it is also redundant with foreign key constraint and primary key definition if the design is based on sk. Also "It may lead to inconsistency when code is modified or ported. If a variable's type is changed" is very relevant in an SQL context. – Christophe May 31 '22 at 13:11
1

Last but not least, it can be embarrassing when class members in some OO code accessing the DB suddenly get such suffixes whereas the naming conventions might prohibit Hungarian notation. Btw, this phenomenon might propagate and amplify drawbacks. – Christophe May 31 '22 at 13:22
Well, I have used (surrogate) keys of the consistent form "TablenameID", and it leads to datamodels for which several things can be generated. Try this without a suffix. – Doc Brown May 31 '22 at 13:31
@DocBrown I’d think that this is different: you don’t make any claims whether it is surrogate or not: you just refer to the name of a precise member (probably named ID) in the other table. You could as well just call it Tablename and use constraints to find out which field is the related one (or probably know that it’ll be ID), but here we are entering the subjective domain of taste ;-) – Christophe May 31 '22 at 14:04
@TheImpaler ah! could you please tell where you found arguments in favour of natural keys here? – Christophe May 31 '22 at 19:41
@Christophe Unfortunately I cannot refer to any bibliography. I only have the experience on many projects where natural keys have been a source of problems. I must say that the theory is correct, but when implemented the real world hits us with all kinds of unexpected cases -- that well, we should have expected. The bottom line is that natural keys attract totally avoidable, unwelcome problems. – The Impaler May 31 '22 at 19:49
@TheImpaler In fact, I share the same experience and it‘s only a matter of time for the problems to occur with natural keys in larger projects. Nevertheless,the question is not about natural keys. Moreover in small projects, natural keys can simplify queries/reporting and cascading constraints can reduce the drawbacks. This is why I do not criticize or downplay the natural keys in ly answer: it all depends on the context and I believe that software engineers are best placed to make the best decision in the context of their own project ;-) – Christophe May 31 '22 at 20:11
4

also: dates are not surrogate keys to begin with. – user253751 Jun 01 '22 at 13:01
Note that the date columns are specifically stated to be in a "date table", aka a calendar table. Multiple formats (and pulling out multiple date parts to separate columns) is expected and normal practice, since the table is (should be) populated by an automated process, and used for joins (and if the RDBMS supports it, should be derived columns as well). – Clockwork-Muse Jun 01 '22 at 15:48
I personally see 'Hungarian Notation' as a benefit, not a drawback. However, isn't that sort of opinion based, anyway? – JosephDoggie Jun 02 '22 at 17:45

Doc Brown · Answer 2 · 2022-05-31T05:34:37.210

28

There is nothing wrong in using surrogate keys as primary keys in a consistent way throughout a whole database schema, for each table, with a defined type and a column name suffix like "_SK". There is also nothing wrong in using corresponding foreign keys, with corresponding column names to the primary key columns whenever possible (hence also getting the suffix "_SK"). I have seen this working well in practice for larger data models (with different suffixes, but that does not matter).

However, IMHO it is not a good idea to abuse a standard term like "surrogate key" with a specific meaning for a key which does not fit to that definition - that will sooner or later create confusion. The whole point of a surrogate key is that it is completely artifical with zero domain logic encoded, for example a GUID or some raw ID or ordinal number. Your example for DATE_SK which encodes a date is clearly an example of not being a surrogate key, it is quite the opposite of it, a natural or domain key.

So I would recommend to either call those kind of keys differently, or use real surrogate keys instead.

edited May 31 '22 at 05:34

answered May 30 '22 at 20:29

Doc Brown

199,015
33
367
565

1

There's a noteworthy omission in this answer. You say that I'm using the wrong name for this pattern and that you've seen the same pattern with different naming conventions. What are those naming conventions and the correct name for this pattern? At minimum, that would be great info for the interested reader to stick in a search engine. – J. Mini May 30 '22 at 21:39
8

@J.Mini: sorry if I wasn't clear: the exact suffix for the surrogate key columns does not matter. If you would use a a different one like "id" or "key", that would not solve the issue. But when you talk of "surrogate keys": that's a technical term with a specific meaning which is contrary to the content of the `date_sk`. If you would name it just "date_id" that would only partially solve the issue. Better stick to your convention and use real surrogate keys. – Doc Brown May 31 '22 at 05:34
2

@J.Mini What he's saying is that the system labels them "Surrogate keys" when they're actually simply fields that are well enough sanitized to use as keys. – Loren Pechtel May 31 '22 at 06:20
1

@LorenPechtel: right, and I guess interpreting the suffix "SK" as an abbreviation for "sanitized key" instead of "surrogate key" could probably work as well. – Doc Brown May 31 '22 at 06:39
@J.Mini This "pattern" is encoding the fact that there is an index in the name of the field. It doesn't really have anything to do with keys. Of itself it doesn't sound like a good idea to me, nor (as others have mentioned) storing the same data in multiple formats. At minimum there had better be a table constraint ensuring the date columns are all for the same date. – OrangeDog May 31 '22 at 11:35
@DocBrown I was thinking about this question today--and I wonder if the "SK" actually meant "sanitized key" and somewhere along the line the understanding got corrupted? – Loren Pechtel Jun 01 '22 at 03:47
SK might be for superkey, in the sense of a SQL UNIQUE NOT NULL column set, which could be declared a SQL PK. In the relational model a superkey is a unique column set & a CK (candidate key) is a superkey containing no smaller/proper superkey & a PK is some CK (not just some superkey). Superkey & CK don't have definite meanings in SQL. Surrogate key also doesn't have a single meaning, it's a column added to be a CK/PK but sometimes it means a value visible to the business that to the DB is a natural key & sometimes it means a column not visible to the business but just DB users. – philipxy Jun 01 '22 at 05:31
I have seen this quite a few times as well (albeit with _fk, _pk, sk, etc...) but I find that the practice seems to be dying off as newer RDBMSs and Database IDEs come packaged with advanced [Entity Relationship Modelling](https://en.wikipedia.org/wiki/Entity–relationship_model) utilities that makes such practices largely redundant. Setting up proper constraints is generally a better practice. – Jun 01 '22 at 15:11

Philipp · Answer 3 · 2022-06-01T15:11:06.723

First of all, whoever came up with this idea seems to misunderstood what a surrogate key is. A surrogate key is a form of primary key. There are two competing ideas among database administrators about how primary keys should be structured. One philosophy is that of "natural keys". This philosophy says that when your data already has an unique identifier, use it. The other philosophy is that of "surrogate keys". Adherents to this philosophy believe that natural keys are often either not as unique and immutable as you assume them to be, or much longer than required. So you should use an additional ID column as primary key for each table which contains auto-generated values which are guaranteed to be unique, also known as a surrogate key.

Which of those philosophies is correct is besides the point. But fact is that a column which isn't the primary key isn't a surrogate key... at least for this table. It might be the surrogate key of another table. Then it would be a foreign key referencing a surrogate key. But it would not be a surrogate key in itself.

However, a consistent naming convention for columns which contain keys from other tables is actually a good idea. Especially if that column contains a true surrogate key, because a true surrogate key would not allow you to guess what table it could refer to just based on context. I have seen this naming convention before. But usually the name for it is ID. For example, when you have a table customer, that table has a primary key ID containing a surrogate key. When you then have a table order, that table would have a column customer_ID which references customer.ID.

And here we see another advantage of surrogate keys: Because of the synthetic nature of surrogate keys, nobody would ever get the idea to store them in different formats. When your customer.ID is, say, a 8 character hexadecimal string (because someone thought that was a good idea at some time), then it would make zero sense to encode customer_ID as an integer. You just use the format of the column you reference.

What is a bad idea, though, is to have multiple fields which contain the exact same data in different formats. Not only does it increase the size of the data, it also allows inconsistencies. If you do need multiple representations of the same data for convenience, then create a VIEW for the table which projects the one date column onto multiple view columns showing that date in different formats. And when that conversion in real-time eats up too much performance in your particular use-case, then materialize that view.

Are surrogate keys a known anti-pattern?

3 Answers3