235

In my databases, I tend to get into the habit of having an auto-incrementing integer primary key with the name id for every table I make so that I have a unique lookup for any particular row.

Is this considered a bad idea? Are there any drawbacks to doing it this way? Sometimes I'll have multiple indices like id, profile_id, subscriptions where id is the unique identifier, profile_id links to the foreign id of a Profile table, etc.

Or are there scenarios where you don't want to add such a field?

AJJ
  • 2,938
  • 4
  • 14
  • 14
  • 1
    Being able to uniquely reference a row is useful, just having an `id` for the sake of having it on something that's just a link table between profile and subscription isn't necessarily the best way to force this, when there's other ways (for example a unique constraint/index, or the primary key being a composite of the two values) which can have benefits in the storage engine that then help queries perform better. – TZHX Aug 15 '16 at 21:08
  • 68
    Have a look at the [German tank problem](https://en.wikipedia.org/wiki/German_tank_problem) for an example where a plain auto-incrementing identifier is a problem. Of course this only matters if you are using your ids in public. – Bergi Aug 15 '16 at 23:40
  • 2
    @Bergi I can't tell if this is a joke or an analogy for some other form of SQL issue? – AJJ Aug 15 '16 at 23:50
  • 26
    @ArukaJ The point is that it leaks some information about the system. E.g., suppose the database contains user-written posts, each of which gets a sequential id. Say you make four posts, each of which gets an id: at 4am (20), 5am (25), 8pm (100), and 9pm (200). By looking at the ids, you can see that only 5 posts were added between 4am and 5am, while 100 were added between 8pm and 9pm. If you were trying to pick the time for a denial of service attack, that could be valuable information. – Joshua Taylor Aug 16 '16 at 00:24
  • 7
    Also consider the case where you find yourself needing to merge two tables. Easy to do with *globally* unique ids. Hard with auto-incrementing ints. – Rorshark Aug 16 '16 at 00:59
  • 1
    @jamesgifford But wouldn't I likely be merging with that ID as part of the join? Joining tables where for example thistable.profile_id = profile_table.id – AJJ Aug 16 '16 at 01:18
  • Do you mean primary key or clustered index? The two are very different, though they may be defined over the same columns by default. – Luaan Aug 16 '16 at 08:44
  • 1
    Not just as part of a join - perhaps it is part of a data reporting system. E.g., you take your current system which works for a department and replicate it so that there is an independent copy in each department. Then, after that, you discover a need to have a data warehouse collecting information from all departments. At that point, your auto-increment keys collide. Or, consider a system that grows so much you need to shard it. You'd like them to run independently (no temporal interdependency) but auto-increment keys will collide. @jamesgifford is right - the situation is general. – davidbak Aug 16 '16 at 17:12
  • 35
    To everyone complaining about the "german tank problem".... if the only thing keeping someone from accessing data they shouldn't is a key in your URL... you have bigger problems than GUID versus Auto INT. – Matthew Whited Aug 16 '16 at 18:55
  • 12
    @MatthewWhited It's not just about swapping parameters in a URL. Suppose you use a site and you create asset 100 at time `t`, and asset 120 at time `t + 60`. If you can see both of those IDs (100 and 120) in unobfuscated form, you now know the total number of assets which exist, as well as roughly the rate at which they're created. This is information leakage. [This is not purely hypothetical.](http://www.rightscale.com/blog/cloud-industry-insights/amazon-usage-estimates) – Chris Hayes Aug 16 '16 at 19:55
  • And, that information doesnt really give you anything useful other than to know how fast data may be created. But it's just an estimate and it provides nothing useful to anyone other than a blind guess it some random correlation may almost mean something. – Matthew Whited Aug 16 '16 at 20:05
  • 1
    As another note, if you are worried about your competitors knowing how fast you grow then you have a bigger issue of not worrying about your own growth. "we would postulate" is another name for "we have no idea but for giggles we will make something up for marketing reasons" – Matthew Whited Aug 16 '16 at 20:11
  • Regarding the German Tank Problem, I generally use an additional guid for resources where I don't want the ID visible. – GrandmasterB Aug 16 '16 at 20:53
  • 1
    Guids, anyone? lol – Mark Rogers Aug 16 '16 at 21:04
  • 1
    @MatthewWhited It's not just about competitors. It also could be a privacy issue or something else. I'm not saying it often is an important issue, you just should give it a quick thought and then decide that it doesn't matter in your threat scenarios. – Bergi Aug 16 '16 at 23:29
  • 1
    You may want to check my question and the answers on this subject at http://dba.stackexchange.com/questions/50708/do-natural-keys-provide-higher-or-lower-performance-in-sql-server-than-surrogate – Hannah Vernon Aug 17 '16 at 14:10
  • 24
    "Is it good practice to **always**..." No. – brian_o Aug 17 '16 at 16:57
  • 1
    If the table consists only of relations to other entities, there is no need for an autoincrement value. As soon as you add a mutable (primitive) element your record should contain a unique key. Primitive is in Brackets because a primitive can become a relation if you expect records to be indistinguishable from each other if they define the same relations AND primitive values. But that is an issue of design. If you introduce ORM then you might face the problem that even relations need a unique autoincrement id. But that is an configuration and design issue again. – oopexpert Aug 17 '16 at 18:25
  • One advantage to auto-increment Ids that has not been mentioned: provided the Id values are always non-negative, the items can easily be indexed using an array structure. Otherwise, one must use a hashtable or other less efficient structure. – Mr Anderson Aug 19 '16 at 19:55
  • Cross tables: Why would you not simply use a concatenated key? What would the advantage be for an auto inc. int. thence? – The Nate Aug 21 '16 at 04:20
  • See also [Should every table have a single-field surrogate/artificial primary key?](http://dba.stackexchange.com/q/6108) – Martin Schröder Aug 28 '16 at 22:02

17 Answers17

153

It's never a bad idea to have a guaranteed unique row identifier. I guess I shouldn't say never – but let's go with the overwhelming majority of the time it's a good idea.

Theoretical potential downsides include an extra index to maintain and extra storage space used. That's never been enough of a reason to me to not use one.

alexwlchan
  • 107
  • 5
GrandmasterB
  • 37,990
  • 7
  • 78
  • 131
  • 1
    Is it bad practice to just label it "id" as I am currently doing? I do not exaggerate when I say I include this field in every single table. Every `CREATE TABLE` statement leads with column `id INT NOT NULL AUTO_INCREMENT PRIMARY KEY` – AJJ Aug 15 '16 at 20:48
  • 12
    Thats what I do. Most people either use 'id' or 'tablename_id' (such as user_id). The argument isn't typically if the column is needed, but which way to name it. – GrandmasterB Aug 15 '16 at 20:50
  • 113
    Personally I think the table name should imply the rest. `TableName.id` as opposed to `TableName.TableName_id`, because what else is that `id` going to refer to? If I have another id field in the table then I will prefix it with a table name if it's referring to some other table – AJJ Aug 15 '16 at 20:51
  • 1
    I agree, thats why I use 'id'. But other people prefer table_id. Whats most important is that you are consistent within your schema. So don't use 'id' on some tables, and 'table_id' on others. – GrandmasterB Aug 15 '16 at 20:52
  • 10
    @ArukaJ you mentioned you are using SQLite. That's actually a bit of a special case, as it always makes such a column 'under the hood'. So you are not even using any extra space because you get one whether you want it or not. Also, SQLite's rowid is always a 64bit integer. If my understanding of it is correct, if you define an auto-incrementing row, it will be an alias to the internal rowid. So you might was well always do it! See https://www.sqlite.org/autoinc.html – GrandmasterB Aug 15 '16 at 21:28
  • 9
    The one exception I can think of is if you have a unique identifier that is generated some other way, in which case that should be the primary key and an auto-incrementing id is redundant. – HamHamJ Aug 15 '16 at 21:39
  • @GrandmasterB Doesn't the ID max out at 2147483647 and never go negative? – AJJ Aug 15 '16 at 22:10
  • 4
    @GrandmasterB: The current version of SQLite allows creating `WITHOUT ROWID` tables (with an explicit `PRIMARY KEY`) as an optimization. But otherwise, an `INTEGER PRIMARY KEY` column is an alias for the rowid. – dan04 Aug 15 '16 at 23:29
  • 2
    @ArukaJ no, its 64bit. Look up the max value for a 64bit integer and you'll see you'll never have a problem :-) – GrandmasterB Aug 16 '16 at 06:25
  • @dan04 that's interesting, I didn't know they changed that. – GrandmasterB Aug 16 '16 at 06:25
  • 1
    As for an example where you don't need to add an AI id: A linking table, e.g. a table with just `user_id` and `group_id` when a user can have multiple groups. An ID here would add nothing usefull. – Martijn Aug 16 '16 at 12:54
  • @ArukaJ I agree with your theory in your third comment but be aware that if you are using Oracle 10g or 11g (confirmed; not sure about 12c) then your are limited to 30 chars for table names and 30 chars for column names. So as long as you keep your table name to 27 chars or less then you can freely use the foreign key column naming convention of `foreign_tablename_id` – MonkeyZeus Aug 16 '16 at 13:10
  • 2
    @ArukaJ the argument I've heard to use `table_id` even in `table` is for joins -- `ON alias.table_id = alias2.table_id` is more obviously correct than `ON alias.id = alias2.table_id`. It gets even harder to check the second case when you have Oracle joins which appear way down in the `WHERE` clause. – Dan Lyons Aug 16 '16 at 17:45
  • @DanLyons Plus you can use `USING (table_id)` which is simpler and produces only one column in the resulting table. – Doval Aug 16 '16 at 19:08
  • 3
    `That's never been enough of a reason to me to not use one.` for **you** but not in general. This answer is so extremely biased that I don't find it helpful. It's a personal and extreme opinion. – usr Aug 16 '16 at 20:10
  • 2
    @ArukaJ please consider accepting a different answer. This one is the one with the *least* information. Do not just accept the answer that is highest voted. Choose the most useful one. – usr Aug 16 '16 at 20:12
  • 3
    As an argument for using `table_id` to name autoincrement fields: It reduces the amount of disambiguation you need to when writing queries across multiple tables, and it lets you do convenient natural joins in 1:N relations ships. E.g, `select * from user JOIN address USING (user_id)` vs `select user.*, address.id AS address_id FROM user JOIN address ON (user.id = address.user_id)` – Sam Dufel Aug 16 '16 at 20:54
  • 4
    I heavily agree to your first sentence. Nevertheless **this is not an answer to the question**, which is about **autoincrement integer keys**. And having some good and some bad experience with the latter, I would not agree that it is always a good idea to have an autoincrement integer PK. – Doc Brown Aug 17 '16 at 08:00
  • @ArukaJ "because what else is that id going to refer to?" - One of our devs had a wtf moment when we added Indonesian and the two-letter language code was already taken – Izkata Aug 17 '16 at 18:51
  • 1
    Possible exception - When you're using an ORM tool to design your datamodel, and inheritance is involved. Each _top-level_ entity should have an auto-incrementing key in its corresponding database table. With entity _subclasses_, however, using an auto-incrementing key in the corresponding database table is completely superfluous. You have to store the id of the associated top-level entity regardless, so in that case it makes more sense to just use that id as your primary key in each subclass table. – aroth Aug 18 '16 at 02:13
  • How do yo handle BEGIN TRANSACTION; INSERT INTO AutoTable VALUES ('Value1', 'Value2', 'Value3); COMMIT; -> network error? Did it commit or did it not commit? – Joshua Aug 19 '16 at 15:17
112

TL;DR: Use UUID's instead of auto-increment, if you don't already have a unique way of identifying each row.

I disagree with all the answers before. There are many reasons why it is a bad idea to add an auto increment field in all tables.

If you have a table where there are no obvious keys, an auto-increment field seems like a good idea. After all, you don't want to select * from blog where body = '[10000 character string]'. You'd rather select * from blog where id = 42. I'd argue that in most of these cases, what you really want is a unique identifier; not a sequential unique identifier. You probably want to use a universally unique identifier instead.

There are functions in most databases to generate random unique identifiers (uuid in mysql, postgres. newid in mssql). These allow you to generate data into multiple databases, on different machines, at any time, with no network connection between them, and still merge data with zero conflicts. This allows you to more easily setup multiple servers and even data centers, like for example, with microservices.

This also avoids attackers guessing url's to pages they shouldn't have access to. If there's a https://example.com/user/1263 there's probably a https://example.com/user/1262 as well. This could allow automation of a security exploit in the user profile page.

There are also a lot of cases where a uuid column is useless or even harmful. Let's say you have a social network. There is a users table and a friends table. The friends table contains two userid columns and an auto-increment field. You want 3 to be friends with 5, so you insert 3,5 into the database. The database adds an auto-increment id and stores 1,3,5. Somehow, user 3 clicks the "add friend"-button again. You insert 3,5 into the database again, the database adds an auto-increment id and inserts 2,3,5. But now 3 and 5 are friends with each other twice! That's a waste of space, and if you think about it, so is the auto-increment column. All you need to see if a and b are friends is to select for the row with those two values. They are, together, a unique row identifier. (You would probably want to do write some logic to make sure 3,5 and 5,3 are deduplicated.)

There are still cases where sequential id's can be useful, like when building an url-shortener, but mostly (and even with the url shortener) a randomly generated unique id is what you really want to use instead.

Filip Haglund
  • 2,823
  • 3
  • 16
  • 20
  • 33
    The problem with UUIDs is that they take up too much space for the majority of tables. Use the right unique identifier for each table. – Stephen Aug 16 '16 at 06:37
  • 55
    The entire paragraph about uniqueness is moot - uniqueness can be enforced, with or without a primary key. Besides, UUIDs are better on the theoretical side, but awful to use when debugging/performing DBA tasks or otherwise doing anything that is not "resisting to attacks". –  Aug 16 '16 at 07:28
  • 12
    Another scenario when UUIDs are better: implementing an idempotent PUT operation, so that you can safely retry requests without introducing duplicate rows. – yurez Aug 16 '16 at 08:21
  • 23
    On the "URL guessing" point, having a unique ID (sequential or otherwise) does not imply exposing that ID to users of the application. – Dave Sherohman Aug 16 '16 at 12:16
  • 5
    @Stephen: If I remember correctly, the other problem with UUIDs as a PK is that database engines have a lot of trouble making an index on a UUID field. Is this right? (ETA: On reading the next answer, I might have meant "clustered index" instead of just "index.") – Codes with Hammer Aug 16 '16 at 13:45
  • 1
    @CodesWithHammer: Yes Clustered index on UUIDs are ranked on the UUIDs, thus your rows has a non business-logical order in memory. Thus you can have bad performance because you don't benefit of the memory caching of your PC. When using id, you may benefit the fact the rows are sorted by creation date. Imagine a select between two dates as a usecase. – Olivier de Rivoyre Aug 16 '16 at 16:29
  • 2
    One advantage with UUIDs is that they are very very unique. This means that any typo where you join CustomerID=ProjectID will return an empty result instead. This is easier to catch when debugging. – LosManos Aug 16 '16 at 18:49
  • 3
    I can't suggest using UUIDs for everything. Yes, a session ID should not be an autoincrement, but for almost everything else, *if* you have a dedicated ID field created internally to the DB or the application, having it actually be sequential is quite useful. It's kind of a SCN ("system change number"), it helps debugging or generally gives you (as a developer/dba) useful information about which entities were likely created close together. – AnoE Aug 16 '16 at 20:29
  • 3
    It's funny that you use "waste of space" as an example (although you could create exactly the same problem with UUIDs obviously, so rather confusing) and then neglect to mention the fact that UUIDs are an order of magnitude larger than autoincremented values. Yes there are some very, very rare situations where you don't want the user to guess your IDs, *but* an UUID is a **horrible** choice there as well! UUIDs are trivial to guess and the only version that isn't necessarily (4) does not require a cryptographically strong PRNG so you can't rely on that one either. – Voo Aug 18 '16 at 19:37
  • UUIDs are a good solution for sharded databases though, although there exist other systems that can just as well and with other useful properties (such as better performance, or being able to sort by time without needing to fetch anything but the key). Hell, Flickr uses DB ticket servers with auto increment and that works on their scale just as well. – Voo Aug 18 '16 at 19:47
  • 1
    Blindly choosing UUID over an incrementing key is not something I'd advise. Sure, UUIDs have lots of advantages, but they have some drawbacks, too. (Some databases index them poorly, for instance.) Examine your use case. Weight the benefits and drawbacks. Then decide. – jpmc26 Aug 19 '16 at 17:39
  • 11
    Purely from a database point of view, this answer is completely wrong. Using UUIDs instead of auto-incrementing integers grows indexes way too fast, and affects performance and memory consumption adversely. If you are talking from the point of view of web service or web app, there should be a layer in between the database and the front end anyway. Anything else is bad design. Using data as a primary key is even worse. Primary keys should be use only on the data layer, nowhere else. – Drunken Code Monkey Aug 19 '16 at 22:31
  • There is also a huge problem with UUIDs in the case of auditing. In many enterprise databases, deleting rows is a big no-no. Auditors often use the primary key sequence as a starting point to investigate missing or tampered with data. If you need a globally unique key, stick it in an global entities index table. – Drunken Code Monkey Aug 19 '16 at 22:37
  • 2
    @DrunkenCodeMonkey: in most DBMS, gaps in the primary key id does not necessarily indicate deleted rows. In Postgres and Oracle, for example, getting a number from sequences are non-transactional; if you insert a row in a transaction then rollback the transactions, your pk ids will have gaps. Sequences can also be pre-allocated by the database's child processes to speed up ID generation, which can cause gaps when the database server is restarted. MySQL/Maria also does something similar, even though they don't have explicit sequence object. – Lie Ryan Aug 21 '16 at 14:26
  • As much as i abhor auto incrementing keys, because it's the sign of a lazy designer, using UUIDs is just as bad. Just because it's harder to "guess". This is just another demonstration of security through obscurity. Which is ALWAYS a bad idea. If you have a URL problem you should be securing access to the data requested, or limiting the data. The question you need to ask is that does knowing someone's user id bad? In short, no. Knowing the id does nothing other than what you let them do with it. – Rahly Oct 01 '18 at 22:21
  • There is a good point but one should consider that using UUID in unique indexes in PostgreSQL might lead to write amplification due to the fact that's it has a random nature. See this article for the details: https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/ – pensnarik Feb 04 '20 at 10:04
64

Autoincemental keys have mostly advantages.

But some possible drawbacks could be:

  • If you have a business key, you have to add a unique index on that column(s) too in order to enforce business rules.
  • When transfering data between two databases, especially when the data is in more than one table (i.e. master/detail), it's not straight-forward since sequences are not synced between databases, and you'll have to create an equivalence table first using the business key as a match to know which ID from the origin database corresponds with which ID in the target database. That shouldn't be a problem when transfering data from/to isolated tables, though.
  • Many enterprises have ad-hoc, graphical, point-and-click, drag-and-drop reporting tools. Since autoincremental IDs are meaningless, this type of users will find it hard to make sense of the data outside "the app".
  • If you accidentally modify the business key, chances are you will never recover that row because you no longer have something for humans to identify it. That caused a fault in the BitCoin platform once.
  • Some designers add an ID to a join table between two tables, when the PK should simply be composed of the two foreign IDs. Obviously if the join table is between three or more tables, then an autoincremental ID makes sense, but then you have to add an unique key when it applies on the combination of FKs to enforce business rules.

Here's a Wikipedia article section on the disadvantages of surrogate keys.

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
  • 14
    Blaming the mt.gox flaw on surrogate keys seems rather dubious. The problem was that they included *all* fields in their compound key, even mutable/malleable fields. – CodesInChaos Aug 16 '16 at 08:51
  • 7
    A "social" disadvantage of using auto-increment keys is that sometimes "the business" assumes that there must never be any gaps and demands to know what happened to the missing rows that occur when a failed insert happens (transaction rollback). – Rick Ryker Aug 17 '16 at 14:51
  • @RickRyker That's right. Also **Inadvertent disclosure if a seq key is leaked:** *"By subtracting a previously generated sequential key from a recently generated sequential key, one could learn the number of rows inserted during that time period. This could expose, for example, the number of transactions or new accounts per period."*. – Tulains Córdova Aug 17 '16 at 15:27
  • 5
    Another disadvantage is that if the system grows so large that you have to shard the database, you can no longer use autoincrement to produce a globally unique key. When you get to that point, you may have lots of code relying on that assumption. There are other ways to produce a unique identifier that will keep working if the database is sharded. – kasperd Aug 17 '16 at 20:37
  • @kasperd Actually you can. Check out DB ticket servers for a sample. Flickr is using those successfully on a pretty large scale. Actually has quite some advantages over GUIDs (size, performance, allows sorting by time without having to fetch anything else) and downsides (single point of failure in the simple implementation - you can fix *that* by having multiple dbs that increment by a fixed value != 1, but then you lose sortability). – Voo Aug 18 '16 at 20:11
  • 1
    @Voo It is not guaranteed that your chosen database supports that. And trying to implement it a higher layer than the database itself means you lose some of the guarantees SQL would give you. Finally any centralized assignment of IDs will increase latency if you have a distributed system. – kasperd Aug 20 '16 at 00:11
  • 1
    @Voo Of course regardless of the scale of the system one shouldn't make too many assumptions about the nature of autoincremented IDs. If you have just a single database they are assigned in order, but there is no guarantee they are committed in order. And there can be gap in the sequence because not all transactions are committed. – kasperd Aug 20 '16 at 00:22
  • @Kasperd "has advantages and disadvantages" is rather different to "cannot be used". Nobody said that it's the perfect system, but it's certainly possible and done on very large scale. – Voo Aug 20 '16 at 22:45
21

Just to be contrary, No, you do NOT need to always have a numeric AutoInc PK.

If you analyse your data carefully you often identify natural keys in the data. This is often the case when the data has intrinsic meaning to the business. Sometimes the PKs are artefacts from ancient systems that the business users utilize as a second language to describe attributes of their system. I've seen vehicle VIN numbers used as the primary key of a "Vehicle" table in a fleet management system for example.

However it originated, IF you already have a unique identifier, use it. Don't create a second, meaningless primary key; it's wasteful and may cause errors.

Sometimes you can use an AutoInc PK to generate a customer meaningful value e.g. Policy Numbers. Setting the start value to something sensible and applying business rules about leading zeros etc. This is probably a "best of both worlds" approach.

When you have small numbers of values that are relatively static, use values that make sense to the system user. Why use 1,2,3 when you could use L,C,H where L,H and C represent Life, Car and Home in an insurance "Policy Type" context, or, returning to the VIN example, how about using "TO" for Toyota? All Toyata cars have a VIN that starts "TO" It's one less thing for users to remember, makes it less likely for them to introduce programming and user errors and may even be a usable surrogate for a full description in management reports making the reports simpler to write and maybe quicker to generate.

A further development of this is probably "a bridge too far" and I don't generally recommend it but I'm including it for completeness and you may find a good use for it. That is, use the Description as the Primary Key. For rapidly changing data this is an abomination. For very static data that is reported on All The Time, maybe not. Just mentioning it so it's sitting there as a possibility.

I DO use AutoInc PKs, I just engage my brain and look for better alternatives first. The art of database design is making something meaningful that can be queried quickly. Having too many joins hinders this.

EDIT One other crucial case where you do not need an Autogenerated PK is the case of tables that represent the intersection of two other tables. To stick with the Car analogy, A Car has 0..n Accessorys, Each Accessory can be found on many cars. So to represent this You create a Car_Accessory table containing the PKs from Car and Accessory and other relevant information about the link Dates etc.

What you don't (usually) need is an AutoInc PK on this table - it will only be accessed via the car "tell me what accessories are on this car" or from the Accessory "tell em what cars have this accessory"

mcottle
  • 6,122
  • 2
  • 25
  • 27
  • 4
    > All Toyata cars have a VIN that starts "TO" That just isn't true. They start with "JT" if made in Japan. American-built Toyotas have completely different VINs https://en.wikibooks.org/wiki/Vehicle_Identification_Numbers_(VIN_codes)/Toyota/VIN_Codes – Monty Harder Aug 16 '16 at 15:37
  • Thank god there's a pragmatist answering this question. I thought I was going to have to write this answer. ++ – RubberDuck Aug 16 '16 at 16:32
  • 20
    `Don't create a second, meaningless primary key; it's wasteful and may cause errors.` However, if the way you establish uniqueness for a record is a combination of 6 columns then joining on all 6 all the time is very error prone unto itself. The data naturally do have a PK but you are better off using a `id` column and a unique constraint on those 6 columns. – Brad Aug 16 '16 at 17:51
  • 14
    I admit some of these suggestions take it a bit far for me. Yes, being pragmatic is fine, but I cannot count how often someone *swore* the life of his firstborn that some attribute out of the domain will stay unique for the rest of days. Well, usually that worked well until the second week after going live, when the first duplicates turned up. ;) Using a "description" as a PK is just far out. – AnoE Aug 16 '16 at 20:27
  • 2
    @Monty, my bad, you're right. Fallible memory, it's 20 years since I architected the fleet management systems. No the VIN wasn't the primary key :) I used an AutoInc Asset_ID IIRC which leads to something I forgot. Tables that are the linkers for many-to-many relationships where you link, say, car to accessory (e.g. sunroof) Many cars have many accessories so you need a "Car_Accessory" table which contains Car_ID and Accessory_ID but absolutely does NOT need Car_Accesory_ID as an AutoInc PK. – mcottle Aug 17 '16 at 02:22
  • 3
    @Brad, Nowhere in my answer was I advocating using a six element composite key. If you have one, it may be a surrogate Primary key (and should be enforced) but you don't whack 6 fields in every table that links to it. My message is about pragmatism and not blindly doing something because an ivory tower academic writes about something theoretical. – mcottle Aug 17 '16 at 02:27
  • 1
    You make a good point, but real world examples show that "natural keys" are fewer that we think... We got a network of offices and each office has an office code? And we are sure that management will not change that overnight? Sounds like a "natural key", until there is a merger with other company and some offices have to change their numbers to avoid collision. – SJuan76 Aug 17 '16 at 07:58
  • @SJuan If the company merges you have a remapping exercise whether you have two offices with the Key "NYC" or two offices with the key "1". If you create a second NYC office you create "NYC2" but it's more about retaining backwards compatibility with existing systems and reducing the training load on end users who are used to operating the gnarly old mainframe system from the '70s that uses "meaningful" TLAs as Keys... – mcottle Aug 17 '16 at 08:19
  • 7
    It's truly amazing how few TRULY immutable "natural keys" there are. SSN's? Nope, they can change. It's rare, but it can happen. Usernames? Nope. Eventually someone will have a valid business reason to change. VIN is often a textbook example, but there aren't many others. Even home addresses can change, given street naming changes. – Erik Funkenbusch Aug 17 '16 at 21:08
  • @Brad: Another way to do IDs for six columns and will create the same ID on separate databases is to use a hash of the row data as the ID entry. This works really well on fairly static data, less well if data fields keep changing all the time. It also saves on indexing since there's no need to maintain an index to enforce uniqueness. – Zan Lynx Aug 17 '16 at 23:47
  • 2
    This is a bit of a nit, but in the interest of good writing, does the answer really need to start with "just to be contrary"? At best it's useless, at worst it invalidates the legitimacy of the whole answer. – djechlin Aug 18 '16 at 19:19
  • I would say your post is actually one of the best examples of why picking things you think are unique as primary keys is pretty dangerous: Sure something may be unique now, but what about 10 years from now? 20? If that assumption ever becomes invalid (and one of the first samples you thought would be a great case for using a natural key, already turned out to be wrong!) you'll have lots and lots of refactoring to do. Also there's a performance hit from using anything but a single int primary key, although it's pretty negligible in most cases. – Voo Aug 18 '16 at 19:55
  • @Voo, that was my memory failing not data failing - look at my reply to Monty. VINs are an industry standard which has remained unique for decades. And when I wrote a system that used VINs I didn't use the VIN as the PK. I've seen people use licence plates as a PK - that didn't end well. – mcottle Aug 19 '16 at 01:26
12

Many tables already have a natural unique id. Do not add another unique id column (auto-increment or otherwise) onto these tables. Use the natural unique id instead. If you add another unique id, you essentially have a redundancy (duplication or dependency) in your data. This goes against the principles of normalization. One unique id is dependent on the other for accuracy. This means that they have to be kept perfectly in sync at all times in every system that manages these rows. It's just another fragility in your data integrity that you don't really want to have to manage and validate to long term.

Most tables these days don't really need the very minor performance boost that an additional unique id column would give (and sometimes it even detracts from performance). As a general rule in IT, avoid redundancy like the plague! Resist it everywhere it is suggested to you. It's anathema. And take heed of the quote. Everything should be as simple as possible, but not simpler. Don't have two unique ids where one will suffice, even if the natural one seems less tidy.

Bradley Thomas
  • 5,090
  • 6
  • 17
  • 26
  • 3
    Shouldn't you only use "natural" IDs as primary keys if they're absolutely guaranteed to never change? For instance, you shouldn't use a driver's license number as a primary key, because if a person gets a new driver's license, you'll need to update not only that table but any tables with foreign keys referencing it! – ekolis Aug 19 '16 at 19:17
  • 1
    There are several reasons why driver's license number does not qualify as a natural unique id. Firstly some of them are derived from other data, like birth date and name. They are not guaranteed unique across states. And to take your example, when a person is re-issued a license with the same number, but perhaps an extended expiry, what happens then? They have a different license with the same number. A natural id still has to fulfill the basic properties of a primary key. Driver's license number (at least in the USA) has some shortcomings in this regard. – Bradley Thomas Aug 19 '16 at 19:59
  • 1
    OK, I guess I misunderstood the definition of natural ID then; I thought it was merely an ID defined by the business rules, whether or not it's actually guaranteed to be immutable. – ekolis Aug 20 '16 at 13:06
12

On larger systems, ID is consistency booster, do use it almost anywhere. In this context, individual primary keys are NOT recommended, they are expensive at the bottom line (read why).

Every rule has an exception, so you might not need integer autoincrement ID on staging tables used for export/import and on similar one-way tables or temporary tables. You would also prefer GUID's instead of ID's on distributed systems.

Many answers here suggest that existing unique key should be taken. Well even if it has 150 characters? I don't think so.

Now my main point:

It looks that opponents of autoincrement integer ID are speaking about small databases with up to 20 tables. There they can afford individual approach to each table.

BUT once you have an ERP with 400+ tables, having integer autoincrement ID anywhere (except cases mentioned above) just makes great sense. You do not rely on other unique fields even if they are present and secured for uniqueness.

  • You benefit from universal time-saving, effort-saving, easy-to-remember convention.
  • In most cases you JOIN tables, without need of checking what the keys are.
  • You can have universal code routines working with your integer autoincrement column.
  • You can extend your system with new tables or user plugins not foreseen before simply by referring to ID's of existing tables. They are already there from the beginning, no costs to add them additionally.

On larger systems, it can be worth ignoring minor benefits of those individual primary keys and consistently use integer autoincrement ID in most cases. Using existing unique fields as primary keys is maybe saving some bytes per record but additional storage or indexing time pose no issue in today's database engines. Actually you are losing much more money and resources on wasted time of the developers/maintainers. Today's software should be optimized for time and effort of programmers – what approach with consistent ID's fulfills much better.

miroxlav
  • 672
  • 4
  • 17
  • 2
    From personal experience, I wholeheartedly agree with the second half of your answer. You will need globally unique keys much, much less often than you will need fast and compact indexes. If you do need one, create a GlobalEntities table with an autogenerated ID and a UUID column. Then add an ExGlobalEntityId foreign key to the Customers table for example. Or use a hash of some of the values. – Drunken Code Monkey Aug 21 '16 at 07:40
8

It is not good practice to superfluous designs. I.e. - it is not good practice to always have an auto increment int primary key when one is not needed.

Let's see an example where one is not needed.

You have a table for articles–this has an int primary key id, and a varchar column named title.

You also have a table full of article categories–id int primary key, varchar name.

One row in the Articles table has an id of 5, and a title "How to cook goose with butter". You want to link that article with the following rows in your Categories table: "Fowl" (id: 20), "Goose" (id: 12), "Cooking" (id: 2), "Butter" (id: 9).

Now, you have 2 tables: articles and categories. How do you create the relationship between the two?

You could have a table with 3 columns: id (primary key), article_id (foreign key), category_id (foreign key). But now you have something like:

| id | a_id | c_id | 
| 1  |  5   |   20 | 
| 2  |  5   |   12 | 
| 3  |  5   |    2 | 

A better solution is to have a primary key that is made up of 2 columns.

| a_id | c_id | 
|    5 |   20 | 
|    5 |   12 | 
|    5 |    2 | 

This can be accomplished by doing:

create table articles_categories (
  article_id bigint,
  category_id bigint,
  primary key (article_id, category_id)
) engine=InnoDB;

Another reason not to use an auto increment integer is if you are using UUIDs for your primary key.

UUIDs are by their definition unique, which accomplishes the same thing that using unique integers does. They also have their own added benefits (and cons) over integers. For instance, with a UUID, you know that the unique string you're referring to points to a particular data record; this is useful in cases where you do not have 1 central database, or where applications have the ability to create data records offline (then upload them to the database at a later date).

In the end, you need to not think about primary keys as a thing. You need to think of them as the function they perform. Why do you need primary keys? To be able to uniquely identify specific sets of data from a table using a field that will not be changed in the future. Do you need a particular column called id to do this, or can you base this unique identification off of other (immutable) data?

anw
  • 189
  • 1
  • The table that defines a relationship between two tables (normally a many-to-many relationship) is called an association or intersection. – Clint Pachl Feb 29 '20 at 00:05
7

An auto-incremented (identity) primary key is a good idea except to note that it is meaningless outside of the context of the database and immediate clients of that database. For example, if you transfer and store some of the data in another database, then proceed to write different data to both database tables, the id's will diverge - i.e., data with an id of 42 in one database won't necessarily match the data with an id of 42 in the other.

Given this, if it's necessary to still be able to identify rows uniquely outside of the database (and it frequently is), then you must have a different key for this purpose. A carefully selected business key will do, but you'll often end up in a position of a large number of columns required to guarantee uniqueness. Another technique is to have an Id column as an auto-increment clustered primary-key and another uniqueidentifier (guid) column as a non-clustered unique key, for the purposes of uniquely identifying the row wherever it exists in the world. The reason you still have an auto-incremented key in this case is because it's more efficient to cluster and index the auto-incrementing key than it is to do the same to a guid.

One case where you might not want an auto-incrementing key would be a many-to-many table where the primary key is a compound of the Id columns of two other tables (you could still have an auto-incrementing key here, but I don't see the point of it).

One other question is the datatype of the auto-incremented key. Using an Int32 gives you a large, but relatively limited range of values. Personally I frequently use bigint columns for the Id, in order to practically never need to worry about running out of values.

MatthewToday
  • 197
  • 4
7

Or are there scenarios where you don't want to add such a field?

Sure.

First of all, there are databases that have no autoincrements (e.g., Oracle, which certainly is not one of the smallest contenders around). This should be a first indication that not everybody likes or needs them.

More important, think about what the ID actually is - it is a primary key for your data. If you have a table with a different primary key, then you do not need an ID, and should not have one. For example, a table (EMPLOYEE_ID, TEAM_ID) (where each employee can be in several teams concurrently) has a clearly defined primary key consisting of those two IDs. Adding an autoincrement ID column, which is also be a primary key for this table, would make no sense at all. Now you are lugging 2 primary keys around, and the first word in "primary key" should give you a hint that you really should have only one.

AnoE
  • 5,614
  • 1
  • 13
  • 17
  • 9
    (Not an Oracle user so forgive the question but) doesn't Oracle use Sequence in the same way others use Autoincrement/Identity? Is saying that Oracle doesn't have an Autoincrement data type really just a sematic argument? – Brad Aug 16 '16 at 17:53
  • Well, that was just a small point; the main part is that a running ID is not appropriate for every table, thus getting used to just slapping an auto-ID on every single table might not be the wisest. – AnoE Aug 16 '16 at 19:58
  • there are no two primary keys , there is only one primary key and all rest are called candidate keys if they can serve as primary keys too.. – rahul tyagi Sep 25 '16 at 17:11
7

I usually use an "identity" column (auto-incremennting integer) when defining new tables for "long-lived" data (records I expect to insert once and keep around indefinitely even if they end up "logically deleted" by setting a bit field).

There are a few situations I can think of when you don't want to use them, most of which boil down to scenarios where one table on one instance of the DB cannot be the authoritative source for new ID values:

  • When incremental IDs would be too much information for a potential attacker. Use of an identity column for "public-facing" data services makes you vulnerable to the "German Tank Problem"; if record id 10234 exists, it stands to reason that record 10233, 10232, etc exist, back to at least record 10001, and then it's easy to check for record 1001, 101 and 1 to figure out where your identity column started. V4 GUIDs composed of mainly random data break this incremental behavior by design, so that just because one GUID exists, a GUID created by incrementing or decrementing a byte of the GUID does not necessarily exist, making it harder for an attacker to use a service indtended for single-record retrieval as a dump tool. There are other security measures that can better restrict access, but this helps.
  • In M:M cross-reference tables. This one's kind of a gimme but I've seen it done before. If you have a many-to-many relationship between two tables in your database, the go-to solution is a cross-reference table containing foreign key columns referencing each table's PK. This table's PK should virtually always be a compound key of the two foreign keys, to get the built-in index behavior and to ensure uniqueness of the references.
  • When you plan on inserting and deleting in bulk on this table a lot. Probably the biggest disadvantage to identity columns is the extra hoopla you have to go through when doing an insert of rows from another table or query, where you want to maintain the original table's key values. You have to turn "identity insert" on (however that's done in your DBMS), then manually make sure the keys you're inserting are unique, and then when you're done with the import you have to set the identity counter in the table's metadata to the maximum value present. If this operation happens a lot on this table, consider a different PK scheme.
  • For distributed tables. Identity columns work great for single-instance databases, failover pairs, and other scenarios where one database instance is the sole authority on the entire data schema at any given time. However, there's only so big you can go and still have one computer be fast enough. Replication or transaction log shipping can get you additional read-only copies, but there's a limit to that solution's scale as well. Sooner or later you'll need two or more server instances handling inserts of data and then synchronizing with each other. When that situation comes, you'll want a GUID field instead of an incremental one, because most DBMSes come pre-configured to use a portion of the GUIDs they generate as an instance-specific identifier, then generate the rest of the identifier either randomly or incrementally. In either case, the odds of a collision between two GUID generators are nil, while an identity integer column is a nightmare to manage in this situation (you can go even/odd by offsetting seeds and setting the increment to 2, but if one server sees more activity than the other you're wasting IDs).
  • When you have to enforce uniqueness across multiple tables in the DB. It's common in accounting systems, for instance, to manage the General Ledger (with a row for each credit or debit of every account that has ever occurred, so it gets very big very quickly) as a sequence of tables each representing one calendar month/year. Views can then be created to hook them together for reporting. Logically, this is all one very big table, but chopping it up makes the DB's maintenance jobs easier. However, it presents the problem of how to manage inserts into multiple tables (allowing you to begin logging transactions in the next month while still closing out the last) without ending up with duplicate keys. Again, GUIDs instead of identity integer columns are the go-to solution, as the DBMS is designed to generate these in a truly unique way, so that a single GUID value will be seen once and only once in the entire DBMS.

There are workarounds that allow use of identity columns in these situations, as I've hopefully mentioned, but in most of these, upgrading from the identity integer column to a GUID is simpler and solves the problem more completely.

KeithS
  • 21,994
  • 6
  • 52
  • 79
  • 1
    There are cases you can still need ID in M:N tables (using columns `ID, ID_M, ID_N`) because of attaching properties to instances of your M:N relation. – miroxlav Aug 17 '16 at 07:47
  • V4 GUIDS aren't guaranteed to use a cryptographically strong PNRG so you really shouldn't rely on it for your first example imo (although if your db engine makes stronger promises you might be fine, but that's rather non portable). Otherwise a well reasoned post. – Voo Aug 19 '16 at 06:13
  • 1
    @miroxlav - I would assert that if a table has enough additional metadata regarding the relationship that a separate PK outside the two FKs is a good idea, it's not really a cross-reference table anymore; it's its own entity that happens to reference the two others. – KeithS Aug 19 '16 at 15:10
  • @Voo - You're right, V4 GUIDs aren't guaranteed to be cryptographically random, just unique (like all GUIDs are). However, the tail numbers of U.S. jet fighters aren't generated from cryptographically random seed data/algorithms either. What you're really looking for is a sparsely-populated domain; a V4 GUID has 112 bytes of random data, capable of uniquely identifying 5e33 records. – KeithS Aug 19 '16 at 15:19
  • To put that number in perspective, every man, woman and child on the planet (all 7 billion) could have 741 *trillion* individually catalogued and IDed data points in our DB, and we'd still only be using one GUID value per *billion* available. Big Data, as a global industry, isn't even close to this scale of knowledge. Even given a pattern to the GUID generation, there are other sources of entropy involved, like the order in which data enters the system and is assigned a GUID. – KeithS Aug 19 '16 at 15:28
  • @Keiths It doesn't matter how large the search space is if you can relatively reliably predict the next results of the algorithm after having seen a few results . And that is definitely possible and there have been successful attacks on weak PRNGs in the past in many different cases – Voo Aug 19 '16 at 17:51
5

As other people have made the case for an incrementing primary key I will make one for a GUID:

  • It is guaranteed to be unique
  • You can have one less trip to the database for data in your application. (For a types table for instance you can store the GUID in the application and use that to retrieve the record. If you use an identity you need to query the database by name and I have seen many an application that does this to get the PK and later queries it again to get the full details).
  • It is useful for hiding data. www.domain.com/Article/2 Lets me know you only have two articles whereas www.domain.com/article/b08a91c5-67fc-449f-8a50-ffdf2403444a tells me nothing.
  • You can merge records from different databases easily.
  • MSFT uses GUIDS for identity.

Edit: Duplicate Point

  • 5
    -1. A GUID / UUID is not guaranteed to be unique, and is not 100% unique. A GUID is still a finite-length, so at some point you can risk getting a duplicate, although it is highly unlikely. Your point about less trips to the database is also invalid - why can't you store the primary id in the application, as you can with the GUID key? – Niklas H Aug 16 '16 at 11:51
  • 2
    Jeff Atwood says it much better than I ever could. https://blog.codinghorror.com/primary-keys-ids-versus-guids/ – Three Value Logic Aug 16 '16 at 13:42
  • As for why can't you store the primary id in your application? Because the database creates it. If you run your seeds on an empty database you can assume that the ID will be 1. What if you run the same script on a database with data in it? The ID won't be 1. – Three Value Logic Aug 16 '16 at 13:51
  • You did not say anything about creating IDs in the application - you just wrote "storing". But if it is necessary to create the ID outside the database, then yes, a GUID could be the answer. – Niklas H Aug 16 '16 at 13:54
  • cant believe the number of people who are stuck in the dark ages using auto inc ints – Ewan Aug 16 '16 at 14:02
  • I said store because that is what I meant. They are created in the database (in this scenario). However there are multiple databases (think test, staging, live). If you generate an incremental ID in the test database how do you know that the live database will generate the same key if it has data in it? There are lots of good reasons to pick an identity. I just wanted to throw out a few for guids based on my experiences. – Three Value Logic Aug 16 '16 at 14:03
  • 2
    I would add they scale better. Big data NoSQL databases like Cassandra don't even support auto-increment keys. – Karl Bielefeldt Aug 16 '16 at 14:19
  • @ThreeValueLogic your hyper link is broken, and is cutting off part of the url – Halfwarr Aug 16 '16 at 18:04
  • Why are you exposing your URL with a PK anyway? If you want a useful link then do something like `www.domain.com/article/title-slug`. it say nothing about how much I have in my database and it makes sense for humans trying to remember your blog post. – Matthew Whited Aug 16 '16 at 18:58
  • @MatthewWhited Performance mainly. SQL is a little bit quicker at fetching by an ID or GUID than it is by a title. I actually went with a hybrid system based off of Stack overflow so www.domain.com/arti‌​cle/id/title-slug which seems to work well. For a smaller system I just use the title-slug as you say. – Three Value Logic Aug 17 '16 at 07:43
  • I have no issue with the id. I was really lost on why so many people get their panties in a bunch over it. There is nothing useful with knowing the set of possible PKs. – Matthew Whited Aug 17 '16 at 13:43
  • `/article/title-slug` is guaranteed to be unique -- if it wasn't, your site wouldn't know what to serve. Certainly, it has the possibility of changing over time. People misspell slugs. But you couldn't have a more natural key for articles published. – Ross Presser Aug 19 '16 at 10:02
2

As a principle of good design, every table should have a reliable way to uniquely identify a row. Although that is what a primary key is for, it doesn't always require the existence of a primary key. Adding a primary key to every table is not a bad practice since it provides for unique row identification, but it may be unnecessary.

To maintain reliable relationships between the rows of two or more tables, you need to do it via foreign keys, hence the need for primary keys in at least some tables. Adding a primary key to every table makes it easier to extend your database design when it comes time to add new tables or relationships to existing data. Planning ahead is always a good thing.

As a basic principle (hard rule perhaps), the value of a primary key should never change throughout the life of its row. It's wise to assume that any business data in a row is subject to change over its lifetime, so any business data will be a poor candidate for a primary key. This is why something abstract like an auto-incremented integer is often a good idea. However, auto-incremented integers do have their limitations.

If your data will only have a life within your database, auto-incremented integers are fine. But, as has been mentioned in other answers, if you ever want your data to be shared, synchronized, or otherwise have a life outside your database, auto-incremented integers make poor primary keys. A better choice will be a guid (aka uuid "universally unique id").

Zenilogix
  • 309
  • 1
  • 3
2

The question, and many of the answers, miss the important point that all the natural keys for each table reside solely in the logical schema for the database, and all the surrogate keys for each table reside solely in the physical schema for the database. other answers discuss solely the relative benefits of integer versus GUID surrogate keys, without discussing the reasons why surrogate keys are properly used, and when.

BTW: Let us avoid use of the ill defined and imprecise term primary key. It is an artifact of pre-relational data models that was first co-opted (unwisely) into the relational model, and then co-opted back into the physical domain by various RDBMS vendors. Its use serves only to confuse the semantics.

Note from the relational model that, in order for the database logical schema to be in first normal form, every table must have a user-visible set of fields, known as a natural key, that uniquely identifies each row of the table. In most cases such a natural key is readily identified, but on occasion one must be constructed, whether as a tie breaker field or otherwise. However such a constructed key is always still user visible, and thus always resides in the logical schema of the database.

By contrast any surrogate key on a table resides purely in the physical schema for the database (and thus must always, both for security reasons and for maintenance of database integrity, be entirely invisible to database users). The sole reason for introducing a surrogate key is to address performance issues in the physical maintenance and use of the DB; whether those be joins, replication, multiple hardware sources for data, or other.

Since the sole reason for the introduction of a surrogate key is performance, let us presume that we wish it to be performant. If the performance issue at hand is joins, then we necessarily wish to make our surrogate key as narrow as can be (without getting in the way of the hardware, so short integers and bytes are usually out). Join performance relies on minimal index height, so a 4-byte integer is a natural solution. If your performance issue is insertion rate a 4-byte integer may also be a natural solution (depending on your RDBMS's internals). If your performance issue for a table is replication or multiple data sources than some other surrogate key technology, be it a GUID or a two-part key (Host ID + integer) may be more suitable. I am not personally a favourite of GUIDs but they are convenient.

To sum up, not all tables will require a surrogate key (of any type); they should only be used when deemed necessary for the performance of the table under consideration. Regardless of which common surrogate key technology you prefer, think carefully about the actual needs of the table before making a choice; changing the surrogate key technology choice for a table will be exhausting work. Document the key performance metric for your table so that your successors will understand the choices made.

Special Cases

  1. If your business requirements mandate a sequential numbering of transactions for audit (or other) purposes than that field is not a surrogate key; it is a natural key (with extra requirements). From the documentation an auto-incrementing integer only generates surrogate keys, so find another mechanism to generate it. Obviously some sort of monitor will be necessary, and if you are sourcing your transactions from multiple sites then one site will be special, by virtue of being the designated host site for the monitor.

  2. If your table will never be more than about a hundred rows then index height is irrelevant; every access will be by a table scan. However string comparisons on long strings will still be much more expensive than comparison of a 4-byte integer, and more expensive than comparison of a GUID.

  3. A table of code values keyed by a char(4) code field should be as performant as one with a 4-byte integer. Although I have no proof of this I use the assumption frequently and have never had reason to rue it.

-1

Not only it's not good practice, in fact it's described as an anti-pattern in Bill Karwin's SQL Antipatterns book.

Not every table needs a pseudokey -- a primary key with an arbitrary value, not something that has semantic value for the model --, and there's no reason to always call it id.

Pedro Werneck
  • 238
  • 1
  • 5
-2

This is pretty universal - otherwise you would need to validate that the key is actually unique. This would be done by looking at all of the other keys... which would be time consuming. Having an incremental key gets expensive as your record number nears the key overflow value.

I usually make the pointers more obvious field names like ref_{table} or similar idea.

If its not necessary to externally point to a record then you don't need an id.

Johnny V
  • 113
  • 4
  • Key rollover value? – AJJ Aug 15 '16 at 20:51
  • An unsigned integer has a max value of 4294967295 before adding 1 will roll it over to 0. Remember if you add a record then delete it, the counter is still increased. Make sure you use `unsigned int` for the field type otherwise the limit is half of that number. – Johnny V Aug 15 '16 at 20:53
  • Integer Overflow - https://en.wikipedia.org/wiki/Integer_overflow – Johnny V Aug 15 '16 at 20:54
  • People have tables with more than 4294967295 rows? – AJJ Aug 15 '16 at 20:56
  • 2
    If you add/remove lots of rows the auto increment counter will overflow eventually. – Johnny V Aug 15 '16 at 20:56
  • 1
    How do people handle rollover? What if there are records with a low ID that never get deleted, but you're starting to near the end where some IDs are at the upper end of 4294967295? Can a "re-indexing" be done? – AJJ Aug 15 '16 at 20:57
  • @ArukaJ MySQL supports 64bit integers as id's if you think you'll need them. I assume other DBMS's support it as well. – GrandmasterB Aug 15 '16 at 21:02
  • GrandmasterB is correct, if you expect lots of add/remove of rows then it might be a good idea to use `unsigned long` or whatever equivalent there is in MySQL. Despite this, Integer Overflow is always something you have to be aware of. – Johnny V Aug 15 '16 at 21:05
  • I use SQLite in Android if that matters – AJJ Aug 15 '16 at 21:12
  • I found this - http://www.tutorialspoint.com/sqlite/sqlite_data_types.htm - It appears like you only get INTEGER which would have value range of –2147483648 to 2147483647 – Johnny V Aug 15 '16 at 21:16
  • @JohnnyV: In SQLite, INTEGER is a signed 64-bit type. They're stored on disk in a variable-length format. – dan04 Aug 15 '16 at 23:33
  • @dan04 would have been awesome if the documentation of the types said that. Regardless, good to know. – Johnny V Aug 15 '16 at 23:36
  • @ArukaJ not tables of so many rows, but tables with large gaps. Remembers, while the ids may be incremental, they can also contain (large) gaps, for instance in a situation with multiple synchronized databases. – Pieter B Aug 16 '16 at 09:10
-2

I wouldn't say it should always be done. I've got a table here with no unique key--and it doesn't need one. It's an audit log. There will never be an update, queries will return all changes to what is being logged but that is the best that can reasonably be done it takes a human to define a wrongful change. (If the code could it would have disallowed it in the first place!)

Loren Pechtel
  • 3,371
  • 24
  • 19
-3

An auto increment counter for a primary key is not a good idea. That is because you need to go back to the database to find the next key and increment by one before inserting your data.

That being said I would generally use whatever the database can provide for the primary key rather than having it as part of the application.

By letting the database natively provide it for you it can guarantee the key to be unique for what it needs.

Of course not all databases support it. In which case I generally use a table that stores key buckets and use high and low ranges that are managed in the application. This is the most performant solution I find because you get a range of 10000 numbers and auto increment them on the application instance. Another application instance can pick up another bucket of numbers to work with. You do need a sufficiently large primary key primitive such as a 64-bit long.

UUIDs I don't use as primary keys because the cost of building them and storing them is much higher than incrementing a long value by one. UUIDs still deal with the birthday paradox in that a duplicate can theoretically arise.

Archimedes Trajano
  • 685
  • 1
  • 6
  • 14
  • 3
    No. auto-increment keys means that the incrementation of the key is done automatically by the database. Sometimes (I'm looking at you, Oracle!) you need a sequence + trigger combination to do so, but you **never** need to look up the previously inserted value for the key, add 1, then use it. – SQB Aug 19 '16 at 14:57
  • With some persistence frameworks such as JPA if you want to return the value of the key ithat was created back to the caller you need to load up the record in order to see the key. – Archimedes Trajano May 08 '17 at 14:03