9

I am trying to design the DB schema of my new project and I need some suggestions. I am thinking to have one extra column for some tables which is a UUID. I wanted mainly for tables which contain data which are accessible through REST, so I don't want to expose the primary key. My concern is that when I want to update a table, I need to execute one more query just to take the original primary key.

What do you think?

pik4
  • 365
  • 3
  • 13
  • Why do you need to execute one more query to take the original primary key? – Thomas Owens May 23 '18 at 12:14
  • Some databases have something like a UUID already, it might be a hidden column. Postgres prior to 8.1 and optionally thereafter has OID, for example. This might be suitable for you, although it is likely to change if you have to rebuild a table from a backup so you can't rely on it being persistent. – PhilHibbs May 23 '18 at 12:38
  • @ThomasOwens imagine a scenario that you have the UUID of an order. Then I need to insert a tuple in another table which has a foreign key the orderId. In that case, I need to execute a query to get the original ID from Order table and then to make one more to make the insert. – pik4 May 23 '18 at 13:10
  • @PhilHibbs could you give me more context? what do you mean by OID? – pik4 May 23 '18 at 13:11
  • In that case, why can't you just use the UUID as your primary key, instead of an autoincrementing ID? – Thomas Owens May 23 '18 at 13:14
  • @ThomasOwens from a performance perspective it's not desired to use long id (UUID). So I tried to avoid it. – pik4 May 23 '18 at 13:18
  • @pik4 What's worse - the performance of using a UUID as a primary key or needing to perform two queries to update dependent tables or databases? – Thomas Owens May 23 '18 at 13:21
  • @ThomasOwens I bet on the second! – pik4 May 23 '18 at 13:26
  • 1
    @ThomasOwens UUIDS really are terrible for indexing. It's probably much better to do the two queries. – JimmyJames May 23 '18 at 13:59
  • @pik4 In the case you describe, you can still do a single `INSERT INTO ... SELECT ...` with a `JOIN` – Vincent Savard May 23 '18 at 14:00
  • I've used this approach a couple of times - it works fine. – GrandmasterB May 23 '18 at 16:17
  • 1
    @JimmyJames: Why would UUIDS be terrible for indexing? In most sane database designs, these are simply 128 bit numbers, essentially a long long. – Robert Harvey May 23 '18 at 19:32
  • @RobertHarvey If you have the chops on your chosen database platform, you can probably make it work. But that aside a lot depends on the type of UUID you are using and it depends on what kind of index is being used but to greatly oversimplify I have a basic sequential index, you can have a simple strategy that a pointer for any key in the range 0-9999 is in block 0, 10000-19999 is in block 1. Now take a 128-bit random key. You might have a million keys but they are evenly spread across 340,282,366,920,938,463,463,374,607,431,768,211,455 different values. You need a different approach. – JimmyJames May 23 '18 at 21:34
  • @JimmyJames: I read your comment and thought "I've never heard of people going to this kind of trouble over indexes," so I did a bit of poking around on Google, and decided that a better statement is "UUIDs are really terrible for indexing *if you expect them to have ordering characteristics,* which they don't have." If you give up that expectation, UUIDs are perfectly fine, and are preferable in many cases, especially if you do cross-database operations or don't want two trips to the database for new records. On SQL Server, this amounts to "Use non-clustered indexes for UUIDs." – Robert Harvey May 23 '18 at 23:56
  • What is the performance impact if you use UUID as primary key? – pik4 May 24 '18 at 07:58
  • @RobertHarvey When we had an process that was taking more than a minute to do an insert, I started caring about how 128-bit values work as primary keys. – JimmyJames May 24 '18 at 13:35
  • @JimmyJames: Assertions without explanations don't provide much insight. – Robert Harvey May 24 '18 at 14:55
  • I'm simply explaining why I ended 'going to this kind of trouble over indexes'. We had a major performance issue that we had to solve. It was related to using the internal row UUID as the primary key of a table. The size of the key wasn't the primary issue but in the process I learned about the costs around using them. That's why SQL server introduced [`newsequentialid`](https://docs.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql?view=sql-server-2017) – JimmyJames May 24 '18 at 15:05

5 Answers5

11

The biggest myth when designing applications is that you are only allowed to have one key.

Have multiple keys, go ahead and do it. Your application is allowed to have a different primary key than your database; I know this may sound strange at first, but you get the best of both worlds going this route.

UUIDs have many advantages, but make generally horrible clustered indexes. So, you can just use a int as a PK/clustered index in your DB, but also have another unique index on your external id column. This will may require an extra join for many queries, but that's okay, because in relational DBs, inner joins on auto-incremented int columns are blazing fast.

Another advantage to this dual key approach is the ability to easily geographically distribute your DB. Since UUIDs are globally unique, two geographically separated DBs will not have to slow down from coordinating their keys, as long as you make sure the DB int PK never leaves the DB and is not used by the rest service.

Here's another point, your external Id doesn't even have to be a UUID, there can be other options too.

The external can be an int as well, or a string, or the natural key of the entity; anything. This would make your urls less ugly than a UUID

Don't be afraid to have multiple keys, hiding your DB key from your rest api consumers is a good thing.


The most keys I have ever used has been two, but in the right situation, I can imagine up to four being justified (DB key, application key, business key, customer key)

TheCatWhisperer
  • 5,231
  • 1
  • 22
  • 41
  • 2
    "UUIDs have many advantages, but make generally horrible clustered indexes." I've had real experience with this and depending on the type of UUID, they really are problematic because (IIRC) they tend to be different at the beginning and end but change in the middle leaving you with lots index blocks with a single index. – JimmyJames May 23 '18 at 13:56
  • I would suggest to read the https://uuid.ramsey.dev/en/latest/database.html php library for handling UUID on your database storage. This will help you decide whats the best for you. – Ner Aug 15 '21 at 22:28
8

There is no point having two primary keys.

  1. UUIDs are slow to index.

No they are not. Stop worrying about arcane numbers like pages and fragmentation and do some tests.

https://www.mssqltips.com/sqlservertip/5105/sql-server-performance-comparison-int-versus-guid/

http://byterot.blogspot.com/2013/02/performance-series-guids-vs-identity-rdbms-sql.html

https://www.cybertec-postgresql.com/en/int4-vs-int8-vs-uuid-vs-numeric-performance-on-bigger-joins/

In any case, If you have both and index the UUID then you have the same problem

  1. UUIDs take up more space. well yes, as long as you can get away with those low max number ints.

In any case if you have both its definitely bigger

  1. UUIDs are ugly. Well you wouldn't want to type one, but do you want to type 13243444444431 or Gdih£$%d1 ? if you need a pretty url for SEO or some other reason, then there are better solutions than adding more id columns to your table.

Sequential ints have some major issues which are solved with UUIDs one of which is guessing the next id or how many orders you have taken.

These are real security and commercial concerns and should, on their own be enough to justify not using auto_inc ints at all.

Don't try to patch over the problems with hacky solutions, just replace them with the industry standard.

Ewan
  • 70,664
  • 5
  • 76
  • 161
1

If your only goal is to hide the primary key from your REST interface, then I'd advice against it. Instead, encrypting your keys would be a good solution for this.

1

This is a common problem that I've come across myself many times. I ended up going down the "just use a number for the id" route, meaning it's not quite as ugly to the clients to see those UUIDs.

The only other way, as you already eluded to, is to hide that ID behind some kind of query, which you'll have to feed parameters to in order to retrieve the correct UUID, however, that means you may as well be using those parameters as the primary key for table!

I'd say there's no way to do REST (i.e. state transfer of an model) without being able to identify the instance you're RESTing, so you're going to need a key, whatever form that key takes.

If your concern is due to security, I see the problem as is in your server message validation code, not that the client sees Ids. If a hacker wants to load some data from your table, they won't need to ids to do so. And you ought be validating the requests too, so no dodgy stuff comes through to your business layers anyway, i.e. Authentication and Authorisation of messages.

To come back to your question:

"accessible through REST, so I don't want to expose the primary key"

I think that it's the opposite, you NEED to expose the primary key in order to do REST.

EDIT: As @Murph pointed out, it actually DOES give attackers a better chance if you give Ids to the client - I did not know this. It may be that other answers here (I'm looking at the shortest one saying to encrypt the key) may be more correct to your situation. Edit 2: Seems UUIDs are no safer than ints...

Dan Rayson
  • 182
  • 7
  • 3
    You don't need to expose the primary key, you need to expose a unique identifier - the point here is that if you expose an incrementing integer then its trivially easy to guess valid keys and that creates various possibilities for attack. Using a GUID for the primary key mitigates against guessing of IDs but you're still exposing a meaningful identifier, hence desirability of having a separate lookup id (which might reasonably be changed from time to time if appropriate) – Murph May 23 '18 at 12:55
  • 1
    @Murph UUID is not a security feature. The idea that using a GUID mitigates the ability to guess keys is not necessarily true: ["If you use an Algorithm 1 GUID generator to assign GUIDs to candidates, you'll find that the GUIDs are assigned in numerically ascending order "](https://blogs.msdn.microsoft.com/oldnewthing/20120523-00/?p=7553) – JimmyJames May 23 '18 at 15:19
  • @JimmyJames interesting - but SQL Server (for example) uses type 4 and really what you're saying is that you still have to pay attention to get the result you want. Regardless its better (but clearly not sufficient in and of itself) – Murph May 24 '18 at 07:23
1

The problem with using primary keys as the public identifier for your entity is that youre leaking details of your data storage solution to your clients (and consequently, all the way through your application). This can have security implications, aswell as meaning that changes to your applications backend can be more complicated. UUIDs go some way to solving this problem by providing an alternative identifier that is entirely independent of your database technology.

UUIDs are just one solution to this problem, with the inherent advantage that for all practical purposes they are guaranteed to be unique. However, if you have gone to the effort of using something other than the primary key as your identifier, you might find that using an arbitrary string or just another numerical identifier provides nicer urls.

One other consideration is that if you're using GUIDs/UUIDs as the public identifier for your entity, you will likely need some sort of index for them. Given that they take up substantially more space than integers, your index will be much larger.

richzilla
  • 1,083
  • 7
  • 11
  • 4
    It's important to understand that there are 4 or 5 types of UUIDs depending on how you count them. Three of these use information about the machine or domain to ensure uniqueness. If you use any of these, you can leak far more information such as the MAC address of the computer generating the value. – JimmyJames May 23 '18 at 14:04