2

Here's a scenario where I'm wondering whether to denormalize a relational database (MS SQL).

Text description of the requirement

I have users (stored in a users table).

Users belong to Accounts:

  • An account is created before its users are created
  • Every user must belong to exactly one account (neither no account, nor more than one account)
  • Any account may contain many users
  • A user's assignment to an account is permanent: users are never moved from one account to another

The Users table is predefined so I can't add an "accountId" column to the Users table; so to implement accounts I have:

  • A Users table, with a userId key
  • An Accounts table, with an accountId key
  • A UserAccounts table, with userId and accountId columns (to identify the account associated with each user); the UserAccounts table is constrained:

    • Foreign key relations to Users and Accounts table
    • Uniqueness constraint to ensure that a userId appears no more than once

Because I can't modify the Users table to insert and accountId column in it, I can't ensure that every user has an accountId. And in fact I will insert into the UserAccounts table after inserting into the Users table, though probably in the same transaction.

  • I have a dozen other tables (e.g. Sales) which define various types of thing which belong to users
  • These tables have a userId column (to identify which user each row belongs to)
  • The problem is that I also need to know what account each thing belongs to

Graphic description of the requirement

Here's a reworded version of the same problem (thanks to MichaelT who suggested this rewording).

   +-----------+    +--------------+      +--------------+
   | Users     |    | UserAccount  |      | Account      |
   +-----------+    +--------------+      +--------------+
+--> userId pk <----+ userId (uniq)|   +--> accountId pk |
|  |           |    | accountId    +---+  |              |
|  +-----------+    +--------------+      +--------------+
|                                                       
|                                                       
|  +-----------+                                        
|  | Sales     |                                        
|  +-----------+  (and several other tables like this)  
|  | saleId pk |                                        
+--+ userId fk |                                        
   |           |                                        
   +-----------+                                        

I can't modify the Users table for various reasons: which is why each user's (single) account is defined using the separate UserAccount table, instead of adding an accountId foreign key to the Users table.

This UserAccount table implements a many-to-one relationship between users and accounts (not a many-to-many relationship). It could be defined using either the following keys:

  • userId as the primary key
  • (userId,accountId) as the primary key plus userId as a unique key

The question

Now for the question:

When doing a query against a Sales table that needs to include the Account information, I think there are two ways to implement knowing which account each thing belongs to:

  1. Join the table to the UserAccounts table (to select the accountId for each userId):

    select S.*, UA.accountId
    from Sales S
        join UserAccount UA on S.userId = UA.userId
    where
        S.something = somethingElse
    
  2. Denormalize the Sales table[s], by storing the accountId in it as well as the userId (the accountId can then be retrieved from Sales without a join to UserAccount):

    |  +--------------+                                        
    |  | Sales        |                                        
    |  +--------------+  (and several other tables like this)  
    |  | saleId pk    |                                        
    +--+ userId fk    |                                        
       | accountId fk |                                        
       +--------------+                                        
    

    If I did the latter I could ensure integrity by defining the userId plus accountId pair/combination as a foreign key constraint into the UserAccounts table (to ensure that the pairing of accountId with userId matches the pairing defined in the UserAccounts table).

    CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId)
        REFERENCES UserAccounts (userId,accountId)
    

I don't have vast experience with database design: I learned it from books (by Joe Celko).

I'm inhibited against denormalizing in general, but denormalizing seems appropriate here: why an extra JOIN in every SELECT statement (to get the accountId), when the same could be obtained by storing the accountId in the table with a (compound or composite) foreign key.

Is there a compelling reason to use (or to not use) either of these two possibilities?

If not then I'm tempted to use the 2nd method:

  • because it's simpler to define the select statement (without a join)
  • because it's (presumably) slightly better performance (without a join)
  • because it ensures that a corresponding row (with an accountId) exists in the UserAccounts table
ChrisW
  • 3,387
  • 2
  • 20
  • 27
  • Can you clarify if a User ever would have more than one UserAccount? – RibaldEddie Jan 03 '15 at 17:30
  • @RibaldEddie Thank you for asking: in a word, "no". I edited the question to clarify that. – ChrisW Jan 03 '15 at 17:33
  • 3
    As a general rule I would never de normalize any data until it was clear that it would be absolutely necessary. Performance isn't a consideration in this situation so I would keep the join table. In fact the type of "optimization" you're talking about isn't just a drop in a single level of normal form (say from 4 to 3), it's completely de normalized. Turns your data into rubbish. – RibaldEddie Jan 03 '15 at 17:39
  • 1
    Playing devil's advocate, I might see userId+accountId as a single (composite or compound) key. Why does it "turn my data into rubbish" to do it that way: for what reason (in what use case or scenario) is it sub-optimal? – ChrisW Jan 03 '15 at 17:52
  • Check out the example section here http://en.m.wikipedia.org/wiki/Second_normal_form – RibaldEddie Jan 03 '15 at 18:57
  • I don't see what you're trying to tell me. Wikipedia says something about 'update anomaly' but so far as I can see I'd have relational integrity by defining `FOREIGN KEY (userId, accountId) REFERENCES UserAccount (userId, accountId)` – ChrisW Jan 03 '15 at 21:48
  • 1
    That would have to be the primary key then. – RibaldEddie Jan 03 '15 at 22:11
  • @RibaldEddie is right, UserID+AccointID should be the PK of UserAccount and not just a mere unique constraint. Most RDBMS would only allow a FK to a PK, not any unique index. – Tulains Córdova Jan 04 '15 at 00:27
  • I think is was clear the existence of the table UserAccount was not a design decision but an answer to a prohibition to add a FK no table User. – Tulains Córdova Jan 04 '15 at 00:29
  • Yes I'd prefer to put non-nullable accountId column in the 'Users' table (to define a required many-to-one relationship); but given that I cannot alter the Users table, I implement that relationship with a new 'UserAccounts' table. – ChrisW Jan 04 '15 at 00:35
  • Thanks for the heads-up about primary key. That won't be a problem, because UserAccounts could have userId as primary key; but it could also have userId+accountId as primary key plus userId as a simple 'unique' constraint. Also it seems that (using MS SQL from 2012 onwards) [A foreign key constraint does not have to be linked only to a primary key constraint in another table; it can also be defined to reference the columns of a UNIQUE constraint in another table.](http://msdn.microsoft.com/en-us/library/ms189049(v=sql.110).aspx) – ChrisW Jan 04 '15 at 00:40
  • If UserID could have been he PK of UserAccount, then the relationship between entities User and Account is 1:1, not 1:m. And that's the reason you say "Uniqueness constraint to ensure that a userId appears no more than once" and not "Uniqueness constraint to ensure that the combination of a userId and an accountID appears no more than once". The question needs to be a lot more clear than it is right now because leaves to many asumtions to the reader. – Tulains Córdova Jan 04 '15 at 01:33
  • 1
    This is a basic compound vs. surrogate key situation regardless of the single account/user requirement. – JeffO Jan 04 '15 at 18:11
  • @user61852 It's 1:m in the sense that each user appears once in the table, and each account appears many times. It's a m:1 relationship in the sense that each user has one account and each account has many users. – ChrisW Jan 04 '15 at 18:32
  • @user61852 [What is unclear about this question and how can it be clarified?](http://meta.programmers.stackexchange.com/q/7085/19237) – ChrisW Jan 04 '15 at 18:40
  • 1
    I suggest you edit your question replacing it with @MichaelT sugestion in Meta (copy & paste will do), but indicating in the diagram the PK of the tables (mainly UserAccount). Please note that a table with no PK is a non-conformity regarding relational database principles. – Tulains Córdova Jan 04 '15 at 23:20

3 Answers3

4
  • You should get rid of the UserAccounts table, since the relationship between User and Account is a one-to-many relatioship (given the uniqueness constraint you have on UserAccounts). There's no need for the connection table where the relationship between entities is one-to-many. Such a table is only needed when the relationship is many-to-many.
  • Let's suposse there's a political reason why you can't get rid of UserAccounts or modify User, then...
  • I advice you to go with option 1. The specific denormalization you suggest will bring you headaches. Ask yourself: What if a user changes account?. In that case you willl have to update a lot of tables. Not having to do so is preciselly what RDBMS are for.
  • There are some cases when de-normalization makes sense but not here.
  • On the argument on join simplicity: some old sage said once: "depend on views, not on tables". Create as many views as needed with the tables already joined. Then program against those views, not tables, that way you don't have to worry about that specific join all the time. There should not be any performance issue if you have indexes and FKs. That's what RDBMS are for.

EDIT: the advantage of the view is that you will no longer have to write the join yourself in every query, which, if I understood well, bothers you, and that, if I understood well, is one of your reasons againts not denormalizing. The view already has the join under the hood, so it will be transparent for you that the AccoountID is in another table. Is has the advantages of normalization but not the disadvantages. One of the advantages of views is offering a denormalize appearance but without denormalizing. You only write the join once (in the view) rather that everytime you want to know the AccoountID, since you will be using the view VSales (which already has an AccoountID column) instead of the table Sale (which doesn't) in queries.

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
  • `Ask yourself: What if a user changes account?` I did ask that explicitly and I was told (by the customer/product manager) that will never happen. I suppose that if (unforeseen situation) a person must change account, they'll do that by becoming a new user in the new account (instead of moving the old user object to a new account). – ChrisW Jan 03 '15 at 17:46
  • `since the relationship between User and Account is a one-to-one relatioship` It's many-to-one. – ChrisW Jan 03 '15 at 17:47
  • 1
    `some old sage said once: "depend on views, not on tables".` Is there any **reason** for applying that old adage in this case? Why is a view better than a table in this scenario? I'm aware that the tendency is to prefer normalization but I could not think of any good reason for doing it in this case, and I feared that applying the adage just because it's old would be an example of [Cargo cult programming](http://en.wikipedia.org/wiki/Cargo_cult_programming). – ChrisW Jan 03 '15 at 17:50
  • @ChrisW I edited the answer to reflect both that the relationship is 1:m and not 1:1 as I thought, but not m:m as an `UserAccount` table would seem to indicate and why the views saves you from having to write the join yourself in every query. – Tulains Córdova Jan 03 '15 at 18:24
  • I agree with almost everything you wrote, but the first paragraph is IMHO useless, since it ignores completely the OP's statement about the fact the User table cannot be changed. And there can be technical or organizational reasons for that, there is not need to assume "political reasons". – Doc Brown Jan 03 '15 at 20:55
  • You said, `The specific denormalization you suggest will bring you headaches`. Can you please identify any potential problem/headache from this specific specific denormalization? – ChrisW Jan 03 '15 at 22:13
  • 1
    @ChrisW I gave you an example of an hypothetical situation in which it's necessary to move an user to a different account becasue they want to delete an account, or **becasue they added an user to the wrong account by error**. That's the type of things that never happen until they happen. Another example is deleting an user. Also I made a mistake, thought the reason for wanting to denormalize was `UserID` but it's `AccountID`, I corrected that in the answer, you can see the edit if you see the edit history. To continue... – Tulains Córdova Jan 04 '15 at 00:00
  • Yes, that's the only hypothetical situation I was able to think of. I'm told it won't happen. If it did happen and if I'd done it the first way, that would make the change easy and cause problems: because old (historical) Sales rows (created when the user was in the old account) would seem to be reassigned to the new account when the user was moved to the new account. STaoring the accountId (with the userId) in the Sales table helps to make it clear that's not allowed: a row in the Sales table is to a user and to an account, and the account cannot change accidentally. – ChrisW Jan 04 '15 at 00:07
  • @ChrisW Having a UserId and a AccountID columns in sales violates 2NF/3NF. Violating 2NF/3NF means you could have update anomalies, meaning you will need to update Sales if a user need to be moved to a different account. The dependence Sales and Account is transitive. – Tulains Córdova Jan 04 '15 at 00:11
  • I won't move users from one account to another ([link](http://programmers.stackexchange.com/questions/267948/effect-of-denormalizing/267952?noredirect=1#comment546440_267952)). I certainly won't move Sales and other such rows from one account to another. I don't think I'll have update anomalies, a) because I have no updates, b) because I have a foreign key relationship ([link](http://programmers.stackexchange.com/questions/267948/effect-of-denormalizing/267952?noredirect=1#comment546484_267948)). – ChrisW Jan 04 '15 at 00:17
  • 1
    @ChrisW Reading your prior-to-last comment I guess I didn't understand the question. Question states _"These tables have a userId column"_ That's why I thought relatioship was with User and not both User and Account and that the whole AccountID-denormalization thing arose from the fact that you weren't allowed to modify the User table. Take from my answer whatever could be useful to you, if any, given I didn´t understand the question but I suggest you improve the question and make some things more evident so you can get better answers. For example mgw854 also misunderstood the answer. – Tulains Córdova Jan 04 '15 at 00:21
  • How can a single account be related to multiple users if you get rid of the UserAccount table? It is true you only need this intermediate table in many-to-many situations, but you don't have the luxury of placing the account id in the user table. – JeffO Jan 04 '15 at 18:08
1

In this particular case, it doesn't sound like denormalizing to me--there is a 1:1 required relationship and the only piece of lookup information is the User key. Adding another table just to store the relationship doesn't add any value and actually works against you (since as you mentioned, you can't require the field that way).

As for your reasons why, I wouldn't worry about simplicity or performance--with proper indexing, you would never even notice the join. The most important aspect is data correctness, and the second way offers that guarantee for you.

mgw854
  • 1,818
  • 10
  • 11
  • If I had been permitted to modify the Users table, then I would have defined a non-nullable accountId column in the Users table (to define each user's account), instead of having a separate UserAccounts table to define that mapping. The question would be then whether (in other tables like Sales) I defined just the userId as a foreign key, or defined the userId+accountId pair -- that (storing the userId+accountId pair in more than one table) does seem to me to be an example of denormalization: isn't it? – ChrisW Jan 03 '15 at 17:40
-1

The point of not denormalizing would be to avoid update anomalies.

Not denormalizing would be good, if and only if:

  • Users are moved from one account to another
  • When a user is moved, all the Sales records for that user should implicitly move with the user to the user's new account

In this case, 'denormalizing' is OK for three reasons:

  • Users are guaranteed to never change accounts
  • Even if a user did change accounts, we wouldn't want the user's historical Sales rows to change
  • userId plus accountId are (or could be) defined together (is it called 'compound' or 'composite'?) as a primary key, so defining both together in the Sales table is not denormalizing (they're two physical columns but kind of one logical column)

The denormalizing is PREFERABLE for three reasons (in order of decreasing importance):

  1. It records the historical accountId at the moment when the Sales row is created. The account (not the user) pays for the sale and owns the thing sold, so the mapping from Sales row to Account row must never change in the future even if the user subsequently changes account (which the user never will)
  2. It guarantees that no Sales are inserted for Users which (by an anomaly) have no associated Account
  3. It makes the SELECT simpler: no need to JOIN Sales to UserAccount (instead that JOIN is defined implicitly and at INSERT time by the compound/composite FOREIGN KEY contraint)

I vaguely remember reading that denormalizing is normal, standard, best practice when dealing with historical data: and this might be an example of that kind of "historical" data where denormalizing is justified.

ChrisW
  • 3,387
  • 2
  • 20
  • 27
  • Given that I'm asking a sincere question and posting this as a possible answer, I hope it's obvious that if you think this answer is wrong, then I'd like you to tell me why/what is wrong about it instead of simply downvoting. – ChrisW Jan 14 '15 at 23:32