30

I got to see many designs that normalization wasn't the first consideration in decision making phase.

In many cases those designs included more than 30 columns, and the main approach was "to put everything in the same place"

According to what I remember normalization is one of the first, most important things, so why is it dropped so easily sometimes?

Edit:

Is it true that good architects and experts choose a denormalized design while non-experienced developers choose the opposite? What are the arguments against starting your design with normalization in mind?

maple_shaft
  • 26,401
  • 11
  • 57
  • 131
Yosi Dahari
  • 602
  • 7
  • 17
  • 7
    because normalized DBs need a lot of joins on even the most trivial queries – ratchet freak Sep 28 '13 at 22:22
  • Well of course.. but, can't view be a solution for that? is this worth the cost of giving up consistency? – Yosi Dahari Sep 28 '13 at 22:24
  • 1
    those joins will still need to happen even hidden by views – ratchet freak Sep 28 '13 at 22:31
  • 1
    And you care about it because? if it's fast enough for you, assuming you have proper indexes, that shouldn't be a problem. and even if it's more slow does it worth the lost of consistency? that's the main question. everyone knows normalization -> JOIN's.. – Yosi Dahari Sep 28 '13 at 22:44
  • Could you add an example to your question? Tables can have 3 columns and not be normalized. They can have a dozen and be normalized. The database can be normalized and still be badly designed. – dcaswell Sep 28 '13 at 22:46
  • 33
    Many programmers don't know the basics of the relational model. – mike30 Sep 28 '13 at 22:48
  • I can, but will it help you? you know what non normalized table is, and I think (and hope) you know the phenomena I am describing here. – Yosi Dahari Sep 28 '13 at 22:49
  • @mike30 - totally agreed. but the more interesting case for me, is why architects make those decisions, rather than non experienced programmers. – Yosi Dahari Sep 29 '13 at 08:19
  • 3
    They ignore it because they don't have to answer to DBAs, BI analysts, or security auditors. – Aaronaught Sep 29 '13 at 13:30

4 Answers4

31

What's interesting about this Q&A thread is that there are actually 3 questions. Everybody has answered a different one, and almost nobody has answered the first one:

  1. Why aren't some databases in the wild normalized?
  2. Why/when should a normalized database be denormalized?
  3. In what situations is it harmful or unnecessary to normalize in the first place?

Alert readers will note that these are very different questions, and I'll try to answer each of them separately while avoiding too much detail. By "too much", I mean that I don't think this is the appropriate context in which to be carrying out an extended debate over the merits of various arguments in favour of or against normalization; I'm simply going to explain what those arguments are, maybe list a few caveats, and save the philosophy for more specific questions, if they ever come up.

Also, in this answer I am assuming that "normalization" implies "BCNF, 3NF, or at least 2NF", since that's the level of normalization that designers generally aim to achieve. It's rarer to see 4NF or 5NF designs; although they certainly aren't impossible goals, they concern themselves with the semantics of relationships rather than just their representation, which requires considerably more knowledge about the domain.

So, onward and upward:

1. Why aren't some databases in the wild normalized?

The answer to this could be "because they shouldn't be", but making that assumption right off the bat is pretty piss-poor detective work. We wouldn't make very much progress as a society if we always operated on the assumption that whatever is, ought to be.

The real reasons that databases don't get normalized in the first place are more complicated. Here are the top 5 that I've come across:

  • The developers who designed it didn't know or didn't understand how to normalize. Strong evidence of this comes in the form of many other accompanying bad design choices, like using varchar columns for everything or having a spaghetti mess of meaningless table and column names. And I assure you, I've seen "real" databases that are every bit as bad as those in the TDWTF articles.

  • The developers who designed it didn't care or were actively against normalization on principle. Note, here I am not talking about instances where a deliberate decision was made not to normalize based on contextual analysis, but rather teams or companies where normalization is more-or-less understood but simply ignored or shunned out of habit. Again, surprisingly common.

  • The software is/was done as a Brownfield project. Many purists ignore this perfectly legitimate business rather than technical reason for not normalizing. Sometimes you don't actually get to design a new database from scratch, you have to bolt on to an existing legacy schema, and attempting to normalize at that point would involve far too much pain. 3NF wasn't invented until 1971, and some systems - especially financial/accounting systems - have their roots even farther back than that!

  • The database was originally normalized, but an accumulation of small changes over a long period of time and/or a widely distributed team introduced subtle forms of duplication and other violations of whatever normal form was originally in place. In other words, the loss of normalization was accidental, and too little time was spent on refactoring.

  • A deliberate business decision was made not to spend any time on business analysis or database design and just "get it done". This is often a false economy and ultimately becomes a mounting form of technical debt, but is sometimes a rational decision, at least based on information that was known at the time - for example, the database may have been intended as a prototype but ended up being promoted to production use due to time constraints or changes in the business environment.

2. Why/when should a normalized database be denormalized?

This discussion often comes up when a database is normalized to start with. Either the performance is poor or there is a lot of duplication in queries (joins), and the team feels, rightly or wrongly, that they've gone as far as they can with the current design. It is important to note that normalization improves performance most of the time, and there are several options to eliminate excess joins when normalization appears to be working against you, many of which are less invasive and risky than simply changing to a denormalized model:

  • Create indexed views that encapsulate the most common problem areas. Modern DBMSes are capable of making them insertable or updatable (e.g. SQL Server INSTEAD OF triggers). This comes at a slight cost to DML statements on the underlying tables/indexes but is generally the first option you should try because it is nearly impossible to screw up and costs almost nothing to maintain. Of course, not every query can be turned into an indexed view - aggregate queries are the most troublesome. Which leads us to the next item...

  • Create denormalized aggregate tables that are automatically updated by triggers. These tables exist in addition to the normalized tables and form a kind of CQRS model. Another CQRS model, more popular these days, is to use pub/sub to update the query models, which gives the benefit of asynchrony, although that may not be suitable in very rare instances where the data cannot be stale.

  • Sometimes, indexed views are not possible, the transaction rates and data volumes are too high to admit triggers with acceptable performance, and the queries must always return realtime data. These situations are rare - I'd hazard a guess that they might apply to things like High-Frequency Trading or law enforcement/intelligence databases - but they can exist. In these cases you really have no option but to denormalize the original tables.

3. In what situations is it harmful or unnecessary to normalize in the first place?

There are, in fact, several good examples here:

  • If the database is being used only for reporting/analysis. Typically this implies that there is an additional, normalized database being used for OLTP, which is periodically synchronized to the analysis database through ETL or messaging.

  • When enforcing a normalized model would require an unnecessarily complex analysis of the incoming data. An example of this is might be a system that needs to store phone numbers that are collected from several external systems or database. You could denormalize the call code and area code, but you'd have to account for all of the different possible formats, invalid phone numbers, vanity numbers (1-800-GET-STUFF), not to mention different locales. It's usually more trouble than it's worth, and phone numbers are usually just shoved into a single field unless you have a specific business need for the area code on its own.

  • When the relational database is primarily there to provide transactional support for an additional, non-relational database. For example, you might be using the relational database as a message queue, or to track the status of a transaction or saga, when the primary data is being stored in Redis or MongoDB or whatever. In other words, the data is "control data". There's usually no point in normalizing data that isn't actually business data.

  • Service-Oriented Architectures that share a physical database. This is a bit of an odd one, but in a true SOA, you will occasionally need to have data physically duplicated because services are not allowed to directly query each other's data. If they happen to be sharing the same physical database, the data will appear not to be normalized - but generally, the data owned by each individual service is still normalized unless one of the other mitigating factors is in place. For example, a Billing service might own the Bill entity, but the Accounting service needs to receive and store the Bill Date and Amount in order to include it in the revenue for that year.

I'm sure there are more reasons that I haven't listed; what I'm getting at, in essence, is that they are quite specific and will be fairly obvious when they come up in practice. OLAP databases are supposed to use star schemas, SOAs are supposed to have some duplication, etc. If you're working with a well-known architecture model that simply doesn't work with normalization, then you don't normalize; generally speaking, the architecture model takes precedence over the data model.

And to answer the very last question:

Is it true that good architects and experts choose a denormalized design while non-experienced developers choose the opposite? What are the arguments against starting your design with normalization in mind?

No, that is complete and utter B.S. It's also B.S. that experts always choose a normalized design. Experts don't just follow a mantra. They research, analyze, discuss, clarify, and iterate, and then they choose whatever approach makes the most sense for their particular situation.

The 3NF or BCNF database is usually a good starting point for analysis because it's been tried and proven successful in tens of thousands of projects all over the world, but then again, so has C. That doesn't mean we automatically use C in every new project. Real-world situations may require some modifications to the model or the use of a different model altogether. You don't know until you're in that situation.

Aaronaught
  • 44,005
  • 10
  • 92
  • 126
15

The assumption built into the question and in some of the answers is that normalization is synonymous good database design. This is in fact often not the case. Normalization is one way of achieving a particular set of design goals and a requirement if you are relying heavily on the database to enforce "business rules" about the relationships between data elements.

Normalization gives you a few key benefits:

  1. Minimizes the amount of redundant data.
  2. Maximizes the extent to which the database's built in integrity mechanisms (foreign key constraints, uniqueness constraints) can be leveraged to ensure the integrity of the data.
  3. Reduces the number of columns per row increasing the efficiency of IO in some cases. Wide rows take longer to retrieve.

That said, there are plenty of valid reasons to denormalize:

  1. Performance, particularly for analytics, can be crippled by normalization. For analysis against relational databases, denormalized dimensional models are the standard approach.
  2. The benefit of enforcing data integrity inside of the database is starting to decline. As more and more development is focused on the object-oriented middle-tier that is often enforcing business rules, reliance on relational constraints in the database is less important.
  3. As others have mentioned, normalization will complicate the queries required to retrieve relevant data.

It is not clear that normalization is a sign of good design. In some cases, normalization is an artifact of a time when storage space was at a premium and when much of the responsibility for encoding business rules resided in the database (think about 2-tiered client-server applications with most if not all business logic in stored procedures). It may well be that many projects veer away from normalization based on good architectural decisions rather than a poor grasp of database design principles.

The article by Jeff Atwood referenced in the comments above provides some good detailed discussion - "Maybe Normalizing Isn't Normal".

DemetriKots
  • 1,188
  • 8
  • 14
  • I was talking about normalization as a good database design, because it's a major subject in most of relational databases courses. What I actually wanted to be explained is why it's not always so.If good architects don't choose normalized design in many cases, I guess there must be many reasons not to use it. – Yosi Dahari Sep 29 '13 at 17:49
  • 7
    Hi Yosi, I understand your point. Normalization is foundational in really understanding the theory of relational databases and has real application in practice, so it isn't surprising that it is a big topic in courses. Good engineers should understand it and understand when it should be applied. The thing that doesn't seem to be covered in the course work is that selectively denormalizing can yield a lot of benefit and some problems really don't lend themselves to normalized models. – DemetriKots Sep 29 '13 at 17:52
  • 1
    What about data consistency ? For example if you have the shop name in every sales' detail, then you can potentially have different contradicting descriptions, whereas if data is normalized, the shop name appears only one (in the shop table) and there's no place for inconsistency. – Tulains Córdova Sep 30 '13 at 15:55
  • @user61852 - That is a good example of a situation in which normalizing makes sense. A counter-example would be thinking about the hierarchy that might exist for a chain of shops. There might be a corporate level, regions, territories, and then individual store locations. You could normalize the hierarchy and end up with three tables and a four table join every time you wanted to get the store record with its full hierarchy, or you could live with the redundancy and risk of inconsistency by collapsing everything into the store record. Both approaches have their place. – DemetriKots Sep 30 '13 at 18:01
  • Wouldn't a view with the joined tables save you the effort of writing the join everytime ? One of the uses of views is to provide a de-normalized set of rows, making queries trivial. That way you have the advantages of normalization, with the ease of use of de-normalization. Violating normal forms without a good reason is paid with endless hours in the office when everyone else is having family time. – Tulains Córdova Sep 30 '13 at 19:13
  • 1
    I agree. I think normalization gets over used at times by DBAs that have been taught that this is the best design. I've always suggested that the DBAs can normalize the tables in the ETL all they want, but when it comes to the tables the UI references, I need tables that are easy to query without excessive joins. I've run into tables that were so over-normalized, so could barely troubleshoot user issues without spending HOURs troubleshooting. – L_7337 Oct 07 '13 at 19:37
  • 1
    Au contraire, analytics is **insanely difficult** if you're unable to *start* from a normalized model. I just had to go through this exercise, and it was hell. Application developers should *never* assume that a denormalized schema is going to be suitable for analytics needs. And as for point #3 against normalization, it's a problem that's almost trivially solved by materialized/indexed views. – Aaronaught Oct 26 '13 at 21:52
  • 1
    And #2 sounds reasonable but strains credulity in practice - I cannot remember seeing a single instance in my 10+ years where constraints were actually thoroughly enforced by the application. More often, developers either incorrectly equate business rules to data integrity or use the fact that ORMs theoretically *can* enforce relational constraints as an excuse not to do it anywhere at all. Maybe I'm just being cynical, but all of my career experience has taught me that statements like "the application will enforce data integrity" are enormous red flags. – Aaronaught Oct 26 '13 at 21:54
11
  1. A lot of developers don't know or care about normalization, or about data modeling or database.
  2. For some jobs it's really not important.
  3. Sometimes there's a really good reason to de-normalize, for example to make a particular difficult workload perform well.
  4. Relational Database concepts are recently less in fashion than they were in the 1990s and 2000s. Developers tend to be influenced by fashion, even if they claim to be very rational. There's no point arguing about taste.

Normalization is also, historically, a territory for near religious argument, so I hesitate to say much more.

joshp
  • 3,451
  • 1
  • 21
  • 27
  • I'd add to this that sometimes relational is not actually the correct design for a database; for example, an LDAP directory is hierarchical, some other types may be better served by a flat design. – Maximus Minimus Sep 28 '13 at 23:59
  • 1
    As far as point #4, I'd say that *relational databases* are less in fashion and are starting to be swapped out for nosql varieties, and that's actually a great thing a lot of the time. But I don't see a lot of movers and shakers throwing together non-relational data models using an RDBMS. That's just stupid. – Aaronaught Sep 29 '13 at 13:34
  • @joshp - Thanks, nice summary. point #3 is the the one I'm personally more interested in. Why do other factors "beat" the need of normalization. – Yosi Dahari Sep 29 '13 at 15:27
  • @JimmyShelter I agree. Fashion aside, relational is not always the best choice. – joshp Sep 29 '13 at 17:28
  • 4
    @Yosi - The reason some factors can trump normalization is that normalization is a technique to avoid common data consistency problems when data is being inserted, updated and deleted. If data is written once and then only read after that then the C, U, and D of CRUD don't matter any more. In such a case the benefits of normalization are basically meaningless so other competing pressures can take precedence, such as read performance or query simplicity. – Joel Brown Sep 29 '13 at 17:30
  • @Yosi In #3, sometimes a normalized design results in performance hot spots for a particular important workload. Sometimes users (e.g. IT) find a large normalized design difficult to understand and query. In the latter case 'denormalized' views can often help. Two examples. – joshp Sep 30 '13 at 18:43
9

In large projects, and specially those in mainframes, this is not the case. In fact if you search job sites you will see several positions for data modelers. Also, having many columns on a single table does not go against normalization. Nevertheless, your observation is valid for some projects.

Database design is one of the skills required to build quality systems. Having said that, some developers don't know enough about database design and still get assigned to the task of data modeling and database design. Some projects even skip data modeling. The focus on many projects is mainly on coding and front-end design.

Another factor for poor database design is the fact that Normalization is not a trivial topic specially when it comes for 4th NF,5th NF, etc. Most books I have seen could not clearly explain those forms well. There is usually bad examples and too much theory. This makes the topic less popular than it should.

Errors in database design are hard to come by unless you look for them or you encounter them during testing. Having no standard for database design quality lets errors happen more likely.

Add to that the fact that some projects don't follow a rigorous development methodology (one that promotes database design), as a result, responsibilities get mixed and tasks get lost between the business analyst, the developers and the DBAs. Developers talk in OO and UML where DBAs talk in DD and some in ERDs and probably many don't get UML or OO. In short, the lack of knowledge, lack of good clear resources, lack of unified language to describe data and lack of methodology are all to blame.

NoChance
  • 12,412
  • 1
  • 22
  • 39
  • Can you suggest database design quality (not only schema, but also procedures) documents/articles? – Tilak Sep 29 '13 at 02:01
  • "having many columns on a single table does not go against normalization" -Sure.My intention was #entailments. In the question I mentioned #columns just for simplicity, my assumption was that the reader will understand the correlation and by that what I meant – Yosi Dahari Sep 29 '13 at 06:26
  • @Tilak, I am not sure if there is a specific reference to get the best guidelines from but you can collect your list from data modeling and database design literature. Sorry if this does not answer your question. I think that this could be a good subject for a book. – NoChance Sep 29 '13 at 10:44