What's interesting about this Q&A thread is that there are actually 3 questions. Everybody has answered a different one, and almost nobody has answered the first one:
- Why aren't some databases in the wild normalized?
- Why/when should a normalized database be denormalized?
- In what situations is it harmful or unnecessary to normalize in the first place?
Alert readers will note that these are very different questions, and I'll try to answer each of them separately while avoiding too much detail. By "too much", I mean that I don't think this is the appropriate context in which to be carrying out an extended debate over the merits of various arguments in favour of or against normalization; I'm simply going to explain what those arguments are, maybe list a few caveats, and save the philosophy for more specific questions, if they ever come up.
Also, in this answer I am assuming that "normalization" implies "BCNF, 3NF, or at least 2NF", since that's the level of normalization that designers generally aim to achieve. It's rarer to see 4NF or 5NF designs; although they certainly aren't impossible goals, they concern themselves with the semantics of relationships rather than just their representation, which requires considerably more knowledge about the domain.
So, onward and upward:
1. Why aren't some databases in the wild normalized?
The answer to this could be "because they shouldn't be", but making that assumption right off the bat is pretty piss-poor detective work. We wouldn't make very much progress as a society if we always operated on the assumption that whatever is, ought to be.
The real reasons that databases don't get normalized in the first place are more complicated. Here are the top 5 that I've come across:
The developers who designed it didn't know or didn't understand how to normalize. Strong evidence of this comes in the form of many other accompanying bad design choices, like using varchar columns for everything or having a spaghetti mess of meaningless table and column names. And I assure you, I've seen "real" databases that are every bit as bad as those in the TDWTF articles.
The developers who designed it didn't care or were actively against normalization on principle. Note, here I am not talking about instances where a deliberate decision was made not to normalize based on contextual analysis, but rather teams or companies where normalization is more-or-less understood but simply ignored or shunned out of habit. Again, surprisingly common.
The software is/was done as a Brownfield project. Many purists ignore this perfectly legitimate business rather than technical reason for not normalizing. Sometimes you don't actually get to design a new database from scratch, you have to bolt on to an existing legacy schema, and attempting to normalize at that point would involve far too much pain. 3NF wasn't invented until 1971, and some systems - especially financial/accounting systems - have their roots even farther back than that!
The database was originally normalized, but an accumulation of small changes over a long period of time and/or a widely distributed team introduced subtle forms of duplication and other violations of whatever normal form was originally in place. In other words, the loss of normalization was accidental, and too little time was spent on refactoring.
A deliberate business decision was made not to spend any time on business analysis or database design and just "get it done". This is often a false economy and ultimately becomes a mounting form of technical debt, but is sometimes a rational decision, at least based on information that was known at the time - for example, the database may have been intended as a prototype but ended up being promoted to production use due to time constraints or changes in the business environment.
2. Why/when should a normalized database be denormalized?
This discussion often comes up when a database is normalized to start with. Either the performance is poor or there is a lot of duplication in queries (joins), and the team feels, rightly or wrongly, that they've gone as far as they can with the current design. It is important to note that normalization improves performance most of the time, and there are several options to eliminate excess joins when normalization appears to be working against you, many of which are less invasive and risky than simply changing to a denormalized model:
Create indexed views that encapsulate the most common problem areas. Modern DBMSes are capable of making them insertable or updatable (e.g. SQL Server INSTEAD OF
triggers). This comes at a slight cost to DML statements on the underlying tables/indexes but is generally the first option you should try because it is nearly impossible to screw up and costs almost nothing to maintain. Of course, not every query can be turned into an indexed view - aggregate queries are the most troublesome. Which leads us to the next item...
Create denormalized aggregate tables that are automatically updated by triggers. These tables exist in addition to the normalized tables and form a kind of CQRS model. Another CQRS model, more popular these days, is to use pub/sub to update the query models, which gives the benefit of asynchrony, although that may not be suitable in very rare instances where the data cannot be stale.
Sometimes, indexed views are not possible, the transaction rates and data volumes are too high to admit triggers with acceptable performance, and the queries must always return realtime data. These situations are rare - I'd hazard a guess that they might apply to things like High-Frequency Trading or law enforcement/intelligence databases - but they can exist. In these cases you really have no option but to denormalize the original tables.
3. In what situations is it harmful or unnecessary to normalize in the first place?
There are, in fact, several good examples here:
If the database is being used only for reporting/analysis. Typically this implies that there is an additional, normalized database being used for OLTP, which is periodically synchronized to the analysis database through ETL or messaging.
When enforcing a normalized model would require an unnecessarily complex analysis of the incoming data. An example of this is might be a system that needs to store phone numbers that are collected from several external systems or database. You could denormalize the call code and area code, but you'd have to account for all of the different possible formats, invalid phone numbers, vanity numbers (1-800-GET-STUFF), not to mention different locales. It's usually more trouble than it's worth, and phone numbers are usually just shoved into a single field unless you have a specific business need for the area code on its own.
When the relational database is primarily there to provide transactional support for an additional, non-relational database. For example, you might be using the relational database as a message queue, or to track the status of a transaction or saga, when the primary data is being stored in Redis or MongoDB or whatever. In other words, the data is "control data". There's usually no point in normalizing data that isn't actually business data.
Service-Oriented Architectures that share a physical database. This is a bit of an odd one, but in a true SOA, you will occasionally need to have data physically duplicated because services are not allowed to directly query each other's data. If they happen to be sharing the same physical database, the data will appear not to be normalized - but generally, the data owned by each individual service is still normalized unless one of the other mitigating factors is in place. For example, a Billing service might own the Bill entity, but the Accounting service needs to receive and store the Bill Date and Amount in order to include it in the revenue for that year.
I'm sure there are more reasons that I haven't listed; what I'm getting at, in essence, is that they are quite specific and will be fairly obvious when they come up in practice. OLAP databases are supposed to use star schemas, SOAs are supposed to have some duplication, etc. If you're working with a well-known architecture model that simply doesn't work with normalization, then you don't normalize; generally speaking, the architecture model takes precedence over the data model.
And to answer the very last question:
Is it true that good architects and experts choose a denormalized design while non-experienced developers choose the opposite? What are the arguments against starting your design with normalization in mind?
No, that is complete and utter B.S. It's also B.S. that experts always choose a normalized design. Experts don't just follow a mantra. They research, analyze, discuss, clarify, and iterate, and then they choose whatever approach makes the most sense for their particular situation.
The 3NF or BCNF database is usually a good starting point for analysis because it's been tried and proven successful in tens of thousands of projects all over the world, but then again, so has C. That doesn't mean we automatically use C in every new project. Real-world situations may require some modifications to the model or the use of a different model altogether. You don't know until you're in that situation.