52

I have a database with a 1:m relationship.

I have to display a list of parents to the user rapidly on a home screen at startup. The parent shows a single piece of information that is a sum of a particular child field for that parent.

I don’t want the expense of making large queries to the child table (essentially doing lists of calculations on the entire child table) - in order to show the parents. I also feel it's important to show the sum to the user on the home screen.

I have therefore denormalised the sum of all children for a particular parent and added it as a field within the parent table. Each time a CRUD operation is done within the child table (with an ID that matches a particular parent), I recalculate and reinsert the new value into the parent field.

Is this an anti-pattern and therefore 'bad' practice? I felt I was doing the right thing by prioritising performance for the UI.

tommytucker7182
  • 631
  • 1
  • 4
  • 8
  • 27
    Did you measure the performance (both of a SUM query and the impact on the CRUD operations)? I suspect ensuring there are appropriate indexes will give you good enough performance for a page's worth of parent info – Caleth Jun 14 '21 at 16:20
  • I havent measured anything yet to be honest. The update expense on a CRUD operation doesnt really bother me as that will be a hidden from the user on a background thread. The software is designed to allow users to add as many parents as they need (and as many children to any parent as necessary) - so the size of the parent list could end up quite large. I read an article a few days ago (i cant remember where) mentioning that this denormalised approach was basically shoddy design... im fairly new to software so i want to understand the engineering best practice where i can. – tommytucker7182 Jun 14 '21 at 16:47
  • 2
    @tommytucker7182 [Here is a good read about the subject.](https://vertabelo.com/blog/denormalization-when-why-and-how/) – T. Sar Jun 14 '21 at 17:03
  • 4
    Fantastic - thanks so much... Im seeing now that i made performance assumptions that i shouldnt have. Im currently researching the best procedure to performance test both options before i go any further. – tommytucker7182 Jun 14 '21 at 17:21
  • 2
    The transformation you describe is not one of the things referred to as denormalization. It can reasonably be said to introduce certain redundancy. Denormalization from higher NFs to lower can reasonably be said to introduce certain redundancy, but you are not introducing that kind of "redundancy". Normalization to higher NFs replaces a table by others that natural join back to it. Denormalizing from higher NFs replaces tables by their natural join. Adding child count to the child table is denormalization, but adding it to the parent table isn't. (But "denormalization" is often misused so.) – philipxy Jun 14 '21 at 21:50
  • 11
    A not so funny read about where this would've been preferred. Although this seems to be NoSQL, the principle remains the same. TL;DR, the website went viral and incurred over 30k USD in a Firebase bill in 48 hours. https://hackernoon.com/how-we-spent-30k-usd-in-firebase-in-less-than-72-hours-307490bd24d – jaskij Jun 15 '21 at 00:15
  • 10
    Don't forget to write a script or program which inserts millions of test records into your database to test average and worst case scenarios. Every software is fast if you test it with 1000 records. It gets interesting if you test it (early) with 5 or 10 million records. – some_coder Jun 15 '21 at 08:16
  • 4
    Note that instead of performing the update manually on every CRUD operation, you may want to consider using triggers or other mechanisms natively supported by your database server. This way you don't have to worry about missing forgetting to update the sum. – Michael Mior Jun 15 '21 at 13:37
  • 1
    As well as all the great advice you may want to investigate columnar/columnstore indexes to speed the aggregation – Steve Ford Jun 15 '21 at 21:21
  • 1
    @JanDorniak Thankfully that bill was covered by Google Grants and didn't take the company down. The code had a looping DB access for aggregate data and made billions of DB calls. Since it was a scaling DB, the usage scaled up with volume and incurred a huge bill. In a "normal" server, the site would've went down, which would've been equally bad. – Nelson Jun 16 '21 at 07:42
  • 4
    The real anti-pattern here seems to be [premature optimization](https://softwareengineering.stackexchange.com/q/80084/73708) – J... Jun 16 '21 at 21:32
  • 1
    I think another interesting solution would be the use of computed columns. You could create a function that selects the sum of the children, then reference that function in the computed column expression. If this works as I think it would, you should be able to access this sum of childrens' values as you would any other column. – Todd Jun 17 '21 at 17:43
  • Have you considered adding an index with the required fields? This way, you only need parse... well... what needs parsed. This is what DB indexes were created for. When you denormalize the originals, you are risking worse performance in future situations, among other things. Indexes let you shape the data however you want, while keeping the original DB spotless – Nate T Jun 18 '21 at 21:14

6 Answers6

94

Is denormalisation for performance reasons an anti-pattern? Not of itself - if something is required, then you have to find a way to do it, and it may well be better to denormalise your data than spend $$$ on a SuperHugeDatabaseInstance to get you the speed you need. I've certainly done this in the past when we were calculating summary data for billions of data points.

Is denormalisation for performance reasons before you've measured things and found out if you actually need to do it an anti-pattern? Yes, every time.

Philip Kendall
  • 22,899
  • 9
  • 58
  • 61
88

Performance requirements are legitimate requirements, and it's great that you have found a potential way to meet these requirements. Denormalization is a tool, not an anti-pattern. But this likely involves tradeoffs, and you should consider them before committing to this solution.

For example, I'm worried about maintaining consistency when the same information is expressed through multiple fields:

  1. What is your “source of truth”? What happens if the sum in the parent gets out of sync? Can the cached values be recalculated from scratch?
  2. If there is a consistency problem, what would the impact be?
  3. How will you ensure that parents get updated whenever a child changes? Would that be triggers or materialized views in the DB or is this the responsibility of the applications making queries?
  4. Will you use transactions/locks to ensure that updates are performed atomically?
  5. Might transactions/locks lead to lock contention on parents with lots of children, thus reducing performance?

Many scenarios can sidestep many of these problems either because they don't have that stringent performance requirements, or because the business problem can tolerate eventual consistency. Also, workloads that mostly read a value and rarely modify it are typically easier to optimize.

You might also consider whether this is the most appropriate way to achieve your performance:

  1. Is the query just slow because the child table is not properly indexed? Does the query plan (“explain” statement) look reasonable?
  2. Is the query really on the critical path, or could it be performed asynchronously?
  3. Is the speed actually acceptable, but the GUI merely feels slow because there are no transition animations?
amon
  • 132,749
  • 27
  • 279
  • 375
  • Thanks for the questions - i have answers to most of them, some i need to reconsider after my performance testing is completed. – tommytucker7182 Jun 15 '21 at 10:11
  • 4
    answer accepted becuase of the comprehensive list of questions to consider - thanks! – tommytucker7182 Jun 15 '21 at 10:24
  • 6
    Excellent post. I've found that denormalization schemes almost always become terrible pain pain points when these questions have not been thoroughly answered. Another thing to consider is whether these computed values should actually take the form of denormalization, rather than being stored separately (perhaps in a store specifically designed for caching). – Matthew Read Jun 16 '21 at 21:27
27

Premature optimization is the root of all evil - most of it, anyway - in computer science. ~Donald Knuth

Denormalizing aggregate data to avoid the aggregate function is not an anti-pattern. The anti-pattern is doing so without first establishing that the aggregate function is infeasibly slow.

First, define "quite large". Hundreds of children? Thousands? Trivial. Modern DBMS software on modern hardware can make short work of even naive implementations of the aggregate at these scales. If by "quite large" you mean hundreds of thousands or millions of rows, then we may be talking about enough raw data for a more custom solution.

A related consideration is multi-level aggregates. If the sum isn't just of children, but children of children, then the Big-O shape of the aggregate function is the number of children raised to the power of the maximum number of "generations" of child data involved in the aggregate. A thousand children of one parent is trivial, but a thousand rows with a thousand children is a million rows that have to be queried. Add one more layer and you're summing a billion rows for just three referential generations. That assumes constant access time for each record, which is rarely true in RDBMSes; best-case is usually logarithmic if indexed, linear if not, so basically add another polynomial degree to the search for tablescans. If all parents reference all children in each generation, this billion-row calculation can be required from a worst-case table of just 3000 rows, making a trivial-looking table a computational nightmare.

Last is the shape of the aggregate function itself. Most "built-ins" like SUM, AVG, STDEV etc are linear to total inputs, but there are "aggregate" functions (defined loosely as algorithms digesting a series to a scalar) that have higher computational complexity. Regression analyses tend to be N^2 or worse, and certain specialized scalar calculations can be higher-order polynomial or NP-time.

I put aggregate function complexity last because you typically have less control over the calculation you need than the inputs into it, however a common optimization involves reducing the problem from a complete calculation to an incremental one. Calculating the sum of a million random data points is a million addition operations, and there really isn't any way around that. However, calculating the sum of a million data points, given the sum of 999,999 of them and the millionth value, is basically constant-time.

This is how that aggregate column is going to help you, if it is really needed. A trigger on an insert or update of a child record, to set the sum field of the parent record(s) to (currentSum - deleted.ValueToSum + inserted.ValueToSum), makes maintaining these sums dramatically less complex than calculating the value from scratch. This improves query speed of the aggregate value without sacrificing insert speed (which is impacted by both triggers and indexes).

Again, all of this is contingent on a simple, intuitive, "naive" solution not being good enough. To determine that, you first have to define "good enough", usually "some insignificant fraction of the total load time of the UI view", and test your candidate implementation to show that it definitely doesn't meet the definition. So, start with a computed column defined as a subquery of the record's children (basically a stored subquery giving you the advantage of a cached execution plan). If that isn't fast enough, make sure you are including the value you're summing in an index of the child table on the parent ID. If that's still not fast enough, then consider replacing the computed column with incremental calculation in a trigger or the create/update SP.

KeithS
  • 21,994
  • 6
  • 52
  • 79
  • 2
    "Modern DBMS software on modern hardware can make short work of even naive implementations of the aggregate at these scales." It may be that aggregating 1000 children is a fast query, but it's still about 1000 times worse than just loading a single row by primary key. If this query is in a frequently used path, denormalization could decrease the cost of this query by three orders of magnitude. Don't be so quick to assume optimization is premature. – Phil Frost Jun 15 '21 at 20:02
  • 2
    On the other hand, this answer could make a good illustration for Wirth's law. – Phil Frost Jun 15 '21 at 20:04
  • i forgot to say thanks for the incredibly comprehensive answer! much appreciated. It turns out, my DB is quite trivial so this has given me some much needed perspective. – tommytucker7182 Jun 16 '21 at 10:42
  • "This is how that aggregate column is going to help you". You're overlooking the read/write ratio. If reads>writes, then an aggregate is a win regardless. – Mooing Duck Jun 17 '21 at 14:29
  • @PhilFrost - I agree with everything you've said; that's why there are a few "never do's" regarding algorithm complexity, and performance/load testing is paramount in performance-critical applications. Re: Wirth's Law, I agree there as well, but the explanation is simple; hardware is cheaper than developer time. As with many creative pursuits, the result's never perfect, you just run out of time to get it there. – KeithS Jul 01 '21 at 20:44
23

Most relational databases have something called Materialized Views. It basically has the database precompute a query and keep that around for quick response. The database is then responsible for keeping the View consistent with the data, including any complications from atomicity transactions. Basically like database-managed caching.

Read the documentation for your database, because this can get complicated.

Using Materialized Views is no more a denormalization antipattern than using indices. But making the application layer ensure data consistency when the database could be doing it: that is an antipattern.

dspeyer
  • 369
  • 1
  • 2
  • 3
    I wasnt aware of Materialized views so i will look up. I have a lot of things to do from this one blog post, its been great, ive gotten some education from it and a lot of things to read up on and research! Thanks for taking the time to comment. – tommytucker7182 Jun 15 '21 at 09:49
  • 7
    As a caveat, there are substantial differences between DBs. T-SQL has materialized views as presented in this answer, which works as a more flexible index. Postgres has materialized views as well, but they must be refreshed explicitly. Oracle has a wide variety of options for refreshs (e.g. on commit or on a schedule). MySQL doesn't have materialized views but can emulate them using triggers or events. So while materialized views in Oracle and SQL Server can be kept automatically consistent, this feature is more geared towards data warehousing use cases where up to date data isn't required. – amon Jun 15 '21 at 17:27
  • Materialized views in sql server may not be picked without a hint. They can be useful, but shouldn’t be you first try unless you know what you are doing. – jmoreno Jun 16 '21 at 01:20
  • That's an under-appreciated answer that probably best addresses OP's real problem, but maybe you can add an explicit answer to the question as asked? Something along the lines of "It's an anti-pattern if you prefer de-normalization over other mechanisms your database already gives you to handle redundancy in a controlled manner, such as materialised views." – xLeitix Jun 17 '21 at 09:41
4

If you are pulling data from a database to display in an app and hitting up against performance issue if this kind, I would seriously consider a caching layer. It's significantly faster to retrieve from a cache rather than a database.

At it's minimum the cache would store the count of children against the parent's ID whenever the count is done, and a CRUD operation that changes the quantity of children would clear the relevant values from the cache. When fetching the data for the UI first check if it's in the cache, and if not do the database sum request.

This solution would open up further performance improvements as you find other things that could easily be cached.

  • 2
    I'd disagree that a separate caching layer is always required. Major databases have their own caching implementations. For example Views in MW SQL Server can be cached in Memory, making them just as fast as some extra caching layer. (Plus they would be a little easier to keep in sync.) – SwissCoder Jun 16 '21 at 13:56
  • 2
    I was going to suggest this approach, a local cache for just the child totals make more sense than denormalizing the tables.. even if dbs have cache policies, it is more transparent for the developer to control their own cache – lurscher Jun 17 '21 at 06:29
0

Whether denormalization is seen as one or not, I will refrain on commenting.

That said, there are many reasons to denormalize datasets. Optimizing for performance, storage and/or reliability should most often be the goal when doing so. (See: CAP theorem and data storage)

The question of whether or not it is pragmatism to do so depends heavily on how you decide to denormalize said data. Looking at the natural access patterns and query patterns set off by users, are there logical mappings to the dataset?

For example, if you have millions of products, do these products have some taxonomy like nested categories? Chiefly: does your user-facing navigation already divide cleanly into a sorted structure? Then make the leaf nodes of that structure the primary identity of each record.

Very, very often we are already "sharding" traffic or lookups by some hierarchy that is more convenient for users. Users happen to tend to be humans, just like developers.

Turns out these same groupings are actually quite useful for scaling out by further distributing the categories which use the most resources. After that point, if you need to, you can always normalize within the categories and use widely available algorithms. You end up with a flexible structure that scales based on the needs of the domain your project deals with.