1

We're starting to record history in an about 5 y/o internet system built with ASP.NET Core and SQL Server.
Our current requirement is to save all records from a specific table, plus two fields from two other tables.
We've tried using Temporal Tables, but since in some scenarios the user can change history records - that didn't work.

Here are the main requirements we have:

  1. In some scenarios the users (-admins) can change the history.
  2. Users can save future records. That is - records that will be relevant in the future (they can say a student will leave a boarding school a week from now).
  3. Saving the data is performed from several modules (different services - some code and some SQL)
  4. We will need to record history for many other data points in the future.

So here are the two alternatives we came up with:

  1. Move the extra two columns from their tables to the main table, and just write every change to that table.
    This means all the data will be in one table. Since there are future records retrieving becomes a bit complicated, but once we'll create a view for it, that problem is solved.
  2. Add a history table and a future table, both with only the data that we need to save history for. The current state will be retrieved as it is now.
    This means having a daily job to move future records to current table when they become relevant, and current records to history when they are updated (except for when they are updated because the user made some mistake, in which case we just override the record).

Here are the main factors we think are relevant to making the decision:

  1. Retrieving current data using the first alternative takes about twice as the second alternative (and retrieving current data is 95% of retrievals performed).
  2. Creating a job and some mechanism to move records from current to history is a relatively big complication required in the second alternative, but avoided in the first. It might become a significant source for bugs.
  3. There's a big difference in the time it will take to implement the two alternatives, but considering the impact of the decision, it's negligible.

...And that's where we're stuck.. The first factor is a big plus for the second alternative, and the second a big minus.

From your experience which way is better? Why?

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
Oren A
  • 270
  • 1
  • 13
  • _"1. In some scenarios the users (-admins) can change the history."_ Define "admin". Is the the application administrators (i.e. devs or IT support), or is this an end user with some additional rights in the application itself? What I'm trying to get at here is whether what you call an "admin" can only access the data using the end user application or not. – Flater Aug 10 '21 at 09:01
  • There's a whole lot of "we'll just do that in SQL" pervasive attitude in this answer, and it is unclear to me whether this is a must or just a habit. A solution to this problem would be significantly easier/cleaner when only letting the application access the database, because then you know for a fact that all database operations must pass through the business logic, which means you get a whole range of options on how to tackle this reliably. However, this is where I need concrete context. Are you in a position where moving away from direct SQL operations is possible? – Flater Aug 10 '21 at 09:04
  • 2
    It is also unclear to me how future records are related to the problem of historical change tracking. The only difference between a current record and a future one is an arbitrary datetime column with a particular value, which doesn't force you to treat these differently in terms of change tracking _unless you choose to do so_ (which makes no sense, if the same solution works for both). Unless I'm missing something, the future records are not relevant to the problem of change tracking and/or are trivially avoided using a basic datetime filter even if relevant. – Flater Aug 10 '21 at 09:08
  • @Flater - admins are end users. All users access the data through the application. – Oren A Aug 10 '21 at 09:24
  • 1
    @Flater - We can theoretically move away from SQL, but we find solving everything in SQL simpler. We create a view for retrieving the data once - and we can get it everywhere. As for updates and inserts - since we perform them from code and from SQL, we might need to implement them twice, but we generally prefer to perform everything once, so SQL seems better. Unless I'm missing out on something – Oren A Aug 10 '21 at 09:38
  • 1
    @Flater - I mentioned the need for future records because it complicates a bit retrieving from a single table (you'll need the max date which is earlier than today), and it makes more sense to have another table for it in the second alternative than to have a "not present table". But I agree that it doesn't have an impact on the bottom line. – Oren A Aug 10 '21 at 09:41
  • _"We create a view for retrieving the data once - and we can get it everywhere."_ The same is true for your business logic when it becomes the sole gate that provides access to the data in the database. _"As for updates and inserts - since we perform them from code and from SQL, we might need to implement them twice, but we generally prefer to perform everything once, so SQL seems better."_ I'm suggesting to do it once too, but to do it only in your business logic. This brings many benefits but is defeated when anyone freely circumvents it by running direct queries. – Flater Aug 10 '21 at 11:44
  • Overall, I get the feeling that this dependency on direct SQL execution is a crutch in absence of a good development process, in the same sense that someone with an old rickety car is more inclined to pop the hood and tinker with the engine than someone who buys a pristine high end vehicle. If you never think outside of having a rickety car, then you'll never really understand how you could live without needing to tinker with your car. – Flater Aug 10 '21 at 11:46
  • 2
    @Flater - I would think the fact we have SQL SP that retrieves/inserts/updates data means if we want it to be written once is should be using SQL, as using c# code from SQL is not best practice to say the least (if possible at all).. It's also more convenient to a developer or system analyst who's trying to understand what's the status of a certain entity, or solve a bug, to just run a view rather than debug.. I really don't see why we would want to implement that logic in code. – Oren A Aug 10 '21 at 14:22
  • Your last comment is effectively confirming my "why take my car to the garage if I can pop the hood myself" analogy. I can't claim that you can't possibly fix your car yourself, I can only offer the suggestion that a garage might have better tools at its disposal than you do at home. If you want to stick with using SQL directly at all costs, good luck. I never said that it can't be done, just that you seem to be (in my honest opinion) overly eager to use (and keep using) SQL directly without considering if there are easier avenues. But, as I said, your way is not by any means forbidden. – Flater Aug 10 '21 at 14:28
  • _"I really don't see why we would want to implement that logic in code."_ Sorry if this sounds snippy, but I'd suggest to not pre-emptively avoid that which you don't quite see/understand yet. This leads to "this is how we've always done it" behavior, which is one of the leading causes of technical stagnation in development teams and developers alike. The main benefit of business logic over SQL is that you have a much wider toolkit at your disposal, which can also more easily be incorporated into your application features (e.g. the admins changing the historical entries). – Flater Aug 10 '21 at 14:30
  • Part of the reason I'm trying to exclude direct SQL usage is that it is so very easy for anyone directly accessing the database to override any measure you put in place. In turn, this means you can't rely as much on everything sticking to the predetermined format, which in turn makes it more likely for errors to slip in. Comparatively, if you develop consistent business logic, and everyone passes by this business logic, you can have much more confidence that the rules are being followed at all times and not being circumvented by a rogue overzealous SQL user. Audit fields are a great example. – Flater Aug 10 '21 at 14:34
  • _"It's also more convenient to a developer or system analyst who's trying to understand what's the status of a certain entity, or solve a bug, to just run a view rather than debug."_ If you are this confident about the view being the solution to finding the source of the issue, then any application rendering the output of this view would be equally capable of providing the same solution. Debugging vs viewing SQL data is comparing apples and oranges, as they pull diferent things into focus (code vs data). Secondly, business logic can be rigorously tested. SQL database configuration much less so – Flater Aug 10 '21 at 14:37
  • 1
    It looks a but weired that retrieving current data using the first alternative will slow down your system that much. Do you have added proper indexes for fast retrieval of "current" data? How are "current records" distinguished from historical or future records? – Doc Brown Aug 11 '21 at 14:19
  • 1
    @DocBrown I've consulted with our DBA. I've added indexes. They are distinguished only by their date – Oren A Aug 12 '21 at 06:02
  • @OrenA: "I've added indexes" - right now, after I wrote my comment, or before you asked the question? And are you still experiencing the slowdown? And is this slowdownn really relevant for you? – Doc Brown Aug 12 '21 at 06:12
  • @DocBrown I wouldn't base this huge decision on my SQL skills. I've consulted with the DBA before asking the question, including about the indexes. The data in the question is based on that. It's very relevant, the first page the users encounter is a list based on that retrieval. – Oren A Aug 12 '21 at 06:24
  • 1
    `Retrieving current data using the first alternative takes about twice as the second alternative` The only comparison that matters is the one you do with current timings. If the increase is not meaningful enough, maybe it's acceptable to mix everything in the same table. You mentioned some insights about each solution but you have not mentioned what are the red lines the company is not willing to cross. Some decisions are about the lesser evil rather than the greater good. Another question to make is: What's the growth factor of the data and what's the relative volume of each kind of record? – Laiv Aug 12 '21 at 20:13
  • 1
    When I say red lines, I'm referring to the non-negotiable "non-functional" requirements. It's performance? it's simplicity? it's maintainability? What are you willing to renounce to make it work. I do ask this because I found the 3 factors not to be relevant. Probably because I'm missing something more important than those factors. You argue that #2 is complicated and a significant source of bugs. Well, I could say the same of every single change we do to running code and legacy code already in production. – Laiv Aug 12 '21 at 20:19
  • @Laiv I compared the performance after getting the DB to hold data as we expect it about 10 year from now. I faked history and future records for the comparison. My company haven't put any red lines. They know what we're doing is costly, but the cost will be measured compared to the other solution. If they'll say "we are willing to accept no more than 2 seconds more in the data retrieval" - the decision is a no-brainer. So it's open until we'll have all aspects, and then we'll decide. Obviously if it will take a minute instead of 7 seconds there's not much choice, but we're not there.. – Oren A Aug 15 '21 at 06:31
  • @Laiv There aren't ANY red lines when it comes to non functional requirements, it's a delicate balance and I'm responsible for recommending which way to go. I thought there's like a default way most organizations go for, but I now realize there isn't. The question however still holds. – Oren A Aug 15 '21 at 06:38
  • @Laiv As for my bugs argument - everything CAN be a source for bugs, but you'll agree for example that retrieving some data from your DB is less buggy than receiving it from some API, just because there are less things that may fail. Having a daily job, and some service to move records from current to past have more mechanisms that must work than just writing everything to some table. That was my argument. – Oren A Aug 15 '21 at 06:41
  • Looks like you already chose. – Laiv Aug 15 '21 at 10:24
  • @Laiv - Not at all.. Just clarified my assumptions and comments.. – Oren A Aug 15 '21 at 11:15
  • Related: [Ways to have a history of changes of database entries](https://softwareengineering.stackexchange.com/questions/156220/ways-to-have-a-history-of-changes-of-database-entries) – Doc Brown Aug 18 '21 at 12:12
  • See also [How to version control a record in a database](https://stackoverflow.com/questions/323065/how-to-version-control-a-record-in-a-database) – Doc Brown Aug 18 '21 at 12:14
  • And [How to Store Historical Data](https://stackoverflow.com/questions/3874199/how-to-store-historical-data) – Doc Brown Aug 18 '21 at 12:15

1 Answers1

2

From your experience which way is better? Why?

A single table

Because:

  • Simplicity is more important than performance. Start with simple, then optimize performance. (Simple = less work, less complexity => less maintenance, fewer bugs, easier to understand by future programmers)
  • You have one table structure, so start with one table. That's simple.
  • It seems like you have overlooked a range of simple performance improvement options.

Options for performance enhancing the simple solution:

  • Indexing - as others have suggested
  • View Optimisation - review query plan because you can improve the structure of the SQL itself too
  • Table Partitioning - Use the Database Tools, don't do it yourself.
  • Materialized View - Use the Database Tools, don't do it yourself.
  • Secondary Table - Like a Materialized View, but instead you manually build the materialized table with a batch process.

Remember, for all performance optimisation: measure then improve.


Here are some reasons why you might consider "3 tables":

  • Read/Write duty - both i) separate process groups reading/writing two history vs future; and, ii) at high volume. Separate tables should result in less lock contention. But this should be considered a progression of performance optimisation, after you work through the simpler measures first - start with Simple.
  • Security Partitioning - where data access roles need to be enforced i) in the database; and, ii) at the table level, differently for history vs future. In your situation, you should be doing that with VIEWs, not at the table level.
  • Separate Databases - where the History table is in one SQL Server instance, and the Future table is on another SQL Server instance. Which doesn't seem to be the case here.
Kind Contributor
  • 802
  • 4
  • 12
  • Simplicity is subjective here. Mixing up three different things in the same table leads to give three different meanings and explanations every single time the table is involved. That transalted into code lead to a lot of boiled plate code and unecessary if/else blocks in places where they aren't supposed to be. Not mentioning the burden of transfering this knowledge, easly, among developers. Technically, over the paper, seeme simpler, but in practice could be not. Hard to say with so little knowledge of the applocation and the team in charge. – Laiv Aug 15 '21 at 10:28
  • @Laiv Simplicity is quite objective, as applied to the context of the Question. "This means having a daily job to move future records to current table when they become relevant, and current records to history when they are updated (except for when they are updated because the user made some mistake, in which case we just override the record).". I go with the information I have, and your imagined complexity is quite unlikely given that the original post describes use of a View. – Kind Contributor Aug 16 '21 at 04:35