12

I'm relatively fresh out of college so most of my familiarity with relational databases is from my databases course where anything not in BCNF or 3NF is a travesty. Certainly that's one end of the extreme, but my team at work really seems to take it to the complete opposite end.

In our microservice db schemas, entities rarely have more than a single table. Anything that you would commonly normalize into another table is stored in a json column. If it's later discovered that one of the properties in this json needs to be queried on, a new column is added and the data is stored in both places (yes, in two different columns in the same table).

In a lot of cases these json columns definitely have an advantage. If you never need to query on that data and if you never have to make a unilateral change to that data (which is something you obviously can't predict), it's not a bad idea. Plus many of our services either don't see server or are hosted on machines with an obscene amount of disk space for what they needed, so data duplication isn't a huge issue. (Though something I'd generally like to avoid out of philosophy)

Currently we're building a service that matches rules based on a set of conditions they own and then performs a set of action associated with those rules when the rules are true (e.g. all of the conditions are true). My sub team that's most immediately building this service believes that there's a substantial benefit to normalizing actions and conditions away from rules in the schema. Obviously these table maintain foreign key relationships with the the rule id. From our perspective we can avoid data duplication on conditions enabling us to ensure that they are only evaluated once once and it's easy to find the conditions and rules we need when need them without needing to pull every single rule out and do the searching in memory.

Speaking with one of our principal engineers today he attempted to push me far away from this schema. Trying to argue in every way that we do don't actually need it it's going to cause performance issues in the future, referencing an old monolith we own that is a design travesty. He referred to what we're doing as "the old way" and flat tables with json as "the new way". He argued that in places where I want atomicity we don't need it and that instead of queries we should do more things in memory. This is a design principle that many of our services follow now. We don't anticipate that the volume of our data will grow substantially which should keep our queries quick. What we do anticipate is a lot of time spent in rule evaluation and performing actions. It seems logical to move as much burden as possible to SQL.

I understand non relational databases have become more popular in recent years but even when actively searching for information about the performance implications of foreign key relationships I don't see a lot of information making his case. I suppose they might tend to introduce large transactions which may cause issues, but that seems like a problem independent of the foreign key itself.

Is this my naivety? Or is this there truly something I and my sub-team are missing? I've explicitly not given detailed information on our problem because I'm not necessarily looking for a solution to that. Given that is a common trend in our larger team, I'm really curious if they're on to something with this.

MichaelCook
  • 129
  • 5
  • The answer to your question in the title would be "They are scared because of the old monolith in your company". But the body of your question seems to ask something different altogether, i.e. "Do foreign keys introduce performance problems?" – Christian Hackl Mar 01 '19 at 09:23
  • 3
    I wonder what % of a RDBMS they have built in "app" code – Caleth Mar 01 '19 at 09:28
  • Whether the approach is good or not depends on the kind of application you are building, its needs, and the direction it's going (requirements, architectural constraints) - something we can't really assess here. As for NoSQL - the whole thing was about supporting massive horizontal salability, and about the recognition that not all applications require the strict constraints of RDBMS. To learn more, use the top 3 answers [here](https://softwareengineering.stackexchange.com/questions/194340/why-are-nosql-databases-more-scalable-than-sql) as a starting point (the 2nd and 3rd go more in depth). – Filip Milovanović Mar 01 '19 at 11:03
  • 2
    If I can offer some non-technical advice: tone it down a bit. You're passing a lot of judgment ("yes, in two different columns in the same table", "design travesty") on work where you had no involvement in the design decisions and doing it from a position of minimal real-world experience. I can't say you're right or wrong because I haven't seen the project, but systems tend to be a series of compromises resulting in the finished product being functional but less-than-conceptually-pure. This will become clearer as your career progresses and making those decisions becomes part of your job. – Blrfl Mar 01 '19 at 16:29
  • @Blrfl Excellently put – Robbie Dee Mar 01 '19 at 16:40

5 Answers5

8

The key word in here to understand where your team is coming from is "microservices". It would be worth reading up on that concept first, particularly for the following information:

  • How should data be stored?
  • Design principles?
  • How are they designed to scale?

As with any relatively new way to do things (and 5-10 years is relatively new when it comes to software architecture), you'll find that the ideals and reality are a bit different.

One of the ideals is that every microservice should have it's own data store. NOTE: I said data store, not database. There are cases where you simply want a search engine, blob storage, or simple caching as opposed to a regular database. Depending on who you talk to, that ideal might even go to a data store per microservice instance!

Bottom line is that when you are talking about going to internet scale, the safety and familiarity of ACID (Atomicity, Consistency, Isolation and Durability) transactions just don't scale when you have millions of users on one database. With the advent of NoSQL, the paradigm has shifted more towards BASE (Basically Available, Soft state, Eventual consistency). (reference)

There is an impact of changing the PH of how you manage data:

  • Things the database used to take care of for you have to be managed in code now
  • It's easier to scale by throwing more microservice instances at a problem than it is to add "infinite" resources to a server
  • You increase reliability at the cost of increased complexity

I can't answer for the details of your team or how big they intend the solution to get, but typically you don't have to have an all or nothing solution. I'm not going to sit here and judge whether the team is making the right choices. I'm just providing you with some context so you can at least understand where they are coming from.

Berin Loritsch
  • 45,784
  • 7
  • 87
  • 160
  • +1 Great stuff - there are a lot of subtleties around microservices for sure that mean it isn't just a case of swapping out databases. – Robbie Dee Mar 01 '19 at 16:45
  • @RobbieDee, agreed. There's a lot of complexity in that world, and not everyone agrees on the details. – Berin Loritsch Mar 01 '19 at 16:57
  • This should be the answer. The bit about each microservice having its own data store really is the differentiating factor. It makes for a big change in your data storage needs and solutions, and an ACID compliant data store is not as much of a benefit as it used to be. – Greg Burghardt Mar 01 '19 at 17:51
  • 7
    It's a good answer, and I upvoted it. I would only point out that what you refer to as "internet scale" only applies to the *largest* of companies; for the vast majority of corporate databases and websites (I would say 95% of them), "conventional" normalized SQL databases are still perfectly viable. – Robert Harvey Mar 01 '19 at 18:05
  • @RobertHarvey, I agree whole-heartedly. I have read multiple articles about microservices that specify what I wrote about. In our own projects we do use a SQL database with proper normalization and constraints. It would hurt the purist's heart, but the reality is our user base is rather small (hundreds or users) and the database hasn't been a performance problem for us. – Berin Loritsch Mar 01 '19 at 19:34
  • The irony of reading “SQL isn’t internet scale” on a Stack Exchange site is truly mind blowing. You’re currently on one of the busiest website’s in the world, and it is back by a replicated SQL database (plus some odds and ends for searching & caching). – RubberDuck Mar 03 '19 at 19:56
  • @RubberDuck, looking at https://stackexchange.com/performance shows that while Stack Exchange is really important to folks who write software, it still pails in comparison to the top sites. It's still few updates compared to the reads which are easier to manage. Twitter sends 10x the tweets as Stack Exchange gets page views: http://www.businessofapps.com/data/twitter-statistics/ Don't get me wrong, there's only a handful of sites that have to deal with that level of throughput. – Berin Loritsch Mar 03 '19 at 23:25
3

OK, not being the principle engineer on the project you really have to follow his directions for this project.

I would encourage you to working through your own design of the system and prototype it is at home so that you understand any tradeoffs. Do this for your own education and only mention it at work when you can demonstrate working examples.

My experience has been that there is a claim that constraints cause a slow down in database performance. And yes, it will, you have to check those constraints. However, it is a far larger problem when the database is inconsistent and this will cause you to write SQL and more code in order to compensate, often increasing the complexity of the system as well as slowing it down.

3nf, when done appropriately, will make the database faster because more of it can be cached as there is less redundant data being stored. However, in your current job, there may not be a large enough dataset to actually see the performance difference between a normalized database and a non-normalized one.

Robert Baron
  • 1,132
  • 7
  • 10
  • +1 Great idea. And if the volumes are too big for a dev machine, a 1 in N sample can often yield great insights too. – Robbie Dee Mar 01 '19 at 16:43
2

I think they're scared of re-creating the same old "travesty" that was there before, rather than Referential Integrity itself.

He argued that in places where I want atomicity we don't need it ...

If you can make a solid case (a.k.a. Non-Functional Requirement) for needing atomicity, then they'll need a good, solid counter-argument to get out of providing it.

... instead of queries we should do more things in memory. This is a design principle ... We don't anticipate that the volume of our data will grow substantially ...

Let's hope you're right. I would suggest that relying on the data staying "small enough" to remain performant is risky.

Also, what's the rate of change on these Rules? The more duplication you have, the more time (a.k.a. money) you'll be wasting updating the same thing in multiple places.

Phill W.
  • 11,891
  • 4
  • 21
  • 36
1

The key concepts behind RDBMSs are well over 40 years old. Back then storage was very expensive and any sort of redundancy was frowned upon. While the concepts behind RDBMSs are still sound, the idea of denormalisation for performance (to reduce joins) has become commonly accepted in recent decades.

So for an RDBMS of a given size, you typically have your logical design (without redundancy) and your physical design (with redundancy) for performance.

Fast forward to today where storage is cheap and processors are faster than ever, some of those design pressures aren't so important. Ultimately it is a judgement call as to whether you care about redundancy and orphan records. For some industries like banking, data correctness is vital so it is hard to see how they will ever move away from RDBMSs. For other industries, new players are entering the market all the time so the choices are myriad.

As for whether your team are uncomfortable with the restrictions that a RDBMS can bring - who knows? Certainly junior developers I see don't have the RDBMS nous that developers of previous generations did, but this is probably more to do with the proliferation of developer technologies and database platforms.

There is no end of technologies a developer can learn and it can be hard to make the right punt for your career. Certainly the days of developers being a jack of all trades are long gone - there is just too much that one can learn.

But - to the question in hand. By your own admission, you don't expect data volumes to grow and the system is performing well. It would be quite a stretch for you to sell the idea of re-engineering things with no perceivable benefit. Perhaps if you could do a proof of concept where a RDBMS approach did reap benefits, that would be a different story.

Robbie Dee
  • 9,717
  • 2
  • 23
  • 53
  • 1
    why is this downvoted? this is balanced answer. pragmatism +1 – Dirk Boer Mar 01 '19 at 15:41
  • Pragmatism is good, but you must still be careful. Denormalizing data in the name of performance at the start of a project reeks of premature optimization. Not re-engineering an old system that's working is obviously a good, pragmatic choice, but refusing to design a new system up to industry standards in the name of "we've always done the opposite and it works" is far from a good argument. – Vincent Savard Mar 01 '19 at 16:19
  • *Denormalizing data in the name of performance at the start of a project...* Hint: you don't :) – Robbie Dee Mar 01 '19 at 16:47
  • 1
    The value of a RDBMS does not come from disk efficiency. – TehShrike Mar 09 '19 at 21:34
0

It depends on what database you're using.

In a traditional RDBMS, you're right. Duplication of data is an abomination. The columns and their json equivalence are inevitably going to get out of sync because there's nothing to enforce it. Foreign key support is well known, does a great job at describing and enforcing relationships. And atomicity is vital to doing almost anything with data.

In a nosql sort of setup, it is less clear. Since there are no firm relations, the enforcement of relations become less important. That sort of json content with column index is a lot more common on these systems because no relations means its less likely to get out of sync. And atomicity is constrained to the single table because that's how nosql works.

Which is better depends on what you're actually doing, and what you actually need.

But it sounds as though your coworkers are in a cargo cult. They were bitten by old bad stuff so now things need to be the new shiny thing. In a few years, once they've been bitten by the new shiny thing they'll hopefully realize that SQL vs noSQL is a set of tradeoffs.

But they won't. Hopefully you will though.

Telastyn
  • 108,850
  • 29
  • 239
  • 365