6

I have about 15 years software engineering experience, writing business software with relational databases. Mainly SQL Server and Oracle. I've always been of the opinion that you should define relations in your database and let your database handle the relational integrity. And in my case, 99.9% of the time I utilize a unique identity column as a primary key.

Now that my opinion is out of the way, I'd like some advice on an application we have at work. It's a third party application that my application has to interface with. Both applications use SQL Server as a database. They have no interface for sending and retrieving data. Because of that, they've given me portions of their database schema and descriptions on how to save the data with SQL queries.

It was clear from the schema sent to me that they were not using relations nor were they using identity values for primary keys in most cases. In following communications with them, they pretty much said that identity columns are the exception, not the norm. They also said they handle all referential integrity in their code.

I'm pretty much convinced they have made a horrendous mistake. For instance, I have to save data to this database and I don't have any of their "referential integrity code" in my app. And if someone had to run some queries directly in the database, that also poses a problem. I think this is a maintenance nightmare.

Can someone make any reasonable arguments to objectively support the decision this vendor made? I'm having a hard time coming up with any good reason.

Christophe
  • 74,672
  • 10
  • 115
  • 187
Dan
  • 299
  • 1
  • 7
  • I rarely use enforced constraints in the db. They can be a PITA to debug. I find it easier to manage in code. – GrandmasterB Dec 13 '19 at 04:00
  • 2
    I'm trying to imagine a scenario where a db constraint is hard to debug. I've had my fair share of relationship issues to debug, but I can't think of any that were difficult to debug. Could you give an example that was hard to debug? – Dan Dec 13 '19 at 04:25
  • Let me guess: they don't use indexes either ? Joke aside, I dared to slightly reword your question in a way to highlight your objective core and removed subjective or non-constructive remarks. Do you agree (if not you may simply revert the changes) ? – Christophe Dec 13 '19 at 07:30
  • 1
    @Christophe, it could have been pertinent to mention that they were AS400 developers (which has been edited out). It could suggest all kinds, but I'd suggest at a minimum it tends to suggest they are experienced developers who have worked in environments where correctness is likely to have been valued highly. – Steve Dec 13 '19 at 13:14
  • @Steve The way it was phrased, AS400 doing ASP.net, suggested some obsolescence, which I think would be unfair. It was hopefully not the intent of the author. removing this remark avoids risks of confusion, and allows everybody to focus on arguments that matter without prior bias. And if it’d somehow be related to AS400 technology, I’d bet that some former AS400 expert would be delighted to explain the rationale ;-) – Christophe Dec 13 '19 at 13:33
  • @Christophe, I only mention it because it suggests the design decisions will have a defensible rationale, and the system will be looked after by those who meet a minimum standard of competence. It rules out the possibility that the OP is likely going to be able to show them a thing or two. I'm not an AS400 developer myself (to make my neutrality clear), but credible reasons for having no constraints are execution speed, and secondly the ability to selectively archive/delete data (for example, to delete an old customer header, but keep detailed sales records which relate to them). – Steve Dec 13 '19 at 14:16
  • 1
    The real problem here is that they want to use a database as an API which has so many issues that "there are no constraints" would be somewhere on page 2. – Voo Dec 13 '19 at 14:37
  • @Dan For example the standard exception when a constraint fails in EF Core is "Foreign key constraint X has been violated". No information about the data since that could constitute a possible information leak. It's hard to find a good solution here: You either pepper your logs with possibly confidential information or you maek debugging these things hard. I guess you could designate which columns/tables contain confidential information, but I don't know any framework that does that. – Voo Dec 13 '19 at 14:39
  • 1
    @Christophe, Steve I never saw the original post but comments mention AS400 (really IBM i now). Way back when the AS400's came from a time where they didn't really have the capability for referential integrity in the database so it was not common. If the application or developers came from this time that could explain the lack of RI. The modern DB2i has all the capabilities but adding those to an old application, or maybe the developers embracing them can sometimes be difficult. – Scott Mildenberger Dec 13 '19 at 14:49
  • @ScottMildenberger I understand and agree. But having started my career at the time of oracle 5.0 and its locking scheme at table level, and having worked with a couple of dbs (some belonging to history now) that supported referential integrity only at the expense of performance constraints, I can reinsure you that it’s not As400 specific. Moreover I know AS400 addicts who use state of the art technologies nowadays, and I know some projects not using referential integrity for other reasons (e.g. document oriented nosql that do not really support it). – Christophe Dec 13 '19 at 18:13
  • @Voo I think you hit the head of the nail here. The issue is indeed not so much the absence of constraints in the db, but the use if the DB as external interface without the constraint enforcement ! – Christophe Dec 13 '19 at 18:16
  • @Christophe I'm not sure whether they have indexes. I haven't had a chance to look at the full schema. Initially we didn't have the app deployed (it's still in active development). So all I had were the definitions of a few tables, but not all of them, and no indexes. I think your edits are fine. I guess I was trying to suggest these developers are trying to use SQL Server with an AS400 mentality. There are other things about their design that also suggest this. – Dan Dec 13 '19 at 18:59
  • And to clarify, I'm not saying all AS400 developers are incapable of using newer tools correctly. I'm just saying these developers don't seem to be. Though that's maybe colored by my opinion that referential integrity in the database is one of the most important things about using a relational database. However, I'm seeing some mixed opinons looking at the comments. It's giving me a lot to think about. – Dan Dec 13 '19 at 19:02
  • @Voo By confidential information, you mean dumping out some of the data you're trying to store in the database into the logs? I'm not sure if I'm understanding the need for that. Most of the constraints I'm working with are relations between tables and I usually only need to know the IDs of the records involved which I don't normally consider that confidential. Once I have those and the name of the constraint that's failing, I usually have enough to go off of. But maybe I'm not understanding your scenario well enough. – Dan Dec 13 '19 at 19:08
  • I'd agree with the opinions that they should have some API. I just don't see them as that capable of delivering. They've missed some deadlines just getting their software deployed by many months after it was supposedly ready. At the end of the day, I'm required to put the data in their system and the only options are to push it into their database or have the users of their app manually enter all the data (which is not going to happen). – Dan Dec 13 '19 at 19:11
  • @Dan Exactly, might not be a problem depending on what data you're dealing with, it depends. IDs are rarely useful if inserts fail since presumably you're using transactions which cause the whole thing to be rolled back leaving no trace in the database. (Or you might be using natural keys). If you just have "constraint x failed on insert into table P" you just don't have much data to go on. – Voo Dec 13 '19 at 19:17
  • @Voo I think maybe the situation I deal with makes it easier to fix. I work very closely with all the users of my application. I contact them as soon as I see something like this. I can usually gather information on exactly what steps they took and I'm almost always able to easily reproduce the problem. – Dan Dec 13 '19 at 19:20
  • Note: The "relations" in "relational database" are the tables, not the foreign keys. So pedantically speaking, your database does have relations. – user253751 Dec 16 '19 at 12:51
  • I guess I learn something new every day. I've never heard anyone say that relation means table, but I did a google search and you're correct. – Dan Dec 16 '19 at 17:19

5 Answers5

13

The reasons for such design decisions are often not technical ones, but organizational ones. I have seen this happen in the real world, in situations of the following kind:

  • At the time when the system is designed, there is only one application which has exclusive access to the database, so referential constraints and ID columns are not that important at the beginning.

  • The developers of the system are well-trained in application development, but less trained in database design.

  • The designers may had some bad experiences with referential integrity constraints in the database, because they usually don't allow easily to implement "exceptions in special cases" - such exceptions are way easier to handle at the application side.

    For example: "this column in table A must be a not-null foreign key ref to the ID column of table B" - and then someone says "oh, except when the isTemplate column in table A is true. To solve such kind of requirements at the DB side one would need to replace the FK constraint by some more or less complex DB trigger.

  • The - normally very good - approach of "let's start with a most simple working solution first and improve later" is interpreted as let's start without relationships in the DB first and add them later (instead of let's start with strict constraints and make them less strict later).

Unfortunately, once a system is build without referential integrity and it first version goes into production, sooner or later it will contain a lot of data which could not have been inserted into the db with some sensible contraints enabled.

Then it becomes hard to add those contraints to the db afterwards. It is often easier for the devs to modify the application instead to make it handle this "low quality" data, so the data can stay unmodified in the database, instead of initiating a "data improvement process", which would be necessary to fix the problems at the DB side, but may require support by users and administrators.

Of course, when a system grows, and it is extended to have more than one application accessing the DB, especially more than one which writes data into it, there is surely a point where it would be better to have a more rigid the DB design. But the later one tries to get there, the harder it becomes.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
2

This is a trend I'm seeing more and more across many industries - a flatter data structure that is far more sanguine about data redundancy.

Perhaps we shouldn't be too surprised about this considering:

  • Storage is cheap, CPUs are fast and memory is plentiful, meaning data redundancy is far less of a performance issue
  • The rise of LINQ in .NET languages meaning data slicing and dicing is far easier on the client side
  • The more relaxed nature of modern languages to the established dogma of the past e.g. (from the Zen of Python) Flat is better than nested

I would have agreed with the points from Doc Brown's answer in full about five years ago, but that is really only a small part of it. A lot has changed since then - both in the software development and database arenas.

I recall a news story a few years back where the non-relational DB structure of Facebook was mentioned to a top guru in the RDBMS world who scoffed: "That will never work, and it will never scale". As we've seen since - it did and it does.

Robbie Dee
  • 9,717
  • 2
  • 23
  • 53
  • 2
    I don't understand what the low price of storage or CPU speed has to do with the question. – Doc Brown Dec 13 '19 at 13:02
  • 3
    In some respects it's only a return to older pragmatism, rather than academic purity. Highly normal forms, to save all the space of a few integers, often substitute dozens of lines of query code, complex lookups, and storage structures that (without conveying any advantage) bear no relation to commonsense notions or to paper-based analogies. – Steve Dec 13 '19 at 13:27
  • 1
    @DocBrown Simply put - it means that data redundancy is far less of an issue resource wise. The elephant in the room as Steve alludes to is that denormalisation commonly happened anyway. It is just that idea logically extended and augmented by the boon of client (in this case, non-DB) layers and technologies. – Robbie Dee Dec 13 '19 at 13:30
  • 3
    @RobbieDee, sometimes the claimed "redundancy" is an illusion. We've all seen software containing the classic case of a fully-normalised accounting system, where performance eventually grinds down and you can never delete anything without changing the current balance, because the balance history was seen as redundant and a projection of the underlying transaction data. In other cases, redundancy emphasises that maintaining consistency in a business, in the event of subsequent change, may require special handling and compensating activity, not just transparently updating the record. – Steve Dec 13 '19 at 14:29
  • 1
    @Steve Indeed - I've never had a client yet that cared that the logical DB design in the proposal didn't match the physical DB design that eventually shipped after extensive performance tuning. – Robbie Dee Dec 13 '19 at 15:35
1

I often skip referential integrity in the DB i.e. FK constraints

But I always have unique PKs.

So in my case its more like a lack of strict relationships over all. Sure you can have a customer address with no customer - knock yourself out. Everything will work except the bits that don't.

Plus I hide the DB behind an API so any relationships that need enforcing can be enforced.

This works well with distributed data where the constraint can't be enforced atomically in any case and you have to deal with a potential error.

Your vendor has gone a step further by not having IDs, but let's be generous and assume there are unique composite keys. In theory that works just as well.

But then they also allow you direct access bypassing an API. Even if they had strict constraints at the DB level, this is still suspect. You can't enforce all the rules in the DB so you are bound to be able to break it if you really try.

If it's the lack of API that's the cardinal sin in this case, I think you have to put yourself under the magnifying glass for requesting this low level access to third party software.

Rather than being intriniscally bad It sounds like this software simply lacks the features you require.

Robbie Dee
  • 9,717
  • 2
  • 23
  • 53
Ewan
  • 70,664
  • 5
  • 76
  • 161
1

What's the real problem ?

The real problem here, is not the absence of referential integrity and other critical constraints: it's the fact that the database is used as an interface media without ensuring proper validation of the data sent to the application.

Sooner or later, this will lead to inconsistencies:

  • Database constraints could help to prevent such a situation by ensuring a minimum of consistency when the data is added.
  • A better approach would be the application to offer some interface (e.g. API or file based) to aquire the data and sanitize it before using it.

Are there historical arguments that explain absence of referential integrity?

First of all, I must admit that I totally share your opinions. Not only will referential integrity strengthen the reliability of the data, but in addition it will provide additional information to the optimizer for doing its job.

This being said, teams may be mislead by historical arguments against referential integrity:

  • I started to develop with Oracle 5 in 1986, at that time there was a table locking scheme for any update. There was no referential integrity before 1992 with Oracle 7.
  • For a certain time, referential integrity was implemented in a couple of DBMS at the expense of performance (I remember for example one case where extra locks on foreign keys reduced concurrency).
  • Autonumbered columns where vendor specific extensions before it became SQL standard in 2003. This made the use of identify columns a delicate thing (eg, query, update, fail, retry). Therefore natural (meaningful) keys were preferred when ever possible. For example, and order ID was almost always managed as a unique ID, but order line items did rarely have their own unique iD, but a combined ID (order ID + line number was preferred because simpler to implement). Or you had a combined key device id + timestamp instead of a unique id for every measurement recorded by the device. Why does this matter ? Because in such context, using referential integrity made the change of natural keys very difficult (unless your DB supported cascading changes).

So the question is not AS400 or not, but previous experience made in former times

Are there still valid reasons for not having referential integrity ?

These historical reasons shall not lead to think that absence of referential integrity would be a sign of obsolete practice !

Several valid reasons in favor of not using referential integrity:

  • Database independence: document oriented databases (MongoDB etc...) do not provide for referential integrity, since they favor less structured data sets. So if you develop apps that intend to use such DBMS, or that want to ensure database independence, you will need to care for integrity in your application. So you will not really care at the DB level.
  • DDD and value objects: value objects differ from entities in that they do not have an identity. A value object is solely identified by the value of its attributes. Translated into SQL, this means that you do not use a unique ID to identify value object. The referential integrity is then less obvious, so that you'll start without.
Christophe
  • 74,672
  • 10
  • 115
  • 187
  • I understand the MongoDB example and it makes sense. I'm more interested in the context of a product like SQL Server. – Dan Dec 13 '19 at 19:25
  • After reading your example on DDD/value objects, I guess I've always had a viewpoint of storing everything as an entity. Other than the overhead of having an extra unique id column, what is the advantage of those kinds of object over just storing everything as an entity? – Dan Dec 13 '19 at 19:28
  • I still think the developers not able to make the leap from AS400 to SQL Server is relevant. In my opinion, they've made other questionable choices. For instance, there's a table with three datetime columns. All three are optional. I store them as nullable columns. They do not. They have given me default dates to store in situations where there's no value. And that works as long as the default dates are outside the range of valid dates. But I just don't understand the logic of not making them nullable. I suspect they don't know what a nullable column is for. – Dan Dec 13 '19 at 19:34
  • 1
    As said in my introduction: the problem is not the schema, but using an unefnorced schema in combination with having a DB wide open. Doomed to fail IMHO. Not to speak about concurrency issues, when the app uses one order for locking objects and your interface uses another. – Christophe Dec 13 '19 at 19:38
  • About SQL server and MongoDB, the point is that if the team consider porting the app one day to MongoDB or similar DB, they should not rely on DB features that are not supported in the future targets. – Christophe Dec 13 '19 at 19:39
  • The distinction between object and value would deserve its own question but with more concrete example about the kind ot data we are discussing about. Otherwise it would be an abstract discussion. – Christophe Dec 13 '19 at 19:41
  • We probably have a similar background since I'd prefer a nullable column as well. However, null values are handled differently than a dummy value, for example in case of joins. I've for example seen plenty of DB that use a 31/12/9999 date instead of a null date to mean that no end date is foreseen. While I already prescent that 31/12/9999 will be worse than 31/12/1999, I do not worry too much for my present job. And I recognize that this encoding can help to simplify a couple of queries. I'd therefore prefer not to judge too quickly ;-) – Christophe Dec 13 '19 at 19:45
  • Fair enough about Mongo, though I don't see that happening here. To me, it's the perfect kind of app to use SQL Server. – Dan Dec 13 '19 at 19:46
-2

I am going to come out with a little stronger language for those who don't use fk constraints to enforce referential integrity.

Coming from the data warehousing side, in every instance that I've encountered some source system that has no relational integrity defined, usually small systems developed by front end developers who have no understanding of relational design concepts. They aren't making careful and rational decisions as to why they are ignoring RI, they just don't understand it and don't care. I've found tables that have many to many relationships that are clearly unintended and orphaned data in child tables. This mess happens due to either an oversight with the developer or a bug in the code...and bugs happen. By the time this junk data gets to the DW where users can now ask much more precise questions of the data, they discover, to their dismay, they can't because the data itself doesn't allow it due to all the unintended many to many "relations", broken links, bad table structure (which tends to go along with these types of systems) and fk constraint violations. The type of designer who forgoes RI is mutually inclusive of these things and I've never seen an exception.

How would you feel when your bank statement comes back empty and the explanation from the bank is "Everything works except the bits that don't.". Relational database design is established and you need to have fk constraints if you want your data to be clean.

John
  • 1
  • 1
    Bank statements are actually a prime example of "orphaned data". The balance is an orphan of long-gone journal data. Also, the problem of incompetent developers is not the same issue as whether referential integrity is enforced - the need to archive different data at different times is an example of why references may be legitimately left dangling. – Steve Dec 11 '21 at 08:53