34

Having two classes:

public class Parent 
{
    public int Id { get; set; }
    public int ChildId { get; set; }
}

public class Child { ... }

When assigning ChildId to Parent should I check first if it exists in the DB or wait for the DB to throw an exception?

For example (using Entity Framework Core):

NOTE these kinds of checks are ALL OVER THE INTERNET even on official Microsoft's docs: https://docs.microsoft.com/en-us/aspnet/mvc/overview/getting-started/getting-started-with-ef-using-mvc/handling-concurrency-with-the-entity-framework-in-an-asp-net-mvc-application#modify-the-department-controller but there is additional exception handling for SaveChanges

also, note that the main intent of this check was to return friendly message and known HTTP status to the user of the API and not to completely ignore database exceptions. And the only place exception be thrown is inside SaveChanges or SaveChangesAsync call... so there won't be any exception when you call FindAsync or Any. So if child exists but was deleted before SaveChangesAsync then concurrency exception will be thrown.

I did this due to a fact that foreign key violation exception will be much harder to format to display "Child with id {parent.ChildId} could not be found."

public async Task<ActionResult<Parent>> CreateParent(Parent parent)
{
    // is this code redundant?
   // NOTE: its probably better to use Any isntead of FindAsync because FindAsync selects *, and Any selects 1
    var child = await _db.Children.FindAsync(parent.ChildId);
    if (child == null)
       return NotFound($"Child with id {parent.ChildId} could not be found.");

    _db.Parents.Add(parent);    
    await _db.SaveChangesAsync();        

    return parent;
}

versus:

public async Task<ActionResult<Parent>> CreateParent(Parent parent)
{
    _db.Parents.Add(parent);
    await _db.SaveChangesAsync();  // handle exception somewhere globally when child with the specified id doesn't exist...  

    return parent;
}

The second example in Postgres will throw 23503 foreign_key_violation error: https://www.postgresql.org/docs/9.4/static/errcodes-appendix.html

The downside of handling exceptions this way in ORM like EF is that it will work only with a specific database backend. If you ever wanted to switch to SQL server or something else this will not work anymore because the error code will change.

Not formatting the exception properly for the end-user could expose some things you don't want anybody but developers to see.

Related:

https://stackoverflow.com/questions/6171588/preventing-race-condition-of-if-exists-update-else-insert-in-entity-framework

https://stackoverflow.com/questions/4189954/implementing-if-not-exists-insert-using-entity-framework-without-race-conditions

https://stackoverflow.com/questions/308905/should-there-be-a-transaction-for-read-queries

Konrad
  • 1,529
  • 2
  • 17
  • 32
  • 2
    [Sharing your research helps everyone](https://softwareengineering.meta.stackexchange.com/questions/6559/why-is-research-important). Tell us what you've tried and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer. Also see [ask] – gnat Sep 19 '18 at 10:09
  • 5
    As others have mentioned, there exists the possibility that a record could be inserted or deleted concurrently with your checking for NotFound. For that reason, checking first seems like an unacceptable solution. If you are concerned about writing Postgres-specific exception handling that is not portable to other database backends, try to structure the exception handler in such a way that core functionality can be extended by database-specific classes (SQL, Postgres, etc) – billrichards Sep 19 '18 at 13:16
  • @billrichards seems reasonable :) but what about the case when parent properties depend on child properties? I need to query both then and the code then will not be that different from my 1st example – Konrad Sep 19 '18 at 17:20
  • 3
    Looking through the comments, I need to say this: **stop thinking in platitudes**. "Fail fast" is not an isolated, out of context rule that can or should be followed blindly. It's a rule of thumb. *Always* analyze what you're actually trying to achieve and then consider any technique in light of whether it helps you achieve that goal or not. "Fail fast" helps you prevent unintended side effects. And besides, "fail fast" really means, "fail as soon as you can detect there is a problem." **Both** techniques fail as soon as a problem is detected, so you must look at other considerations. – jpmc26 Sep 19 '18 at 20:59
  • The phrase for the first pattern that for some reason no one mentioned is "race condition". Look up what it means, then always use the second pattern forever. – Jared Smith Sep 19 '18 at 22:48
  • When you first check if for example record already exists then you have an oportunity to return custom message. Also it makes business rules more verbose. When I see `if(!Exists(item))`, I know right away that there must never be a duplicated record. – FCin Sep 20 '18 at 09:14
  • @JaredSmith I know what race condition and locking is. – Konrad Sep 20 '18 at 11:50
  • @JaredSmith but FYI `FindAsync` doesn't throw any exception like that – Konrad Sep 20 '18 at 11:51
  • if something changes concurrently after calling `FindAsync` and data is inconsistent then exception will be thrown by `SaveChangesAsync` – Konrad Sep 20 '18 at 11:52
  • I would be more happy to have an answer with practical example of the problem with code samples where it doesn't work – Konrad Sep 20 '18 at 11:53
  • For example: if you use `Find` but someone adds the record with this specific id then nothing wrong will happen it will just return 404 not found. But if someone deletes it in the meantime or something else then it affects the rest of the operation. – Konrad Sep 20 '18 at 11:56
  • 1
    @Konrad what do exceptions have to do with it? Stop thinking of race conditions as something that lives in your code: it is a property of the universe. Anything, *anything* that touches a resource it doesn't completely control (e.g. direct memory access, shared memory, database, REST API, filesystem, etc. etc.) more than once and expects it to be unchanged has a potential race condition. Heck, we deal with this in C which doesn't even *have* exceptions. Just don't ever branch on the state of a resource you don't control if at least one of the branches messes with the state of that resource. – Jared Smith Sep 20 '18 at 14:05
  • @JaredSmith I think you're confusing me, I know race conditions are not related to exceptions, but you're confusing concepts here of race conditions you deal with in your code vs race conditions in database. – Konrad Sep 20 '18 at 14:08
  • @Konrad no difference. None. Same problem, same solution. Shared resources are shared resources whether its RAM, filesystem, database, etc. – Jared Smith Sep 20 '18 at 14:09
  • @JaredSmith not really because race conditions in database are handled differently in the language of your choice – Konrad Sep 20 '18 at 14:09
  • for example in this case where I'm using EF Core, it will throw an exception when there's a race condition in the database – Konrad Sep 20 '18 at 14:10
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/83440/discussion-between-jared-smith-and-konrad). – Jared Smith Sep 20 '18 at 14:10
  • @Konrad: Your behavior on this question makes it clear that you weren't asking this question in good faith. You wanted a particular answer, to validate an opinion you had formed before asking the question. Wanting a particular answer isn't wrong in itself, but the way you have engaged with everyone who disagrees with you (which appears to be by far the majority) makes it clear that you are more interested in arguing than in learning. That's not very constructive behavior, for the community or for yourself. You can never learn anything until you can admit you have something to learn. – Daniel Pryden Sep 20 '18 at 18:46
  • As a somewhat unrelated aside: many of the problems you describe are symptoms of your decision to use an ORM in the first place. If your primary key were a natural key of some sort, or an autoincremented sequence, then you wouldn't have this problem at all. If your tools are making it hard for you to build a good solution to a problem, then you probably need to carefully consider whether you need different tools. – Daniel Pryden Sep 20 '18 at 18:49
  • @DanielPryden Using ORM or not every choice has some drawbacks you have to deal with. IMO it doesn't matter what you use as a key in this case because it's about queries, so you will have the same or similar problem when not using ORM, and using something like dapper or some micro-orm. – Konrad Sep 20 '18 at 19:02
  • and you can still directly execute queries instead of relying on auto generation. – Konrad Sep 20 '18 at 19:04
  • @Konrad: If you have a set of fields that uniquely identifies a record, then you *should* have a UNIQUE constraint that enforces that. As long as your database is enforcing the integrity of your data (which it should be, that's its job!), then you will have to deal with requests from your application that the database rejects due to violating a constraint. As Mr.Mindor points out below, this is really a not-very-special special case of error handling: no matter what you do, it's *always* possible for the insert to fail, and you need to handle that, regardless. – Daniel Pryden Sep 20 '18 at 19:13
  • 1
    @DanielPryden In my question, I didn't say that I don't want to handle database exceptions(I know that exceptions are inevitable). I think many people misunderstood, I wanted to have friendly error message for my Web API(for end-users to read) like `Child with id {parent.ChildId} could not be found.`. And formatting "Foreign key violation" I think is worse in this case. – Konrad Sep 20 '18 at 19:19
  • And I think this justifies the need for additional `SELECT` query – Konrad Sep 20 '18 at 19:20
  • If the extra query is to provide a readable error message, shouldn't that happen after we know there's an error, not before? I.e. `lock_db { try { add() } catch (db_error) { if (id_not_present()) { return NotFound(); } else { return OtherError(); } }`. I'm not expert enough to turn that into actual code, but that's what I'd expect. – user673679 Sep 21 '18 at 08:28
  • @user673679 database gives you an error like "Foreign key violation", now try and format that to "Child with id {parent.ChildId} could not be found." because you don't want your end-user to know anything about your underlying database. Simply returning `500 Internal Server Error` or `404 Not Found` doesn't tell you anything from the user perspective, so you need to know what exactly went wrong so you can fix that and retry. – Konrad Sep 21 '18 at 08:32
  • Exceptions are meant for the developers, not for end-users so you'd have to format not really meaningful error from the database exception to user-friendly message. – Konrad Sep 21 '18 at 08:36
  • Also, in my 1st code example I'd use `Any` instead of `Find` when I don't need to read anything from the child because ORM will convert it to `SELECT 1` instead of `SELECT *` – Konrad Sep 21 '18 at 08:40
  • 1
    @Konrad. I suspect in the end it depends entirely on usage. If that ID is correct 99% of the time, you'll probably find it faster not to check (because checking is always extra work), and do any error string formatting in the error handling code (which in this case involves exception handling - you can rethrow if it's not the particular error you expect). If the ID is wrong 99% of the time, it'll be faster to do the check. Measure it and see. – user673679 Sep 21 '18 at 09:09
  • 1
    The check is really doing two things: 1) input validation, 2) error presentation. You can merge these two into one if you maintain database state (lock / transaction), or do both separately. – user673679 Sep 21 '18 at 09:12

5 Answers5

116

Checking for uniqueness and then setting is an antipattern; it can always happen that the ID is inserted concurrently between checking time and writing time. Databases are equipped to deal with this problem through mechanisms like constraints and transactions; most programming languages aren't. Therefore, if you value data consistency, leave it to the expert (the database), i.e. do the insert and catch an exception if it occurs.

Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • I updated my question. – Konrad Sep 19 '18 at 10:22
  • But isn't it better to fail fast? – Konrad Sep 19 '18 at 10:28
  • I think it depends, because if some Parent properties would depend on child properties then you have to fetch a child first. – Konrad Sep 19 '18 at 10:32
  • 34
    checking and fail is not faster than just "trying" and hope for the best. Former implies 2 operations to be implemented and performed by your system and 2 by the DB, while the latest only implies one of them. Checking is delegated to the DB server. It also implies one less hop into the network and one less task to be attended by the DB. We might think that one more query to the DB is affordable, but we often forget to think in big. Think in high concurrency triggering the query over and over a hundred times. It could dup the whole traffic to the DB. If that matters is up to you to decide. – Laiv Sep 19 '18 at 10:39
  • @Konrad if you *know* that the main query you're running will waste a lot of time/cpu/memory/etc executing before failing, but the check query can execute much faster, *then* it might worthwhile to run the check query and then conditionally the main query **within the same *transaction*** - your programming language's library for your database should expose the ability to start and end a transaction. **However**, remember that premature optimization is often wasted effort, so only do it if you *really know* that it's inefficient. – mtraceur Sep 19 '18 at 16:25
  • @mtraceur that's why I currently don't do it and have separate check queries. – Konrad Sep 19 '18 at 16:42
  • when this will become a problem then I will refactor and handle DB exceptions appropriately – Konrad Sep 19 '18 at 16:42
  • 6
    @Konrad My position is that the default-correct choice is one query that will fail on its own, and it is the *separate query* pre-flighting approach that has the burden of proof to justify itself. As for "become a problem": so you **are** *using transactions* or *otherwise ensuring that you are safe against ToCToU errors*, right? It's not obvious to me from the code posted that you are, but if you are not, then it has already become a problem the way that a ticking bomb becomes a problem long before it actually explodes. – mtraceur Sep 19 '18 at 16:55
  • @mtraceur perhaps you are right, I'm not actually sure if EF Core is doing it implicitly as a part of transaction. The only problem with this 1 query that will fail on its own is properly formatting it to send it back to the client (user friendly message without exposing the details) – Konrad Sep 19 '18 at 17:14
  • like I said before there are situations (not this one though) where you need 2 queries anyway where 1 operation depends on columns from different tables – Konrad Sep 19 '18 at 17:15
  • in web APIs you need to map it to appropriate status code like 404 not found... – Konrad Sep 19 '18 at 17:17
  • @Konrad If 1 operation depends on columns from different tables, why not JOIN them and query them in 1 database query? – Navin Sep 19 '18 at 17:50
  • 4
    @Konrad EF Core is not going to implicitly put both your check and the insert into one transaction, you will have to explicitly request that. Without the transaction, checking first is pointless as the database state can change between the the check and insert anyway. Even with a transaction, you might not prevent the database from changing under your feet. We ran in to an issue a few years ago using EF with Oracle where although the db supports it, Entity was not triggering locking of the read records within a transaction, and only the insert was treated as transactional. – Mr.Mindor Sep 19 '18 at 20:55
  • @Mr.Mindor so how did you solve it? by putting both check and insert in 1 transaction scope? – Konrad Sep 19 '18 at 20:58
  • 3
    "Checking for uniqueness and then setting is an antipattern" I would not say this. It depends strongly on whether you can assume no other modifications are happening and on whether the check produces some more useful result (even just an error message that actually means something to the reader) when it does not exist. With a database handling concurrent web requests, no, you can't guarantee other modifications aren't happening, but there are cases when it's a reasonable assumption. – jpmc26 Sep 19 '18 at 21:06
  • 1
    @Konrad It was pretty significant rework to not rely on state remaining unchanged. The code in question was poorly designed to begin with and made exactly the sort of mistakes that this question touches on.(not acknowledging it needed to run in a concurrent manner) When it was written and originally tested, the developer was the only user and did all the testing manually, so it all worked fine. Once there were just two users in UAT everything got very weird fast. The attempts to get the read lock working in the transaction to work were already a last effort to save it. – Mr.Mindor Sep 19 '18 at 21:19
  • "it can always happen that the ID is inserted concurrently between checking time and writing time" `FindAsync` will always return `null` or entity. Concurrency conflict will happen only when it's deleted or updated between check and save – Konrad Sep 20 '18 at 12:03
  • 1
    you checked it's there, but someone in the same time deleted it so `SaveChanges` will throw an exception – Konrad Sep 20 '18 at 12:04
  • but if you checked it and it isn't there then it will return 404 not found anyway, there will be nothing wrong with it even if someone added row with this specific id shortly after... at least I think so because there's no other way to handle it – Konrad Sep 20 '18 at 12:05
  • https://www.citusdata.com/blog/2018/02/15/when-postgresql-blocks/ – Konrad Sep 20 '18 at 12:09
  • as you can see in the table `SELECT` runs concurrently with `INSERT UPDATE DELETE` – Konrad Sep 20 '18 at 12:10
  • 5
    Checking for uniqueness first does not eliminate the need to handle possible failures. On the other hand, if an action would require performing several operations, checking whether all are likely to succeed before starting any of them is often better than performing actions that may *likely* need to be rolled back. Doing the initial checks may not avoid all situations where a rollback would be necessary, but it could help reduce the frequency of such cases. – supercat Sep 20 '18 at 16:24
  • I don't think it should be taken as a **fact** the statement *"Checking for uniqueness and then setting is an antipattern"* – Yusha May 09 '19 at 16:18
39

I think what you call “fail fast” and what I call it is not the same.

Telling the database to make a change and handling the failure, that is fast. Your way is complicated, slow and not particularly reliable.

That technique of yours is not fail fast, it is “preflighting”. There are sometimes good reasons, but not when you use a database.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • 1
    There are cases when you need 2nd query when one class depend on another, so you have no choice in cases like that. – Konrad Sep 19 '18 at 11:01
  • 4
    But not here. And database queries can be quite clever, so I generally doubt the “no choice”. – gnasher729 Sep 19 '18 at 11:04
  • 1
    I think it also depends on the application, if you create it just for a few users then it shouldn't make a difference and code is more readable with 2 queries. – Konrad Sep 19 '18 at 11:06
  • 21
    You are assuming that your DB is storing inconsistent data. In other words, looks like you don't trust in your DB and the consistency of the data. If that were the case, you have a really big problem and your solution is a walkaround. A palliative solution fated to be overruled sooner than later. There could be cases where you are forced to consume a DB out of your control and management. From other applications. In those cases, I would consider such validations. In any case, @gnasher is right, yours is not failing fast or it's not what we understand as fail fast. – Laiv Sep 19 '18 at 11:58
16

This started as a comment but grew too large.

No, as the other answers have stated, this pattern should not be used.*

When dealing with systems that use asynchronous components, there will always be a race condition where the database (or file system, or other async system) may change between the check and the change. A check of this type is simply not a reliable way prevent the the type of error you don't want to handle.
Worse than not being sufficient, at a glance it gives the impression that it should prevent the duplicate record error giving a false sense of security.

You need the error handling anyway.

In comments you've asked what if you need data from multiple sources.
Still No.

The fundamental issue does not go away if what you want to check becomes more complex.

You still need the error handling anyway.

Even if this check were a reliable way to prevent the particular error you are trying to guard against, other errors can still occur. What happens if you lose connection to the database, or it runs out of space, or?

You very probably still need other database related error handling anyway. The handling of this particular error should probably be a small piece of it.

If you need data to determine what to change, you obviously will need to collect it from somewhere. (depending on what tools you are using there are probably better ways than separate queries to collect it) If, in examining the data you collected, you determine you don't need to make the change after all, great, don't make the change. This determination is completely separate from error handling concerns.

You still need the error handling anyway.

I know I'm being repetitive but I feel it is important to make this clear. I've cleaned up this mess before.

It will fail eventually. When it does fail it will be difficult and time consuming to get to the bottom of. Resolving issues that arise from race conditions is hard. They don't happen consistently, so it will be difficult or even impossible to reproduce in isolation. You didn't put in the proper error handling to begin with so you won't likely have much to go on: Maybe an end user's report of some cryptic text (which hey you were trying to prevent from seeing in the first place.) Maybe a stack trace that points back to that function that when you look at it blatantly denies the error should even be possible.

*There may be valid business reasons to perform these exists checks, such as to prevent the application from duplicating expensive work, but it is not a suitable replacement for proper error handling.

Mr.Mindor
  • 309
  • 2
  • 8
5

Rather a confused question, but YES you should check first and not just handle a DB exception.

First of all, in your example you are at the data layer, using EF directly on the database to run SQL. You code is equivalent to running

select * from children where id = x
//if no results, perform logic
insert into parents (blah)

The alternative you are suggesting is:

insert into parents (blah)
//if exception, perform logic

Using the exception to execute conditional logic is slow and universally frowned upon.

You do have a race condition and should use a transaction. But this can be fully done in code.

using (var transaction = new TransactionScope())
{
    var child = await _db.Children.FindAsync(parent.ChildId);
    if (child == null) 
    {
       return NotFound($"Child with id {parent.ChildId} could not be found.");
    }

    _db.Parents.Add(parent);    
    await _db.SaveChangesAsync();        
    transaction.Complete();

    return parent;
}

The key thing is to ask yourself:

"Do you expect this situation to occur?"

If not, then sure, insert away and throw an exception. But just handle the exception like any other error that might occur.

If you do expect it to occur, then it is NOT exceptional and you should check to see if the child exists first, responding with the appropriate friendly message if it doesn't.

Edit - There's a lot of controversy over this it seems. Before you downvote consider:

A. What if there were two FK constraints. Would you advocate parsing the exception message to work out which object was missing?

B. If you have a miss, only one SQL statement is run. It's only hits which incur the extra expense of a second query.

C. Usually Id would be a surrogate key, It's hard to imagine a situation where you know one and you aren't pretty sure it's on the DB. Checking would be odd. But what if its a natural key the user has typed in? That could have a high chance of not being present

Ewan
  • 70,664
  • 5
  • 76
  • 161
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/83680/discussion-on-answer-by-ewan-should-i-check-if-something-exists-in-the-db-and-fa). – maple_shaft Sep 26 '18 at 13:47
  • 1
    This is totally wrong and misleading! It is answers like this that produce bad professionals that I always have to fight against. SELECT never locks a table, so between the SELECT and the INSERT, UPDATE or DELTE, the record may change. So it is poor lousy software deign and an accident waiting to happen in production. – Daniel Lobo Mar 22 '19 at 09:40
  • 1
    @DanielLobo transactionscope fixes that – Ewan Mar 23 '19 at 17:32
  • No it doesn't. Referential constraints do. – Daniel Lobo Mar 23 '19 at 18:12
  • 1
    test it if you dont believe me – Ewan Mar 23 '19 at 18:45
  • 1
    @yusha I have the code right here – Ewan Apr 10 '19 at 16:45
  • @Ewan Checking if exists means adding an extra db request, adding extra overhead. That alone can be agood reason to just let the db handle the error – Juan Perez Jan 26 '23 at 15:36
  • The problem with "just let the db handle it" is that there are a million assumptions behind that approach giving the same result, and a billion around it being faster. – Ewan Jan 27 '23 at 14:42
2

I think a secondary thing to note here - one of the reasons you want this is so that you can format an error message for the user to see.

I would heartily recommend that you:

a) show the end user the same generic error message for every error that occurs.

b) log the actual exception somewhere that only the developers can access (if on a server) or somewhere that can be sent to you by error reporting tools (if client deployed)

c) don't try and format the error exception details that you log unless you can add more useful information. You don't want to have accidentally 'formatted' away the one piece of useful information that you would have been able to use to track an issue down.


In short - exceptions are full of very useful technical information. None of this should be for the end user and you lose this information at your peril.

Paddy
  • 2,623
  • 16
  • 17
  • 2
    "show the end user the same generic error message for every error that occurs." that was the main reason, formatting the exception for end-user looks like a horrible thing to do.. – Konrad Sep 20 '18 at 16:11
  • 1
    In any reasonable database system, you should be able to programmatically find out why something has failed. It shouldn't be necessary to parse an exception message. And more generally: who says that an error message needs to be displayed to the user at all? You can fail the first insert and retry in a loop until you succeed (or up to some limit of retries or time). And in fact, backoff-and-retry is something you're going to want to implement eventually anyway. – Daniel Pryden Sep 20 '18 at 18:40