Should a single failure fail a bulk operation?

Question

In the API I'm working on there's a bulk delete operation which accepts an array of IDs:

["1000", ..., "2000"]

I was free to implement the delete operation as I saw fit, so I decided to make the whole thing transactional: that is, if a single ID is invalid, the entire request fails. I'll call this the strict mode.

try{
savepoint = conn.setSavepoint();

for(id : IDs)
    if( !deleteItem(id) ){
        conn.rollback(savepoint);
        sendHttp400AndBeDoneWithIt();
        return;
    }

conn.commit();
}

The alternative (implemented elsewhere in our software suite) is to do what we can in the backend, and report failures in an array. That part of the software deals with fewer requests so the response doesn't end up being a gigantic array... in theory.

A recent bug occurring in a resource-poor server made me look at the code again, and now I'm questioning my original decision - but this time I'm motivated more by business needs rather than best practices. If, for example, I fail the entire request, the user will have to try again whereas if a number of items get deleted, the user can finish the action and then ask an administrator to do the rest (while I work on fixing the bug!). This would be the permissive mode.

I tried looking online for some guidance on the matter but I've come up empty handed. So I come to you: What is most expected by bulk operations of this nature? Should I stick with strict more, or should I be more permissive?

It depends. What is the cost of having something not deleted when it should be? (Cost being defined as bad data, headache, undesired behavior, the time it takes an admin to fix it, etc.) Is that acceptable? If you can live with the consequences of not failing everything, go for it. If it would cause too much of a problem, don't. You know your software and the consequences, so you'll have to make a judgement call. — Becuzz, Oct 12 '16 at 14:28
@Becuzz The cost would be the user noticing one or two leftovers and opening a ticket about that; the current situation is "omg delete is broken". Luckily the user is down the hallway so it's not too much of an issue this time. The point is, I like to do the _correct thing_ whenever possible, and with a 10+ year-old codebase, God knows some things can stand to be done correctly — rath, Oct 12 '16 at 14:32
I think this also depends on whether you want scalability or not. If you don't intend to have a lot of ID's, it shouldn't matter too much. If you intend to have a million ID's, or better yet, aren't absolutely sure it won't happen, then you could spend an hour deleting ID's just to have it completely reset due to 1 invalid ID. — imnota4, Oct 12 '16 at 15:31
@imnota4 An excellent point I hadn't considered. The UI restricts the request to a maximum of about 250, but the backend has no restriction. MayI ask you to repost your comment as an answer? — rath, Oct 12 '16 at 15:34
Permissive mode also makes Admins job easier because they don't need to reproduce the fail with all the stack of id's. It could be also useful to inform in the response the cause of each error. Looking at the cause, It could be possible for the final user to solve it with no "omg delete is broken" tickets. — Laiv, Oct 12 '16 at 15:37
I would say that it depends on the context and motivation of the client deleting multiple resources. Can you share with us the intent of clients using the bulk delete? Do clients think they are the sole owners of their resources, or are multiple clients independently editing shared resources? — Erik Eidt, Oct 12 '16 at 17:17

Adam Wells · Accepted Answer · 2016-10-13T12:51:44.620

Its okay to do a 'strict' or a 'nice' version of a delete endpoint, but you need to clearly tell the user what happened.

We're doing a delete action with this endpoint. Likely DELETE /resource/bulk/ or something similar. I'm not picky. What matters here is that no matter if you decide to be strict or nice, you need to report back exactly what happened.

For example, an API i worked with had a DELETE /v1/student/ endpoint that accepted bulk IDs. We'd regularly send off the request during testing, get a 200 response and assume everything was fine, only to find out later that everyone on the list was both IN the database still (set to inactive) or not actually deleted due to an error which messed up future calls to GET /v1/student because we got back data we weren't expecting.

The solution to this came in a later update that added a body to the response with the ids that weren't deleted. This is - to my knowledge - a sort of best practice.

Bottom line, no matter what you do, make sure you provide a way to let the end user know what's going on, and possibly why its going on. IE, if we picked a strict format, the response could be 400 - DELETE failed on ID 1221 not found. If we picked a 'nice' version, it could be 207 - {message:"failed, some ids not deleted", failedids:{1221, 23432, 1224}} (excuse my poor json formatting).

Good luck!

`207 Multi-Status` might be appropriate for that partial failure response — Richard Tingle, Oct 13 '16 at 11:31
THERE WE GO! I couldn't actually remember it! I'm gonna go ahead and update the answer with that, since that's actually up to the standard. — Adam Wells, Oct 13 '16 at 12:51

score 2 · Answer 2 · answered Oct 12 '16 at 15:59

One should be strict and permissive.

Usually, bulk loads are broken down to 2 phases:

Validation
Loading

During the validation phase every record is looked at strictly to make sure it meets the requirements of the data specifications. One can easily inspect 10s of 1000s of records in just a few seconds. The valid records are placed in a new file to be loaded, the invalid one(s) flagged and removed and usually put in a separate file (skip file). Notification is then sent out on the record(s) that failed validation so they can be inspected and diagnosed for troubleshooting purposes.

Once the data has been validated, it is then loaded. Usually it is loaded in batches if it is large enough to avoid long running transactions or if there is a failure it will be easier to recover. Batch size depends on how large the data set is. If one only has a few 1000 records, one batch would be OK. Here you can be somewhat permissive with failures, but one may want to set a failed batch threshold to stop the entire operation. Maybe if [N] batches fail one would stop the whole operation (if the server was down or something similar). Usually, there are no failures at this point because the data has already been validated, but if there was due to environment issues or other, just reload the batch(s) that failed. This makes recovery a little easier.

I don't validate the IDs against DB values, I just try to delete them and see how that goes, or it would take forever. Aborting after N failures seems a very reasonable suggestion, +1 — rath, Oct 13 '16 at 09:48

Kristian H · Answer 3 · 2016-10-13T14:59:08.567

Should a single failure fail a bulk operation?

There isn't a canonical answer to this. The needs of and consequences to the user need to be examined, and the trade-offs assessed. The OP gave some of the required info, but here is how I would proceed:

Question 1: 'What is the consequence to the user if an individual delete fails?'

The answer should drive the rest of design / implemented behavior.

If, as the OP sort of stated, it is simply the user notices the exception and opens a trouble ticket, but is otherwise unaffected (the non-deleted items do not affect subsequent tasks), then I would go with permissive with an auto notification to you.

If the failed deletes need to be resolved before the user can proceed, then strict is clearly preferable.

Giving the user the option (e.g., essentially an ignore-failures flag with either the strict or permissive as default) may be the most user friendly approach.

Question 2: 'Would there be any data coherence / consistency problems if subsequent tasks are performed with not-deleted items still in the data store?'

Again, the answer would drive the best design / behavior. Yes -> Strict, No -> Permissive, Maybe -> Strict or User Selected (particularly if the user can be depended upon to accurately determine consequences).

score 0 · Answer 4 · answered Oct 12 '16 at 15:42

I think this depends on whether you want scalability or not. If you don't intend to have a lot of ID's, it shouldn't matter too much. If you intend to have a million ID's, or better yet, aren't absolutely sure it won't happen, then you could spend an hour deleting ID's just to have it completely reset due to 1 invalid ID.

score -1 · Answer 5 · answered Oct 12 '16 at 18:25

I'd say one important point here is what it means for a bulk of stuff to be deleted.

Are these IDs somehow logically related, or is it just a convenience / performance - batch grouping of these?

In case of somehow, even loosely, connected, I'd go for strict. If it's just a batch mode (e.g. user clicks "save" for his last minutes of work, and only then is the batch transmitted) then I'd go for the permissive version.

As the other answer states: In any case tell the "user" exactly what happened.

Should a single failure fail a bulk operation?

5 Answers5