How to guarantee HTTP message delivery in fault tolerant way

Question

When an application A communicates with an external/3rd party system B, there is always a chance that B is down.

Say that A raises an event that should be sent as a message to B via HTTP, what is the best way to guarantee that the message is delivered?

One possibility is of course to have some retry logic to resend the message a few times. But what should we do with the message if delivery fails too many times? Or if A crashes (maybe due to too many messages waiting to be sent)? Then we need a way to persist those messages to be able to resume message delivery after A has recovered.

My first idea was to store all events in a dedicated table in the database and mark off when they are sent. Then a colleague argued that we can't always rely on the database and we should instead store the messages locally on the filesystem. But the latter approach looks like we'd be implementing a message queue ourselves and we'd be better of with a real full-fledged message queue (which we currently don't have for this application).

The same colleague then argued that even if we have a message queue, we can't be sure that the message is delivered to the queue and we'd still need to implement a queue on the filesystem. That really seems overkill to me. That would mean that to be really sure we need to implement a locally stored message queue for all communications even between our own different microservices.

For context this is low volume a system with few messages (at most in the 100s) per day, but they have very high value (used for billing) so that we don't want to miss any.

Any thoughts?

Haven't read the question, but the easiest way to implement reliable messaging is by letting consumers pull the data themselves. That way they only have to track which messages were processed (most likely using a checkpoint) and query from there. The notification stream could be HTTP-based and designed in a way that can be cached. — plalx, Oct 11 '18 at 13:41
"*we can't always rely on the database*" I would dig into the reasons for this. Databases should be extremely reliable, and if you lose the database connection then presumably a lot more than messaging will break. — Dan Wilson, Oct 11 '18 at 15:58
Despite the answers shown here, I would say that a filesystem approach is easier to build and safer. Databases require more maintenance, updates, can crash or even get corrupted if not configured properly, resulting in one more source of failure. ...but if you already have one for your critical data, you might as well use it, that's true as well. ...lastly, there are many queuing libraries who do just that. It's a very mainstream task after all. Perhaps one of these libs would be a good alternative as well. — dagnelies, Oct 12 '18 at 08:45

Walfrat · Accepted Answer · 2018-10-11T08:53:59.287

2

The database solution is definitively the best, transactional filesystem are not common, unless you consider that filesystem never fail (permission settings, disk full,...).

I'll detail a more accurate scenario of what you suggest in order to make sure you doesn't lose an entry, using transactions.

When needing to send something :

Create an persistent entry with a "pending" status and a link to the data you need to send.
Try to send it, if it fails, let it to the pendingstatus, otherwise update the status to "finished".
Commit the current transaction
Have a dedicated background worker checking for pending stuff to be sent and try sending them. Eventually before trying to send them, the worker may try to check if the 3rd party service is available. Open a transaction for each message to send them and update them independantly. Make sure an error on the message don't fail the sending of others (catch exceptions). When done successfully update the status to "finished".

This also works for task like pushing a file to an ftp server.

If the ordering is important the workflow is simpler : never try to send a message, juste queue it by adding an entry to be taking by the background worker. And if one message fail, don't try to send others. So your table is a blocking queue of task to do.

edited Oct 11 '18 at 08:53

answered Oct 11 '18 at 08:31

Walfrat

3,456
13
26

This looks like a refined version of my database table solution. So you agree that one should be able to trust the database so that one doesn't need to implement a local message queue with transactions on a filesystem like my colleague was suggesting? – bangnab Oct 11 '18 at 08:40
Transactions on a filesystem is not a common thing, from memory it requires dedicated solution, unless you don't care about a lock or anything that may happen on a file system. if you already have a relational database, this seems unnecessary. Furthermore, if the order doesn't matter, the queue may block on first error even if it is not a connection one. With this solution, you can change the implementation of each worker (one by each 3rd party communication required at least, ideally, one by each functionnality requiring a 3rd party communication) depending if the ordering matter or not, – Walfrat Oct 11 '18 at 08:51
@inovaovao also another supposition I made is that if your database is down, then your website is also down, this is why a dedicated filesystem solution doesn't fit. Some other architecture may want to support fonctionalities with one of the database down, in such case, you may need a filesystem solution (but distributed transactional filesystem if there is multiple web server ?) – Walfrat Oct 11 '18 at 09:27
1

@inovaovao: Can you be certain that the filesystem is more available than the database server? Filesystems can get full, corrupted or otherwise unavailable. – Bart van Ingen Schenau Oct 11 '18 at 13:36
@BartvanIngenSchenau A filesystem might not be more available than a database, but it's certainly more available than the network used to connect to it. We can and do monitor disk space and number of open files, which are the most typical sources of filesystem failure. – bangnab Oct 12 '18 at 06:48
@inovaovao yes but does your application can even do something if the database is down ? If not then you can rely on the database. – Walfrat Oct 12 '18 at 06:59

score 2 · Answer 2 · answered Oct 11 '18 at 09:18

Your colleague is right. You can't eliminate all failure modes. The goal should be predictable failure modes, e.g. to meet a certain SLA perhaps you want 99.99% reliability and a response time of under 24 hours in case of total failure, that sort of thing.

To achieve a goal like that, many organizations will choose a proven platform (e.g. MSMQ, or even just SMTP). The advantage of a proven platform is that it has been thoroughly tested both in the lab and in the market and people pretty much know what it does when things go wrong. Generally they will come with your choice of persistent storage, plus important things you didn't think of like queue monitoring, performance counters, throttling, and email/SMS alerts for your operations team. Also, there will be a user community, and possibly technical articles on how to set it up with hot standby, disaster recovery, second-siting, etc. It may even be available as a service offering from your cloud provider.

Unless you work for an organization whose core competency is messaging protocols, you are probably better off investigating third party options than growing your own. That way you can spend more time on your core business.

That's exactly my point: we should not implement this ourselves. And we should not need to implement a queue on the filesystem for delivery to a real message queue or even to another service that we control and have built for high availability, I would argue. — bangnab, Oct 11 '18 at 09:24
If you're concerned about network failure, I think most message queuing systems have a client which allows for local failover. No need to build that yourself. — John Wu, Oct 12 '18 at 22:29

score 0 · Answer 3 · answered Oct 11 '18 at 15:39

IMHO, both the database and filesystem approach to queue messages are fine.

Both will save you in case of a process restarts, crashes or disconnections.

However, even if this covers most of the incidents, it doesn't shield against "disasters". By "disaster", I mean a non-recoverable machine for whatever reason. It's extremely rare, but it might happen. A good way to protect against this is having redundant nodes, so that when one is dead, the system as a whole continues working. However, this brings a lot of other challenges with it.

We have redundancy for application A, but cannot control B's availability and we know that it can be down. — bangnab, Oct 12 '18 at 06:43

score 0 · Answer 4 · answered Oct 12 '18 at 03:08

As other answers has mentioned, the local filesystem is not always available either. If you really need a reliable delivery under all possible circumstances, you are screwed.

However, you don't actually need to guarantee delivery under all circumstances. You really only need to deliver an invoice if the order itself was successfully created. In fact, if the order creation is failed, you probably don't want to send an invoice.

This leads to the consequence that whatever you do, you must finalize the purchase orders and the invoice delivery plan in a way that make sure that they are either successfully persisted together or they fail together.

If you store your purchase order in a database, then you should use the same database transaction to atomically persist the invoice delivery plan.

If you use a message queue, you'll need to queue both order and invoice delivery in the message queue, atomically; this would also mean that your message queue would become the source of truth regarding successful and orders.

The filesystem is a bit tricky. In fact I don't think there's a single scenario where storing it in the filesystem would be the best solution. Turning a filesystem into a transactional system is not a trivial task in most filesystems.

How to guarantee HTTP message delivery in fault tolerant way

4 Answers4