Unexpected shutdown before a saga completion

Question

Suppose we have some microservices and a saga will run to do a transaction in 6 microservices.

What if the whole system dies(unexpected shutdown), on middle of saga process in the step number 4?(System died, So state is lost)

How severe is your system failure in this scenario? Did you just lose the RAM contents due to an unexpected reboot, or do you have to rebuild the entire system from week-old backups? — Bart van Ingen Schenau, Aug 05 '20 at 13:38
In that case, if it is important that your saga's are an all-or-nothing affair, just make sure that the state isn't completely lost so you can do a recovery-rollback when the servers come back up. — Bart van Ingen Schenau, Aug 05 '20 at 13:55

score 2 · Accepted Answer · answered Aug 05 '20 at 13:58

That’s not the way a saga works:

every involved microservice performs a step, which is locally handled as a transaction.
every completed step shall result in an event to be triggered
the events must be handled in a reliable way for example using an event queue
the update of the local database shall be atomic with the event message. This is key to reliability

If the system fails between two steps, when it’s restarted, the processing just goes on where it left: the event queue is reliably persisted and the next step will be triggered by the event already on the queue.

If the system fails in the middle of the step, when it is restarted, either the step should go on (if the state of the step can be restored) or the step is rolled-back (since it’s managed in a transactional manner). Then it depends on how you have designed your saga and steps for node failure:

if you have control on the rollback, you know that something went wrong and you generate a message that will trigger compensating transactions on the partial steps that were already successfully completed. It’s similar to when a step fails due to some business rules.
if you don’t control the rollback (e.g. the db does it for you because of a missing commit) you’d need some end-to-end monitoring, for example that some time-out causes the failure message for the compensating transactions, unless you have some tighter control on the processing of the events to find out that an event was consumed, but is still not processed (i.e. one of the microservices crashed or lost network connection) and needs to be relaunched.

Of course, this is greatly simplified, because distributed processing is very complex and needs very careful design (e.g. what if you relaunch a step on a new instance, but the old instance managed to recover with the risk of having things processes twice).

thank you very much. I am reading about this. What if system goes down between the commit and the ack? i guess we have to save whole history and rollback or continue after system restore — Amin Shojaei, Aug 12 '20 at 11:55
@AminShojaei As said, this is greatly simplified when I write that the comment and the event message must be done atomically (it’s very tricky algorithms. You could for example read about Kafka’s "deliver exactly once" to get a taste of it). Atomic means that it behaves in a way that it’s either nothing or everything. A system failure in the middle would be processed at restart, and depending on the recoverability of the state, either rollback or make sure that the event is sent. Journalling usually plays an important role for this. — Christophe, Aug 12 '20 at 12:32

Unexpected shutdown before a saga completion

1 Answers1