Error handling in distributed system

Question

This is the common sequence of two distributed components in our Java application:

1  A sends request to B
2      B starts some job J in parallel thread
3      B returns response to A
4  A accepts response
5      Job finishes after some time
6      Job sends information to A
7  A receives response from a Job and updates

This is the ideal scenario, assuming everything works. Of course, real life is full of failures. For example, one of the worst cases may be if #6 fails simply because of the network: the job has been executed correctly, but A does not know anything about it.

I am looking for a lightweight approach on how to manage errors in this system. Note that we have a lot of components, so clustering them all just because of error handling does not make sense. Next, I ditched the usage of any distributed memory/repo that would again be installed on each component for the same reason.

My thoughts are going in the direction to have one absolute state on a B and to never have a persisted state on a A. This means the following:

before #1 we mark on A that the work unit i.e. the change is about to start
only B may un-mark this state.
A may fetch info about the B any time, to update the state.
no new change on the same unit can be invoked on A.

what do you think? Is there any lightweight way to tame the errors in system of this kind?

This is an old question. Did you find a good solution? ... If so, can you share it? — svidgen, Mar 24 '18 at 17:50

J_H · Answer 1 · 2018-04-18T18:10:49.963

Appending to a persistent log on A should suffice. This copes with reboots and network partitions to achieve eventual consistency, or to signal breakage which prevents such convergence. With amortized group commit it can take less than a single write to persist a log entry.

You suggested making B responsible for unmarking state. I disagree. Only A becomes aware of new work, and only A should be responsible for tracking it and reporting errors such as timeouts. B sends idempotent message(s) to A, and A updates the state, re-querying at intervals as needed.

At step 0, A becomes aware of a new request and logs it. That constitutes an obligation A must later discharge by some deadline - A will continuously perform, and repeat, the subsequent steps until A learns that request processing has completed.

Some requests will be longer than others. Estimates of processing time will become available on A and on B, perhaps revised as processing continues. Such estimates may be fed back to A so it will seldom produce false-positive timeouts. Think of it as a keep alive message that says "still working, still working".

score 1 · Answer 2 · answered Jan 23 '18 at 09:49

Adopt a pull instead of push strategy. Make each part pull changes from the others and update its own records.

A logs things B should do to a queue
B pulls from A's queue and does the work
B logs things it has done to a queue
A pulls from B's queue to know what the job result was

(I'm using the word queue, but you may substitute log or topic.)

You can either bake the queue into the services, or you can have a separate message broker. An implementation baked into a service can be as simple as GET /jobrequests?from=<timestamp> (with B keeping track of the latest processed job request's timestamp).

A tricky part of such an architecture is to decide on at-least-once vs at-most-once semantics. Concretely: if B pulls an item from the queue and then crashes while performing it, what should happen? There are two possibilities, and which is most appropriate depends on your use case:

At-least-once: B only commits which point in the queue it has gotten to after completing an action, there is a risk of doing actions twice. If you design actions to be idempotent you may achieve exactly-once behavior using this approach. (I use kafka for this scenario.)
At-most-once: B only consumes every queue item once. If it crashes while executing it, then the item will be never executed.

Benefits of this approach:

Queue-consuming services don't need to be up for queue push to occur. This means that you are free to restart B while A is doing work or restart A while B is doing work. Redundant hosting of background services is only necessary to ensure overall response time, not reliable operation.
The pace of pulling of queue items can be controlled by the consumer, which allows you to temporarily buffer load peaks in the queue.

Error handling in distributed system

2 Answers2