Good strategy to deploy workers that process long-lasting tasks

Question

I have multiple instances on the same worker that process long-lasting tasks. Usually, those tasks last about 30 minutes - 5 hours. Tasks are stored in RabbitMQ. Workers are deployed as Kubernetes single-container deployment with multiple replicas.

The problem is deploying new changes. I see two strategies here: interrupting current processing or deploying new workers and letting existing one's exit by itself.

I chose the first strategy that lets me deploy new changes quickly. After a deploy is finished I could be ensured that all workers use the same codebase. But there are downsides. I need handling exit signal, task processing restarting, restoring state, checking for should I insert or update records, and so on.

So my question is, could I say that interrupting current processing to deploy new changes is a best-in-class solution? Are there other approaches here?

How do your users like it if their job that was started 4h 59 min ago gets killed for a minor software update (e.g. fixing a typo in a comment)? Would you like the provider of your CI pipeline to deploy updates in that way? — Bart van Ingen Schenau, Aug 11 '20 at 09:56
If task processing returns an unexpected error (most of the errors) a task message will be re-queued using progressive delay mechanism. So I will have a few hours to fix such an accident bug. — pprishchepa, Aug 11 '20 at 11:07
You should interrupt existing workers so they finish their current task and *then* exit. That way, there is no save-and-restore - but you also don't have old workers getting new tasks from the work queue. Or, you can interrupt existing workers so they exit immediately, and restart running tasks from the beginning. — user253751, Aug 11 '20 at 14:22
@PavelPrischepa, I think you misunderstood my comment. I meant that you kill the job to roll out a new version. — Bart van Ingen Schenau, Aug 11 '20 at 19:32

score 3 · Answer 1 · answered Aug 11 '20 at 09:54

3

In general, unless workers contain buggy code which may damage data, it is best to just let them run to completion and exit then, while new tasks are started by new workers (rolling deployment).

If the reason for deploying new worker code is extended functionality, there shouldn't be any job in the queues yet that would benefit from running with a new worker.

If you have just performance improvements, letting old workers continue should not hurt unless their performance is so bad that you'd wait days to have a job finished by an old worker.

If you have genuine bug fixes, you might have a case for stopping old workers, but in that case you'll probably want to start the jobs from the beginning and not from a saved intermediate state that might already contain buggy data.

answered Aug 11 '20 at 09:54

Hans-Martin Mosner

14,638
1
27
35

I think that in case of an automatic CI/CD pipeline it's hard to handle multiple deployment variants according to a type of committed modifications (bugfix, performance improvements, ... ). It looks like it should be a generic pipeline. Taking in mind that task processing could be abrupted by a problem on a hosting provider side, e.g. a kind of VPS hardware issue, availability zone could get down, and so on. And a task will be re-scheduled to another Kubernetes node. Thus, a variant with stopping/starting workers looks comprehensive for all the cases. – pprishchepa Aug 11 '20 at 11:39
Supporting idempotency requires writing code to avoid errors like inserting DB record that already exists etc. And as there are plenty of such cases it rise a concern for me. – pprishchepa Aug 11 '20 at 11:41
Which would be the beauty of a transaction. Everything lands, or none of it does. – Kain0_0 Aug 11 '20 at 23:31
1

@Kain0_0 That's true, but 5-hour transactions are not generally advisable :-) – Hans-Martin Mosner Aug 12 '20 at 04:39
Then how about optimistic locking? – Kain0_0 Aug 12 '20 at 05:13

Good strategy to deploy workers that process long-lasting tasks

1 Answers1