How to Recover from Inconsistent Job State without Database Polling

Question

I'm working on scaling an application which is currently polling a mySQL database to send async jobs to a background processing system.

A state machine is being used to keep track of the entities progress throughout a workflow. For example, we might have three states:

Scheduled
Processing
Complete

My plan is to add a messaging queue system to broker the jobs sent to the background processing system. So Application A would insert a new entity, then push a message to the queue. Application B would consume these messages and route them to the correct background processing job.

def do_work(entity)
  # Precondition check
  raise "Wrong State" unless entity.scheduled?

  # Update state to processing
  entity.processing

  # ... do work

  # Update state to complete
  entity.complete
end

Given a job like above, I am having a hard time determining how it would be possible to recover from a situation where there was an error between the processing and complete event transitions. For example a process crash.

The job processor would re-try the job, but now the entity is in an inconsistent state, and would fail.

How could I handle this case without resulting back to polling the entity table looking for stale jobs?

Edit

There are two different concepts at play here. We have the data, and we have the state of the data as it relates to the job execution (scheduled/processing) + the state of the workflow (complete).

Both of these pieces of information are in the same table. So after data grows the process of polling becomes inefficient (reads/updates/inserts constantly happening).

So maybe a solution is to have Application B move active jobs into a separate datastore. So when the "clean up" task that @Ӎσᶎ refers to is running, the dataset should be much smaller.

Ultimately, there it seems like there is no way to avoid polling the database to ensure data is a correct state.

One way or another you need that clean up step. What if something crashes/goes offline/catches fire? You may only do it once a day, but you have to do it. — Móż, Mar 26 '14 at 02:57
@Ӎσᶎ To play devils advocate, why not always use db polling? It handles the use case plus the edge case. My hope was to abandon polling the entity table all together. — Karl, Mar 26 '14 at 12:45
@ElliottFrisch That seems like a reasonable option. I'll need to find out if that is possible with this job processor. — Karl, Mar 26 '14 at 12:48
@Karl: you mean roll the cleanup into the poll? Depends how long the cleanup takes, it may not be something you want to add to every single poll. If polling is just "grab first unowned job from pending job" but cleanup is "for each entry in the pending jobs table and the completed jobs table do something" that seems unnecessary. — Móż, Mar 26 '14 at 21:31
@Ӎσᶎ Take a look at my edits, does that make things clearer? — Karl, Mar 27 '14 at 02:50
@Karl: yes, and your suggested solution is how most systems do it. Sometimes by a periodic "archive out old records" operation (or, if you must "data_table_2014_03_27" etc), but always by removing processed records. — Móż, Mar 27 '14 at 02:52
@Karl is there anything I can add to make you think my answer is worth upvoting or accepting? — Móż, Mar 28 '14 at 04:13

Móż · Accepted Answer · 2014-03-27T02:55:51.767

I'm going to consider this more in the form of a risk management plan than a program design, because that's where the program design has to start.

It looks as though you have two communication channels - the database and a message queue. We'll assume the database is accurate, since if it fails it will normally require manual intervention. But everything else can fail, even in unlikely circumstances (fires!)

The normal flow seems to be:

process a: 
     (1) create job
     (2) add to database
     (3) send message
process b: 
     (4) message received 
     (5) get job from database 
     (6) process job 
     (7) mark complete in database

Possibly each of these steps can fail. "create job" might generate an invalid job (that can't be completed, say), or might fail to retrieve required information. So you'll probably want some kind of error reporting mechanism (no, really :)

Note that one of those might be resolvable by retrying (data not available), but the other will probably want an error detection mechanism - the "clean up" task. And the cleanup needs to categorise problems as fixable or not, ideally by having the code to fix them built in. That way its output is normal process flow and error reports, just like every other part of the program.

I suggest going through your code and at least mentally noting the hierarchy of error processing you already have, from "if this string can be turned into an integer do..." right up to "if there's an unhandled exception somewhere in the program". The "clean up" code is more closely aligned to this than it is to normal execution. It's akin to the heap checking code that programs like valgrind have - you run it periodically to make sure that your program state is correct.

One way or another you're going to end up with a thread or process somewhere that schedules jobs. Either right at the start (your question talks about scheduling jobs rather than just queuing them), or purely to run the clean up task. If the latter you might be able to run it as a separate executable via cron (I suggest under windows not getting involved with their equivalent, just my personal memory of it being a PITA to use programmatically. One of the parts of "clean up" is to make sure the scheduled task exists)

You might be able to reduce the frequency of the cleanup task by triggering it manually from the steps above when there are certain types of errors (ie, if processing a job takes longer than allowed abandon it and call clean up). If you can make it lightweight enough you might be able to get away with running it every time step 5 above runs, but I suggest not doing that even if you can initially. Or at least test it with several years worth of data in all your tables.

But regardless, I suggest running it at program start up then daily or so, just to catch things you haven't thought of yet. Like, what happens when your program gets shut down? How fast does it exit when the computer is shut down, and can it leave a job "paused" to be resumed or does it have time to reset the job to "ready to be run"? What if the database task to reset the job times out because the database is in the process of shutting down too?

(edit: moving data out of the "queue" table once it is processed will speed things up, and more so over time as the number of processed jobs grows. Depending on the amount of data and number of failures the clean up has to deal with it may be possible to add that to either the "get job" or "mark complete" steps. But that's for really simple jobs).

How to Recover from Inconsistent Job State without Database Polling

1 Answers1