I'm working on scaling an application which is currently polling a mySQL database to send async jobs to a background processing system.
A state machine is being used to keep track of the entities progress throughout a workflow. For example, we might have three states:
- Scheduled
- Processing
- Complete
My plan is to add a messaging queue system to broker the jobs sent to the background processing system. So Application A
would insert a new entity, then push a message to the queue. Application B
would consume these messages and route them to the correct background processing job.
def do_work(entity)
# Precondition check
raise "Wrong State" unless entity.scheduled?
# Update state to processing
entity.processing
# ... do work
# Update state to complete
entity.complete
end
Given a job like above, I am having a hard time determining how it would be possible to recover from a situation where there was an error between the processing
and complete
event transitions. For example a process crash.
The job processor would re-try the job, but now the entity is in an inconsistent state, and would fail.
How could I handle this case without resulting back to polling the entity table looking for stale jobs?
Edit
There are two different concepts at play here. We have the data, and we have the state of the data as it relates to the job execution (scheduled/processing) + the state of the workflow (complete).
Both of these pieces of information are in the same table. So after data grows the process of polling becomes inefficient (reads/updates/inserts constantly happening).
So maybe a solution is to have Application B
move active jobs into a separate datastore. So when the "clean up" task that @Ӎσᶎ refers to is running, the dataset should be much smaller.
Ultimately, there it seems like there is no way to avoid polling the database to ensure data is a correct state.