If you had a task that you wanted to run only once on a cluster of servers, at a regular interval what would be the best way of achieving this? The definition of cluster in this case is 2 or more identical servers with distributed sessions sitting behind a load balancer.
Use Case: You have a task that is expensive to run that should only be run once per X hours. This job could for instance iterates over a bunch of records and updates their status.
- Worst case scenario is that having the job run twice invalidates your data.
- Best case scenario is that the job utilises resources on all your servers.
Requirements Summary:
- The job must still run even if one of the nodes are down.
- The job must only be run once per schedule.
- If multiple jobs are scheduled at the same time or at overlapping times that the number of running jobs is distributed equally between the servers.
- The machines must have the same code base and be synchronised via NTP.
- The configuration may differ between node and node, by environment variables.
- The job has to start on time or within a given interval of the assigned time. (say 5 minutes for example)
Possible solutions
- Set one node as the master node, this doesn't work as it violates 1 above.
- Make a request that the load balancer balances to kick off the job. Unfortunatly this has the side effect that if you have multiple jobs running at the same time they may all be run by the same machine.
This would have to run in Java, in a servlet container. However it isn't coding the jobs I'm looking for.
Surely this is a solved problem with known best solution.
Related question. https://stackoverflow.com/questions/5949038/schedule-job-executes-twice-on-cluster
This isn't a duplicate as the solution is insufficient as per those 5 requirements given above. The most upvoted solution suffers from a race problem, and the second solution violates requirement 3