2

I have a micro-service based system running in kubernetes bare-metal. The key aspects are:

  • download data from a datasource nightly and add to a database
  • get any new data from the database, run a ML algorithm to get predictions for new data points

I'm doing this using cronjobs at the moment, e.g.

  • Cronjob 1 - python container scrapes a datasource nightly at 0100 and adds any new data to a database
  • Cronjob 2 - python container checks database for any new data, runs ML model to get predictions on new datapoints

At the moment, I just have 30 mins between the jobs, but is there a better way of triggering Cronjob2 to run after Cronjob1 completes (only if new data is available)?

Ideally I want it to be possible to run each Cronjob independently as well.

wokiwiv
  • 21
  • 1
  • If they must run sequentially, why are they not scheduled as a single sequential job? – Flater Jun 23 '21 at 15:56
  • As I want to be able to run them independently too – wokiwiv Jun 23 '21 at 16:25
  • So why not have both? One does not preclude the other. I don't mean copy/paste the job logic, just have a job that calls both methods and jobs that only call one method. – Flater Jun 24 '21 at 07:52

1 Answers1

1

There is nothing specific to Kubernetes in this answer.

One approach is to have job 2 be able to ask if job 1 is complete and therefore job 2 can begin; then you have job 2 periodically poll to see if it can start; if not, it sleeps briefly then checks again. Once it can start, it does it's data processing and probably uses the same mechanism that job 1 used to note it's completion so than a job 3 can depend on data from job 2.

One approach that I have used successfully is to implement a "high-water mark" for each job. This is a timestamp associated with each job that indicates how far along it is with it's data processing. A job can be dependent on the high-water marks of other jobs, so there is limited coupling between job implementations.

A job needs to know that it will not receive any data from prior to the new high-water mark before it can set its mark to a new value.

Let's say that all the high-water marks at are noon today. Job 1 receives a data file that covers the period from noon to 1pm and starts processing it and writing results. Job 2 see's that its own mark is at noon, as is Job 1's mark. Since they are the same, job 2 knows that it is caught up and has no processing to do; it sleeps briefly and will check again. Eventually, job 1 finishes processing the data and updates its mark to 1pm (since it now knows that it has completed all data that will arrive up to 1pm). Job 2 wakes up and sees the difference in marks and starts querying/processing for data between its current mark (noon) and job 1's mark (1pm). Job 2 processes the data and updates its mark to 1pm.

These can either be long-running jobs that poll, process, and sleep as appropriate or if you love cron (or similar-type scheduling), you can just schedule them to run every minute or five minutes or whatever and just bail out if there is no work to be done.

Rob
  • 164
  • 3