Best practice to keep different data sources in sync?

Question

I'm having doubts about how to implement and keep synchronization between two datasources in a distributed system. In my case, I have a service that checks for expired jobs in a repository. If the job has expired, then it is removed from the repository and enqueued in a distributed queue (the example is in Python but should be easy to understand):

def check_expired_jobs(self):
    jobs = self._job_repository.all()

    for job in [ job for job in jobs if job.has_expired() ]:
        self._job_queue.enqueue(job.crawl_task)
        self._job_repository.delete(job)

My concerns are that a lot of things may happen here since both queue and repo are remote data sources. If the queue operation is successful but for whatever reason the repository deletion fails I could run into an inconsistency issue. It's not the first time I encounter such a problem and I want to tackle it in the best way possible.

What would be the best practice to keep several data sources/repositories in sync?

Much better :-). Does your stack supports distributed transactions? If distributed transactions trade-offs concerns you, I would suggest to delve in eventual consistency. I see this because reading your question, seems to me a problem of transactionality rather than synchronization. — Laiv, Aug 22 '17 at 11:07
@Laiv Eventual consistency is a possibility as long as it is guaranteed, but I don't have a clue which mechanisms could I use to support this eventual consistency, and, foremost, how to guarantee it regardless of any eventual failure. — David Jiménez Martínez, Aug 22 '17 at 12:04
This is the point of EV. There's no guarantee. You allow the system to be in a inconsistent state for a while. The state is fixed **later**. How to solve the inconsistency is a subject that might require you to sit with the "domain experts" in order to find a *compensating solution*. It could be anything, from scheduled automations to manual interventions. The question to ask to the experts is: *How we deal with inconsistencies in the real world?*. It could be as easy as: *allow the system to try It again later*. Checkout [this](https://softwareengineering.stackexchange.com/q/290917/222996) — Laiv, Aug 22 '17 at 12:33
Maybe my [answer](https://softwareengineering.stackexchange.com/questions/354911/microservices-handling-eventual-consistency/354927#354927) here gives you a little bit more info about how to face transactions in distributes environments. — Laiv, Aug 23 '17 at 06:40

John Wu · Accepted Answer · 2017-08-23T08:13:27.413

If the queue operation is successful but for whatever reason the repository deletion fails I could run into an inconsistency issue.

The simplest solution in this case it just to make it not an issue. You're designing this thing, right?

You just need to write any program that reads the repository to treat expired records as if they don't exist. This is actually key to any sort of fault tolerance-- you need to allow the queue to grow and shrink (even a little bit) in any reasonable sort of fault-tolerant situation-- so you ought to make this a rule anyway, if you haven't. You can do it with a view if you don't want to pollute your c# code.

Then, in a separate process, occasionally copy (not move) any expired records into the distributed queue. If the copy fails, heck, just try copying them again a few seconds or a few minutes later. I say "copy, not move," because the repository shouldn't care whether an expired record exists or not.

Of course you will eventually run out of disk space with expired records. So you also need a simple job, running maybe once every 24 hours, to physically delete the expired records, if and only if they exist somewhere in the distributed queue. You can shorten that if you need high performance. You can even do it immediately every time you add to the distributed queue.

The only difficulty really is ensuring that copying the expired records doesn't result in duplicates in the distributed queue system. You can accomplish this very simply by tagging each job with a GUID and enforcing a uniqueness constraint. Very straightforward for a database working in isolation.

Don't monkey with 2PC unless you are doing this for self-education. Given your requirements it is way overkill.

Best practice to keep different data sources in sync?

Best practice is KISS.

Good answer, it fits well to what I wrote in my comment below about idempotent operations. However, I think what you describe will still work when the deletion of the expired records is tried immediately after successfully copying them to the distributed queue, just as in the OP's code example. Assumed `check_expired_jobs` is already called in regular intervals, another "cleanup" job then is not required. — Doc Brown, Aug 23 '17 at 10:46

Jonathan van de Veen · Answer 2 · 2017-08-23T06:39:15.230

0

In this case using a Transaction would solve your problem. By this I don't mean a database transaction, but a Transaction Pattern, if you will.

I've never written any code in Python, so I don't know if Python has a Transaction construct of some sort. However in this case, it would be easy enough to come up with something yourself.

Basically what you would do is queue the job and then try and delete the job. If the deletion fails, you'd need to roll back the "transaction", which in your case means removing the job from the queue if that step did complete.

In order to do this, you'd need some local storage of the Transactions state, until it is either committed (because deletion was successful) or rolled back (because deletion failed and removing the job from the queue worked). The state would include either the complete Task or a way to retrieve the Task.

edited Aug 23 '17 at 06:39

answered Aug 22 '17 at 13:28

Jonathan van de Veen

608
3
10

This is a bit vague. Sure, you can use a database transaction, which seems to be what you're describing, but you still need a way in your distributed system to determine whether a given operation has succeeded or failed. – Robert Harvey Aug 22 '17 at 15:02
@Robert Harvey: Thanks for your comment. I've updated the answer and hope it is more clear now as to what I mean. – Jonathan van de Veen Aug 23 '17 at 06:39
This answer is a pointer in the right direction, however, to implement transactions over distributed systems correctly, one should follow the gory details described in the (Two-Phase commit protocol)[https://en.wikipedia.org/wiki/Two-phase_commit_protocol], as I already mentioned in a comment above. – Doc Brown Aug 23 '17 at 06:46
IMO, the 2PC will depend on the processes behind the queue. Would we good to know if the processes can be undone or not. If they can, the OP still needs a *compensating process* to put the app' state back to a consistent state. Another issue to take into account is the possible access to the *job* once is within the transaction (concurrency control) – Laiv Aug 23 '17 at 06:50
@Laiv: of course, 2PC, or the kind of naive transaction handling described in this answer can be completely avoided if the steps in the process are all idempotent. If it does not really matter if `job.crawl_task` is executed twice instead of once, and if it does not matter if `job_repository.delete(job)` is called for a job which is not in `job_repository` any more, than the whole handling can made simpler. In case of a failure, one can ignore any temporary inconsistencies and just call `check_expired_jobs` again. I guess that conforms to your suggestion of eventual consistency. – Doc Brown Aug 23 '17 at 07:03

Best practice to keep different data sources in sync?

2 Answers2