3

Once you create separate components that need to communicate with each other you enter the realm of systems programming where you have to assume that errors could originate at any step in the process. You throw try-catch blocks out the window and have to develop robust alternatives for error handling yourself.

We have two systems both with REST apis. Both systems have GUIs that users can use to add/update information. When information is added to one system it must be propagated to the other. We have integration software (the middleman) that polls on a minute-by-minute basis, picks up adds/edits and translates them from one system to the other. Each invokation keeps track of the timestamp of the last successful run--we have one timestamp for communication in either direction. In this way, if any part of the system fails, we can resume right where we left off when the issues are corrected.

I have heard bad things about poll-based approaches: namely the fact that it runs without regard to whether there is actually work. I have heard that push-based approaches are more efficient because they are triggered on demand.

I am trying to understand how a push-based approach might have worked. If either system attempts to push an add/edit, we have to assume that it could fail because the other system is down. It would seem to me that either system would need to maintain its own outgoing queue in order to resume once the issue with the other system is corrected.

It seems to me that using a push approach eliminates the middleman, but heaps more responsibility on each system to manage its messages to the other system. This seems to not be a clean way of separating concerns. Now both systems have to take on middleman responsibilities.

I don't see how you would redesign the middleman for a push-based architecture. You run the risk that messages are lost if the middleman himself fails.

Is there a fault-tolerant architecture that could be used to manage system interactions without the polling? I'm trying to understand if we missed a better alternative when we devised/implemented our poll-based middleman. The software does the job, but there's some latency.

Mario T. Lanza
  • 1,700
  • 1
  • 13
  • 22
  • Look at databases replication techniques. You're trying to reinvent the wheel for a something that is actually quite hard to implement. – Pieter B Aug 01 '14 at 12:51
  • It's not possible to do direct database updates since the submission of a transaction produces internal side effects within each system. We have to stick with the public api. – Mario T. Lanza Aug 01 '14 at 12:55
  • 2
    http://en.wikipedia.org/wiki/Message_queue http://en.wikipedia.org/wiki/Two_Generals%27_Problem – Den Aug 01 '14 at 13:04

3 Answers3

1

From your question and the comments to other answers already given, I feel like you're working on two things: 1) eliminating the middleman, 2) converting your poll-mechanism to a push-mechanism. I'd say, those can be viewed separately. Let's start with the poll-to-push change.

In general, I'd say yes, pushing is superior to polling. If I understand your infrastructure correctly, you have a middleman that polls in both directions looking for stuff to sync between the systems involved. I'd imagine there's already some kind of queuing in place. If your middleman polled some events from A and B is down, it'll keep those events until B is up again an re-transmits. Not sure if I got this right but assuming so, switching to push-mechanisms would be pretty simple and the benefit clear: Instead of the middleman fetching information (which may or may not be there), A (and B, of course) could just push their events into the middleman. It'll use its existing queues and either pushes to B immediately or whenever it sees fit. Now, if this is better for your situation depends on a lot of things, mainly the amount of events expected and the maximum delay allowed between syncs. Currently, any number of events will be synced with a delay of max. 1 minute. With pushing, the delay may decrease heavily but on the other hand, the load may increase if lots of events (that previously have been handled in a bunch) now result in many singular pushes. All of this can and must be handled by the middleman.

If you now try to eliminate the middleman, all that queuing must reside within the systems A and B. Whether that's feasible depends highly on what A and B currently do. It may be overstepping their responsibilities. Also, the currently very constant overhead of syncing might vary as described before. The middleman may be even good for load-reduction. Its mechanisms to forward incoming events could be set up to transmit them in batches. What you then gain is: Systems A and B that can have events, push them out and immediately forget about them. That nice and simple for them. The middleman could then, depending on various settings like maximum delay, maximum number of events, etc., sync those events into the respective other system.

I'm sure you see what I'm getting at and it may very well not be what you're looking for. But maybe it helps if you think about it this way. Let me know if I can clarify things.

jhr
  • 473
  • 2
  • 7
0

If the middleman is currently pulling data based on the time of the previous successful run, then, to cut out the middleman, each individual system could do the same in a push architecture. Yes, each system would need to keep track of this, but it's only a single timestamp, not a full blown message queue.

eldon111
  • 36
  • 1
  • Once the first error occurs there is a message that has to be sent at some later time. Given this, wouldn't each system have to poll in order to process that latent message? I was wonder whether polling was avoidable or necessary in a system where any given component could fail. – Mario T. Lanza Aug 01 '14 at 13:19
  • If one app tries to push to the other, and receives an error response, it can just try again later (maybe next time it has something to push, or on a timer) and just use the last successful timestamp to determine if there is any other data to push. No polling necessary. – eldon111 Aug 01 '14 at 13:25
  • It was the timer I was thinking of. I figured if the system had a timer running it would be a form of polling. (e.g. Anything to do now? How about now?) But when I thought about it I realized that each system could switch the timer on when an error occurs and off when the queue is emptied. – Mario T. Lanza Aug 01 '14 at 13:32
  • True, and when the timer is running, the app might be polling itself, but that would be better than constantly communicating over a network. – eldon111 Aug 01 '14 at 13:37
0

It would seem to me that either system would need to maintain its own outgoing queue in order to resume once the issue with the other system is corrected.

Yes, queueing is a good solution for such a scenario. That's what queues are made for: queing up messages for its consumers. But I do not quite get, why a system has to maintain its own outgoing queue.

A simple setup would be, that you have two services A and B, both of which have in Incoming Messages queue. Those queues act like mailboxes. A sends messages to B's Inbox and vice versa. The queues do, what they could do best: queueing those messages.

The conditions under which messages arrive at the consument of such a queue are a) on start, the service registers as a consumer and gets the information, that there are messages in the "mailbox", so the service pops them one by one until the "inbox" is empty and b) everytime a new message arrives, the service gets a notification and pops unread messages until the inbox is empty.

You decouple your systems with two message queues. You send messages in a fire-and-forget-manner. The question whether the recipient is alive doesn't bother the sender. A queue would do what your current middleware does - but in an easier way.

The point of failure shifts from the direct recipient, e.g. service A, to the queueing infrastructure, which has to be built resilient/redundant = fault-tolerant.

Each invokation keeps track of the timestamp of the last successful run--we have one timestamp for communication in either direction. In this way, if any part of the system fails, we can resume right where we left off when the issues are corrected.

The queues in the above scenario were acting like stacks and would preserve order.

And since it is a distributed system, you have to deal with CAP; simply spoken: the problem, that you have two out of sync services, which both could get queryied simultanously.

Thomas Junk
  • 9,405
  • 2
  • 22
  • 45
  • Service A has an event that B cares about. It goes to send that event/message to B's inbox, but the inbox itself is part of the system. The inbox resides on a separate machine which can itself fail. If it wasn't on a separate machine we'd have an outbox and we're back to polling (is the inbox available now?) in order to move messages from the outbox to the consumer's inbox. How does the queuing infrastructure guarantee availability so that the publishers and subscribers can keep about their business of sending and reading messages? – Mario T. Lanza Aug 01 '14 at 16:55
  • 1
    No. The inboxes are part of a separate part of your infrastructure. That an inbox is available is, as said, a problem of your queing infrastructure. You could think of multiple redundant queues: queues backing up queues. You could not eleminate system failure, but minimize its consequences. – Thomas Junk Aug 01 '14 at 17:06
  • An infrastructure of queues backing up queues (QBUQ) seems more complex than polling, where no backup is necessary assuming we're content with the latency associated with a restart if our middleman fails. The problem with QBUQ is that if the 2 backups fail, you'll lose transactions. Plus with several queues it seems you'd have trouble maintaining the chronology of messages. I understand the architecture could work; I am just trying to understand if the added complexity is worth it. The polling approach minimizes infrastructure. – Mario T. Lanza Aug 01 '14 at 17:27
  • No change comes without cost. And yes, it may be, that complexity increases. If your middleware crashes there is no synchronization, that's a no-brainer. The _upside_ is: you update when necessary instead of putting your DB under constant fire. This takes load from the server. Unless you have the need, why change the infrastructure at all? – Thomas Junk Aug 01 '14 at 20:11
  • @ Thomas Junk In your queue suggestion, what if the message was received correctly from A to B, but something goes wrong at a different level inside B (SQL for example). You cannot use a 'fire-and-forget' approach. You need to think about how to confirm the synchronization back to A. The thing that comes on top of my mind is doing one round of confirmation. This will not solved the problem completely. Things can still be improperly synchronized, it will be fixed either periodically or when a system wide read request arrives. There you do signature comparison for data from both A and B. – InformedA Aug 02 '14 at 11:35
  • .. then if you see a miss-match, you issue a recover request and return an error to the current request. The recovery might also not be completely perfect, and this goes on. But that's ok because the chance of failure goes down each time. This is the typical solution of the form `it is not a problem until it is a problem`, and we will `get it better next time`. – InformedA Aug 02 '14 at 11:40
  • @randomA This is not somehow _transactional_: **A** sends only to **B** but is not interested in _feedback_. But on the other hand: where is the difference to the current solution? **A** transmits state to **B**; **B** crashes. **A** retransmits at a later point in time. So the "message" would be the same as the one waiting in the queue. The only difference is: **A** knows, that **B** knows **A's** state. But a) why should **A** care? and b) if it is necessary, why not sending _confirmation messages_? As long as **A** has no confirmation. **A** is sure, that **B** will answer in the future. – Thomas Junk Aug 02 '14 at 11:47
  • @Thomas Junk In my suggestion, A sends B message and A is interested in one confirmation feedback from B. In your current solution, A has to transmit ALL of its states to B. If not, then how does A know which part of the state is received and synchronized and which part was not synchronized because of an error somewhere. – InformedA Aug 02 '14 at 11:56
  • »A has to transmit ALL of its states to B« No. Only _changes_: e.g. {message:"UPDATECUSTOMER", entity:{id:"123", lastName:"Sixpack"}}. » A sends B message and A is interested in one confirmation feedback« We are talking about _syncing_ state, so: Why should **A** care about B's state at all? And how does **A** react if it hasn't confirmation? – Thomas Junk Aug 02 '14 at 12:24
  • Btw. I do not see any sense in further elaborating, since the TO seems not interested in answers. – Thomas Junk Aug 02 '14 at 12:26