Best practices for Heartbeat in distributed systems

Question

We had in our system in the past an external data provider (call it source) sending regular heartbeats to a java application (call it client). If the heartbeat failed, system shut itself down (to avoid serving stale data in a critical application). This was straightforward as both data and heartbeat used the same channel, making it highly reliable.

Since then we moved to a distributed system with java client broken down into several microservices and data flowing in part through kafka queues between services.

The important thing -- the most upstream system (call it destination) should still reliably get a heartbeat.

If we keep sending the heartbeat via a separate channel, then any failure in one of the microservices or kafka queue will disrupt the data flow to the destination, on the other hand, heartbeat will continue to flow without interruption -- failing the whole purpose of having a heartbeat

One solution I am thinking about is to push heartbeats through all of the services and kafka queues so that they take the same path as the data itself. In any case, what are the best patterns/design criteria for reimplementing heartbeat in such a distributed system?

We use it as one, but what difference would it have made to the problem on hand if it were rabbit or something else? — senseiwu, Apr 06 '18 at 10:08
You mentioned that the "heartbeats" originate from outside the system. Does that mean the outside system is checking to see if your system is still up, or is it announcing to your system that the outside system is still up? It's the difference between asking "Are you there?" and saying "I am here". — Greg Burghardt, Apr 06 '18 at 16:33
I'm confused. What direction is the data flowing? *To* or *away* from the system generating the heartbeat? ... And what is the consequence of not relying on a heartbeat at all? — svidgen, Apr 06 '18 at 17:12
@svidgen as mentioned in the original post, not relying the heartbeat reliably will lead destination system to shut itself down — senseiwu, Apr 06 '18 at 17:15
@GregBurghardt it is actually doing both. The outside system is checking whether we are alive- if not it will stop sending new messages. We on the other hand will shut ourselves down if the external system is not sending heartbeat — senseiwu, Apr 06 '18 at 17:20
@zencv ... Yeah. You said that. But, that's not what I'm asking. I'm not asking what happens if the heartbeat goes away. I'm asking if you can design the heartbeat out of the whole damn thing. ... Which system is sending the heartbeat? Why? What's the point of it? ... Why aren't you just sending data and retrying if it fails? ... etc. — svidgen, Apr 06 '18 at 19:41
@svidgen no, we are dealing with prices which change every second. If you offer prices which are old (=few secs or even millisec later), then we lose money. Heartbeat is meant as a safety mechanism. If we lose heartbeat, we accept the uncertainty and shuts our system off instead of risking losing — senseiwu, Apr 06 '18 at 22:47
Shuts *what* system down??? And how does that save you money exactly? ... Recommended reading: http://xyproblem.info/ — svidgen, Apr 06 '18 at 23:55
@svidgen the practice of shutting *service* down (not system down) is a standard practice in this domain if we do not get real time prices. Think of it as a domain specific requirement rather than a technical choice. How this is implemented is the only technical part. Thanks for the link — senseiwu, Apr 07 '18 at 07:40

Caleth · Accepted Answer · 2018-04-06T13:57:50.707

5

Your solution is the obvious one. When each service receives a heartbeat from one of it's sources, note the source and time, and when that service would send a heartbeat (to it's sinks), it checks that all it's sources are alive.

If you have optional sources, the "are my sources alive" becomes more tricky, but you presumably have dealt with that in how it handles data, the heartbeat just has to match that approach.

If ServiceA can send data to any of 3 instances of ServiceB, it has to send heartbeats to all 3 instances.

If ServiceC receives data from any of 3 instances of ServiceD, it has seen a recent heartbeat from it's D source if any ServiceD sent one

edited Apr 06 '18 at 13:57

answered Apr 06 '18 at 10:10

Caleth

10,519
2
23
35

actually it gets more tricky due to having several instances of each service. Data could end up flowing through one service and heartbeat on another instance which due to bad luck dies at that moment -- so a false positive heartbeat failure will be reported – senseiwu Apr 06 '18 at 10:20
Could your services not send their own unique heartbeat messages? That way you know where it has come from, regardless of how it got to the destination – richzilla Apr 06 '18 at 13:56

score 1 · Answer 2 · answered Apr 06 '18 at 14:33

OK so. As I understand it you have this:

DataSource - pushes occasional messages to Clients

Client - Listens for datasource messages

Problem: Because the DataSource sends messages intermittently, if it dies the clients are left unaware and continue displaying the old and now invalid data.

Old Solution:

DataSource - pushes occasional messages to Clients, 
    PLUS a regular small 'heartbeat' message

Client - Listens for DataSource messages and the 'heartbeat'. 
    If the heartbeat isn't received X seconds after the last one, 
    we know the DataSource has died and can take action.

New Situation:

DataSource - pushes occasional messages to intermediate clients,

Load Balanced MicroService(1) - listens for datasource mesages 
    and pushes messages to next in chain

Load Balanced MicroService(n) - listens for MicroService(n-1) 
    and pushes messages to next in chain

Client - Listens for MicroService(last) messages, but the
    heartbeat is lost in the ether

Solution:

The MicroServices should behave like the old client and report when their datasource has failed to their listeners.

But while the messages will be processed by a single microservice in a load balanced group, the heart beat must be processed by all of them. So the heartbeat should use fanout routing while the message should use a worker queue.

However, Its hard to continue this pattern down the chain as each worker process would publish its own heartbeat.

I would suggest a more advanced form of routing where you have a routing service which hides the workers from the rest of the world

Here your router worker listens to the incoming queues and doles out tasks to a pool of workers. It receives the completed work and passes it on. Hiding the individual workers. It can cope with workers that die or take too long to complete work, fire up new workers when under load etc

In your case it can also handle the heartbeat. ensuring that the downstream heartbeat is representative of the messages it is sending out.

score 0 · Answer 3 · answered Apr 06 '18 at 16:15

A "heartbeat" is the solving the wrong problem.

The consumer of the micro services needs to guard against serving stale data when any one of the micro services goes down.

In fact, a heartbeat, even in your current setup, is not truly solving the problem.

If the database goes down, a "heartbeat" that doesn't connect to the database will report that the application is still up. I ran into this several years ago. Worse yet, you can't assume that each micro service connects to the same database.

Every single call to a micro service needs error handling for any catastrophic problem that can occur from the point of making the call (source) to all of the resources used by the micro service. You can't obviously tell if the database for a micro service is down when you need to call it, but some sort of HTTP error response will come back (4xx or 5xx). And when responses don't come back, applications consuming micro services need sensible time outs around the calls.

That last piece of the puzzle is good server monitoring of the entire technology ecosystem, and a well defined and efficient means of informing the people responsible for maintaining the consumers of micro services of any problems.

Welcome to service oriented/micro service architecture. Things work great when they work, but when chaos reigns, it pours.

for the kind of things you are mentioning, we have health checks. LBs rely on those health checks. Heartbeats are originating from outside the system, but has to be propagated through the entire system to the destination. They serve a different purpose than plain health check (usually to let upstream systems to take some action) — senseiwu, Apr 06 '18 at 16:29

Best practices for Heartbeat in distributed systems

3 Answers3