How to avoid points of critical failure?

Question

Background

We have a ton of GPS devices in vehicles. These devices need to communicate with our system. To achieve this, we need to parse the device's messages before proceeding.

The following describes my scalable and fail resistant architecture idea.

Architecture

The devices are black boxes that send us buffers of data via TCP every X seconds. A typical message follows this trajectory:

Each device communicates with a RabbitMQ server (R1) that has a persistent queue. This way if something fails, no messages are lost.
R1 sends the messages to the Message Parser Cluster (Hamster Cluster). Fancy name for a bunch of hamsters that receive a data buffer as input and output a JSON object we can understand.
The Hamster Cluster then sends the parsed messages to another Persistent Queue (R2), which has the same properties as R1.
R2 then sends these messages to the Data Processing Cluster (Monkey Cluster) which does the real work. We are inside our system now. The path ends.

Objectives

The main objectives here are to have an architecture design with the following properties:

Scalable. We must be able to recruit more Hamsters and Monkeys if any of our Clusters is dying ( No one likes dead animals! )
Fail proof. If any given machine in the system fails, the service must not go down.

Problems/Questions

As designed, this system has 2 critical failure points: R1 and R2. Should this machines fail, the whole system goes down. Is there a way to avoid having these two critical failure points?
My coworkers made the case for NGINX, which also supports TCP/UDP connections. I understand that using it, we would have load balancing between the machines of each cluster. However, Replacing the Rabbit servers for NGINX servers would still have 2 points of critical failure. Could this be avoided?

Are you searching for general architecture solution or system/networks one ? On the network side, the simple answer would be to have a each rabbit server double by a fail-over, given that the R2 and monkey won't bother the fact that you might have messages not ordered when they come into R2 or the data processing cluster. If ordering is required, you will necessary need a single point gathering the data and so a failure point. Also if you architecture is behind a router, that router will also be a single failure point. Others solution would involve cache on the device and hamster side. — Walfrat, Apr 23 '18 at 11:32
Given your description I would go the system/network solution route.The single entry point is R1 and we don't really care about order at any given step, provided that the time difference is not abysmal ( not greater than X minutes, for example. Each message is parsed rather quickly, so I don't see this happening ). Would you care to elaborate on your cache suggestion? I fail to see how it would help. — Flame_Phoenix, Apr 23 '18 at 11:44
Well the cache solution would simply be able to store locally data and push them later in the case you still have a single point of failure, that is suppose your requirments make it so you shouldn't loss any datas or be able to at least have a history tracking one information every X minutes. — Walfrat, Apr 23 '18 at 11:56
RabbitMQ is normally deployed in [clusters](https://www.rabbitmq.com/clustering.html) of 3 or more nodes for high availability. The single point of failure I can see is the ingress point / network connection between the devices and R1. I guess you could either expose multiple nodes in the cluster, or maybe use a [load balancer](https://insidethecpu.com/2014/11/17/load-balancing-a-rabbitmq-cluster/) — Justin, Apr 23 '18 at 12:16
Still, here we have the same issue. If I use the Load Balancer, then the Load Balancer becomes critical point of failure. If I create a Rabbit Cluster then the master Node ( exposed to the outside world ) becomes the point of failure. — Flame_Phoenix, Apr 23 '18 at 12:20
Would like to know some more strategies about having the client know different endpoints. Do you have any literature regarding this or regarding the DNS suggestion? — Flame_Phoenix, Apr 24 '18 at 07:35
Besides: `R1 sends` , `R2 then sends` You mean data is pulled from R1/R2? — Thomas Junk, Apr 24 '18 at 09:09
@Flame_Phoenix: What Ewan is suggesting is this. Have code in your device switching to R1.1 in case R1 is down. R1.1 does the same thing as R1. You could also have your hamsters know to switch to R2.1 in case R2 is down. — Jonathan van de Veen, May 07 '19 at 13:03

score 1 · Answer 1 · answered Sep 15 '21 at 15:51

Since you are already using RabbitMq. Consider that RabbitMq can be made highly available. RabbitMq can also be made to be pretty durable.

From the RabbitMq team themselves...

For high availability we recommend clustering multiple RabbitMQ brokers within a single data center or cloud region and the use of replicated queues. Clustering requires highly reliable, low latency links between brokers. Clustering over a WAN is highly discouraged.

Disaster recovery typically requires asynchronous replication of data between data centers. RabbitMQ does not currently support this feature but other message routing features can be leveraged to offer a partial solution.

RabbitMQ does have support for replication of schema (the exchanges, queues, bindings, users, permissions, policies etc) which allows for a secondary cluster in a different data center to be an empty mirror of the primary cluster. A fail-over using only schema synchronisation would provide availability but with a data loss window.

You will have to decide for yourself if the durability (disaster recovery, or the ability to not lose data, or minimize the loss of data, if the datacenter housing your instances is destroyed) option proposed there meets your needs. Without knowing your exact situation I would caution that it's easy to fall into the trap of overengineering these things and an entire modern datacenter going down is quite uncommon and the associated disaster (likely involving loss of power) may take other parts of our service offline anyway.

Here is the RabbitMq team blog I pulled the above quotes from. It is quite detailed and a good starting point for hardening your RabbitMq infrastructure.

score 0 · Answer 2 · answered Aug 14 '21 at 15:18

Use Sqlite as your persisted outbound-queue.

You already have a conceptual outbound-queue from each device, which is implemented in memory (not persisted).

Sqlite is very very reliable. It has been pounded by billions of devices, and they have reached safety-critical certifications for the software. It also can practically run on almost every device. If you have limited disk space, you can quite easily run an SQL query to implement whatever data-trimming policy you design.

The internet itself is quite unreliable, particularly from the perspective of IoT devices. So the only way to avoid the biggest point of failure (prolonged internet unavailability) is to persist the outbound queue.

Even better if you can connect directly to the devices and query the sqlite database. Because the queue is already persisted, there's no need for an intermediate queue (RabbitMQ). (My company has plans to do this for our software, and to opensource the tooling).

How to avoid points of critical failure?

Background

Architecture

Objectives

Problems/Questions

2 Answers2