0

I'm designing the architecture of a platform to introduce a message broker in an existing data collection web application.

This web application is currently used to upload data from excel files that are then validated, processed and inserted into a database. Every time a new set of data is inserted we should run one or more scripts, depending on the nature of the data, to perform a deep analysis of the data. Think about these scripts as Microservices.

These scripts are written by data scientists or other programmers, therefore they couldn't be in the same technology as the main application and some of them are also hosted on another machine. Some of this script could run for also 2~3 hrs to produce the results.

For the above reasons, I think that introducing a message broker, like RabbitMQ or SQS or Kafka, could be an added value to making these applications talks together and synchronize their running processes.

The messages should be two-way: The main application should notify the "scripts" that there is new data available to be processed, and the "scripts" should notify the main application when they have finished (so I can also run sequential operations...)

I foresee having around 200 messages exchanged per day with a max peek of 400msgs/day, split into ~10 different queues/topics.

The software architect is contesting me to have only one message broker he said that I need to make one broker for each way, so I need to set up one message broker for the messages produced by the main application and another one for the messages produced by the "scripts" and micro-services. In addition to that he wants to put a REST API to manage the insert of the messages into the queue, so that the "scripts"/microservices call an API and not directly the message broker.

I think that having two message brokers is really a waste of resources considering the amount of data that I have to manage, what do you think?

In addition masking AMQP with HTTP it's non-sense to me, as we take the worst of both, what do you think?

Giox
  • 109
  • 1
  • 2
    lots of questions. 1. two brokers. sounds wrong but the term can mean more than one thing need more detail 2. the rest service to mask AMQP is pretty standard, as some languages/systems don't have AMPQ clients, but everything can do HTML these days – Ewan Jan 30 '23 at 17:56
  • 2
    Overuse of bold style does make your question less readable, did you know? – Doc Brown Jan 30 '23 at 18:46
  • 2
    ... and questions which end with "what do you think?" are often too opinionated and/or not focussed enough to be good fit for this site. – Doc Brown Jan 30 '23 at 18:57

2 Answers2

2

If you adopt Kafka you'll find its ability to replay the log, during development or after a production incident, is very useful. If you instead choose RabbitMQ or SQS, pay careful attention to message reordering and dups.

having two message brokers is really a waste of resources

Agreed, given your message volumes. But the bigger issue is gratuitous complexity. It affects the cost of training staff, and of developing, documenting, and debugging a distributed system.

At some point you will see a server which is running message broker software go through an unscheduled reboot. Perhaps due to a building power fail. With all messages (main -> scripts and scripts -> main) appearing in the same distributed persistent kafka log, it's pretty easy to reason about tasks successfully completed and ones that need to be retried. With messages split between disjoint message brokers it becomes much harder to work out what happened and what should be done now.


script could run for 2 — 3 hrs to produce results

You will see script restarts and related server reboots. Recommend you arrange for each script to follow this protocol:

  1. receive "start work on X" message
  2. check central RDBMS to verify this is a new "X" request
  3. record a "task in progress" row in the DB, perhaps as the tuple (host, script, X, timestamp, status)
  4. think about X and persist the results, perhaps in S3
  5. update that row's status="done", maybe with an S3 pathname

If servers reboot during the multi-hour processing time, at least you'll know what the state of the world is.

What to do if step (2.) notices a replayed request is up to you to define. Maybe the script checkpointed partial results. Maybe DB table cleanup must happen before restarting. Maybe it's acceptable to have multiple servers working on the same analysis, and then there's some resolution protocol for choosing which one wins. Write down the details, and be sure to test they're implemented.


Recommend the folks writing your analysis scripts adopt an approach like MLFlow. But that's orthogonal to any message broker details.

J_H
  • 2,739
  • 11
  • 19
0

Reading through your post, my first feeling is that you seem to want to use this technology so much, but don't necessarily have a good use case. For example, Azure Message Bus has a default timeout of 30 seconds, sometimes increased to 5 minutes. You're talking about several hours.

  • Kafka is extremely complex, it takes a lot of effort to learn, setup and maintain.

  • Long running jobs are not necessarily a good match for a message broker, especially one with a very low number of messages per day. Ideally you want to slice up jobs into a large number of small tasks that can be processed in parallel, or retried cheaply.

  • You want two-way communication, and while MQs are capable of this, ideally you want messages to flow with minimal requirement for any kind of syncing or cross talk. Ideally, you want the publisher to 'fire and forget'.

My advice would be to carefully consider your current architecture, and look at pain points, and then see if some simpler queuing technology could help you. E.g. could you just implement a single, non-brokered queue for some of your jobs?

That would be a lot easier to do as a proof of concept, retain many of the advantages you want from queue, but with a lower complexity, faster feedback cycle, and lower cost to see if this is really a path you want to follow.

Hope this helps.

Phil S
  • 129
  • 3