In an event-driven system, what are the best practices around the number of events and size/atomicity?

Question

If you can have one event with multiple "things to process" or one event per thing to process, what is considered the best practice? Bearing in mind, the former option may involve persisting the payload in a DB (because it's too big) whilst the latter may mean 1000s of more events, but I'm unsure of the real disadvantage to having 1000s.

This is in the context of AWS using SQS queues, EventBridge, and Lambdas (possibly with step functions depending on the approach). These physical resources place constraints on the event size, but may affect the best choice in other ways.

Example: A service called 'File Splitter' takes large files and splits them into paragraphs. Each paragraph needs to be read by a downstream service called 'Paragraph Processor' that will do something with it (e.g., add up the words).

Options:

The File Splitter service persists the paragraphs in a DB and then publishes a single 'ProcessParagraphs' event that includes a key to the paragraphs in the DB. Then, the Paragraph Processor handles that event by reaching out to the File Splitter and asking for the paragraphs from the DB. Or
The File Splitter service publishes one 'ProcessParagraph' (singular) event per paragraph, each containing the paragraph to be processed. The event size is under the 256kb limit of SQS events. The Paragraph Processor then handles each event individually, possibly in batches up to 10000 as allowed by SQS batching.

The "best practice" would be to try both and measure/analyze either solutions for parameters that are important to you, like performance, latency, resource usage, dubugability, maintainability, extensibility, etc .. But I guess that is not what you are asking. Nor something that is always possible due to dev effort constraints. — Euphoric, Apr 20 '23 at 15:03
I think I'm asking what could be the disadvantages of having 1000s of events instead of fewer. Because 1000s of events seems to me better: each failed event will represent one thing that failed to process, instead of multiple, and retrying that event will be retrying one thing only. Moreover, because the payload is smaller it can be included in the event, no need to cross service boundaries and pull the event from a database (even if abstracted). — Dr. Thomas C. King, Apr 20 '23 at 15:23
The issue with answering your question is that is is highly context-dependent. Either can work given technologies used, business use-cases, performance expectations, etc.. And even if you try to explain in high detail your context and expectations, no one her has time to investigate and experiment to give you authoritative answer. — Euphoric, Apr 20 '23 at 15:32
Best practices are absolutely not context-independent. Food seasoned to perfection should not be served to heart patients. The best practice on best practices is to stop quoting best practices and think about what you're doing. — candied_orange, Apr 20 '23 at 16:14
The attitude that concerns me here is the assumption that "more events are better". Prove this before you assume it. This is a good way to form a bottleneck for no good reason. Hell I'm still in doubt that your paragraphs should be in the DB rather than just on the file system. Clobs aren't what DBs do best. — candied_orange, Apr 20 '23 at 16:27
I'm not sure where there is an assumption. And it's normative, so not provable. But I gave a list of reasons why having more events seems better. This question is about the tradeoffs. Bottlenecks might be one, but I'm unclear on why because more events increases parallelisation. — Dr. Thomas C. King, Apr 20 '23 at 16:30

score 2 · Accepted Answer · answered Apr 20 '23 at 16:53

Setting aside the term 'best practice', there are some considerations.

First off, when it comes to messaging, forget any ideas you might have about 'guaranteed' delivery etc. It's very common for messages to be 'lost' or mishandled in such systems. Almost every time I've worked with someone who is new to messaging, they fail to understand that getting a message from a queue is non-safe operation. That is, unlike reading a row from a DB, reading a message from a queue/topic modifies the state of the queue/topic, at least from the consumers perspective. If you fail to process the message successfully but commit the read, that message be 'gone'. It's very common for this to be the cause of bugs.

For that reason, I have become very wary of using queues as the primary means of tracking work to be done and using them as the primary storage mechanism for that work is just a terrible idea for anything that matters.

The fact that you are using a DB for storing means that the latter is not a relevant issue here. But the former is a concern. You should track the processing status of the paragraph in the DB, if you are not already doing so.

This brings me to the question of your DB design. Is there a reason you don't store the paragraphs independently in the DB? That is, why not split the paragraphs in the DB and store them as separate rows? That would allow you to track their status independently and in the case of a failure, allow you to only reprocess the failed rows. It also makes it much easier to track the processing in a distributed design.

And that brings us to the next point: what is messaging good for then? The main value of messaging on common contemporary systems (64bit) is the ease of distributing work in a very flexible way. They work well with pull models where you can increase and decrease workers as need to keep up with load. They also handle spikes in load well such that can accept more load that you can process for short periods of time without failures.

My recommendation would be to split the paragraphs in the DB which would get you out of LOB territory based on your description of the paragraph size. Then I would probably publish and event per paragraph. Whether the event contains the content of the paragraph or just a reference to where it can be retrieved is a little nuanced. On one hand, it's possible that the read from the queue involves just as much system and network overhead as retrieving from a DB. The assumption that the queue brings the message content 'closer' to the client is not necessarily true but there may be something to that in a multi-region implementation. If you have a high number of workers, you might have contention on reads from the DB, which would indicate putting the content right on the queue. You should read a little about how SQS manages messages and distributes them and also understand what your DB can handle. Then, do some profiling and monitor the performance once you have a working solution.

One note on LOBs: this is DB dependent. If you are using Oracle, for example, I believe the limit for in-row storage is 4K bytes. If you exceed that, off-table storage will happen automatically. While it is not specific to the question, LOBs tend to be problematic. Check the documentation on your DB platform. — JimmyJames, Apr 21 '23 at 14:57

In an event-driven system, what are the best practices around the number of events and size/atomicity?

1 Answers1