Event sourcing, replaying and versioning

Question

I am designing a system that uses Event Sourcing, CQRS and microservices. I am lead to understand this isn't an uncommon pattern. A key feature of the service needs to be the ability to rehydrate/restore from a system of record. Microservices will produce commands and queries on a MQ (Kafka). Other microservices will respond (events). Commands and queries will be persisted on S3 for purpose of auditing and restoring.

The current thought process was that, for the purposes of restoring the system, we could extract the event log from S3 and simply feed it back into Kafka.

However, this fails to acknowledge changes in both producers and consumers over time. Versioning at the command/query level seems to go some way toward solving the problem but I can't wrap my head around versioning consumers such that I could enforce that when a command, during a restore, is received and processed, it's the exact same [version of the] code that's performing the processing as it was the first time the command was received.

Are there any patterns I can use to solve this? Is anyone aware of other systems that advertise this feature?

EDIT: Adding an example.

A 'buyer' sends a 'question' to a 'seller' on my auction site. The flow looks as follows: UI -> Web App: POST /question {:text text :to seller-id :from user-id} Web App -> MQ: SEND {:command send-question :args [text seller-id user-id]} MQ -< Audit: <command + args appended to log in S3> MQ -< Questions service: - Record question in DB - Email seller 'You have a question'

Now, as a result of a new business requirement, I adjust the 'Questions service' consumer, to persist a count of all unread questions. The DB schema is changed. We have had no notion of whether or not a question was read by the seller, until now. The last line becomes:

MQ -< Questions service: - Record question in DB - Email seller 'You have a question' - Increment 'unread questions count'

Two commands are issues, one before the change, one after the change. The 'unread questions count' equals 1.

The system crashes. We restored by replaying the commands through the new code. At the end of the restore, our 'unread questions count' equals 2. Even though, in this contrived example, the result is not a catastrophe, the state that has been restored is not what it previously was.

The question seems to mix a couple of concerns. Event sourcing is an architectural strategy to deal with systems where a lot of changes to the underlying data are taking place. Versioning of DTOs and backups and data is another matter altogether. Event sourcing is not specifically designed to get you bare-metal restore capability - for that, you need to have a specific strategy in place. — theMayer, Feb 15 '16 at 18:49
Maybe Event upcasting can solve your problem: http://blog.trifork.com/2012/04/17/refactoring-in-an-event-sourced-world-upcasting-in-axon-2/ — Songo, Feb 15 '16 at 21:32

theMayer · Accepted Answer · 2020-06-05T13:41:17.177

First, it is important to understand and be able to leverage the difference between Commands and Events.

As this question succinctly points out, Commands are things we would like to happen, and Events are things that have already happened. A command does not necessarily result in a significant event in the system, but it usually does. For example, a send message command may be rejected, in which case no event happens (typically an error would not be considered an event in this sense, though we may still choose to log it in a diagnostic log). Now, if the send message command is accepted, the message sent event occurs, and event details could describe the sender, the receiver, and the content.

When we talk about the system state, we are actually discussing not a culmination of commands, but of events. Only events reflect changes of state in the system. To draw from a life example, suppose I go to the local Publix supermarket and buy a Florida lottery ticket. The command was "Buy Ticket" and the event was "Ticket issued." My next command then is to the lottery to draw my numbers for the PowerBall. The lottery is going to ignore my command (but I have no knowledge), and the event "PowerBall numbers chosen" takes place irrespective of my wishes. If my numbers match, the event "Jackpot won" happens to me (and I think my command was heard). If not, I realize my command was ignored.

From a historical perspective, the lottery is only interested in a subset of events. The lottery only cares that (a) a ticket was issued, (b) the numbers were chosen, and (c) the jackpot was won. Those are the items of interest. The act of purchasing the ticket, wanting to win, etc. are all irrelevant, as is what I do with my ticket after I lose. While the real world does change for mundane events, we only need to record those events which are significant to our system.

In theory, under an event-sourcing technique, a stream of events may be replayed from the beginning of time to arrive at the current state. This relies upon the assumption that the underlying system conditions are constant and deterministic. However, these assumptions are not valid in many systems. The data associated with an event, as well as the types of events we are interested in, may change as our computer software evolves. In addition, it can be computationally expensive to re-compute the current state in response to every query. For this reason, snapshots of the system state are often taken to represent known points in time, which most recent events can then be added to.

While it is still possible to replay an event stream across multiple versions, the amount of human effort involved in doing so is likely to be cost-prohibitive. Unless there is a justifiable reason to design that capability into the system, you are better off building your system to utilize snapshots.

Example in Question

In the example given in the question, the architecture is not truly event-based; it is command-based. Replaying commands creates the system state. This is an anti-pattern and should be fixed. Instead, the primary events are:

Buyer asks question
Seller responds to question

Each of these events can be "replayed" to give the current state. For example, in the act of asking a question, the system behavior might be to email the seller and increment the unanswered question counter. This behavior can be changed; however, the fact that the question was asked does not. Similarly, the system might decrement the unanswered question counter when the seller responds. This behavior is changable, but the fact that the seller responded is not.

Most event-sourcing systems would dynamically compute the count of unanswered questions by replaying the specific event stream in response to a query.

In this lottery example, you say that "Only events can cause a change of state in the system", but if the event is "ticket issued" (and presumably the event includes some details like timestamp, buyer_id, ticket_id), how do you record the reference number of the ticket if there isn't some other system of record producing the ids? Is there a traditional CRUD system that needs to produce a ticket first before the event source can record the fact as past tense? — Homan, Jun 01 '16 at 06:40
The action of issuing the ticket *is* the event in this case. The data associated with the event is what is being described as the event in your question, which is useful but technically incorrect. Furthermore, events typically represent a [holarchy](https://en.wikipedia.org/wiki/Holarchy) of details, where they can be composed and decomposed relatively limitlessly in each direction. — theMayer, Jun 01 '16 at 16:46
I guess what I was thinking about was this: In the CRUD world, especially Rails, it's common to have auto-incrementing ids for the primary keys of tables. We create records without knowing the ids, the DB hands me back the ticket id. Now moving into Event Sourcing world, from what I've read, the event is 'realized' before it is persisted in DB, and it requires an aggregate id. So rather than getting the id back after persistence from DB it sounds like the unique id must already be known so it can be described as a whole. That seems like we should always be making uuid and not auto-ids. — Homan, Nov 21 '16 at 07:40
Well, the UUID was developed for this purpose (a single auto-counter in a database represents a single point of contention/failure and should be avoided). However, let's ask a more fundamental question - why do we need a numerical increasing integer value for every single record in a database? Is that not a totally man-made contrivance? — theMayer, Nov 23 '16 at 00:50

score 5 · Answer 2 · answered Feb 15 '16 at 18:24

5

Commands and queries will be persisted on S3 for purpose of auditing and restoring.

For auditing, sure. For restoring ? That's weird, and likely to cause you headaches.

If you are going to be event sourcing, you want to be rehydrating state from events (things that happened in the past) not commands. This saves you from most of the problems associated with changes to command implementation -- you only need to deal with the persisted state changes.

Versioning is still a concern. In particular, you want to make sure that your persisted events are as supple as possible (DTOs representations, rather than direct serializations of the concepts in your domain). When reading events from the store, you have an opportunity to update them as necessary prior to applying them to the rehydrating state.

answered Feb 15 '16 at 18:24

VoiceOfUnreason

32,131
2
42
79

2

Ok, so I think your advice is to worry less about restoring from commands and more about from events? For example, if I get a _command_ along the lines of "add 10 beans" then I should subsequently issue and store an _event_ which says "10 beans were added. new total: 40"? – Antony Woods Feb 16 '16 at 14:03
2

Yes, that's right. Each change of state in your event sourced entity is represented by one or more events; to rehydrate, you replay all of those events in order. – VoiceOfUnreason Feb 16 '16 at 14:28
4

I didn't accept this answer but I want to thank you for contribution as it was vital in amending my understanding. I chose rmayer06's answer just because it was more direct, rounded and more useful to some one just accessing this question for a quick answer. – Antony Woods Feb 17 '16 at 14:40

Event sourcing, replaying and versioning

2 Answers2