Microservices without data duplication

Question

I’m finding it hard to avoid data duplication or a shared database for even the simplest microservices design, which makes me think I’m missing something. Here’s a basic example of the problem I’m facing. Assuming someone is using a web application to manage an inventory they would need two services; one for the inventory managing the items and the quantity in stock and a users service that would manage the users data. If we want an audit of who stocked the database we could add the users ID to the database for the inventory service as a last stocked by value.

Using the application we may want to see all the items that are running low, and a list of who stocked them last time so we can ask them to restock it again. Using the architecture described above, a request would be made to the inventory service to retrieve the item details of all items where the quantity is less than 5. This would return a list including the user IDs. Then a separate request would be made to the users service to get the user name and contact details for the list of user IDs obtained from the inventory service.

This seems awfully inefficient and it doesn’t take many more services before we’re making multiple requests to different services APIs which in turn are making multiple database queries. An alternative is to replicate the users details in the inventory data. When a user changes their contact details we would then need to replicate the change through all other services. But this doesn’t seem to fit with the bounded context idea of microservices. We could also use a single database and share this between different services, and have all the problems of an integration database.

What’s the correct/best way to implement this?

Welcome to the paradox of micro-services. That which would appear to make things simpler can actually makes things more complex. — Robert Harvey, Mar 19 '18 at 15:04
The "correct" way is the same as it's always been: figure out a way of doing things that best suits your specific objectives. — Robert Harvey, Mar 19 '18 at 15:05
@RobertHarvey That's always the case but I'm trying to understand the textbook microservices way. Once I understand how it should work in an ideal world I'll happily change it to fit my use case. — Geraint Anderson, Mar 19 '18 at 15:29
But your framing your question in terms of efficiency, which is a non-functional software requirement. The way you solve the efficiency problem is by asking the database directly. — Robert Harvey, Mar 19 '18 at 15:36
I was about to write a question exactly as yours.I still don't see advantages in MSA for reasonably simple web applications. I think in many cases modularity could be achieved without making things so complex. — Glasnhost, Oct 26 '18 at 17:49

Maurits Moeys · Answer 1 · 2019-01-08T20:41:52.483

31

I’m finding it hard to avoid data duplication....

According to the Microsoft ebook on microservice architecture, there is nothing wrong with data duplication. Basically, duplicating data increases the decoupling between the services and therefore strengthens their roles as a single authority. A relevant passage:

And finally (and this is where most of the issues arise when building microservices), if your initial microservice needs data that's originally owned by other microservices, do not rely on making synchronous requests for that data. Instead, replicate or propagate that data (only the attributes you need) into the initial service's database by using eventual consistency (typically by using integration events...

edited Jan 08 '19 at 20:41

answered Jan 08 '19 at 18:03

Maurits Moeys

425
4
5

2

I completely disagree. It makes it harder to maintain. It makes you implement transactions among microservices when something has to be added, updated or removed. In case you want to prevent a single point of failure you can use request or any other type of caching. – Alan Sereb Sep 20 '19 at 21:04
6

@AlanSereb It's harder to maintain, but the point is sometimes you have no other choice. For example, what if you need to make a FK between objects living in two databases? The only way to ensure consistency when making queries in a local DB, is to have a data replication. Take a look to: https://stackoverflow.com/a/4452586/2255491 – David Dahan Oct 26 '19 at 14:59
1

I agree. Another great approach is to take the event sourcing route. And have all mutations be executed via the event pipeline – Alan Sereb Oct 27 '19 at 02:55
@AlanSereb Can you design your microservices so that transactions aren't needed? Example: If the products service adds a new product, the search service has no choice but to replicate it in the search index. There is no need to ask the search service whether it's okay to add a new product. If the search service has a problem, that's a bug in the search service. – user253751 Aug 28 '20 at 11:13
@AlanSereb, the keyword(s) in the quoted Microsoft's ebook is "integration events". That, basically, is event sourcing. https://devblogs.microsoft.com/cesardelatorre/domain-events-vs-integration-events-in-domain-driven-design-and-microservices-architectures/ – aiapatag Nov 11 '20 at 16:52
@Maurits Moeys, what if we have millions of items from another microservice that we need to work with? It'd be really time-consuming to update them all the time if you decided to store them, instead of requesting them on demand. – Greg Eremeev May 24 '21 at 21:19

candied_orange · Accepted Answer · 2018-03-19T16:16:10.980

17

I completely missed where you're being required to duplicate.

A central principle of micro services is for the service to be the single authority. That means inventory and user management can be completely separate. I'd design the user management so that it doesn't even know the inventory system exists.

But I'd design the inventory system so that it never stores anything about users other then a user ID. That takes care of your problem of propagating user info changes.

As for things that need both inventory info and user info such as logs, audits, and print outs they don't get updated as info changes. They are a record of what was. Again, you don't propagate change.

So in every case, when you want the latest user info you ask the user info service.

edited Mar 19 '18 at 16:16

answered Mar 19 '18 at 15:37

candied_orange

102,279
24
197
315

@Geraint: Can you be more specific about what kind of duplication is occurring in your system? – Robert Harvey Mar 19 '18 at 15:38
2

Thanks. The duplication referred to copying the users contact details to the inventory service but you have addressed that (i.e. it's not required). It seems counter-intuitive to move from a single relational database where I could get the inventory data and the user data with a join to making two distinct API calls where the second can't begin until the first has returned the results. But I guess that's part of the evaluation as to whether I use microservices or something else. – Geraint Anderson Mar 19 '18 at 16:03
It's the same trick the DB would use if it managed both. You don't copy user info into the inventory table. You give it a foreign key. The user ID is doing the same job across services. Just make it unique. – candied_orange Mar 19 '18 at 16:06
2

`It seems counter-intuitive to move from a single relational database where I could get the inventory data and the user data with a join` Keep in mind that "ideally" there's one store per service (or more!). So, there's nothing such as "join" between "boundaries". The reason is simple, DB generates coupling among services. Unlike @CandiedOrange suggest, I think we can duplicate a minimum of data from one service to another. I'm referring to data which is unlikely to change. If this dups improves efficiency and performance (and both are required) the "pros" would probably off-set the "cons" – Laiv Mar 19 '18 at 16:17
@GeraintAnderson I mean, if you need efficiency (which is by definition a non-functional requirement), there are ways to do that. I.e. request pages of data from the Inventory Service (like 10 elements), take each page and use that page to request data from the User Service, and aggregate at the end. That way you keep your boundaries while leveraging the parallelism of independent services. Even then, don't bother until you've identified it as a real bottleneck of the application that must be resolved - waiting an extra 1/2 second on a 1-second overnight job doesn't matter to anyone. – Delioth Jan 08 '19 at 21:14
1

What if I need to replicate behavior of JOIN+WHERE? Say, I want to show all inventories managed by users from specific country. If there is no data duplication, I need to call users api to get such users, then extract inventories for them. And what if I have thousands of users? – Igor Apr 24 '20 at 00:53
@igor Hmm, paging? – candied_orange Apr 24 '20 at 00:59
@candied_orange but you will paginate users, not inventories. And sorting can be by some column from inventory. So pagination can work only if you sort by related table – Igor Apr 25 '20 at 01:44
@Igor ah then you send to the inventory service the userIds, the inventory column to sort by, and the page number of that sorted list you wish to receive. – candied_orange Apr 25 '20 at 02:09
@candied_orange, Yes, but this is where I started from. It will work good when I have tens or hundreds of users. But what if there are thousands? – Igor Apr 30 '20 at 10:58
@Igor if there is a reason it won’t scale with the right resources behind it I don’t see it. – candied_orange Apr 30 '20 at 11:02
@candied_orange you said "send to the inventory service the userIds". So my app first needs to do API request to get userIds assigned to selected area, then do API request to get inventories. The 2nd request can be paginated, but the first can not. Imagine I have thousands of user. Is it OK to send such amount of ids with every request? – Igor May 01 '20 at 13:54
@Igor no idea why users can’t be paged. Maybe you’re thinking of a multi element sorting situation. – candied_orange May 02 '20 at 01:46
@candied_orange I don't understand how you can paginate users. You need to show inventories (paginated), but first 10 users may have 0 elements and second 10 users may have 500. The only way I see here is some custom "load more" logic on frontend, that will paginate users, all inventories pages for that page of users, next page of users and so on. But you can't count total number of elements, and it looks to complecated compared to data duplication and sync. – Igor May 04 '20 at 07:36
@igor remember why you're paging. You didn't want to get everything at once. So you decide how much you can take at once. You don't need to know how much there is. Just how much you can take. Keep taking as you can until you've had all you need. Works on the user service and the inventory service. – candied_orange May 04 '20 at 07:40
3

@candied_orange Sorry to revive this, but Igor is completely correct, and people viewing these comments should know that. You cannot paginate by users in this situation. Pagination has to occur after sorting, otherwise the query result is incorrect. Yet the user query must occur before the inventory query to perform the filter-by-user, and the sorting must occur in the inventory query to perform the sort-by-inventory. Thus the paging must occur in the inventory query, after sorting. This would, then, require sending the list of potentially tens of thousands of user IDs to the inventory query. – Alexander Guyer Dec 15 '20 at 17:08
2

In other words, @Igor has, in fact, come up with a very reasonable type of query which completely defeats the shared-nothing datastore-per-service paradigm. This is exactly the problem with such bold "academic" paradigms; they fail to recognize the simple practical problems which they cannot solve. That is also why Maurits Moeys's answer is the better one. This answer overgeneralizes and states "in every case" you should ask the user service. But as proven, sometimes data duplication is necessary if you want any sort of reasonable performance, and that shouldn't prevent strong consistency. – Alexander Guyer Dec 15 '20 at 17:15
2

@AlexanderGuyer: Your (and Igor's) issue is one of performance, but **microservices do not focus on performance**. Microservices tend to trade away some (per-request) performance in return for simpler individual codebases, independent service lifecycles, and the ability to scale your services with less hassle (which yields total load performance improvements, but not per-request performance). What you're doing here is judging a fish by its ability to climb a tree. If performance is the main priority above all else, then microservices aren't for you. – Flater Apr 14 '22 at 11:33

Odalrick · Answer 3 · 2019-03-06T07:26:37.270

a request would be made to the inventory service to retrieve the item details of all items where the quantity is less than 5. This would return a list including the user IDs. Then a separate request would be made to the users service to get the user name and contact details for the list of user IDs obtained from the inventory service.

Indeed, yes.

Granted, in a monolith you could have an Inventory-model that you query for the relevant items, feed that into a User-model and get the same data.

Or you could take it further, if you have them in the same relational database and write SQL that and the database will take the inventory-table and user-table, it does some magic, and you get the data you are after.

Regardless of how you do it, somewhere there will be code that essentially fetches a list of user ids from the inventory system, feeds them into the user system and compiles a list of data.

The question you need to answer is about performance and maintenance and other "soft" qualities.

The main benefit of microservices is scaling. If you have a ten thousand users on one machine and it is a bit sluggish, you can add another machine and the system becomes twice as fast. Add eight more and it's ten times as fast. (Linear scaling is probably optimistic, but it is the ideal and not that unreasonable to hope for.)

And this is per service. If the inventory system is the bottleneck, it is used for more than reports about users, you can add more machines to just that service. The machines can also be specialised; this service needs a lot of memory, that service does heavy calculations and needs more cpu.

If you don't need the scaling, there is one other benefit of microservices: they are modular. Of course, monolithic apps can also be modular, and you have a normalised database and... but in practice the walls between modules are like glass walls in the best case, and lines in the sand in the worst. Microservices are separated by solid steel.

If your user system literally catches fire, that wont affect your inventory system in the slightest. You wont be able to print pretty reports about who stocked what, but customers will be able to place orders safe in the knowledge that the stocked items are there.

And you don't duplicate data in microservices, any more than you do in a relational database(*). In a relational database you can do a join, and the equivalent is to merge the lists in code like described.

You could also add a view, the equivalent is to add a new service that does the merge for you; that would result in three requests; one to the new service and then that service does the original two. Relational databases have fancy stuff that optimises views, that has to be implemented on the service level. You don't get it "for free".

Caching is different from data duplication in that if two values mismatch you know which one is wrong. It is often used in microservices to bring availability up at the expense of consistency (CAP theorem). Since relational databases completely butcher availability on the altar of consistency it is less common in them. I'd say there is nothing inherent about microservices that makes caching easier, but in practice caching is a primary concern and that makes caching easier in microservices.

(*) If it makes sense to duplicate data in a microservice swarm then it probably would make sense in the equivalent relational database to.

I really liked your answer until the "don't duplicate data in microservices" part. I think there are cases where data duplication is the right approach. It improves fault tolerance and autonomy. If the user service went down, the inventory service can still display a list of low inventory with who stocked them last. — Peter Pompeii, Feb 16 '19 at 19:35
@peterpompeii I'd call that caching, not data duplication. Data duplication is when you have two place to update for one datum, caching when there is one place and automatic propagation to the other places. Also I said more than relational. If it makes sense in a relational database to duplicate data it makes sense in a microservice. I think we agree and that part could be clearer, but I only have a phone right now so won't update the text right now. — Odalrick, Feb 18 '19 at 01:00
@PeterPompeii Hope the added section about caching addresses some of your concerns. — Odalrick, Mar 06 '19 at 07:28
@Odalrick what you described sounds like data replication. Replication and caching are *both* forms of duplicating data. Replication is when a copy is guaranteed to always have all the needed data. Caching is on-demand. Caching can have a miss. Caching for availability does not make as much sense as caching for performance. TL;DR if you are storing a complete copy of something with enough consistency guarantees that you never need to check for misses, then it's not a cache. — Brandon, Apr 25 '19 at 21:04
@Brandon I see your point, but it's not what I was thinking of. I'm talking about caching the result of calculations, like map-reduce in CouchDb. You cannot recreate the full data from them. Replication is almost something you do instead of microservices; get a big fat database instead of splitting the database into less coupled services. — Odalrick, Apr 26 '19 at 06:58
Replication is definitely not an alternative to or in opposition to micro services. Unless you mean using replication *across* services, in which case, yes, that would be an anti-pattern. Using replication within a service's database to help make that service scale better or more fault tolerant can make a lot of sense. — Brandon, Apr 26 '19 at 15:06
@Brandon Exactly replication is within a service, and thus not anything to do with microservices as such; equally applicable to a monolith. It can often make sense to cache the responses of other services though; for instance aggregating statistics daily which is why I mentioned caching. — Odalrick, Apr 26 '19 at 15:18
@Brandon Another difference between replication and caching is how you know which data is wrong when there is a difference. Replication defines some rules on how to merge the data. Caching on the other hand is _always_: the cache is wrong. — Odalrick, Apr 26 '19 at 15:24
For Me Best Benefit is isolating developers task and code,distribution system , And Also Tracking Service performance — Ali.Mojtehedy, Jul 02 '21 at 21:53

score 0 · Answer 4 · answered Apr 14 '22 at 08:28

I think, inventory service do not need all the user infos, as inventory service needs user data, it should consume the events(create,update,delete) from the user service and maintain only required user data int it's own user database. In that way there will be data duplication however, your services won't be tightly coupled.

score 0 · Answer 5 · answered Apr 15 '22 at 20:31

This is indeed super inefficient.

Therefore splitting up your monolith into microservices shouldn’t be taken lightly.

It is always going to be a trade-off so you have to make sure the trade-off is worth it.

With a very large project of 8 years I’ve found myself multiple times splitting off a microservice and then figure out later I should actually merge them with other microservices for better maintainance and performance.

More microservices definitly doesn’t mean by definition that it will be more easily to scale or maintain.

Microservices without data duplication

5 Answers5