Inner join between separate databases

Question

Assume there is a large corporation with multiple SW systems that all use the same CRM system via a REST API. All systems have their own databases as does the CRM system.

Now, system A only holds references to the CRM customer in the form of customer IDs but no other customer data.
System A's database holds 10,000 customers, CRM holds 1,000,000 customers.

There's a requirement for system A to allow searching customers by name or partial name and a role held in system A. If CRM and system A were in the same database this would be trivial, a simple inner join. But since both systems have separate databases, this is not an option.

Option 1 would be that there's a REST service that accepts 10,000 customer IDs as input parameters (as well as the searched name), does the inner join against its own database, and returns the result set. In our case this is not possible as the REST service accepts at most 100 customer IDs and returns at most 300 customers.

Option 2 would be that the CRM system keeps track of which customers are in system A and can then limit name searches to those customers.

Option 3 would be that system A replicates some of the CRM data (names) in its own database, but that comes with a its own problems. How do you propagate changes in CRM to system A?

Any other options? I'm not too familiar with how MDM systems would handle this. It seems to me that with the current microservices craze similar problems are all the more prevalent.

[There's a similar question in Stack Overflow](https://stackoverflow.com/questions/57738802/master-data-management-strategies-in-microservices-paradigm) — COME FROM, Aug 20 '20 at 12:12

Hans-Martin Mosner · Answer 1 · 2020-08-20T10:24:20.017

If the current data layout does not support a new requirement, you always have a number if options. Which one is most suitable in your case is something that only you (or your organization) can decide.

In general, the path of least resistance seems to be to use the APIs that are available and work around their limitations. So your system A might query the CRM system for all customer IDs matching some partial name (which could be a lot but usually returning a huge number of IDs is easier than passing a huge number as parameters) and then intersect this list with the list of customers having some role in A. This could work, but is probably suboptimal. However, you would be able to implement the solution completely within the boundaries of system A.

A more comprehensive solution might be to create a requirement for the CRM system (which may or may not be possible at all) to store additional information such as the roles each customer has in the various other systems. This has the advantage of being queriable within the CRM system and possibly supporting additional reporting functionality that requires cross-system customer role information. However, it would also require all systems to publish their customer role information to the CRM.

Your third option should also be considered, of course, but it looks like it could yield relatively little gain for the additional effort required, so to me it feels like the least attractive solution.

I'm currently working with a system which implements the second approach, and in my opinion it feels natural and sound. However, in this case the design was implemented a long time ago, so there's no pressure to change the architecture to just support a new use case. In your case, that may be harder, so the first option could actually be preferrable.

One thing I neglected to mention is that the CRM is a legacy system and they don't want to do much new development. One other solution I came across is using SQL/MED. However, the CRM is DB2 and system A is PostgreSQL. I'm not sure if I can define a DB2 table as a foreign table in PostgreSQL. — Rubio, Aug 20 '20 at 11:54
This answer is not too bad, but IMHO we have not enough information to know if Option 3 may require more or less effort than Option 2 in the specific case, so I would not draw any conclusions, like it is done here. Both approaches require to propagate information which is in one of the systems to the other. The effort for each case will heavily depend on the existing APIs and the maintainability of each of the systems. — Doc Brown, Aug 20 '20 at 13:13
@Rubio _"I'm not sure if I can define a DB2 table as a foreign table in PostgreSQL"_ I think there's a postgres foreign data wrapper for DB2: https://pgxn.org/dist/db2_fdw/doc/db2_fdw.html — πάντα ῥεῖ, Aug 20 '20 at 20:27

Andreas Hütter · Answer 2 · 2020-08-26T20:31:25.670

There's a requirement for system A to allow searching customers by name or partial name and a role held in system A. If CRM and system A were in the same database this would be trivial, a simple inner join. But since both systems have separate databases, this is not an option.

I'm getting a little sceptical with the there's a requirement for system A part here. From a user's perspective is there really a system A, system CRM, etc.?

If not, does the user of your software really care about where in the backend the application is searching for a name?

CRM systems often store role information along with the customer information. So maybe the easiest solution - having the role information in the CRM - is the best. With that the search API would be provided by the CRM.

But if the role information is too specific to service A and does not really reflect the role of a customer in the context of a CRM I would suggest the following:

Define the Customer domain model in the context of service A with all data that is required in service A (external customerId, customer name, role, etc)
Update your customer data in your service A with data from the CRM either
- via some caching mechanism (if this is applicable in terms of performance)
- or by letting the CRM publish changes to customers which service A can listen to in order to update its own data set (e.g. by using some messaging infrastructure)

If you go with this second approach you of course have to be aware that you might sometimes search on outdated data which could or could not be problem depending on your business requirements. Although outdated can really just mean seconds or less depending on the implementation.

Update

Bottom line I would suggest one of the two approaches I described above depending on your current situation. So publish changes from the CRM so that service A can build its own projection of the data for search capabilities if some data (for instance the role information) is too specific to service A. Or extend the CRM data so that everything is available in the CRM system and provide the search functionality via the CRM system if the searched data/parameters fit into the model of the CRM.

But if there is no way you can implement changes in the CRM system you could also look into the Change data capture pattern. With that you can utilize the transaction log of the database (which is usually provided by most modern database systems) to synchronize data changes in the background by selecting only what you need. This would not require to change anything in the CRM system's implementation or the CRM database. There are tools that also allow to setup this kind of synchronization from one database to another without implementing any custom code. You could look, for instance, into Debezium in combination with some message streaming service (e.g. Kafka). See also https://developers.redhat.com/blog/2020/04/14/capture-database-changes-with-debezium-apache-kafka-connectors/

_does user of your software really care about where in the backend the application is._ Well, no. I don't quite understand what the user's understanding has to do with anything. Isn't what you call _role_ stored in the CRM my option 2? The alternative you provide is my option 3. The problem is the CRM system doesn't provide any mechanism to propagate changes (publish/subscribe or similar). — Rubio, Aug 26 '20 at 17:08
As I understand it you have no chance to implement any changes in your CRM so I updated my answer accordingly to address this. — Andreas Hütter, Aug 26 '20 at 20:33
I did briefly consider CDC but initially thought it too difficult. However, I found some resources about replicating data between DB2 and PostgreSQL using CDC ([link](https://www.ibm.com/support/producthub/db2/docs/content/SSTRGZ_11.4.0/com.ibm.cdcdoc.postgresql.doc/concepts/overview.html?cp=SSEPGG_11.5.0)). I was leaning towards option 3 but I think you're onto something here. Will have to investigate. (Debezium doesn't have a DB2 connector.) — Rubio, Aug 27 '20 at 05:58

score 0 · Answer 3 · answered Aug 22 '20 at 06:32

0

What you described a need for is ETL. You need to keep the data synchronized between two systems, you can create a process that is responsible for doing it on a schedule.

answered Aug 22 '20 at 06:32

Michael Brown

21,684
3
46
83

And we would ETL what, the entire CRM? Please elaborate. – Rubio Aug 24 '20 at 05:46
“Option 3 would be that system A replicates some of the CRM data (names) in its own database, but that comes with a its own problems. How do you propagate changes in CRM to system A?” You etl the elements of data you need from the CRM to the other app – Michael Brown Aug 24 '20 at 15:40
System A has to interact with the CRM system anyways by calling its services. Whenever system A retrieves a customers data from the CRM system, it checks if the customer exists in the replica and inserts the customer if not, and updates any data that is not up-to-date. In addition to this, some type of mass update is needed to ensure that the replica doesn't become stale. Whether that's done by an ETL process or some type of a batch job is more of an implementation detail. – Rubio Aug 25 '20 at 07:10
You asked a question. "How can I synchronize the data from my CRM to System A" I gave a solution and you say that's an implementation detail? Tell me then, why did you ask for an implementation detail if you didn't want one? Figure out what you want then come back and ask a question. – Michael Brown Sep 03 '20 at 03:56
What I meant is you can implement synchronization in multiple ways, ETL being one of them. The core solution being that customer data is replicated and has to be somehow kept in sync with the master data. – Rubio Sep 04 '20 at 07:43

Inner join between separate databases

3 Answers3