-4

Let's assume I have access to two rest endpoints, /persons and /products. If I need to retrieve 100K persons with their respective products, what is the best way to do that?

How can I join them efficiently ? Implementing temporary table in DB, joining them in memory? For example, If I do it in memory, then I found myself doing RDBMS work, when the join condition is only on some ID key it can be acceptable, but if they are more complex join conditions using web services, in this case, seems to be very inefficient.

Join with WS:

  1. HTTP GET 100K persons in JSON format
  2. HTTP GET all the products filtered by the person ID
  3. Create some array structure of products indexed with the person ID
  4. Loop over persons then adding into person their products when person.id = products.person_id
  5. Use the data

Join with some table dump:

  1. FTP get persons.sql
  2. FTP get (linked) products.sql
  3. Load them into temporary table in DB (doing the same with some JSON or XML need transformations)
  4. Run SQL query
  5. Use the data

So in the WS simple scenario, the disadvantages are the massive use of memory and the complexity of doing the RDBMS (join not even talking about some other case like ordering grouping if needed) work and in this case, there are only two entities.

(The idea behind this reflexion is to study the possibility to completely remove EAI in enterprise for SOA/ESB architecture)

Laiv
  • 14,283
  • 1
  • 31
  • 69
Michael
  • 101
  • 2
  • To the guys simply put -1 without anny comment what is your problem?? – Michael Jul 14 '17 at 10:49
  • 1
    see [Why is “Is it possible to…” a poorly worded question?](https://softwareengineering.meta.stackexchange.com/a/7274/31260) – gnat Jul 14 '17 at 11:08
  • i don't understand the question. can you read it back to yourself, correct spellings, grammar add paragraphs etc? Also whats the technical problem you are running into with the simplistic approach? can you give some pseudo code? – Ewan Jul 14 '17 at 12:10
  • 1
    Do you need to retrieve 100K users and the products just in time? Any chance to perform ETL periodically and dump the data into the temp table? – Laiv Jul 14 '17 at 12:56
  • @Laiv thanks for your comment yes i need all the data in order to product some file for another program, effectively i think ETL/EAI way is more efficient but if the purpose is micro services only architecture so no ETL or other files transfert – Michael Jul 14 '17 at 13:01
  • 1
    Forget in-memory approaches or databases. Look for data streaming. I don't know what kind of stack are you planning to use, but in the case of Java, we have frameworks such as [Spring Cloud Data Flow](http://cloud.spring.io/spring-cloud-dataflow). [Take a look at the implementation with Mesos](http://cloud.spring.io/spring-cloud-dataflow-server-mesos). you could also implement your own "data flow" on Kafka or similars. – Laiv Jul 14 '17 at 13:08
  • @Laiv seems interesting even i didn't see how the join is performing without having some index available, in stream way you don't have all the data available all the time (i didn't know this way of doing maybe is solution) it's also look very complex – Michael Jul 14 '17 at 13:18

1 Answers1

1

So 100k Person each with many Product to be brought down over a web service seems like a lot. But you have to remember that webservices are asynchronous.

If the data is held in a NoSQL scaleable way then the task is amenable to a different approach than the standard query the db and loop through one.

We can split the task up amongst many async workers, each maybe taking a single Person and getting a list of Product, applying chunk of business logic and writing back to some aggregate.

Sure its not going to be as quick an running an SQL query joining the tables directly... at 100k rows. But when you get to 100 million rows it might be faster. or at least run to completion without locking up the rest of your application.

Ewan
  • 70,664
  • 5
  • 76
  • 161
  • Ok interesting but that means you're doing 100K more http request. I know there is no magical solution, my opinion is if it's possible, doing for each use case custom web service answered specific requirements, so in my example asking the producer's to make new endpoint with the products embedded in person entity of course in real case is rarely possible or even efficient because the entities can be stored in so many different ways and even running sql join query not always possible – Michael Jul 14 '17 at 13:50
  • yes, more requests than you would expect from a traditional approach. But its not necessarily a bad thing if each request is cheap and done at the same time. – Ewan Jul 14 '17 at 14:05
  • 'done at the same time' you mean parelelize query ? yeah it can be a good approach by doing chunk call of 1000 queries but the provider need a good http server and of course allow you to do that – Michael Jul 14 '17 at 14:10
  • not really no. I mean a completely async 'offline' based approach "We can split the task up amongst many async workers" – Ewan Jul 14 '17 at 14:19
  • Just remember that HTTP is not the best protocol for low latency communications. [Is not even the only IPC protocol](https://www.nginx.com/blog/building-microservices-inter-process-communication/) :-). Nevertheless, I'm sure that the task to perform here is a good candidate for stream processing. Later, if you need to expose the processed data through a RESTful API via HTTP, the best you can do is do paginated requests. – Laiv Jul 14 '17 at 14:32
  • stream processing might be another solution, but the OP states "assume I have access to two rest endpoints" If you are in a REST HTTP SOA environment then the async queue based approach is the way to go. you dont have to be low latency, because you dont care what order all the individual bits get done, or how many you are doing at the same time. – Ewan Jul 14 '17 at 14:36
  • You suggestion is to run 100K http request I don't think is the best way to do that, my first question is more how to merge multiple json or XML object – Michael Jul 17 '17 at 07:21