Design pattern for streaming data?

Question

I have a use case where I'm asked to read an XML document that is a list of data, break it up into data sub-elements, and transform those sub-elements (in order) into another document format (like a flat file or JSON array).

I could solve the problem using a typical synchronous flow, first processing the entire XML document into relevant (Java) objects, and then process all the objects. This would ensure the order of the output is the same as the order of the input.

However I've been told there is a design pattern which fits this use case. My guess is that it's one of the Concurrency patterns, and so my feeling is that it could be implemented with a Queue instead of a Collection. The XML parser would take each set of data, parse it, and push it to the Queue, while another thread (or pool of threads) would pop elements off of the queue and process them to the output file.

I've not implemented this before so I have a number of questions, but first I'd like to know if I'm on the right track?

An additional use case is that the design should be able to handle multiple (shorter) XML as input from a web service. Each XML will contain one set of data, and there is no requirement around what order the documents need to be in the output, as long as the sub-elements from each set of data are in the right order.

(Edit) I'm not asking how to chose a design pattern in general. I'm asking which design pattern applies to this very specific use case.

Possible duplicate of [Choosing the right Design Pattern](https://softwareengineering.stackexchange.com/questions/227868/choosing-the-right-design-pattern) — gnat, May 23 '17 at 17:49
@gnat how is this a duplicate? I'm not asking how to chose a design pattern *in general*. I'm asking which design pattern applies to this *very specific* use case. — Andrew, May 23 '17 at 17:55
You ask for "a design pattern which fits this use case even better", but you do not tell us what you mean by "better". My favorite definition of "better" is "simpler", because "simpler" oft means "easier to understand", "easier to implement without bugs", "less code to maintain". I totally fail to see how a concurrent approach could be simpler than you initial solution. So voting to close as "unclear". — Doc Brown, May 23 '17 at 21:26
@DocBrown if I remove the word "better" and just ask for a design pattern that fits, will that make the question acceptable? — Andrew, May 23 '17 at 21:54
If you have a list of items that needs to be processed in parallel, then it's trivially parallelizable . If you want to run one operation after the other, that's called pipelining. These two forms of concurrency are not mutually exclusive; the optimally efficient setup depends on your particular data set and machine specification. — gardenhead, May 24 '17 at 03:22
@Andrew: no, quite the opposite. I recommend you remove the words "design pattern" from the question, define clearly what you are looking for, what your **goals** are, why your current solution does not suffer your needs ("someone else has said there is pattern" is not a sensible reason) and ask for a solution to a problem you have. Maybe someone can then describe a solution, maybe there is a common name for it (so it might be called a "pattern"), but tell us clearly what you are after. In the current form, your question leads to nothing but a guessing game about a name. — Doc Brown, May 24 '17 at 05:41
Why not open the document, iterate through your elements and spit out lines or JSON text elements to your output file as you go? And be done with it? Multiple threads and an intermediate medium like a queue seems awfully inefficient for such a trivial task. I would call this pattern the not-messing-about pattern. — Martin Maat, May 24 '17 at 07:54

score 1 · Answer 1 · answered May 24 '17 at 03:44

What you've described is called is called fork-join queue or fork-join model.

From wikipedia:

The fork–join model is a way of setting up and executing parallel programs, such that execution branches off in parallel at designated points in the program, to "join" (merge) at a subsequent point and resume sequential execution.

You can implement it either with explicit queue or without. An explicit queue has the advantage that you may use a persistent queue and you can distribute the load over different machines.

In java (on one jvm without explicit queue) you can use either Java Fork-Join framework or since Java8 (parallel) Streams for that.

Design pattern for streaming data?

1 Answers1