What are the distinction and relation between batch processing and stream processing systems?

Question

Design Data Intensive Applications says

Batch processing systems (offline systems) Chapter 10

A batch processing system takes a large amount of input data, runs a job to pro‐ cess it, and produces some output data. Jobs often take a while (from a few minutes to several days), so there normally isn’t a user waiting for the job to finish. Instead, batch jobs are often scheduled to run periodically (for example, once a day). The primary performance measure of a batch job is usually throughput (the time it takes to crunch through an input dataset of a certain size). We discuss batch processing in chapter 10.

Stream processing systems (near-real-time systems) Chapter 11

Stream processing is somewhere between online and offline/batch processing (so it is sometimes called near-real-time or nearline processing). Like a batch pro‐ cessing system, a stream processor consumes inputs and produces outputs (rather than responding to requests). However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data. This difference allows stream processing systems to have lower latency than the equivalent batch systems. As stream processing builds upon batch process‐ ing, we discuss it in Chapter 11.

In summary of Chapter 10 Batch Processing Systems,

In this chapter we explored the topic of batch processing. We started by looking at Unix tools such as awk , grep , and sort , and we saw how the design philosophy of those tools is carried forward into MapReduce and more recent dataflow engines. Some of those design principles are that inputs are immutable, outputs are intended to become the input to another (as yet unknown) program, and complex problems are solved by composing small tools that “do one thing well.” In the Unix world, the uniform interface that allows one program to be composed with another is files and pipes; in MapReduce, that interface is a distributed filesystem. We saw that dataflow engines add their own pipe-like data transport mechanisms to avoid materializing intermediate state to the distributed filesystem, but the initial input and final output of a job is still usually HDFS.

What are the distinction and relation between batch processing and stream processing systems? Specifically

Is it correct that the difference between batch and stream processing systems is
- a batch processing system must read in its entire input before it can start to process the input?
- a stream processing system can process part of the input, without reading all of the input?
Assume the answers to part 1 are yes:
- How does a batch processing system know what size its input has? How does a batch processing system know it has read all of its input?
- How does a stream processing system know how much part of the input it needs to read in before it starts processing?
Use programs and pipe in bash-like shells as an example, commandB | commandA.
- When commandA is awk grep or sort, is it a batch processing system, and why?
- When commandA is cat, is it a batch or stream processing system, and why?
Do both/either batch processing systems and/or stream processing systems fit into the producer-consumer pattern?

Thanks.

Duplicate of https://unix.stackexchange.com/questions/558059/are-awk-grep-sort-and-cat-batch-processing-or-stream-processing — steve, Dec 19 '19 at 16:41
@steve I am here hoping for answers from software architecture perspective, not from Unix/Linux tooling perspective. It is not helpful to have a clean cut of things that have clear ties with several areas. People here are mostly software engineers and don't necessarily know Unix/Linux tools, and people there are mostly administrators and super users and don't necessarily know about software architecture. — Tim, Dec 19 '19 at 16:48
In this context, it's not exactly clear that there is such a thing as a stream job that is also distinct from both a batch job and online processing. A stream job just seems to be essentially the same approach as a batch job (periodic execution on a fixed accumulation of inputs), but at much higher frequency. — Steve, Dec 19 '19 at 20:51

Berin Loritsch · Answer 1 · 2019-12-19T20:30:18.893

Think of the distinction in this primary way:

A batch system works on chunks of data at a time (i.e. a batch) periodically
A streaming system works on each record as it comes in

Key characteristics of Batch processing

Batch processing is typically done "out of band", or essentially not built as a core component of the application suite. NOTE: bulk data ingest, map-reduce, and restructuring data into a new lookup table for easy query later on are all examples of batch processing.

Batch processing polls for new data to process
Batch processing will hopefully limit the number of records it processes at one time
Batch processing is scheduled to happen at regular intervals, or at least when activity for your application is lowest.

Key characteristics of Streaming

Streaming is a core component of your application suite, and while it may process records asynchronously, that processing is a core feature of what is built.

Data is pushed to the streaming processor
Streaming processing should hopefully scale if the queue length gets too long
Streaming processing happens as data is received, and commonly put on a queue to work in the background

When to choose one or the other

That really depends on the nature of the problem. Streaming works well with lightweight processing and near real time requirements. Batch processing works well with heavy weight calculations that probably need to be done in context with other data.

The big question is what are your actual requirements?

Do you have to display the results of processing within a short period of time? (stream)
Can you wait minutes, hours, or days for the result of processing? (batch)
Do you receive the data as individual observations that you have to summarize? (stream)
Or Do you receive the data after a few hours as a bunch of files? (batch)

Hopefully you get the idea of the distinction and the types of problems that each set of background processing can perform.

score 2 · Answer 2 · answered Dec 19 '19 at 16:03

Is it correct that the difference between batch and stream processing systems is

a batch processing system must read in its entire input before it can start to process the input?

a stream processing system can process part of the input, without reading all of the input?

Yes and no. It is not a requirement for a batch system that it actually reads all inputs directly at the start of a batch. But it is a requirement that the software can assume that all input data is present at the start of a batch.

For a stream processing system, the inputs can become available gradually.

How does a batch processing system know what size its input has? How does a batch processing system know it has read all of its input?

How does a stream processing system know how much part of the input it needs to read in before it starts processing?

These questions don't have a general answer, because it is largely defined by the processing that needs to be done.

For example, if I have a batch job to summarize the logs that were created on day X, then the size of the job is unknowable but, assuming the logs are stored in chronological order, the job is finished once it encounters log entries for day X+1. And if the filesystem reports an end-of-file on the input, then the batch job is done as well.

Use programs and pipe in bash-like shells as an example, commandB | commandA.

When commandA is awk grep or sort, is it a batch processing system, and why?

When commandA is cat, is it a batch or stream processing system, and why?

Going by the definitions you quoted from the book, the determination if it is a batch processing system or a stream processing system depends on how the trigger is given to run that pipeline.

If the trigger is periodic or manual, then it would be part of a batch processing system. If the trigger would be the arrival of new data, then it would be part of a stream processing system.

Do both/either batch processing systems and/or stream processing systems fit into the producer-consumer pattern?

Batch processing systems can be the producer side of the producer/consumer pattern, but they don't fit on the consumer side.

Stream processing systems can exist on either side of the producer/consumer pattern. The producer having new data can be an excellent trigger for a stream processing system to process a bit more data.

What are the distinction and relation between batch processing and stream processing systems?

2 Answers2