Design Data Intensive Applications says
Batch processing systems (offline systems) Chapter 10
A batch processing system takes a large amount of input data, runs a job to pro‐ cess it, and produces some output data. Jobs often take a while (from a few minutes to several days), so there normally isn’t a user waiting for the job to finish. Instead, batch jobs are often scheduled to run periodically (for example, once a day). The primary performance measure of a batch job is usually throughput (the time it takes to crunch through an input dataset of a certain size). We discuss batch processing in chapter 10.
Stream processing systems (near-real-time systems) Chapter 11
Stream processing is somewhere between online and offline/batch processing (so it is sometimes called near-real-time or nearline processing). Like a batch pro‐ cessing system, a stream processor consumes inputs and produces outputs (rather than responding to requests). However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data. This difference allows stream processing systems to have lower latency than the equivalent batch systems. As stream processing builds upon batch process‐ ing, we discuss it in Chapter 11.
In summary of Chapter 10 Batch Processing Systems,
In this chapter we explored the topic of batch processing. We started by looking at Unix tools such as
awk
,grep
, andsort
, and we saw how the design philosophy of those tools is carried forward into MapReduce and more recent dataflow engines. Some of those design principles are that inputs are immutable, outputs are intended to become the input to another (as yet unknown) program, and complex problems are solved by composing small tools that “do one thing well.” In the Unix world, the uniform interface that allows one program to be composed with another is files and pipes; in MapReduce, that interface is a distributed filesystem. We saw that dataflow engines add their own pipe-like data transport mechanisms to avoid materializing intermediate state to the distributed filesystem, but the initial input and final output of a job is still usually HDFS.
What are the distinction and relation between batch processing and stream processing systems? Specifically
Is it correct that the difference between batch and stream processing systems is
a batch processing system must read in its entire input before it can start to process the input?
a stream processing system can process part of the input, without reading all of the input?
Assume the answers to part 1 are yes:
How does a batch processing system know what size its input has? How does a batch processing system know it has read all of its input?
How does a stream processing system know how much part of the input it needs to read in before it starts processing?
Use programs and pipe in bash-like shells as an example,
commandB | commandA
.- When
commandA
isawk
grep
orsort
, is it a batch processing system, and why? - When
commandA
iscat
, is it a batch or stream processing system, and why?
- When
- Do both/either batch processing systems and/or stream processing systems fit into the producer-consumer pattern?
Thanks.