Why choose stream processing systems over batch processing
Stream processing is highly beneficial if the events you wish to track are happening frequently and close together in time. In this, the data flows through the system with no compulsory time limitations on the output. With almost instant flow, systems do not require large amounts of data to be stored. Here we will talk about how stream processing systems outperform batch processing systems.
Stream processing is a specific model of processing in which the same flow control logic is applied to all events that happen within the time period covered by one invocation of the application. The term stream was introduced by Jay Kreps, who also created an early implementation called Kafka Streams.
The advantage of this approach over traditional data-processing approaches (batch processing) lies in its ability to perform operations like join and reduce on streams without storing intermediate results in memory or databases, leading to huge cost savings for companies with large amounts of data streaming through their systems. While batch processes are good at consistency (“all updates completed successfully”), they have problems with throughput because of serialization as well as latency due to resource contention when there are too many parallel requests. Stream processing, on the other hand, can scale with system resources and process an unlimited number of events in a given time frame without compromising consistency.
Processing individual records are not sufficient for many use cases as it is often necessary to perform aggregate operations like aggregation or grouping over continuous data streams. The term “stream processing” is sometimes used interchangeably with stream analytics (or just “analytics”), which generally refers to tools that allow semi- and unsupervised learning on data streams.
In-stream processing, data moves through various stages at high speed — including capture, filtering, transformation, etc.- based on some requirements like windowing or event counting, etc., from which metrics and actions are derived. In batch processing of data, intermediate results are aggregated and stored into a temporary database like Hadoop Distributed File System (HDFS). In-stream processing, the data is directly processed and the results are immediately available. Therefore, from a performance perspective, batch processing systems process fewer records while streaming processing systems can process millions of records.
In real-time analytics, applications need to be resilient to both planned and unplanned maintenance operations on production instances. It’s also important that changes or additions to application architectures do not require rewriting these applications every time they’re modified. Real-time streaming architectures should also provide low latency access to data for event-based business processes — typically in less than 100 milliseconds (ms) — which requires the adoption of NoSQL databases to enable storage and indexing of streaming data at high speed.
Stream processing systems are composed of three main layers:
To achieve scalability, fault tolerance, massive parallelism, and near real-time response times, a stream processing system usually uses cluster computing. This allows scaling on an n-node commodity cluster with n processors by running n instances on each node. Cluster computing supports the stream parallelism model where multiple events happening in parallel are processed independently.
The processing element layer consists of nodes that perform event registration (input), event shipping (shipping), and event consumption (output). The “event” is the base abstraction level for all kinds of operations that happen within the system — from input records to output metrics — and the element layer is where they all start. The event-processing nodes are connected in a pub-sub network that allows events to be published by one node and consumed by others. Events are sent from producers (publishers) to consumers via this network which implements the messaging protocol for communication, including message reliability, guaranteed delivery, etc.
Primary responsibilities of publisher nodes include:
Publishers publish events through an ad hoc bus in directed acyclic graph (DAG) topology over the pub-sub network using Kafka topics, so-called streamlets. These can also be shared with other applications or used as backup streams if needed. A single Kafka topic can handle hundreds of thousands of records per second at high throughput rates. This allows messages to be sent as fast as possible and ensures the capability to handle a large number of producers.
Consumer nodes are primarily responsible for:
The primary responsibility of sink nodes is to consume data from publishers and index it into a searchable archive, database, or storage system (e.g., Hadoop Distributed File System — HDFS). It’s important here that the events have been indexed in near real-time rather than stored in the batch mode because they might contain valuable business information about the current state of an application. For this reason, stream processing systems usually use built-in high availability mechanisms like Kafka Replication with automatic failover where one node can act both as producer and consumer by using two separate consumers.
The presentational layer (not shown in the diagram) processes incoming streams of data coming from sensors, applications, machines, and services. Stream processing systems can treat this input as an event stream with a single type of event or as a set of complex events that have customized schemas to hold varying payloads. For example, it could be a tuple stream where each record is represented by at least three values: a “timestamp”, “id” and one main attribute; or it could be an XML document with multiple attributes in an arbitrary format. A common example for commonly used types of events is stock exchange event records. Other examples include real-time data feeds for sensor monitoring or machine string monitoring such as temperature/pressure/humidity data which could be recorded in a tuple stream with three attributes: timestamp, id, value.
The output of a stream processing system is usually a time-series data set containing metrics from processing multiple events. It can therefore be visualized as an aggregated snapshot of recent changes in some behavior for the application being monitored. For example, it might show (on average) how many users are currently streaming videos from an online service or compare the total number of bytes processed by servers per second against the maximum throughput capacity during peak hours. Stream processing systems typically support aggregation of related events into so-called micro-batches to reduce storage and memory bottlenecks at high-frequency rates on commodity hardware. These batches are then later transferred via Kafka topic replication/sharding over low latency networks to storage systems such as Hadoop Distributed File System (HDFS).
In contrast with stream processing, batch processing is a way of saving and aggregating data periodically; for instance, orders could be stored in some kind of database after they have been received. With this approach, it is possible that information from the past, even if not relevant anymore, may still remain in the system. This typically comes at the cost of more resources than stream processing due to its need for additional disk space etc. The following table compares traditional Batch Processing vs Stream Processing:
Stream processors are newer and more complex than batch processors which means that they tend to be less mature at present. There are also fewer tools, libraries, and frameworks available for stream processing.
Comparisons like these are however not always fair since batch processors can handle infinite amounts of data whereas stream processors have to cope with a finite amount of events due to the fact that input is continuous over time.
A disadvantage of streaming (as opposed to polling) is that in order to receive messages from the source, one must keep track of when new ones arrive and establish connections frequently. This can be a problem if:
Streaming ties up network bandwidth while polling ties up server resources on both sides (client and server). Therefore, streaming tends to be more resource-intensive on the client-side than polling, but less intensive on the server-side. Some languages and protocols may also require a client to implement a separate network transport protocol or handle multiple concurrent connections.
Originally published at https://protonautoml.com on July 9, 2021.