Data Partitioning in the context of distributed stream processing refers to the strategy used to divide a data stream into multiple substreams (partitions) and distribute these partitions across the available parallel instances (tasks or sub-tasks) of downstream operators within a Dataflow Graph.
The primary goal of partitioning in streaming is scalability. By processing partitions in parallel across multiple workers (CPU cores or nodes), the system can handle much higher data volumes than a single instance could manage. Correct partitioning is also essential for the correctness of stateful operations.
Processing unbounded data streams often requires handling high throughput and performing stateful computations (like joins or aggregations). A single processing instance quickly becomes a bottleneck. Partitioning addresses this by:
For stateful operations like 'GROUP BY' aggregations or joins based on a key, hash partitioning (or key-based partitioning) is critical:
Incorrect partitioning for stateful operations leads to incorrect results.
RisingWave automatically handles data partitioning as part of executing distributed Streaming SQL queries:
This automatic handling of partitioning and shuffling allows users to write declarative SQL queries without needing to manually manage the complexities of distributed data routing for stateful operations.