Scalability in Streaming Systems

Scalability in Streaming Systems refers to the ability of a stream processing system to gracefully handle increasing workloads by adding more resources, without a corresponding degradation in performance or requiring a fundamental architectural overhaul. A scalable streaming system can adapt to growing data volumes, higher data velocities, increased query complexity, and a larger number of concurrent users or queries.

Dimensions of Scalability

Streaming systems need to scale along several dimensions:

Data Volume/Velocity (Throughput):
- The ability to ingest and process a higher rate of incoming events (events/second or bytes/second).
- Achieved by distributing data ingestion and processing across multiple nodes or cores.
State Size:
- The ability to manage and access increasingly large amounts of state required for stateful operations (e.g., windowed aggregations, joins).
- Often involves scalable state backends, potentially decoupled from compute resources (like RisingWave's Hummock on cloud storage).
Computational Complexity:
- The ability to handle more complex queries or a larger number of concurrent queries without significant performance drops.
- Requires efficient query planning, optimization, and parallel execution.
Number of Streams/Pipelines:
- The ability to support a growing number of independent data sources, sinks, and processing pipelines.
Query Latency:
- Maintaining low latency for query results even as other dimensions scale. For streaming systems serving pre-computed results (like RisingWave's Materialized Views), this means keeping the views fresh and queryable quickly.

Key Architectural Approaches to Scalability

Horizontal Scaling (Scale-Out):
- Adding more machines (nodes) to a cluster to distribute the workload.
- This is the most common and generally preferred approach for large-scale streaming systems as it can offer near-linear scalability if designed well.
- Requires mechanisms for data partitioning and load balancing.
Vertical Scaling (Scale-Up):
- Increasing the resources (CPU, RAM, storage) of existing machines.
- Can be simpler initially but often hits physical limits and becomes cost-prohibitive beyond a certain point.
Separation of Compute and Storage:
- Decoupling the resources used for processing (compute) from the resources used for storing state (storage).
- Allows independent scaling of each based on demand. For example, if state size grows much faster than compute needs, storage can be scaled independently.
- RisingWave employs this with its Hummock state store on cloud object storage, allowing Compute Nodes to be scaled based on processing load while Hummock manages scalable state persistence.
Data Partitioning:
- Dividing incoming data streams into multiple partitions based on a key (e.g., user ID, device ID).
- Each partition can then be processed by a parallel instance of an operator, enabling distributed processing.
- Crucial for stateful operations to ensure related data is processed together.
Parallelism:
- Executing different parts of a dataflow graph (or multiple instances of the same operator on different data partitions) concurrently across multiple cores or nodes.
Elasticity:
- The ability to dynamically add or remove resources in response to workload changes, often automatically. This helps optimize resource utilization and cost.

Scalability in RisingWave

RisingWave is designed with scalability as a core principle:

Distributed Architecture: Consists of multiple node types (Meta, Compute, Compactor, Frontend) that can be scaled.
Horizontal Scaling of Compute Nodes: Users can add more Compute Nodes to increase processing capacity for streaming jobs. RisingWave's scheduler distributes the work (fragments of the dataflow graph) across available Compute Nodes.
Separation of Storage and Compute:
- Compute Nodes: Handle stateless processing and cache recent state.
- State Store (Hummock): Manages the durable persistence of large streaming states on scalable cloud object storage (like S3). This allows state to grow very large without being limited by the local disk capacity of Compute Nodes. Compute and storage can be scaled independently.
Data Sharding in State Store: Hummock shards state data for better distribution and parallel access.
Parallelism: Streaming SQL queries are compiled into parallel dataflow graphs.
Efficient State Management: Hummock's LSM-tree based architecture is optimized for high write throughput and efficient state lookups, crucial for scalable stateful stream processing.
Cloud-Native Design: Leverages the scalability and elasticity of cloud infrastructure (e.g., Kubernetes for orchestration, cloud object storage for Hummock).

By combining these features, RisingWave aims to provide a highly scalable platform for demanding real-time stream processing workloads, allowing users to start small and grow their deployments as their data and processing needs increase.

Challenges in Achieving Scalability

State Management: Efficiently managing and accessing distributed state is often the biggest challenge.
Skewed Data: Uneven data distribution across partitions can lead to hotspots and limit overall scalability.
Stragglers: Slow-running tasks or nodes can become bottlenecks.
Network Bottlenecks: High data volumes can saturate network links between nodes.
Coordination Overhead: Managing a distributed system introduces coordination costs.

Scalability in Streaming Systems

Dimensions of Scalability

Key Architectural Approaches to Scalability

Scalability in RisingWave

Challenges in Achieving Scalability

Related Blog Posts

Related Glossary Terms