Disaggregated Storage in Stream Processing: Why S3 Changes Everything

Disaggregated storage separates compute from state in stream processing — storing all streaming state (aggregations, join buffers, window data) in object storage like S3 instead of local disks. This architectural pattern, adopted by RisingWave and Flink 2.0, fundamentally changes how streaming systems scale, recover from failures, and manage costs.

Traditional vs Disaggregated Architecture

Traditional (Coupled Compute + State)

Compute Node 1: [Processing] + [RocksDB State on Local Disk]
Compute Node 2: [Processing] + [RocksDB State on Local Disk]
Compute Node 3: [Processing] + [RocksDB State on Local Disk]

State lives on each node's local disk. If a node dies, its state is lost and must be restored from a remote checkpoint.

Disaggregated (Separated)

Compute Node 1: [Processing] + [In-Memory Cache]
Compute Node 2: [Processing] + [In-Memory Cache]
Compute Node 3: [Processing] + [In-Memory Cache]
              ↕ Read/Write ↕
         [S3 — All State Stored Here]

State lives in S3. Compute nodes are relatively stateless. If a node dies, a replacement node reads state from S3 immediately.

Why Disaggregated Storage Matters

Benefit	Traditional	Disaggregated
Recovery time	Minutes to hours (download checkpoint)	Seconds (state already in S3)
Scaling	Redistribute state across nodes (slow)	Add compute, point at same S3 (fast)
Disk failure	State lost, must restore	No impact (S3 = 11 nines durability)
Cost	Over-provision for peak state	S3 at $0.023/GB/month
Checkpoint time	Proportional to state size	Near-constant (~3-4 seconds)
State size limit	Limited by local disk	Unlimited (S3 is elastic)

RisingWave: Built on S3 from Day One

RisingWave's Hummock storage engine is an LSM-tree designed specifically for streaming on S3:

Mem-tables buffer writes in memory
SSTables (Sorted String Tables) are flushed to S3
Local cache serves reads (S3 is never on the read path during queries)
Compactor nodes merge SSTables in the background
1-second checkpoints are feasible because state is already durable in S3

Flink 2.0: ForSt State Backend

Flink 2.0 introduced the ForSt (Flink on Remote Storage) state backend:

State stored in S3/HDFS instead of local RocksDB
40x faster recovery compared to Flink 1.x
Checkpoint duration: 3-4 seconds regardless of state size (vs minutes in 1.x)
Backward compatible with existing Flink applications

Performance Impact

Common concern: "Doesn't reading from S3 add latency?"

In practice, no — because of caching:

Hot state is cached in memory on compute nodes
Warm state is cached on local SSD
Cold state is fetched from S3 (rare, only during recovery or cold start)

During normal processing, all reads hit local cache. S3 is only involved during writes (async) and recovery.

When Disaggregated Storage Matters Most

Large state (>10 GB): Local disk becomes a bottleneck
Elastic workloads: Scale compute up/down without state migration
High availability: Seconds-level recovery vs minutes
Cost-sensitive: S3 storage is 10-100x cheaper than SSD
Multi-tenant: Share storage across workloads

Frequently Asked Questions

Does S3-based state add latency to stream processing?

No, during normal processing. Hot state is cached in compute node memory. S3 is only involved during async checkpoint writes and recovery. RisingWave achieves sub-100ms end-to-end processing latency despite storing all state on S3.

Is disaggregated storage the future of stream processing?

Yes. Both RisingWave (S3-native from day one) and Flink 2.0 (ForSt backend) have adopted disaggregated storage. The benefits — elastic scaling, fast recovery, lower cost, unlimited state — outweigh the added complexity of remote storage management.

How does disaggregated storage affect checkpointing?

Dramatically. In traditional systems, checkpoints must copy state from local disk to remote storage — time proportional to state size. In disaggregated systems, state is already in S3, so checkpoints just commit metadata — completing in 3-4 seconds regardless of state size.