Disaggregated Storage in Stream Processing: Why S3 Changes Everything

Disaggregated Storage in Stream Processing: Why S3 Changes Everything

Disaggregated Storage in Stream Processing: Why S3 Changes Everything

Disaggregated storage separates compute from state in stream processing — storing all streaming state (aggregations, join buffers, window data) in object storage like S3 instead of local disks. This architectural pattern, adopted by RisingWave and Flink 2.0, fundamentally changes how streaming systems scale, recover from failures, and manage costs.

Traditional vs Disaggregated Architecture

Traditional (Coupled Compute + State)

Compute Node 1: [Processing] + [RocksDB State on Local Disk]
Compute Node 2: [Processing] + [RocksDB State on Local Disk]
Compute Node 3: [Processing] + [RocksDB State on Local Disk]

State lives on each node's local disk. If a node dies, its state is lost and must be restored from a remote checkpoint.

Disaggregated (Separated)

Compute Node 1: [Processing] + [In-Memory Cache]
Compute Node 2: [Processing] + [In-Memory Cache]
Compute Node 3: [Processing] + [In-Memory Cache]
              ↕ Read/Write ↕
         [S3 — All State Stored Here]

State lives in S3. Compute nodes are relatively stateless. If a node dies, a replacement node reads state from S3 immediately.

Why Disaggregated Storage Matters

BenefitTraditionalDisaggregated
Recovery timeMinutes to hours (download checkpoint)Seconds (state already in S3)
ScalingRedistribute state across nodes (slow)Add compute, point at same S3 (fast)
Disk failureState lost, must restoreNo impact (S3 = 11 nines durability)
CostOver-provision for peak stateS3 at $0.023/GB/month
Checkpoint timeProportional to state sizeNear-constant (~3-4 seconds)
State size limitLimited by local diskUnlimited (S3 is elastic)

RisingWave: Built on S3 from Day One

RisingWave's Hummock storage engine is an LSM-tree designed specifically for streaming on S3:

  • Mem-tables buffer writes in memory
  • SSTables (Sorted String Tables) are flushed to S3
  • Local cache serves reads (S3 is never on the read path during queries)
  • Compactor nodes merge SSTables in the background
  • 1-second checkpoints are feasible because state is already durable in S3

Flink 2.0 introduced the ForSt (Flink on Remote Storage) state backend:

  • State stored in S3/HDFS instead of local RocksDB
  • 40x faster recovery compared to Flink 1.x
  • Checkpoint duration: 3-4 seconds regardless of state size (vs minutes in 1.x)
  • Backward compatible with existing Flink applications

Performance Impact

Common concern: "Doesn't reading from S3 add latency?"

In practice, no — because of caching:

  • Hot state is cached in memory on compute nodes
  • Warm state is cached on local SSD
  • Cold state is fetched from S3 (rare, only during recovery or cold start)

During normal processing, all reads hit local cache. S3 is only involved during writes (async) and recovery.

When Disaggregated Storage Matters Most

  • Large state (>10 GB): Local disk becomes a bottleneck
  • Elastic workloads: Scale compute up/down without state migration
  • High availability: Seconds-level recovery vs minutes
  • Cost-sensitive: S3 storage is 10-100x cheaper than SSD
  • Multi-tenant: Share storage across workloads

Frequently Asked Questions

Does S3-based state add latency to stream processing?

No, during normal processing. Hot state is cached in compute node memory. S3 is only involved during async checkpoint writes and recovery. RisingWave achieves sub-100ms end-to-end processing latency despite storing all state on S3.

Is disaggregated storage the future of stream processing?

Yes. Both RisingWave (S3-native from day one) and Flink 2.0 (ForSt backend) have adopted disaggregated storage. The benefits — elastic scaling, fast recovery, lower cost, unlimited state — outweigh the added complexity of remote storage management.

How does disaggregated storage affect checkpointing?

Dramatically. In traditional systems, checkpoints must copy state from local disk to remote storage — time proportional to state size. In disaggregated systems, state is already in S3, so checkpoints just commit metadata — completing in 3-4 seconds regardless of state size.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.