Disaster Recovery for Streaming Pipelines

Disaster recovery (DR) for streaming pipelines ensures continuous operation during infrastructure failures, region outages, or data corruption. With S3-based state (RisingWave), DR is fundamentally simpler — state survives any compute failure, and recovery takes seconds instead of hours.

DR Strategies

Strategy	RTO	RPO	Cost	Complexity
S3 state + auto-recovery	Seconds	1 second	Low	Low
Cross-region S3 replication	Minutes	Minutes	Medium	Medium
Active-standby cluster	Seconds	1 second	2x compute	Medium
Active-active multi-region	Zero	Near-zero	2x+ everything	High

RisingWave DR Advantage

With state on S3:

Node failure: New node reads state from S3 in seconds
AZ failure: Redeploy in another AZ, same S3 bucket
Region failure: S3 cross-region replication provides state in backup region

Compare to local-state systems (Flink 1.x, Kafka Streams):

Node failure → download checkpoint from remote (minutes)
AZ failure → restore from last remote checkpoint (minutes to hours)
Region failure → must maintain separate checkpoint storage

Runbook

Detection: Monitor barrier latency and throughput (alert on anomalies)
Assessment: Is it node, AZ, or region failure?
Recovery:
- Node: Kubernetes auto-restarts pod, reads state from S3 (seconds)
- AZ: Reschedule to healthy AZ (minutes)
- Region: Deploy in backup region with S3 cross-region replicated state

Frequently Asked Questions

What is RTO and RPO for streaming?

RTO (Recovery Time Objective): how long until processing resumes. RPO (Recovery Point Objective): how much data is lost. RisingWave achieves seconds RTO and 1-second RPO thanks to S3 state and 1-second checkpoints.

Do I need multi-region DR for streaming?

Depends on your SLA. For most workloads, single-region with multi-AZ and S3 state provides sufficient DR. Multi-region is needed for zero-downtime guarantees during full region outages.