Disaster Recovery for Streaming Pipelines
Disaster recovery (DR) for streaming pipelines ensures continuous operation during infrastructure failures, region outages, or data corruption. With S3-based state (RisingWave), DR is fundamentally simpler — state survives any compute failure, and recovery takes seconds instead of hours.
DR Strategies
| Strategy | RTO | RPO | Cost | Complexity |
| S3 state + auto-recovery | Seconds | 1 second | Low | Low |
| Cross-region S3 replication | Minutes | Minutes | Medium | Medium |
| Active-standby cluster | Seconds | 1 second | 2x compute | Medium |
| Active-active multi-region | Zero | Near-zero | 2x+ everything | High |
RisingWave DR Advantage
With state on S3:
- Node failure: New node reads state from S3 in seconds
- AZ failure: Redeploy in another AZ, same S3 bucket
- Region failure: S3 cross-region replication provides state in backup region
Compare to local-state systems (Flink 1.x, Kafka Streams):
- Node failure → download checkpoint from remote (minutes)
- AZ failure → restore from last remote checkpoint (minutes to hours)
- Region failure → must maintain separate checkpoint storage
Runbook
- Detection: Monitor barrier latency and throughput (alert on anomalies)
- Assessment: Is it node, AZ, or region failure?
- Recovery:
- Node: Kubernetes auto-restarts pod, reads state from S3 (seconds)
- AZ: Reschedule to healthy AZ (minutes)
- Region: Deploy in backup region with S3 cross-region replicated state
Frequently Asked Questions
What is RTO and RPO for streaming?
RTO (Recovery Time Objective): how long until processing resumes. RPO (Recovery Point Objective): how much data is lost. RisingWave achieves seconds RTO and 1-second RPO thanks to S3 state and 1-second checkpoints.
Do I need multi-region DR for streaming?
Depends on your SLA. For most workloads, single-region with multi-AZ and S3 state provides sufficient DR. Multi-region is needed for zero-downtime guarantees during full region outages.

