Fault Tolerance in Data Streaming: How Modern Platforms Handle Failures (2026)
Every streaming system will eventually face failures — node crashes, network partitions, out-of-memory kills, disk corruption. What separates production-grade streaming platforms is how they recover: how much data is lost (Recovery Point Objective) and how long recovery takes (Recovery Time Objective). In 2026, disaggregated state storage on S3 has emerged as the key architectural innovation, with RisingWave achieving seconds-level recovery with 1-second RPO and Flink 2.0 delivering 40x faster recovery through its new ForSt state backend.
This guide provides a deep technical comparison of fault tolerance mechanisms across Apache Flink, Kafka Streams, RisingWave, and Spark Structured Streaming.
How Streaming Fault Tolerance Works
Streaming systems must solve two fundamental problems during failures:
- State preservation — Aggregations, join state, window buffers, and counters must survive node crashes
- Processing guarantees — Ensuring exactly-once semantics so events are neither lost nor processed twice
The two dominant approaches are checkpointing (periodic snapshots of state) and changelog replication (continuous logging of state changes).
| Approach | How It Works | RPO | Recovery Time | Used By |
| Checkpointing | Periodic consistent snapshots of all operator state | = Checkpoint interval | Snapshot restore + replay | Flink, RisingWave, Spark |
| Changelog | Every state mutation written to a durable log (Kafka topic) | Near-zero (every write logged) | Replay log from last offset | Kafka Streams |
Platform-by-Platform Comparison
Apache Flink: The Checkpoint Pioneer
Flink implements the Asynchronous Barrier Snapshots (ABS) algorithm, a variant of Chandy-Lamport distributed snapshots. The JobManager injects checkpoint barriers into data streams; when all operators receive the barrier, they snapshot their state to durable storage.
State Backend Options:
| Backend | Storage | Best For | Incremental Checkpoints |
| HashMapStateBackend | Java heap (in-memory) | Small state (<1 GB) | No |
| EmbeddedRocksDBStateBackend | Local disk (RocksDB) | Large state (multi-GB) | Yes |
| ForStStateBackend (Flink 2.0) | Remote S3/HDFS | Cloud-native, elastic | Yes |
Flink 1.x Recovery (Traditional):
- Checkpoint interval: Typically 30 seconds to several minutes in production
- Recovery time: Minutes for multi-gigabyte state (proportional to state size)
- Process: Stop all operators → download state from checkpoint storage → restart from checkpoint offset
- Large state (>100 GB): Recovery can take 30+ minutes
Flink 2.0 Recovery (Disaggregated State): The ForStStateBackend in Flink 2.0 is a game-changer:
- Checkpoint duration: 3-4 seconds regardless of state size (94% reduction vs 1.x)
- Recovery time: Under 10 seconds (40x faster than 1.x)
- State size independent: Same recovery time whether state is 1 GB or 1 TB
- Cost savings: ~50% due to independent compute/storage scaling
Exactly-Once Semantics: Flink achieves end-to-end exactly-once through a two-phase commit protocol. Sources (Kafka) and sinks (databases, Kafka) participate in a distributed transaction coordinated by the checkpoint.
Kafka Streams: Changelog-Based Recovery
Kafka Streams takes a fundamentally different approach: every state mutation is written to a compacted Kafka changelog topic. The changelog is the source of truth; local RocksDB state is a cache.
Architecture:
State Change → Write to local RocksDB + Write to Kafka changelog topic
Recovery → Read changelog topic from last committed offset → Rebuild RocksDB
Recovery Characteristics:
- Small state (<1 GB): Seconds to rebuild from changelog
- Medium state (1-10 GB): Minutes
- Large state (>100 GB): Hours for full changelog replay
- This is the critical weakness: recovery time scales linearly with state size
Standby Replicas (Mitigation):
Setting num.standby.replicas=1 creates shadow copies of state on other instances. When the active instance fails, the standby promotes almost instantly (seconds). However, this doubles resource consumption.
Exactly-Once:
Kafka Streams supports exactly-once via processing.guarantee=exactly_once_v2 (Kafka 2.5+). This coordinates transactional producers and consumers within the Kafka ecosystem.
RisingWave: Disaggregated-First Architecture
RisingWave was built from day one with disaggregated state on S3. Its Hummock storage engine is an LSM-tree optimized for streaming workloads, with all persistent state stored in object storage.
Architecture:
Streaming Operators → In-memory state (mem-tables)
→ Flush to SSTables (Sorted String Tables)
→ Upload to S3
→ Global epoch-based consistency
Checkpoint Mechanism:
- Checkpoint interval: 1 second (default, configurable)
- The Meta node injects barriers into the streaming dataflow
- When all operators complete the barrier, state is committed to S3
- Asynchronous checkpointing: "almost invisible to foreground tasks"
Recovery Characteristics:
- Recovery time: Seconds (not proportional to state size)
- RPO: ~1 second (at most 1 second of data loss)
- No local state to restore — compute nodes are relatively stateless
- New compute nodes fetch state from S3 on demand
- Elastic scaling without service interruption
Why Disaggregated State Enables Faster Recovery:
| Aspect | Local State (RocksDB) | Disaggregated (S3) |
| Recovery process | Download full checkpoint from remote → rebuild local state | Point compute at existing S3 state |
| Recovery time | Proportional to state size | Near-constant (seconds) |
| Disk failure impact | State lost, must restore from checkpoint | No impact (state in S3) |
| Durability | Depends on local disk + checkpoint frequency | 11 nines (S3 durability) |
| Scaling | Must redistribute state across nodes | Compute scales independently |
Exactly-Once: RisingWave provides exactly-once semantics through barrier-based checkpointing with epoch-based consistency. All reads within a checkpoint boundary return consistent results.
Spark Structured Streaming: Micro-Batch Recovery
Spark Structured Streaming uses a micro-batch model with Write-Ahead Logs (WAL) stored in HDFS or S3.
Recovery Mechanism:
- Checkpoint directory stores: stream offsets, committed batch IDs, state snapshots
- On failure: restart driver → launch new executors → replay from last committed batch offset
_spark_metadatadirectory tracks completed batches, preventing duplicate processing
Recovery Characteristics:
- Recovery time: Seconds to minutes (cluster restart + state restore)
- RPO: One micro-batch duration (typically 1-10 seconds)
- Limited to micro-batch latency (not true real-time)
Real-World Failure Scenarios
Node Crash During Processing
| System | What Happens | Recovery Time |
| Flink 1.x | Job stops, operators reset to latest checkpoint, replay from checkpoint offset | Minutes |
| Flink 2.0 | Same process, but ForSt backend recovers in under 10 seconds | Seconds |
| Kafka Streams | Instance restarts, replays changelog topic | Seconds (small state) to hours (large state) |
| RisingWave | Operators restart, fetch state from S3 checkpoint | Seconds |
| Spark SS | Driver restarts, replays from last committed batch | Seconds to minutes |
Disk Failure (Local State Loss)
This scenario is where architecture matters most:
- Flink (RocksDB): Local state destroyed. Must download full checkpoint from remote storage. Recovery time proportional to state size.
- Kafka Streams: Local RocksDB cache destroyed. Must replay entire changelog topic. Can take hours for large state.
- RisingWave: No impact. State lives in S3. Compute node restarts and fetches state on demand. Recovery in seconds.
- Flink 2.0 (ForSt): Minimal impact. State already in S3. Recovery under 10 seconds.
Out-of-Memory Kill
OOM kills (exit code 137) are instantaneous — no graceful shutdown:
- All in-flight data and uncommitted state is lost
- Recovery point is the last successful checkpoint
- Data loss = events processed since last checkpoint
- Flink: Up to 30-120 seconds of data (typical checkpoint interval)
- RisingWave: Up to 1 second of data (1-second checkpoint interval)
- Kafka Streams: Near-zero (changelog captures every write)
Network Partition
When the cluster splits into isolated groups:
- Flink/RisingWave: Meta node (JobManager/Meta service) coordinates using quorum consensus. Minority partition fences itself. Majority partition continues processing.
- Kafka Streams: Depends on Kafka broker availability. If consumer can't reach coordinator, rebalance triggers after timeout.
The Checkpoint Interval Trade-Off
Checkpoint interval directly determines your Recovery Point Objective (maximum data loss during failure):
| System | Default Interval | RPO | Checkpoint Overhead |
| RisingWave | 1 second | ~1 second | Low (async, S3-native) |
| Flink | 30 sec - several minutes | = interval | Medium-High (state size dependent) |
| Flink 2.0 (ForSt) | Configurable (shorter possible) | = interval | Low (3-4 sec constant) |
| Spark SS | 1 micro-batch | = batch duration | Medium |
| Kafka Streams | Continuous | Near-zero | High (every write to changelog) |
The trade-off: more frequent checkpoints reduce data loss but increase processing overhead. RisingWave's S3-native architecture makes 1-second checkpoints feasible without significant overhead, while Flink 1.x's local-state model makes sub-minute checkpoints expensive for large state.
How to Choose Based on Fault Tolerance Requirements
Choose RisingWave if:
- You need the lowest RPO (1-second data loss window)
- Fast, consistent recovery time regardless of state size matters
- You want to eliminate local disk as a failure point
- SQL-based pipelines are sufficient for your workload
Choose Flink 2.0 (ForSt) if:
- You need disaggregated state with the broadest feature set
- You require complex event processing (MATCH_RECOGNIZE)
- You have an existing Flink platform and want to upgrade
Choose Kafka Streams with standby replicas if:
- Near-zero RPO matters most (changelog captures every write)
- You accept higher resource cost for instant standby failover
- Your workload is Kafka-native
Choose Spark Structured Streaming if:
- Seconds-level RPO is acceptable
- You're already in the Spark ecosystem
- Batch + streaming unification matters more than low latency
Frequently Asked Questions
What is the difference between RPO and RTO in streaming?
Recovery Point Objective (RPO) is the maximum amount of data you can afford to lose during a failure — determined by the checkpoint interval. Recovery Time Objective (RTO) is how long it takes to resume processing after a failure. RisingWave achieves ~1 second RPO and seconds-level RTO. Flink 2.0 with ForSt achieves configurable RPO and under 10 seconds RTO.
Which streaming platform has the best fault tolerance?
It depends on your priority. For lowest data loss (RPO), Kafka Streams' continuous changelog approach captures every state mutation. For fastest recovery (RTO), RisingWave and Flink 2.0 with disaggregated state recover in seconds regardless of state size. For the broadest feature set with good fault tolerance, Flink 2.0 leads.
How does disaggregated state improve fault tolerance?
Disaggregated state stores all streaming state in object storage (S3) instead of local disks. This eliminates disk failure as a risk, makes recovery time independent of state size (no need to download and rebuild local state), and enables elastic scaling. Both RisingWave and Flink 2.0 (ForSt backend) use this approach.
Does exactly-once processing guarantee no data loss?
Exactly-once semantics guarantee that each event is processed exactly once — no duplicates and no skipped events. However, during a failure, events processed since the last checkpoint may need to be reprocessed (they are replayed from the source). The checkpoint interval determines how much reprocessing occurs. Exactly-once ensures the final result is correct, not that the processing is interrupted-free.
What happens to state when a streaming node runs out of memory?
An out-of-memory (OOM) kill terminates the process immediately (no graceful shutdown). All in-memory state since the last checkpoint is lost. The system recovers from the last checkpoint, replaying events from the source. With RisingWave's 1-second checkpoint interval, at most 1 second of data needs to be reprocessed. With Flink's typical intervals, this window can be 30 seconds to several minutes.

