Fault Tolerance in Data Streaming: How Modern Platforms Handle Failures (2026)

Every streaming system will eventually face failures — node crashes, network partitions, out-of-memory kills, disk corruption. What separates production-grade streaming platforms is how they recover: how much data is lost (Recovery Point Objective) and how long recovery takes (Recovery Time Objective). In 2026, disaggregated state storage on S3 has emerged as the key architectural innovation, with RisingWave achieving seconds-level recovery with 1-second RPO and Flink 2.0 delivering 40x faster recovery through its new ForSt state backend.

This guide provides a deep technical comparison of fault tolerance mechanisms across Apache Flink, Kafka Streams, RisingWave, and Spark Structured Streaming.

How Streaming Fault Tolerance Works

Streaming systems must solve two fundamental problems during failures:

State preservation — Aggregations, join state, window buffers, and counters must survive node crashes
Processing guarantees — Ensuring exactly-once semantics so events are neither lost nor processed twice

The two dominant approaches are checkpointing (periodic snapshots of state) and changelog replication (continuous logging of state changes).

Approach	How It Works	RPO	Recovery Time	Used By
Checkpointing	Periodic consistent snapshots of all operator state	= Checkpoint interval	Snapshot restore + replay	Flink, RisingWave, Spark
Changelog	Every state mutation written to a durable log (Kafka topic)	Near-zero (every write logged)	Replay log from last offset	Kafka Streams

Platform-by-Platform Comparison

Apache Flink: The Checkpoint Pioneer

Flink implements the Asynchronous Barrier Snapshots (ABS) algorithm, a variant of Chandy-Lamport distributed snapshots. The JobManager injects checkpoint barriers into data streams; when all operators receive the barrier, they snapshot their state to durable storage.

State Backend Options:

Backend	Storage	Best For	Incremental Checkpoints
HashMapStateBackend	Java heap (in-memory)	Small state (<1 GB)	No
EmbeddedRocksDBStateBackend	Local disk (RocksDB)	Large state (multi-GB)	Yes
ForStStateBackend (Flink 2.0)	Remote S3/HDFS	Cloud-native, elastic	Yes

Flink 1.x Recovery (Traditional):

Checkpoint interval: Typically 30 seconds to several minutes in production
Recovery time: Minutes for multi-gigabyte state (proportional to state size)
Process: Stop all operators → download state from checkpoint storage → restart from checkpoint offset
Large state (>100 GB): Recovery can take 30+ minutes

Flink 2.0 Recovery (Disaggregated State): The ForStStateBackend in Flink 2.0 is a game-changer:

Checkpoint duration: 3-4 seconds regardless of state size (94% reduction vs 1.x)
Recovery time: Under 10 seconds (40x faster than 1.x)
State size independent: Same recovery time whether state is 1 GB or 1 TB
Cost savings: ~50% due to independent compute/storage scaling

Exactly-Once Semantics: Flink achieves end-to-end exactly-once through a two-phase commit protocol. Sources (Kafka) and sinks (databases, Kafka) participate in a distributed transaction coordinated by the checkpoint.

Kafka Streams: Changelog-Based Recovery

Kafka Streams takes a fundamentally different approach: every state mutation is written to a compacted Kafka changelog topic. The changelog is the source of truth; local RocksDB state is a cache.

Architecture:

State Change → Write to local RocksDB + Write to Kafka changelog topic
Recovery    → Read changelog topic from last committed offset → Rebuild RocksDB

Recovery Characteristics:

Small state (<1 GB): Seconds to rebuild from changelog
Medium state (1-10 GB): Minutes
Large state (>100 GB): Hours for full changelog replay
This is the critical weakness: recovery time scales linearly with state size

Standby Replicas (Mitigation): Setting num.standby.replicas=1 creates shadow copies of state on other instances. When the active instance fails, the standby promotes almost instantly (seconds). However, this doubles resource consumption.

Exactly-Once: Kafka Streams supports exactly-once via processing.guarantee=exactly_once_v2 (Kafka 2.5+). This coordinates transactional producers and consumers within the Kafka ecosystem.

RisingWave: Disaggregated-First Architecture

RisingWave was built from day one with disaggregated state on S3. Its Hummock storage engine is an LSM-tree optimized for streaming workloads, with all persistent state stored in object storage.

Architecture:

Streaming Operators → In-memory state (mem-tables)
                    → Flush to SSTables (Sorted String Tables)
                    → Upload to S3
                    → Global epoch-based consistency

Checkpoint Mechanism:

Checkpoint interval: 1 second (default, configurable)
The Meta node injects barriers into the streaming dataflow
When all operators complete the barrier, state is committed to S3
Asynchronous checkpointing: "almost invisible to foreground tasks"

Recovery Characteristics:

Recovery time: Seconds (not proportional to state size)
RPO: ~1 second (at most 1 second of data loss)
No local state to restore — compute nodes are relatively stateless
New compute nodes fetch state from S3 on demand
Elastic scaling without service interruption

Why Disaggregated State Enables Faster Recovery:

Aspect	Local State (RocksDB)	Disaggregated (S3)
Recovery process	Download full checkpoint from remote → rebuild local state	Point compute at existing S3 state
Recovery time	Proportional to state size	Near-constant (seconds)
Disk failure impact	State lost, must restore from checkpoint	No impact (state in S3)
Durability	Depends on local disk + checkpoint frequency	11 nines (S3 durability)
Scaling	Must redistribute state across nodes	Compute scales independently

Exactly-Once: RisingWave provides exactly-once semantics through barrier-based checkpointing with epoch-based consistency. All reads within a checkpoint boundary return consistent results.

Spark Structured Streaming: Micro-Batch Recovery

Spark Structured Streaming uses a micro-batch model with Write-Ahead Logs (WAL) stored in HDFS or S3.

Recovery Mechanism:

Checkpoint directory stores: stream offsets, committed batch IDs, state snapshots
On failure: restart driver → launch new executors → replay from last committed batch offset
_spark_metadata directory tracks completed batches, preventing duplicate processing

Recovery Characteristics:

Recovery time: Seconds to minutes (cluster restart + state restore)
RPO: One micro-batch duration (typically 1-10 seconds)
Limited to micro-batch latency (not true real-time)

Real-World Failure Scenarios

Node Crash During Processing

System	What Happens	Recovery Time
Flink 1.x	Job stops, operators reset to latest checkpoint, replay from checkpoint offset	Minutes
Flink 2.0	Same process, but ForSt backend recovers in under 10 seconds	Seconds
Kafka Streams	Instance restarts, replays changelog topic	Seconds (small state) to hours (large state)
RisingWave	Operators restart, fetch state from S3 checkpoint	Seconds
Spark SS	Driver restarts, replays from last committed batch	Seconds to minutes

Disk Failure (Local State Loss)

This scenario is where architecture matters most:

Flink (RocksDB): Local state destroyed. Must download full checkpoint from remote storage. Recovery time proportional to state size.
Kafka Streams: Local RocksDB cache destroyed. Must replay entire changelog topic. Can take hours for large state.
RisingWave: No impact. State lives in S3. Compute node restarts and fetches state on demand. Recovery in seconds.
Flink 2.0 (ForSt): Minimal impact. State already in S3. Recovery under 10 seconds.

Out-of-Memory Kill

OOM kills (exit code 137) are instantaneous — no graceful shutdown:

All in-flight data and uncommitted state is lost
Recovery point is the last successful checkpoint
Data loss = events processed since last checkpoint
- Flink: Up to 30-120 seconds of data (typical checkpoint interval)
- RisingWave: Up to 1 second of data (1-second checkpoint interval)
- Kafka Streams: Near-zero (changelog captures every write)

Network Partition

When the cluster splits into isolated groups:

Flink/RisingWave: Meta node (JobManager/Meta service) coordinates using quorum consensus. Minority partition fences itself. Majority partition continues processing.
Kafka Streams: Depends on Kafka broker availability. If consumer can't reach coordinator, rebalance triggers after timeout.

The Checkpoint Interval Trade-Off

Checkpoint interval directly determines your Recovery Point Objective (maximum data loss during failure):

System	Default Interval	RPO	Checkpoint Overhead
RisingWave	1 second	~1 second	Low (async, S3-native)
Flink	30 sec - several minutes	= interval	Medium-High (state size dependent)
Flink 2.0 (ForSt)	Configurable (shorter possible)	= interval	Low (3-4 sec constant)
Spark SS	1 micro-batch	= batch duration	Medium
Kafka Streams	Continuous	Near-zero	High (every write to changelog)

The trade-off: more frequent checkpoints reduce data loss but increase processing overhead. RisingWave's S3-native architecture makes 1-second checkpoints feasible without significant overhead, while Flink 1.x's local-state model makes sub-minute checkpoints expensive for large state.

How to Choose Based on Fault Tolerance Requirements

Choose RisingWave if:

You need the lowest RPO (1-second data loss window)
Fast, consistent recovery time regardless of state size matters
You want to eliminate local disk as a failure point
SQL-based pipelines are sufficient for your workload

Choose Flink 2.0 (ForSt) if:

You need disaggregated state with the broadest feature set
You require complex event processing (MATCH_RECOGNIZE)
You have an existing Flink platform and want to upgrade

Choose Kafka Streams with standby replicas if:

Near-zero RPO matters most (changelog captures every write)
You accept higher resource cost for instant standby failover
Your workload is Kafka-native

Choose Spark Structured Streaming if:

Seconds-level RPO is acceptable
You're already in the Spark ecosystem
Batch + streaming unification matters more than low latency

Frequently Asked Questions

What is the difference between RPO and RTO in streaming?

Recovery Point Objective (RPO) is the maximum amount of data you can afford to lose during a failure — determined by the checkpoint interval. Recovery Time Objective (RTO) is how long it takes to resume processing after a failure. RisingWave achieves ~1 second RPO and seconds-level RTO. Flink 2.0 with ForSt achieves configurable RPO and under 10 seconds RTO.

Which streaming platform has the best fault tolerance?

It depends on your priority. For lowest data loss (RPO), Kafka Streams' continuous changelog approach captures every state mutation. For fastest recovery (RTO), RisingWave and Flink 2.0 with disaggregated state recover in seconds regardless of state size. For the broadest feature set with good fault tolerance, Flink 2.0 leads.

How does disaggregated state improve fault tolerance?

Disaggregated state stores all streaming state in object storage (S3) instead of local disks. This eliminates disk failure as a risk, makes recovery time independent of state size (no need to download and rebuild local state), and enables elastic scaling. Both RisingWave and Flink 2.0 (ForSt backend) use this approach.

Does exactly-once processing guarantee no data loss?

Exactly-once semantics guarantee that each event is processed exactly once — no duplicates and no skipped events. However, during a failure, events processed since the last checkpoint may need to be reprocessed (they are replayed from the source). The checkpoint interval determines how much reprocessing occurs. Exactly-once ensures the final result is correct, not that the processing is interrupted-free.

What happens to state when a streaming node runs out of memory?

An out-of-memory (OOM) kill terminates the process immediately (no graceful shutdown). All in-memory state since the last checkpoint is lost. The system recovers from the last checkpoint, replaying events from the source. With RisingWave's 1-second checkpoint interval, at most 1 second of data needs to be reprocessed. With Flink's typical intervals, this window can be 30 seconds to several minutes.