Fault Tolerance in Data Streaming: How Modern Platforms Handle Failures (2026)

Fault Tolerance in Data Streaming: How Modern Platforms Handle Failures (2026)

Fault Tolerance in Data Streaming: How Modern Platforms Handle Failures (2026)

Every streaming system will eventually face failures — node crashes, network partitions, out-of-memory kills, disk corruption. What separates production-grade streaming platforms is how they recover: how much data is lost (Recovery Point Objective) and how long recovery takes (Recovery Time Objective). In 2026, disaggregated state storage on S3 has emerged as the key architectural innovation, with RisingWave achieving seconds-level recovery with 1-second RPO and Flink 2.0 delivering 40x faster recovery through its new ForSt state backend.

This guide provides a deep technical comparison of fault tolerance mechanisms across Apache Flink, Kafka Streams, RisingWave, and Spark Structured Streaming.

How Streaming Fault Tolerance Works

Streaming systems must solve two fundamental problems during failures:

  1. State preservation — Aggregations, join state, window buffers, and counters must survive node crashes
  2. Processing guarantees — Ensuring exactly-once semantics so events are neither lost nor processed twice

The two dominant approaches are checkpointing (periodic snapshots of state) and changelog replication (continuous logging of state changes).

ApproachHow It WorksRPORecovery TimeUsed By
CheckpointingPeriodic consistent snapshots of all operator state= Checkpoint intervalSnapshot restore + replayFlink, RisingWave, Spark
ChangelogEvery state mutation written to a durable log (Kafka topic)Near-zero (every write logged)Replay log from last offsetKafka Streams

Platform-by-Platform Comparison

Flink implements the Asynchronous Barrier Snapshots (ABS) algorithm, a variant of Chandy-Lamport distributed snapshots. The JobManager injects checkpoint barriers into data streams; when all operators receive the barrier, they snapshot their state to durable storage.

State Backend Options:

BackendStorageBest ForIncremental Checkpoints
HashMapStateBackendJava heap (in-memory)Small state (<1 GB)No
EmbeddedRocksDBStateBackendLocal disk (RocksDB)Large state (multi-GB)Yes
ForStStateBackend (Flink 2.0)Remote S3/HDFSCloud-native, elasticYes

Flink 1.x Recovery (Traditional):

  • Checkpoint interval: Typically 30 seconds to several minutes in production
  • Recovery time: Minutes for multi-gigabyte state (proportional to state size)
  • Process: Stop all operators → download state from checkpoint storage → restart from checkpoint offset
  • Large state (>100 GB): Recovery can take 30+ minutes

Flink 2.0 Recovery (Disaggregated State): The ForStStateBackend in Flink 2.0 is a game-changer:

  • Checkpoint duration: 3-4 seconds regardless of state size (94% reduction vs 1.x)
  • Recovery time: Under 10 seconds (40x faster than 1.x)
  • State size independent: Same recovery time whether state is 1 GB or 1 TB
  • Cost savings: ~50% due to independent compute/storage scaling

Exactly-Once Semantics: Flink achieves end-to-end exactly-once through a two-phase commit protocol. Sources (Kafka) and sinks (databases, Kafka) participate in a distributed transaction coordinated by the checkpoint.

Kafka Streams: Changelog-Based Recovery

Kafka Streams takes a fundamentally different approach: every state mutation is written to a compacted Kafka changelog topic. The changelog is the source of truth; local RocksDB state is a cache.

Architecture:

State Change → Write to local RocksDB + Write to Kafka changelog topic
Recovery    → Read changelog topic from last committed offset → Rebuild RocksDB

Recovery Characteristics:

  • Small state (<1 GB): Seconds to rebuild from changelog
  • Medium state (1-10 GB): Minutes
  • Large state (>100 GB): Hours for full changelog replay
  • This is the critical weakness: recovery time scales linearly with state size

Standby Replicas (Mitigation): Setting num.standby.replicas=1 creates shadow copies of state on other instances. When the active instance fails, the standby promotes almost instantly (seconds). However, this doubles resource consumption.

Exactly-Once: Kafka Streams supports exactly-once via processing.guarantee=exactly_once_v2 (Kafka 2.5+). This coordinates transactional producers and consumers within the Kafka ecosystem.

RisingWave: Disaggregated-First Architecture

RisingWave was built from day one with disaggregated state on S3. Its Hummock storage engine is an LSM-tree optimized for streaming workloads, with all persistent state stored in object storage.

Architecture:

Streaming Operators → In-memory state (mem-tables)
                    → Flush to SSTables (Sorted String Tables)
                    → Upload to S3
                    → Global epoch-based consistency

Checkpoint Mechanism:

  • Checkpoint interval: 1 second (default, configurable)
  • The Meta node injects barriers into the streaming dataflow
  • When all operators complete the barrier, state is committed to S3
  • Asynchronous checkpointing: "almost invisible to foreground tasks"

Recovery Characteristics:

  • Recovery time: Seconds (not proportional to state size)
  • RPO: ~1 second (at most 1 second of data loss)
  • No local state to restore — compute nodes are relatively stateless
  • New compute nodes fetch state from S3 on demand
  • Elastic scaling without service interruption

Why Disaggregated State Enables Faster Recovery:

AspectLocal State (RocksDB)Disaggregated (S3)
Recovery processDownload full checkpoint from remote → rebuild local statePoint compute at existing S3 state
Recovery timeProportional to state sizeNear-constant (seconds)
Disk failure impactState lost, must restore from checkpointNo impact (state in S3)
DurabilityDepends on local disk + checkpoint frequency11 nines (S3 durability)
ScalingMust redistribute state across nodesCompute scales independently

Exactly-Once: RisingWave provides exactly-once semantics through barrier-based checkpointing with epoch-based consistency. All reads within a checkpoint boundary return consistent results.

Spark Structured Streaming: Micro-Batch Recovery

Spark Structured Streaming uses a micro-batch model with Write-Ahead Logs (WAL) stored in HDFS or S3.

Recovery Mechanism:

  • Checkpoint directory stores: stream offsets, committed batch IDs, state snapshots
  • On failure: restart driver → launch new executors → replay from last committed batch offset
  • _spark_metadata directory tracks completed batches, preventing duplicate processing

Recovery Characteristics:

  • Recovery time: Seconds to minutes (cluster restart + state restore)
  • RPO: One micro-batch duration (typically 1-10 seconds)
  • Limited to micro-batch latency (not true real-time)

Real-World Failure Scenarios

Node Crash During Processing

SystemWhat HappensRecovery Time
Flink 1.xJob stops, operators reset to latest checkpoint, replay from checkpoint offsetMinutes
Flink 2.0Same process, but ForSt backend recovers in under 10 secondsSeconds
Kafka StreamsInstance restarts, replays changelog topicSeconds (small state) to hours (large state)
RisingWaveOperators restart, fetch state from S3 checkpointSeconds
Spark SSDriver restarts, replays from last committed batchSeconds to minutes

Disk Failure (Local State Loss)

This scenario is where architecture matters most:

  • Flink (RocksDB): Local state destroyed. Must download full checkpoint from remote storage. Recovery time proportional to state size.
  • Kafka Streams: Local RocksDB cache destroyed. Must replay entire changelog topic. Can take hours for large state.
  • RisingWave: No impact. State lives in S3. Compute node restarts and fetches state on demand. Recovery in seconds.
  • Flink 2.0 (ForSt): Minimal impact. State already in S3. Recovery under 10 seconds.

Out-of-Memory Kill

OOM kills (exit code 137) are instantaneous — no graceful shutdown:

  • All in-flight data and uncommitted state is lost
  • Recovery point is the last successful checkpoint
  • Data loss = events processed since last checkpoint
    • Flink: Up to 30-120 seconds of data (typical checkpoint interval)
    • RisingWave: Up to 1 second of data (1-second checkpoint interval)
    • Kafka Streams: Near-zero (changelog captures every write)

Network Partition

When the cluster splits into isolated groups:

  • Flink/RisingWave: Meta node (JobManager/Meta service) coordinates using quorum consensus. Minority partition fences itself. Majority partition continues processing.
  • Kafka Streams: Depends on Kafka broker availability. If consumer can't reach coordinator, rebalance triggers after timeout.

The Checkpoint Interval Trade-Off

Checkpoint interval directly determines your Recovery Point Objective (maximum data loss during failure):

SystemDefault IntervalRPOCheckpoint Overhead
RisingWave1 second~1 secondLow (async, S3-native)
Flink30 sec - several minutes= intervalMedium-High (state size dependent)
Flink 2.0 (ForSt)Configurable (shorter possible)= intervalLow (3-4 sec constant)
Spark SS1 micro-batch= batch durationMedium
Kafka StreamsContinuousNear-zeroHigh (every write to changelog)

The trade-off: more frequent checkpoints reduce data loss but increase processing overhead. RisingWave's S3-native architecture makes 1-second checkpoints feasible without significant overhead, while Flink 1.x's local-state model makes sub-minute checkpoints expensive for large state.

How to Choose Based on Fault Tolerance Requirements

Choose RisingWave if:

  • You need the lowest RPO (1-second data loss window)
  • Fast, consistent recovery time regardless of state size matters
  • You want to eliminate local disk as a failure point
  • SQL-based pipelines are sufficient for your workload

Choose Flink 2.0 (ForSt) if:

  • You need disaggregated state with the broadest feature set
  • You require complex event processing (MATCH_RECOGNIZE)
  • You have an existing Flink platform and want to upgrade

Choose Kafka Streams with standby replicas if:

  • Near-zero RPO matters most (changelog captures every write)
  • You accept higher resource cost for instant standby failover
  • Your workload is Kafka-native

Choose Spark Structured Streaming if:

  • Seconds-level RPO is acceptable
  • You're already in the Spark ecosystem
  • Batch + streaming unification matters more than low latency

Frequently Asked Questions

What is the difference between RPO and RTO in streaming?

Recovery Point Objective (RPO) is the maximum amount of data you can afford to lose during a failure — determined by the checkpoint interval. Recovery Time Objective (RTO) is how long it takes to resume processing after a failure. RisingWave achieves ~1 second RPO and seconds-level RTO. Flink 2.0 with ForSt achieves configurable RPO and under 10 seconds RTO.

Which streaming platform has the best fault tolerance?

It depends on your priority. For lowest data loss (RPO), Kafka Streams' continuous changelog approach captures every state mutation. For fastest recovery (RTO), RisingWave and Flink 2.0 with disaggregated state recover in seconds regardless of state size. For the broadest feature set with good fault tolerance, Flink 2.0 leads.

How does disaggregated state improve fault tolerance?

Disaggregated state stores all streaming state in object storage (S3) instead of local disks. This eliminates disk failure as a risk, makes recovery time independent of state size (no need to download and rebuild local state), and enables elastic scaling. Both RisingWave and Flink 2.0 (ForSt backend) use this approach.

Does exactly-once processing guarantee no data loss?

Exactly-once semantics guarantee that each event is processed exactly once — no duplicates and no skipped events. However, during a failure, events processed since the last checkpoint may need to be reprocessed (they are replayed from the source). The checkpoint interval determines how much reprocessing occurs. Exactly-once ensures the final result is correct, not that the processing is interrupted-free.

What happens to state when a streaming node runs out of memory?

An out-of-memory (OOM) kill terminates the process immediately (no graceful shutdown). All in-memory state since the last checkpoint is lost. The system recovers from the last checkpoint, replaying events from the source. With RisingWave's 1-second checkpoint interval, at most 1 second of data needs to be reprocessed. With Flink's typical intervals, this window can be 30 seconds to several minutes.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.