Apache Flink Recovery Time vs RisingWave: A Benchmark

Your stream processing job goes down at 2 AM. A TaskManager OOMs. A Kubernetes node is evicted. A network partition isolates half the cluster. Whatever the cause, you are now staring at a recovery time that could stretch from seconds to tens of minutes, depending on the architecture you chose when you built the system.

Apache Flink recovery time vs RisingWave benchmark comparisons rarely surface in vendor marketing, but they reveal one of the most significant operational differences between the two systems. For Flink, recovery duration scales with state size: a 50 GB job recovers fast, but a 2 TB job may take 15-30 minutes to restore state from a checkpoint before processing resumes. For RisingWave, recovery is stateless by design. Compute nodes restart and reconnect to state already living in S3. The job resumes in seconds regardless of whether you have 10 GB or 10 TB of streaming state.

This article explains why the gap exists, walks through each system's recovery sequence step by step, provides realistic benchmark estimates for large-state workloads, and shows you how to observe recovery behavior using SQL against a running RisingWave instance.

Why Recovery Time Matters for Stream Processing Engineers

Recovery time is not just an operations metric. It has direct product consequences.

A streaming job that takes 20 minutes to recover produces 20 minutes of stale results downstream. Real-time dashboards freeze. Fraud detection rules stop firing. CDC pipelines stall, causing consumers to see inconsistent data. For SLAs measured in seconds, a 20-minute recovery window is not a minor inconvenience - it is a P0 incident.

The standard advice is to reduce checkpoint intervals to minimize replay distance. But shorter intervals do not help recovery time when the bottleneck is state restore, not replay. Understanding the exact sequence of events during recovery is necessary to diagnose and fix this problem correctly.

How Apache Flink Recovery Works

Flink's recovery mechanism is well-documented and battle-tested. It is also tightly coupled to the state backend configuration you chose when you deployed the job.

Step 1: Failure Detection

The JobManager monitors TaskManagers via heartbeats. When a TaskManager misses a configurable number of heartbeats (default: 50 seconds), the JobManager declares it dead and initiates recovery. If you use Kubernetes and the eviction happens more violently (e.g., OOM kill), the failure detection can be faster because the pod disappears immediately.

Step 2: Job Cancellation and Resource Release

The JobManager cancels all running tasks. Other TaskManagers finish their current processing cycle, release network buffers, and signal readiness for restart. This takes a few seconds.

Step 3: Resource Allocation

New TaskManager pods are scheduled. Kubernetes must find a node with sufficient capacity, pull the container image if not cached, and start the JVM. Cold starts take 30-60 seconds on a warm cluster. On a cold cluster (scaled to zero), this step alone can take several minutes.

Step 4: State Download from Checkpoint Storage

This is where recovery time diverges sharply between small-state and large-state jobs.

Flink's classic state backend is RocksDB. State is stored on local SSD on each TaskManager and checkpointed periodically to remote storage (S3 or HDFS). When the TaskManager restarts, it must download the checkpoint data from remote storage back to local disk.

For a job with 100 GB of checkpointed state distributed across 10 TaskManagers, each TaskManager downloads 10 GB. At typical S3 download throughput of 1-2 GB/s per TaskManager (with parallelism limited by network bandwidth), this step takes roughly 5-10 seconds per task manager. That sounds fast - but it assumes optimal conditions.

In practice, S3 download throughput per TaskManager is limited by:

Number of parallel S3 GET requests the TaskManager issues
Network interface capacity on the EC2 instance type
S3 request rate limits if the checkpoint is split into many small files

For a 1 TB job, download time scales accordingly: 100 GB per TaskManager at 2 GB/s is 50 seconds, but at 500 MB/s (a realistic sustained rate including RocksDB rebuilding overhead) it is closer to 3-4 minutes.

Step 5: RocksDB Rebuild

Downloading SST files to local disk is not enough. RocksDB must build its internal index structures (block index, bloom filters) before the state backend is queryable. For large state, this initialization adds 30-60 seconds of overhead.

Step 6: Source Offset Seek and Replay

Once state is restored, Flink replays events from the last checkpointed Kafka offset. With a 30-second checkpoint interval, this means replaying up to 30 seconds of events. With a 5-minute checkpoint interval (common for large-state jobs to reduce checkpoint overhead), replay covers 5 minutes of backlog.

Total Flink Recovery Timeline

Step	Small State (50 GB)	Medium State (500 GB)	Large State (2 TB)
Failure detection	30-50 seconds	30-50 seconds	30-50 seconds
Resource allocation	30-60 seconds	30-60 seconds	30-60 seconds
State download	5-15 seconds	50-120 seconds	200-600 seconds
RocksDB rebuild	10-20 seconds	30-60 seconds	60-120 seconds
Replay from offset	5-60 seconds	5-60 seconds	5-60 seconds
Total estimate	80-185 seconds	145-290 seconds	325-890 seconds

These are conservative estimates under favorable conditions (warm cluster, uncongested S3, single failure). Multiple simultaneous TaskManager failures compound the issue because all TaskManagers download state concurrently, saturating shared network bandwidth.

Flink 2.0 ForSt Backend: A Step Forward

Flink 2.0 (released March 2025) introduced ForSt, a disaggregated state backend that stores state directly in object storage rather than local disk. This eliminates the state download step during recovery: compute nodes read state directly from S3 on demand.

ForSt benchmark results published by the Apache Flink community show recovery times under 10 seconds for jobs with hundreds of gigabytes of state. This is a genuine architectural improvement that moves Flink closer to RisingWave's stateless compute model.

However, as of early 2026, ForSt is an opt-in feature with a different access pattern than the classic RocksDB backend. Most production deployments continue to use the RocksDB state backend because ForSt requires re-tuning query patterns optimized for local memory access rather than remote object storage. Migration from RocksDB to ForSt requires a savepoint migration and cluster restart.

How RisingWave Recovery Works

RisingWave's recovery sequence is fundamentally different because the architecture treats state and compute as separate concerns from the start.

The Hummock Storage Model

RisingWave uses Hummock, a purpose-built LSM-tree storage engine that writes all state directly to S3-compatible object storage. Local memory and SSD on compute nodes serve only as a read-through cache. The authoritative state always lives in S3.

This means there is no concept of "checkpoint state to remote storage." State is already remote. Checkpointing in RisingWave records epoch metadata - a lightweight snapshot of which S3 files constitute a consistent view of the streaming state at a given point in time. This metadata is small (kilobytes) and is recorded in milliseconds regardless of total state size.

Step 1: Failure Detection

RisingWave's meta service monitors compute nodes. Node failure triggers immediate rescheduling. Detection latency is typically under 10 seconds.

Step 2: Stateless Restart

A replacement compute node starts. There is no JVM to initialize (RisingWave is written in Rust), which eliminates JVM startup overhead and garbage collection warm-up.

Step 3: Epoch Metadata Load

The new compute node reads the last consistent epoch metadata from the meta service. This is a lightweight operation: a few kilobytes of metadata describing which S3 files contain the current state. This takes milliseconds.

Step 4: Cache Population on Demand

The compute node begins processing. When an operator needs state that is not in its local cache, it fetches the relevant SSTable from S3. This happens transparently as part of normal processing. There is no "restore state before resuming" phase - the node starts processing immediately and fetches state lazily.

Step 5: Resume from Checkpoint Offset

RisingWave replays events from the Kafka offset stored in the last epoch metadata. The default checkpoint interval is 1 second, so replay covers at most 1 second of events in the common case.

Total RisingWave Recovery Timeline

Step	Small State (50 GB)	Medium State (500 GB)	Large State (2 TB)
Failure detection	5-10 seconds	5-10 seconds	5-10 seconds
Node restart (Rust, no JVM)	2-5 seconds	2-5 seconds	2-5 seconds
Epoch metadata load	< 1 second	< 1 second	< 1 second
State download	None - S3 native	None - S3 native	None - S3 native
Resume processing	2-5 seconds	2-5 seconds	2-5 seconds
Total estimate	9-21 seconds	9-21 seconds	9-21 seconds

The critical observation is that recovery time does not scale with state size. Whether your job has 50 GB or 2 TB of streaming state, the compute node restarts in under 20 seconds. The state was never local; it never needs to be downloaded.

Benchmark Comparison

The following table summarizes recovery time estimates across state sizes, based on the architectural analysis above and published benchmarks from the Flink and RisingWave communities.

State Size	Flink (RocksDB, classic)	Flink 2.0 (ForSt)	RisingWave
10 GB	80-120 seconds	10-20 seconds	10-20 seconds
100 GB	120-180 seconds	10-20 seconds	10-20 seconds
500 GB	150-300 seconds	10-20 seconds	10-20 seconds
2 TB	300-900 seconds	10-20 seconds	10-20 seconds
10 TB	900+ seconds	10-20 seconds	10-20 seconds

RisingWave's recovery time is constant across all state sizes. Flink with classic RocksDB backend scales linearly with state. Flink 2.0 with ForSt approaches RisingWave's performance, but requires opting in to the new backend and re-tuning the deployment.

Note: Recovery times include failure detection, resource allocation, and state initialization. They assume a warm Kubernetes cluster with pre-pulled images. Cold cluster starts add 60-120 seconds for both systems.

Observing RisingWave Checkpoint Behavior

You can directly observe RisingWave's epoch-based checkpointing and understand the recovery model by running the following SQL against a live instance. The examples below use table names prefixed with rec_ and are verified against RisingWave 2.8.0.

Set Up a Sample Streaming Pipeline

Start by creating a table that simulates an order event stream and a materialized view that aggregates over it:

CREATE TABLE rec_orders (
    order_id    BIGINT PRIMARY KEY,
    customer_id BIGINT,
    amount      DECIMAL,
    region      VARCHAR,
    status      VARCHAR,
    event_time  TIMESTAMPTZ
);

INSERT INTO rec_orders VALUES
    (1,  1001, 250.00, 'us-east', 'completed', '2026-04-01 08:00:00+00'),
    (2,  1002, 89.99,  'us-west', 'completed', '2026-04-01 08:00:05+00'),
    (3,  1003, 1200.00,'eu-west', 'completed', '2026-04-01 08:00:10+00'),
    (4,  1001, 45.50,  'us-east', 'completed', '2026-04-01 08:00:15+00'),
    (5,  1004, 320.00, 'ap-east', 'completed', '2026-04-01 08:00:20+00'),
    (6,  1005, 780.00, 'eu-west', 'returned',  '2026-04-01 08:00:25+00'),
    (7,  1002, 99.00,  'us-west', 'completed', '2026-04-01 08:00:30+00'),
    (8,  1006, 2500.00,'us-east', 'completed', '2026-04-01 08:00:35+00'),
    (9,  1007, 60.00,  'ap-east', 'completed', '2026-04-01 08:00:40+00'),
    (10, 1003, 450.00, 'eu-west', 'completed', '2026-04-01 08:00:45+00');

Create a Materialized View

This materialized view is the "state" that would need to be recovered after a node failure. In RisingWave, it is continuously persisted to S3, not to local disk.

CREATE MATERIALIZED VIEW rec_regional_stats AS
SELECT
    region,
    COUNT(*)                                           AS total_orders,
    SUM(amount)                                        AS total_revenue,
    AVG(amount)                                        AS avg_order_value,
    COUNT(*) FILTER (WHERE status = 'returned')        AS returns
FROM rec_orders
GROUP BY region;

Query the view immediately after creation:

SELECT * FROM rec_regional_stats ORDER BY region;

Expected output:

 region  | total_orders | total_revenue |        avg_order_value        | returns
---------+--------------+---------------+-------------------------------+---------
 ap-east |            2 |        380.00 |                        190.00 |       0
 eu-west |            3 |       2430.00 |                        810.00 |       1
 us-east |            3 |       2795.50 | 931.8333333333333333333333333 |       0
 us-west |            2 |        188.99 |                       94.4950 |       0
(4 rows)

Chain a Second Materialized View

One of RisingWave's architectural strengths is cascading materialized views - views that read from other views. Each layer is independently persisted to S3, so recovery at any tier does not require recomputing upstream results.

CREATE MATERIALIZED VIEW rec_high_value_regions AS
SELECT
    region,
    total_revenue,
    avg_order_value
FROM rec_regional_stats
WHERE total_revenue > 500;

SELECT * FROM rec_high_value_regions ORDER BY total_revenue DESC;

Expected output:

 region  | total_revenue |        avg_order_value
---------+---------------+-------------------------------
 us-east |       2795.50 | 931.8333333333333333333333333
 eu-west |       2430.00 |                        810.00
(2 rows)

If the compute node hosting rec_high_value_regions restarts, recovery does not replay the full rec_orders history. The materialized view state (rec_regional_stats filtered output) is already persisted in S3 as part of Hummock's epoch snapshot. The new node loads the epoch metadata and resumes.

Windowed Aggregations and Recovery

Windowed aggregations are among the heaviest state users in stream processing. In Flink, window state grows proportionally to window width and event rate, and it all needs to be checkpointed and downloaded during recovery.

In RisingWave, window state is stored in Hummock the same way as all other state - persisted to S3 on every barrier, recovered by reading epoch metadata:

CREATE MATERIALIZED VIEW rec_order_windows AS
SELECT
    window_start,
    window_end,
    region,
    COUNT(*)    AS orders_in_window,
    SUM(amount) AS window_revenue
FROM TUMBLE(rec_orders, event_time, INTERVAL '15 seconds')
GROUP BY window_start, window_end, region;

SELECT window_start, window_end, region, orders_in_window, window_revenue
FROM rec_order_windows
ORDER BY window_start, region;

Expected output:

       window_start        |        window_end         | region  | orders_in_window | window_revenue
---------------------------+---------------------------+---------+------------------+----------------
 2026-04-01 08:00:00+00:00 | 2026-04-01 08:00:15+00:00 | eu-west |                1 |        1200.00
 2026-04-01 08:00:00+00:00 | 2026-04-01 08:00:15+00:00 | us-east |                1 |         250.00
 2026-04-01 08:00:00+00:00 | 2026-04-01 08:00:15+00:00 | us-west |                1 |          89.99
 2026-04-01 08:00:15+00:00 | 2026-04-01 08:00:30+00:00 | ap-east |                1 |         320.00
 2026-04-01 08:00:15+00:00 | 2026-04-01 08:00:30+00:00 | eu-west |                1 |         780.00
 2026-04-01 08:00:15+00:00 | 2026-04-01 08:00:30+00:00 | us-east |                1 |          45.50
 2026-04-01 08:00:30+00:00 | 2026-04-01 08:00:45+00:00 | ap-east |                1 |          60.00
 2026-04-01 08:00:30+00:00 | 2026-04-01 08:00:45+00:00 | us-east |                1 |        2500.00
 2026-04-01 08:00:30+00:00 | 2026-04-01 08:00:45+00:00 | us-west |                1 |          99.00
 2026-04-01 08:00:45+00:00 | 2026-04-01 08:01:00+00:00 | eu-west |                1 |         450.00
(10 rows)

The window state for all 10 rows is already persisted in S3 as part of the epoch snapshot. After a recovery event, a new compute node reads this state directly. No replay of rec_orders history required.

The Root Cause: Coupled vs. Decoupled State

The recovery time difference between Flink (classic RocksDB) and RisingWave is not a tuning problem. It is an architectural consequence.

Flink's Coupled Model

In Flink's classic architecture, compute and storage are coupled. State lives on the TaskManager's local disk (in RocksDB). Checkpointing copies that local state to remote storage. Recovery downloads that remote state back to local disk. This round-trip is unavoidable.

The checkpoint serves two purposes: it is both the disaster recovery artifact and the only persistent copy of the state. If the local disk is gone (node eviction, SSD failure), the checkpoint is the only recovery path, and it must be fully downloaded before the node can process again.

RisingWave's Decoupled Model

RisingWave separates these concerns from the start. S3 is not the backup - it is the primary store. Local memory is a cache in front of S3. When a compute node dies, the state is unaffected because it never lived there in the first place.

This is similar to how Snowflake or Redshift Serverless recover instantly: there is no local state to restore because compute is truly stateless. RisingWave applies the same architecture to streaming workloads.

The RisingWave vs Apache Flink comparison covers this architectural difference in detail, including how it affects not just recovery but also checkpoint overhead, state management complexity, and cost.

Operational Implications

Checkpoint Interval Strategy

For Flink (classic RocksDB), there is a tension between:

Short checkpoint intervals (e.g., 30 seconds): Lower data loss on failure, but more frequent checkpoint overhead. For large state jobs, checkpointing itself adds latency to processing.
Long checkpoint intervals (e.g., 5 minutes): Less checkpoint overhead, but more data to replay on recovery, and longer recovery windows because the checkpoint itself is a full state snapshot.

For RisingWave, checkpoint intervals default to 1 second. Each checkpoint is a lightweight epoch metadata record, not a full state copy. There is no tension: low-overhead checkpoints and minimal replay distance are both achievable simultaneously.

Resource Sizing and State Growth

In Flink with the RocksDB backend, state growth has direct cost implications: you must provision local SSDs on every TaskManager large enough to hold the maximum expected state size per partition. Underestimate, and jobs fail due to disk exhaustion. Overestimate, and you pay for idle SSD capacity.

In RisingWave, state lives in S3. You pay for what you store at roughly $0.023/GB/month, and there is no ceiling except your S3 account limits. Compute nodes do not need oversized local storage. The total cost comparison between Flink and RisingWave shows that storage cost differences alone can be significant for large-state deployments.

Upgrade and Savepoint Management

Flink upgrades require creating a savepoint, stopping the job, deploying the new version, and restoring from savepoint. The savepoint process is similar to a full checkpoint and takes proportional time for large state. For a 2 TB job, creating a savepoint before an upgrade adds 10-30 minutes of downtime planning to every release.

RisingWave upgrades use rolling restarts of stateless compute pods. Because state is external, the upgrade process does not require savepoint creation or state migration. You upgrade the binary; the state in S3 remains unchanged and is read by the new version on startup.

What Flink 2.0 ForSt Changes

Flink 2.0's ForSt backend deserves a fair assessment. It represents the Flink community recognizing the limitations of the coupled-storage model and moving toward cloud-native architecture.

With ForSt, state changes are streamed directly to remote object storage, and recovery no longer requires a full state download. Published results show recovery times of 5-10 seconds for jobs with hundreds of gigabytes of state, which is competitive with RisingWave's recovery time.

The gap that remains is operational: ForSt requires migration from the classic RocksDB backend (via savepoints), re-tuning of access patterns optimized for local memory to remote-first access, and confidence that ForSt's behavior is stable in your specific workload. Teams running Flink 1.x in production will not see ForSt benefits until they complete this migration.

For new deployments starting in 2026, ForSt is a compelling option that brings Flink's recovery behavior much closer to RisingWave's. For teams already running RisingWave, the recovery advantage is built-in and requires no migration path.

Full Comparison Table

Dimension	Flink (RocksDB)	Flink 2.0 (ForSt)	RisingWave
State storage	Local SSD (RocksDB)	Object storage (S3)	Object storage (S3 via Hummock)
Recovery scales with state	Yes (linear)	No (constant)	No (constant)
Recovery time (100 GB)	2-4 minutes	10-20 seconds	10-20 seconds
Recovery time (2 TB)	10-30 minutes	10-20 seconds	10-20 seconds
Checkpoint overhead	High (full state copy)	Low (streaming writes)	Very low (epoch metadata)
Default checkpoint interval	30 seconds to minutes	30 seconds to minutes	1 second
Local SSD required	Yes	No	No (cache only)
JVM startup overhead	Yes	Yes	No (Rust)
Migration required for fast recovery	Yes (ForSt migration)	N/A (native)	N/A (built-in)
SQL interface	Flink SQL (ANSI-like)	Flink SQL (ANSI-like)	PostgreSQL-compatible
Serving built in	No	No	Yes

FAQ

What is the typical Apache Flink recovery time for a large-state job?

Flink recovery time with the classic RocksDB state backend scales linearly with state size. A job with 100 GB of state typically recovers in 2-4 minutes, while a 2 TB job can take 10-30 minutes. The bottleneck is downloading checkpoint data from remote storage back to local SSD on each TaskManager, then rebuilding RocksDB index structures. Flink 2.0's ForSt backend eliminates the download step and recovers in 10-20 seconds regardless of state size, but requires migration from the classic backend.

How does RisingWave achieve constant recovery time regardless of state size?

RisingWave recovery time is constant because compute nodes are stateless. All state is stored in S3-compatible object storage via the Hummock storage engine. There is no local state to download during recovery. When a compute node restarts, it reads lightweight epoch metadata (kilobytes), connects to the existing S3 state, and resumes processing with lazy cache population. The recovery sequence is the same whether the job has 10 GB or 10 TB of streaming state.

Does RisingWave support exactly-once processing during recovery?

Yes. RisingWave's recovery preserves exactly-once semantics through epoch-based checkpointing. Each epoch snapshot captures a consistent point-in-time view of all materialized view state and the corresponding source offsets. After recovery, the system replays events from the last committed epoch offset (typically covering 1 second of events, given the default 1-second checkpoint interval). This guarantees that each event's effect appears exactly once in the output. See Exactly-Once Processing in Stream Processing for a deep dive on the mechanics.

When should I choose Flink over RisingWave for production streaming jobs?

Flink is the stronger choice when your workload requires complex event processing (CEP) using MATCH_RECOGNIZE, custom Java or Scala operators via the DataStream API, or connectors not yet supported by RisingWave. Flink's connector ecosystem (100+ connectors) is broader than RisingWave's current library. If your team has deep Flink expertise and established operational playbooks, and your state is under 100 GB, the operational cost of migration may outweigh the recovery time benefit. For new greenfield projects or migrations from Flink SQL pipelines, RisingWave's constant-time recovery, PostgreSQL-compatible interface, and built-in serving layer offer a compelling alternative.

Conclusion

The Apache Flink recovery time vs RisingWave benchmark comparison reveals a structural difference, not a tuning gap:

Flink (classic RocksDB) recovery time scales linearly with state size - minutes to tens of minutes for terabyte-scale jobs.
Flink 2.0 (ForSt) eliminates the download step and achieves constant-time recovery, but requires migrating to the new backend.
RisingWave recovery is always constant, typically under 20 seconds, because compute nodes are stateless and all state already lives in S3.

For stream processing engineers evaluating platforms for large-state workloads - joins over wide windows, high-cardinality aggregations, multi-stage CDC pipelines - recovery time is a first-class architectural concern, not an afterthought. The system you choose determines whether a 2 AM node failure is a 30-second blip or a 30-minute incident.

RisingWave's Hummock architecture gives you constant-time recovery, 1-second checkpoint intervals, and no local disk management without any additional configuration. Those are not tuning parameters. They are properties of the storage model.

Ready to test recovery behavior for yourself? Try RisingWave Cloud free - no credit card required. Sign up

Join our Slack community to ask questions and connect with other stream processing developers.