Flink State Management vs RisingWave: How They Compare

Flink State Management vs RisingWave: How They Compare

Flink state management requires operators to configure and tune the RocksDB state backend, set checkpoint intervals, manage state TTL policies, and allocate local SSD storage on every TaskManager. RisingWave eliminates this entire operational surface by persisting all state to S3-compatible object storage automatically, using epoch-based snapshots, and exposing state only through standard SQL materialized views. The result is the same stateful streaming capability with none of the RocksDB configuration overhead.

This article compares the two systems across four dimensions: state backend architecture, checkpointing and recovery, state TTL, and the operational day-two experience. It targets stream processing engineers and platform engineers who already understand Flink internals and want a concrete picture of what changes when evaluating RisingWave.

What Is State Management in Stream Processing

Every stateful streaming system must answer two questions: where does intermediate state live while a job is running, and how is that state persisted so the job can recover after a failure? The answers have direct consequences for hardware requirements, operational complexity, and recovery time objectives.

Flink and RisingWave give different answers to both questions. Flink delegates the first question to a pluggable state backend and the second to a periodic checkpointing protocol. RisingWave integrates both into a single storage layer built on S3.

Flink ships with two production-relevant state backends: the HashMapStateBackend, which keeps all state in the JVM heap, and the EmbeddedRocksDBStateBackend, which spills state to disk via the RocksDB embedded key-value store. For any job with more than a few gigabytes of state, RocksDB is the practical choice because the heap backend triggers garbage collection pauses that translate directly into processing latency spikes.

The RocksDB configuration surface

Choosing RocksDB does not end the configuration work. It begins it. The following options appear in real production Flink configurations:

EmbeddedRocksDBStateBackend rocksdb = new EmbeddedRocksDBStateBackend(true); // true = incremental checkpoints

// Column family options applied to every state operator
rocksdb.setRocksDBOptions(new OptionsFactory() {
    @Override
    public DBOptions createDBOptions(DBOptions currentOptions, Collection<AutoCloseable> handlesToClose) {
        return currentOptions
            .setMaxOpenFiles(5000)
            .setMaxBackgroundJobs(4);
    }

    @Override
    public ColumnFamilyOptions createColumnOptions(ColumnFamilyOptions currentOptions, Collection<AutoCloseable> handlesToClose) {
        return currentOptions
            .setWriteBufferSize(64 * 1024 * 1024)         // 64 MB write buffer
            .setMaxWriteBufferNumber(3)
            .setMinWriteBufferNumberToMerge(1)
            .setLevel0FileNumCompactionTrigger(4)
            .setLevel0SlowdownWritesTrigger(20)
            .setLevel0StopWritesTrigger(36)
            .setMaxBytesForLevelBase(256 * 1024 * 1024)   // 256 MB L1 target
            .setTargetFileSizeBase(64 * 1024 * 1024)
            .setCompressionType(CompressionType.LZ4_COMPRESSION);
    }
});

env.setStateBackend(rocksdb);

This is a representative, not exhaustive, configuration. In practice, Flink teams maintain these options per-job or in a shared properties file and revisit them whenever job parallelism changes, state size grows, or they observe compaction stalls.

JVM heap sizing alongside RocksDB

RocksDB writes go through the JVM native memory path, but the JVM heap still matters. Each TaskManager runs multiple operators, and Flink allocates heap for network buffers, serialization, and operator metadata. A misconfigured heap-to-off-heap ratio results in either OutOfMemoryError from heap pressure or OutOfDirectMemoryError from native memory exhaustion. The canonical Flink sizing formula for a TaskManager with RocksDB involves setting taskmanager.memory.managed.fraction and taskmanager.memory.process.size to leave enough room for RocksDB block caches while keeping the heap at a manageable size.

# flink-conf.yaml (Flink 1.x) / config.yaml (Flink 2.x)
taskmanager.memory.process.size: 8192m
taskmanager.memory.managed.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 512m
taskmanager.memory.jvm-overhead.max: 1024m
taskmanager.memory.jvm-overhead.fraction: 0.1

Changing job parallelism or adding new stateful operators often requires re-tuning these values across all TaskManagers in the cluster.

RisingWave State Architecture: Disaggregated Storage on S3

RisingWave does not use a local state backend. All state lives in S3-compatible object storage, structured as a Log-Structured Merge (LSM) tree called Hummock. Compute nodes -- the workers that run streaming operators -- are stateless. They hold in-memory caches of hot data but write all mutations to S3 through a shared storage layer.

This architecture has three operational consequences:

  1. No local SSD provisioning. You do not size or attach local storage to compute nodes.
  2. No RocksDB configuration. There is no per-column-family or per-job tuning surface.
  3. Compute scales independently of state. Adding or removing compute nodes does not require rebalancing local state files.

State in RisingWave is defined implicitly by the SQL you write. When you create a materialized view, RisingWave maintains the incremental state required to answer queries against that view. You do not declare a state backend, an operator state type, or a key serializer.

-- RisingWave: state is the implicit backing store for this materialized view.
-- No state backend configuration required.
CREATE MATERIALIZED VIEW fstate_order_totals AS
SELECT
    user_id,
    COUNT(*)    AS order_count,
    SUM(amount) AS total_amount
FROM fstate_orders
GROUP BY user_id;

This materialized view maintains running per-user counts and sums. The state that backs this aggregation lives in S3 via Hummock. RisingWave decides the storage format, compaction strategy, and cache policy automatically.

Flink's checkpointing protocol inserts barrier messages into the data stream. When an operator receives a barrier from all of its input partitions, it snapshots its current state and acknowledges the checkpoint to the JobManager. The JobManager waits for all operators to acknowledge before marking the checkpoint complete.

With the RocksDB backend and incremental checkpointing enabled, Flink uploads only the RocksDB SST files that changed since the last checkpoint rather than the full state. This reduces checkpoint upload time, but it also creates a dependency chain: to restore from a given checkpoint, Flink must apply all incremental deltas back to the last full checkpoint.

// Checkpoint configuration in a Flink job
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointInterval(60_000);          // every 60 seconds
checkpointConfig.setCheckpointTimeout(120_000);          // fail if not complete in 2 min
checkpointConfig.setMinPauseBetweenCheckpoints(10_000);  // at least 10s gap
checkpointConfig.setMaxConcurrentCheckpoints(1);
checkpointConfig.setExternalizedCheckpointCleanup(
    CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
);

// Enable incremental checkpoints on the RocksDB backend
EmbeddedRocksDBStateBackend rocksdb = new EmbeddedRocksDBStateBackend(true);
env.setStateBackend(rocksdb);

Checkpoint duration is a function of state size, network bandwidth to checkpoint storage, and the number of concurrent checkpoint operations. A job with 100 GB of RocksDB state can take several minutes to complete a full checkpoint, during which barrier alignment introduces back-pressure across the topology. Unaligned checkpoints (introduced in Flink 1.11) mitigate the back-pressure issue but add implementation complexity and increase checkpoint sizes.

Flink retains a configurable number of completed checkpoints on disk. Storage cost grows linearly with retention count multiplied by checkpoint size. Incremental checkpointing helps, but a cluster with many jobs, each holding tens of gigabytes of state, accumulates significant checkpoint storage.

RisingWave epoch-based snapshots

RisingWave uses a barrier protocol similar in structure to Flink's, but the output is different. At each epoch boundary, RisingWave flushes all in-memory state changes to Hummock (the S3 LSM), producing an immutable snapshot. Snapshots are cheap to create because the underlying storage is already S3 -- there is no separate upload step.

Recovery means replaying from the most recent committed epoch rather than reconstructing local RocksDB files from a checkpoint archive. Because Hummock is shared storage, all compute nodes read the same state on startup. There is no need to distribute checkpoint archives to individual workers.

RisingWave does not expose checkpoint configuration to users. There is no interval to set, no timeout to tune, and no concurrent checkpoint limit. The epoch interval is an internal parameter managed by the meta service based on cluster load.

State TTL: Expiring Old State

Long-running streaming jobs accumulate state for keys that are no longer active. A session tracking job retains state for users who last visited months ago. Flink provides a StateTtlConfig API to mark individual state descriptors with a TTL:

StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.hours(24))
    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .disableCleanupInBackground()         // explicit background cleanup
    .cleanupInBackground()                // enable background compaction-driven cleanup
    .build();

ValueStateDescriptor<UserSession> sessionDescriptor =
    new ValueStateDescriptor<>("user-session", UserSession.class);
sessionDescriptor.enableTimeToLive(ttlConfig);

ValueState<UserSession> sessionState = ctx.getPartitionedState(sessionDescriptor);

TTL cleanup in Flink is not instantaneous. Expired entries remain in RocksDB until a compaction pass or an explicit read eviction removes them. The cleanupInBackground() option triggers cleanup during RocksDB compactions, but compaction timing is non-deterministic. Large keyspaces with long TTLs can carry gigabytes of logically expired state that bloats checkpoint sizes and RocksDB disk usage until compaction catches up.

Flink's TTL is per-state descriptor, so jobs with multiple stateful operators require separate TTL configurations for each. Coordinating TTL policies across a complex topology -- ensuring that all state for a given key expires at roughly the same wall-clock time -- requires careful design.

RisingWave: windowed aggregation as natural TTL

RisingWave does not implement a TTL API analogous to Flink's StateTtlConfig. Instead, the preferred approach for time-bounded state is to write windowed materialized views. A tumbling window aggregation retains state only for the current and recent windows, automatically releasing state for past windows once they close:

-- RisingWave: tumbling window over user events.
-- State for closed windows is released automatically.
CREATE MATERIALIZED VIEW fstate_session_counts AS
SELECT
    window_start,
    window_end,
    user_id,
    COUNT(*) AS event_count
FROM TUMBLE(fstate_user_events, event_time, INTERVAL '1' HOUR)
GROUP BY window_start, window_end, user_id;

For running aggregations that should logically expire old state, you can filter at query time using the materialized view's timestamp columns. This is the trade-off: RisingWave does not purge rows from a materialized view unless you model the view to produce finite output. For use cases that genuinely require row-level expiry on unbounded state, the pattern is to write a view whose output retains only recent data using time predicates in a downstream query.

RisingWave's Hummock storage does implement compaction and garbage collection of superseded data internally, but this is transparent to the user and not configurable per-view.

Running-Balance State: A Concrete Example

A running account balance is a prototypical stateful streaming use case. Each transaction updates the balance for an account. The state is the current balance per account ID.

In Flink, you implement this with a KeyedProcessFunction that holds a ValueState<Double> per account:

public class BalanceFunction extends KeyedProcessFunction<Long, Transaction, AccountBalance> {

    private ValueState<Double> balanceState;

    @Override
    public void open(Configuration config) {
        StateTtlConfig ttlConfig = StateTtlConfig
            .newBuilder(Time.days(90))
            .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
            .build();

        ValueStateDescriptor<Double> descriptor =
            new ValueStateDescriptor<>("balance", Double.class);
        descriptor.enableTimeToLive(ttlConfig);

        balanceState = getRuntimeContext().getState(descriptor);
    }

    @Override
    public void processElement(Transaction txn, Context ctx,
                               Collector<AccountBalance> out) throws Exception {
        Double current = balanceState.value();
        double newBalance = (current == null ? 0.0 : current) + txn.getAmount();
        balanceState.update(newBalance);
        out.collect(new AccountBalance(txn.getAccountId(), newBalance, txn.getTxnTime()));
    }
}

This requires: a Java project with Flink dependencies, a state descriptor, an explicit TTL policy, serialization configuration for Double, and a deployment pipeline to submit the JAR to a Flink cluster.

RisingWave (SQL)

The same logic in RisingWave is a single SQL statement:

-- Running account balance, always current, served via PostgreSQL protocol.
CREATE MATERIALIZED VIEW fstate_user_balances AS
SELECT
    account_id,
    SUM(amount)   AS running_balance,
    COUNT(*)      AS txn_count,
    MAX(txn_time) AS last_txn_time
FROM fstate_transactions
GROUP BY account_id;

After inserting transactions, the materialized view reflects the current balance within milliseconds:

SELECT account_id, running_balance, txn_count
FROM fstate_user_balances
ORDER BY account_id;
 account_id | running_balance | txn_count
------------+-----------------+-----------
        201 |          350.00 |         2
        202 |         1000.00 |         1

The state backing this view -- the per-account running sum -- lives in Hummock on S3. No TTL policy, no state descriptor, no serialization configuration, and no JAR deployment.

Operational Comparison: Day-Two Experience

The deepest differences between Flink and RisingWave state management appear not at initial deployment but during ongoing operations.

Adding a new streaming job

In Flink, adding a new stateful job means configuring its state backend, sizing its TaskManager allocation, deciding whether it shares a cluster with existing jobs or runs in isolation (job mode vs. session mode), and initializing its checkpoint storage location. If the new job joins data from an existing stateful operator, you must decide whether to use connected streams or a Flink Table API lateral join.

In RisingWave, you create a new materialized view with a CREATE MATERIALIZED VIEW statement. RisingWave schedules the streaming computation across available compute nodes automatically. The new view can reference existing tables or other materialized views directly in SQL. No cluster reconfiguration is required.

Changing job parallelism

Scaling a Flink job requires stopping the job (with a savepoint), modifying the parallelism setting, restoring from the savepoint, and verifying that the restored state aligns with the new key distribution. Incremental checkpointing complicates this further because the restore chain must be valid at the new parallelism.

RisingWave's compute nodes are stateless and can be added or removed while the system is running. State lives in S3 and is not tied to any specific compute node. Scaling compute does not require a restart or a savepoint.

Recovering from a failed node

When a Flink TaskManager fails, the JobManager detects the failure and restarts the entire job from the most recent checkpoint. Recovery time is the sum of: checkpoint download time (proportional to checkpoint size), RocksDB reconstruction time (deserializing SST files), and replay time (re-processing events since the last checkpoint).

When a RisingWave compute node fails, the meta service reschedules its actors on surviving nodes. Because state is in S3, there is no checkpoint download step. Recovery involves warming the in-memory cache from S3, which is typically faster than reconstructing a RocksDB instance from a checkpoint archive.

Feature Comparison Table

DimensionApache FlinkRisingWave
State backendHashMapStateBackend or EmbeddedRocksDBStateBackendHummock (S3 LSM, no configuration)
State persistence locationLocal SSD (RocksDB) + periodic S3 checkpointS3 directly (Hummock)
State configurationPer-job, per-operator, per-state-descriptorNone exposed to users
JVM heap tuningRequired (managed memory, JVM overhead fractions)Not applicable (Rust runtime)
RocksDB tuningRequired for production (write buffer, compaction, compression)Not applicable
CheckpointingUser-configured interval, timeout, concurrency limitAutomatic epoch-based, no configuration
Incremental checkpointsAvailable, requires RocksDB backendImplicit in Hummock design
State TTLStateTtlConfig per state descriptorWindowed views; no TTL API
Scaling without restartNo (requires savepoint + restore)Yes (stateless compute nodes)
Recovery granularityFull job restart from checkpointPer-actor reschedule from shared S3 state
State access from external systemsNot directly (needs separate serving layer)Direct PostgreSQL query
Programming modelJava/Scala DataStream API or SQLSQL only
Local SSD requiredYes (for RocksDB)No

Flink's state management is mature, well-documented, and battle-tested at scale. It is the right choice when:

Your team already maintains Flink expertise and has established runbooks for RocksDB tuning and checkpoint management. The cost of the existing knowledge base outweighs the operational overhead.

You need fine-grained control over state serialization, compaction, or TTL policies that go beyond what SQL expressions can capture. Complex event processing patterns using CEP or low-level timer management are better expressed in the DataStream API.

Your streaming jobs have extremely high throughput requirements (millions of events per second per task) and you need to tune every layer of the stack for maximum efficiency.

You are using Flink's ecosystem integrations -- Flink ML, PyFlink, or vendor-specific connectors -- that have no equivalent in RisingWave.

When RisingWave Removes Unnecessary Complexity

RisingWave is the better choice when the Flink operational overhead is not serving a genuine technical requirement:

Your team does not have dedicated platform engineers to maintain JVM tuning, RocksDB configuration, and checkpoint lifecycle management. The overhead is cost without benefit.

Your streaming logic is expressible in SQL. If you can write the transformation as a GROUP BY, a windowed aggregation, or a join over two streams, RisingWave handles the state management transparently and exposes the results through a standard PostgreSQL interface.

You want to query the results of stream processing directly from your application without building a separate serving layer. RisingWave's materialized view system serves fresh results over the PostgreSQL protocol, so any tool that speaks PostgreSQL -- psql, Grafana, Metabase, Tableau -- can query it directly.

You need to scale compute up and down in response to load without coordinating savepoints. The stateless compute architecture makes horizontal scaling a routine operation.

For teams considering a migration, the migration guide from Flink to RisingWave maps Flink concepts to their RisingWave equivalents step by step.

FAQ

Does RisingWave support exactly-once semantics the way Flink does?

Yes. RisingWave's epoch-based snapshot protocol provides exactly-once guarantees for stateful computations. Each epoch commits atomically to Hummock. If a compute node fails mid-epoch, the system rolls back to the last committed epoch and replays from the input source. The guarantee holds end-to-end when the input source supports offset-based replay, as Kafka does. For a deeper treatment of exactly-once semantics across systems, see the article on exactly-once processing in stream processing.

What happens to RisingWave state when I drop a materialized view?

Dropping a materialized view removes the corresponding state from Hummock. RisingWave's garbage collector reclaims the S3 objects during the next compaction cycle. This is equivalent to cancelling a Flink job with ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION and then manually deleting the checkpoint directory -- except RisingWave handles it automatically.

Can I query RisingWave state the way I query a database table?

Yes, and this is one of the most significant operational differences. In Flink, the state backing a streaming job is internal and not queryable from outside the job. To expose results, you must write to a sink (a database, a Kafka topic, or a REST endpoint) and then query the sink. RisingWave materialized views are first-class database objects. You query them with a standard SELECT statement over the PostgreSQL wire protocol, with no intermediate sink required.

How does RisingWave handle state for long-running joins between two streams?

A stream-stream join in RisingWave is expressed as a regular SQL join over two sources or materialized views. RisingWave maintains the join state -- the buffered rows from each side -- in Hummock. For unbounded streams, RisingWave requires a time-bounded join condition (an interval join) to prevent unbounded state growth, similar to Flink's interval join requirement. This is standard behavior for any stream processing system that provides join semantics with bounded state. The operational comparison between Flink and RisingWave covers join state and other operational topics in detail.

Does eliminating RocksDB tuning come with a performance trade-off?

Not for the vast majority of workloads. RisingWave's Hummock storage is implemented in Rust and uses an LSM design similar to RocksDB, but optimized for S3 as the primary storage medium rather than local disk. For workloads requiring low-latency point lookups on hot state, RisingWave uses in-memory block caches backed by the compute node's RAM. The Flink vs RisingWave total cost of ownership analysis includes benchmark context across representative workload types.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.