Why RisingWave Is Easier to Operate Than Apache Flink

Why RisingWave Is Easier to Operate Than Apache Flink

RisingWave is easier to operate than Apache Flink because it removes the four largest sources of Flink operational overhead: JVM tuning, JobManager and TaskManager topology management, checkpoint configuration, and manual state backend management. Instead, RisingWave accepts PostgreSQL-compatible SQL, stores all state automatically in S3-compatible object storage, and scales compute independently from storage without operator intervention.

This is not a capabilities comparison. Both systems can process streaming data at scale. This is an operational comparison: what does your on-call rotation look like at 2 AM when something goes wrong? What does a new team member need to learn before they can deploy a streaming job? What happens when your traffic triples overnight?

The answers to those questions are very different depending on which system you run.

Before looking at RisingWave, it helps to name precisely what makes Flink hard to operate. There are four distinct problem areas, each of which requires ongoing specialist attention.

1. JVM Tuning

Flink runs on the Java Virtual Machine. The JVM's garbage collector must periodically pause all application threads to reclaim memory. In a stream processing context, these pauses directly translate to latency spikes and potential backpressure cascades.

A production Flink deployment requires configuring the JVM garbage collector explicitly. A typical flink-conf.yaml includes entries like:

# Flink JVM GC configuration (flink-conf.yaml)
env.java.opts.taskmanager: >
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=20
  -XX:G1HeapRegionSize=16m
  -XX:+ParallelRefProcEnabled
  -XX:+DisableExplicitGC
  -XX:+AlwaysPreTouch
  -Xss4m
taskmanager.memory.process.size: 8192m
taskmanager.memory.flink.size: 6144m
taskmanager.memory.managed.fraction: 0.4
taskmanager.memory.network.fraction: 0.1
taskmanager.memory.jvm-overhead.min: 512m
taskmanager.memory.jvm-overhead.max: 1024m
taskmanager.memory.jvm-overhead.fraction: 0.1

Get any of these wrong and you will see one of two failure modes: OutOfMemoryError crashes (under-allocated) or wasted compute spend (over-allocated). The tricky part is that the right values depend on your specific job topology, state access patterns, and event throughput, all of which can change as your workload evolves.

Modern JVM collectors like ZGC and Shenandoah reduce stop-the-world pause times, but they come with higher CPU overhead and their own configuration surface. You trade one tuning problem for another.

2. JobManager and TaskManager Topology

Flink uses a two-tier cluster architecture: a JobManager (the coordinator) and one or more TaskManagers (the workers). Each component has independent resource requirements, failure modes, and scaling behavior.

A production-grade Flink cluster requires:

  • JobManager high availability: At least two JobManager processes (active/standby) backed by ZooKeeper or etcd for leader election. Without this, a single JobManager failure stops all running jobs.
  • TaskManager sizing: Each TaskManager runs a fixed number of task slots. You must pre-calculate the slot count based on your job parallelism requirements, then size the TaskManager heap accordingly.
  • ZooKeeper ensemble: For HA mode, Flink requires a ZooKeeper quorum (minimum 3 nodes) to coordinate leader election and store job metadata.
  • Parallelism planning: Each streaming operator in a Flink job runs at a configured parallelism level. Changing parallelism after deployment requires a stateful restart with a savepoint.

The operational consequence of this topology is that scaling Flink is not elastic. If you need to handle a traffic spike, you must provision new TaskManagers (with correctly sized slots and memory), register them with the JobManager, and potentially restart jobs to redistribute work across the expanded pool. This process takes minutes to hours depending on your automation maturity and whether savepoints are required.

3. Checkpoint Configuration

Flink achieves fault tolerance through periodic checkpoints: consistent snapshots of operator state written to durable storage. When a job fails and restarts, it resumes from the last completed checkpoint rather than reprocessing the entire stream from the beginning.

Checkpoint configuration involves several interacting parameters:

# Flink checkpoint configuration (flink-conf.yaml)
execution.checkpointing.interval: 60000          # ms between checkpoints
execution.checkpointing.min-pause: 30000          # min gap between checkpoints
execution.checkpointing.timeout: 600000           # checkpoint must complete in 10m
execution.checkpointing.max-concurrent-checkpoints: 1
execution.checkpointing.mode: EXACTLY_ONCE
state.checkpoints.dir: s3://my-bucket/checkpoints
state.savepoints.dir: s3://my-bucket/savepoints
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION

The interactions between these parameters are non-obvious. If your checkpoint interval is too short relative to checkpoint completion time, checkpoints queue up and backpressure builds. If the timeout is too aggressive, large-state jobs fail checkpoints constantly. If you forget to set RETAIN_ON_CANCELLATION, a cancelled job loses its recovery point.

Checkpoint failures at 3 AM are among the most common Flink on-call incidents. They require inspecting checkpoint logs, identifying which operator failed to complete the barrier, and often correlating that with memory pressure or network saturation.

4. State Backend Management

Flink keeps job state in a configured state backend. Three backends are available: heap (in-memory, lost on restart), RocksDB (local disk, survives restarts), and the newer ForSt (fork of RocksDB, introduced in Flink 2.0).

For production workloads, RocksDB is the default recommendation. But RocksDB introduces its own operational layer:

  • Local SSD provisioning: Each TaskManager needs local SSD storage for RocksDB. A job maintaining 100 GB of keyed state needs 100 GB of local SSD per TaskManager replica.
  • RocksDB tuning: Block cache size, write buffer size, compaction strategy, and bloom filter parameters all affect throughput and memory use. Default settings are rarely optimal for high-throughput streaming.
  • Incremental checkpoints: Enabling incremental checkpoints reduces checkpoint size but adds complexity to recovery (Flink must reconstruct full state from a chain of incremental snapshots).
  • State migration: Changing key schema or operator state structure between deployments requires careful schema evolution planning or a full state reset.

The net effect is that operating Flink state backends requires expertise in both Flink's state model and RocksDB internals, two separate knowledge domains.

How RisingWave Approaches Each Problem

RisingWave is a PostgreSQL-compatible streaming database written in Rust. It eliminates each of the four Flink operational challenges through architectural choices rather than better defaults.

No JVM, No GC Pauses

RisingWave is written in Rust. Rust uses a compile-time ownership model to manage memory without a garbage collector. There are no stop-the-world pauses, no heap sizing parameters, no GC algorithm selection, and no JVM metaspace limits.

From an operational standpoint, this means your on-call rotation never pages for a GC pause causing latency spikes. Memory usage is predictable and bounded by Rust's ownership semantics rather than by GC heuristics.

You also eliminate the entire classpath management problem. Flink jobs frequently fail due to dependency conflicts between user JAR files and Flink's own libraries. Because RisingWave uses SQL as its only interface, there are no user-submitted JAR files and no classpath to manage.

SQL Replaces Topology Management

In RisingWave, you do not deploy jobs or manage worker topology. You connect with any PostgreSQL client and write SQL.

Creating a streaming data source:

-- Connect with psql, then create a source from Kafka
CREATE SOURCE ops_orders_source (
    order_id    BIGINT,
    customer_id BIGINT,
    amount      DECIMAL,
    region      VARCHAR,
    order_time  TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'broker:9092'
) FORMAT PLAIN ENCODE JSON;

Creating a continuously maintained aggregation:

-- This materialized view updates automatically as new orders arrive
CREATE MATERIALIZED VIEW ops_order_revenue_mv AS
SELECT
    customer_id,
    region,
    COUNT(*)        AS order_count,
    SUM(amount)     AS total_revenue,
    AVG(amount)     AS avg_order_value
FROM ops_orders_source
GROUP BY customer_id, region;

Querying the current result at any time:

-- Reads from continuously maintained state, returns instantly
SELECT * FROM ops_order_revenue_mv
WHERE region = 'us-east'
ORDER BY total_revenue DESC;

There is no JobManager to configure, no TaskManager slot count to calculate, no ZooKeeper to provision, and no parallelism to set per operator. RisingWave handles all of that internally.

When you need to add a new streaming computation, you add a new CREATE MATERIALIZED VIEW statement. When you need to remove one, you DROP MATERIALIZED VIEW. The operational model is identical to managing a PostgreSQL schema.

Disaggregated Storage Replaces Checkpoint Management

RisingWave uses a disaggregated compute-storage architecture. All persistent state lives in S3-compatible object storage. Compute nodes are stateless and can be restarted, replaced, or scaled without data loss.

This design eliminates checkpoint configuration entirely. You do not configure checkpoint intervals, checkpoint modes, checkpoint timeouts, or checkpoint retention policies. The system continuously persists state to object storage as part of its normal operation.

When a compute node fails, RisingWave restarts it and resumes processing from the last persisted state automatically. There is no manual recovery procedure, no checkpoint log to inspect, and no savepoint to take before a planned upgrade.

You also eliminate local SSD provisioning. Because state lives in S3 rather than on local disks, compute nodes do not require any attached block storage for state. This significantly reduces per-node cost and eliminates the provisioning step that precedes Flink TaskManager scaling.

The following illustrates the difference in what a high-value order alert pipeline looks like in each system:

-- RisingWave: deploy an alert filter as a single SQL statement
-- No checkpoint config, no state backend, no topology to manage
CREATE MATERIALIZED VIEW ops_order_alerts_mv AS
SELECT
    customer_id,
    order_id,
    amount,
    region,
    order_time
FROM ops_orders_source
WHERE amount > 1000.00;

-- Query alerts directly from any PostgreSQL client
SELECT * FROM ops_order_alerts_mv ORDER BY order_time DESC LIMIT 50;

In Flink, the equivalent pipeline requires a compiled JAR, a deployment to the cluster, checkpoint configuration, and a separate serving database to make results queryable.

Auto-Scaling Replaces Manual Capacity Planning

Because RisingWave's compute nodes are stateless (all state in S3), adding or removing compute capacity does not require data migration or job restarts. You scale the compute pool and RisingWave redistributes work automatically.

On RisingWave Cloud, scaling is a single API call or a click in the console. On Kubernetes with the RisingWave Operator, it is a change to the replica count in the custom resource. The Kubernetes deployment guide covers this in detail.

Contrast this with Flink: scaling TaskManagers requires provisioning new nodes, registering them with the JobManager, and for stateful jobs, taking a savepoint and restarting with increased parallelism. That process can take 30 to 60 minutes for large jobs and introduces a recovery window during which the job is not processing.

Side-by-Side Operational Comparison

Operational taskApache FlinkRisingWave
Deploy a new streaming jobCompile JAR, configure parallelism, submit to JobManagerCREATE MATERIALIZED VIEW
Change aggregation logicTake savepoint, update JAR, restart with savepointDROP MATERIALIZED VIEW, CREATE MATERIALIZED VIEW
Scale up for traffic spikeProvision TaskManagers, restart job with savepoint if parallelism changesScale compute pool (stateless); no job restart
Recover from node failureJob restarts from last checkpoint; checkpoint gap is unprocessed windowAuto-recovery from S3 state; no manual intervention
Add a new data sourceConfigure connector, write DataStream or Table API code, compileCREATE SOURCE with SQL
Query current aggregation resultsWrite to external DB; query that DBSELECT from materialized view
Tune memorySet heap, managed memory, network buffer fractions per TaskManagerNo JVM; no memory tuning required
Configure fault toleranceSet checkpoint interval, mode, timeout, storage path, retentionNothing; state persists to S3 automatically
Upgrade to new versionRolling restart with savepoints; verify state compatibilityStandard rolling update via Helm or operator
On-call skills requiredJVM internals, RocksDB tuning, Flink checkpoint protocolSQL, standard database operations

A Day-One vs Day-365 Perspective

The operational gap between Flink and RisingWave widens over time.

On day one, both systems are manageable. Flink has good documentation, and a skilled engineer can stand up a cluster in a few hours. RisingWave is even faster to start: install the binary, connect with psql, write SQL.

By day 365, the Flink cluster has accumulated operational debt: tuned GC flags that nobody remembers writing, a ZooKeeper version that is two major releases behind, checkpoint sizes that have grown 10x as state accumulates, and a team that rotates new engineers in but cannot safely transfer the accumulated configuration knowledge. Every planned upgrade is a high-risk event.

RisingWave's operational surface does not accumulate complexity in the same way. New materialized views are additive. State management is handled by the storage layer. Upgrades follow a standard database upgrade procedure. The system that runs 365 days in looks operationally similar to the system on day 1.

This is not to say RisingWave has zero operational complexity. Large deployments need monitoring, capacity planning, and query optimization. The total cost of ownership comparison covers those trade-offs in depth. But the complexity curve is fundamentally flatter.

Windowed Aggregations Without Operator Configuration

One concrete example of where the operational difference becomes tangible is windowed aggregation. In Flink, every windowed operator has configurable state TTL, state backend settings, and parallelism that must be tuned per job. In RisingWave, you write SQL:

-- 5-minute tumbling window: click counts by event type
-- Verified against RisingWave 2.8
CREATE MATERIALIZED VIEW ops_click_counts_mv AS
SELECT
    window_start,
    window_end,
    event_type,
    COUNT(*) AS event_count
FROM TUMBLE(ops_user_events, event_time, INTERVAL '5 minutes')
GROUP BY window_start, window_end, event_type;

This single statement creates a continuously maintained, windowed aggregation. You query it with a normal SELECT. There is no window operator state backend to configure, no TTL to set, and no parallelism to plan.

The equivalent Flink DataStream API code requires defining a windowed stream, configuring the window assigner, defining an aggregate function, and handling state backend configuration separately.

For teams that want to see the full syntax comparison, Flink SQL vs RisingWave SQL covers the differences in sources, windowing, joins, and sinks.

RisingWave is a better operational choice for most new streaming workloads, but Flink retains real advantages in specific scenarios.

Complex event processing (CEP): Flink's MATCH_RECOGNIZE and pattern matching operators are mature and well-tested. RisingWave does not yet have a direct equivalent for multi-event pattern detection.

Custom processing logic in Java or Python: If your pipeline requires business logic that cannot be expressed in SQL, Flink's DataStream API gives you full programming language expressiveness. RisingWave supports user-defined functions (UDFs) in Python and Java, but complex stateful logic is better expressed in Flink's native API.

Large existing Flink investment: If you have a team of Flink specialists, a mature CI/CD pipeline for JAR deployment, and years of tuned configurations, the operational cost of that system is already paid. Migration makes sense when you are building new pipelines or when operational incidents become frequent enough to justify the transition investment. The migration guide from Flink to RisingWave covers the evaluation criteria in detail.

Flink 2.0 disaggregated state: Flink 2.0 introduced disaggregated state management, which addresses some of the local SSD and state backend complexity. If you are on Flink 2.0 with ForSt, the operational gap for state management specifically is narrower. JVM tuning and topology management remain.

FAQ

Does RisingWave support exactly-once processing?

Yes. RisingWave provides exactly-once semantics for sources and sinks that support it, including Kafka. The guarantee is built into the storage layer rather than requiring checkpoint configuration. You do not specify a processing mode; exactly-once is the default behavior for supported connectors.

How does RisingWave handle backpressure?

RisingWave handles backpressure internally within its compute layer. When a downstream operator falls behind, upstream operators slow their processing rate automatically. Unlike Flink, there are no per-operator backpressure metrics that require manual tuning. You monitor overall throughput and latency through standard Prometheus metrics exposed by the system.

Can RisingWave replace Flink for all use cases?

Not all. RisingWave is a better operational fit for SQL-expressible streaming workloads, including aggregations, joins, CDC processing, and windowed analytics. Use cases that require custom stateful operator logic written in Java, complex event pattern matching across long event sequences, or deep integration with Flink's existing connector ecosystem may still be better served by Flink. The decision depends on whether your processing logic fits the SQL model.

What happens to RisingWave state if S3 goes down?

RisingWave compute nodes pause processing when they cannot write to object storage. They do not lose data or corrupt state. When S3 availability is restored, processing resumes from the last successfully persisted state. This behavior is similar to how a PostgreSQL database handles a storage failure: the system stops accepting writes rather than risking data corruption.

Is RisingWave production-ready?

Yes. RisingWave is an open-source project under the Apache 2.0 license, available at github.com/risingwavelabs/risingwave. RisingWave Cloud is a managed service used in production by organizations processing billions of events per day. The system has been deployed in production since 2022.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.