Flink Cluster Management vs Serverless Streaming SQL

Flink Cluster Management vs Serverless Streaming SQL

Managing Flink cluster infrastructure -- YARN resource scheduling, Kubernetes TaskManager pods, parallelism tuning, and checkpoint configuration -- adds a significant operational layer that is separate from writing streaming logic. RisingWave eliminates that layer by running streaming SQL on a single binary or Helm chart with automatic state management, no JVM configuration, and compute that scales without job restarts.

This article compares the two approaches head-to-head from the perspective of a platform engineer responsible for keeping streaming pipelines healthy in production.

Apache Flink is a mature, capable stream processing framework. Its operational complexity is not a bug -- it reflects genuine distributed systems challenges. But understanding precisely where that complexity lives helps you make informed architectural decisions.

Flink's operational surface has four major areas: cluster deployment and mode selection, TaskManager sizing and parallelism, checkpointing and state backend configuration, and failure recovery. Each area requires ongoing attention from platform engineers who understand both distributed systems and JVM internals.

Flink supports three deployment targets: standalone, YARN, and Kubernetes. Each has a different operational profile.

Standalone Mode

Standalone is the simplest deployment. You provision VMs, install the Flink distribution, configure flink-conf.yaml, and start JobManager and TaskManager processes manually. It works for development and fixed-capacity workloads.

The operational downside is that standalone offers no automatic resource recovery. If a TaskManager VM fails, you restart it manually or write your own watchdog scripts. Scaling requires SSH and process restarts. Most production teams graduate out of standalone within the first few months.

YARN Mode

YARN (Yet Another Resource Negotiator) is the Hadoop-era resource manager, still common in organizations running on-premises Hadoop clusters or EMR. Flink on YARN integrates reasonably well, but introduces YARN-specific operational concerns.

A typical flink run invocation on YARN:

# Submit a Flink job to YARN
flink run \
  -m yarn-cluster \
  -p 8 \
  -yjm 2048m \
  -ytm 4096m \
  -yn 4 \
  -yqu streaming-jobs \
  -c com.example.StreamingJob \
  my-streaming-job.jar

Each flag is a tuning decision: -p 8 sets parallelism, -yjm sets JobManager heap, -ytm sets TaskManager heap, -yn sets TaskManager count, -yqu routes to a YARN queue. Getting these right requires knowing your job's state access patterns, event throughput, and the current load on other YARN applications sharing the cluster.

YARN introduces additional failure scenarios: YARN NodeManager failures, ResourceManager HA transitions, queue capacity fights when multiple teams share a cluster, and version skew between Flink and Hadoop libraries.

Kubernetes Mode

Kubernetes has become the preferred deployment target for new Flink deployments. The Flink Kubernetes Operator provides a CRD-based deployment model that feels more cloud-native.

A minimal FlinkDeployment manifest looks like this:

apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: streaming-job
spec:
  image: flink:1.18
  flinkVersion: v1_18
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "2"
    state.backend: rocksdb
    state.backend.rocksdb.memory.managed: "true"
    execution.checkpointing.interval: "60s"
    execution.checkpointing.timeout: "10m"
    execution.checkpointing.min-pause: "30s"
    state.checkpoints.dir: s3://my-bucket/flink-checkpoints
    state.savepoints.dir: s3://my-bucket/flink-savepoints
    high-availability: kubernetes
    high-availability.storageDir: s3://my-bucket/ha
    kubernetes.rest-service.exposed.type: LoadBalancer
  serviceAccount: flink
  jobManager:
    resource:
      memory: "2048m"
      cpu: 1
  taskManager:
    resource:
      memory: "4096m"
      cpu: 2
  job:
    jarURI: local:///opt/flink/usrlib/streaming-job.jar
    parallelism: 8
    upgradeMode: savepoint

This manifest is reasonable but it is not the whole story. You also need: the Flink Kubernetes Operator installed (its own Helm release to manage), RBAC roles for the flink service account, an S3 IAM role with checkpoint bucket access, Prometheus ServiceMonitor resources for metrics, a horizontal pod autoscaler strategy (Flink does not autoscale natively), and a savepoint workflow for controlled restarts.

The Kubernetes Operator improves on raw YAML management, but platform engineers still own the full stack: operator upgrades, CRD migrations, image management, and the operational runbooks for each failure mode.

TaskManager Sizing and Parallelism

This is where the most subtle operational work lives in Flink.

Memory Configuration

Flink TaskManagers have a complex memory model. The total process memory divides into multiple pools, each with a separate configuration knob:

# Typical TaskManager memory configuration in flink-conf.yaml
taskmanager.memory.process.size: 8192m        # Total container memory
taskmanager.memory.flink.size: 6144m          # Flink-managed portion
taskmanager.memory.managed.fraction: 0.4      # RocksDB / sort buffers
taskmanager.memory.network.fraction: 0.1      # Network shuffle buffers
taskmanager.memory.jvm-overhead.fraction: 0.1 # JVM metaspace, code cache
taskmanager.memory.jvm-overhead.min: 512m
taskmanager.memory.jvm-overhead.max: 1024m
taskmanager.numberOfTaskSlots: 2

# JVM GC tuning
env.java.opts.taskmanager: >
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=20
  -XX:G1HeapRegionSize=16m
  -XX:+ParallelRefProcEnabled
  -XX:+DisableExplicitGC
  -Xss4m

Getting these numbers wrong produces one of two outcomes: OutOfMemoryError crashes (under-allocated) or inflated cloud bills (over-allocated). The right values depend on job topology, state size, and event throughput -- all of which change as your workload evolves. When a new job is added or existing jobs are modified, the sizing exercise starts over.

Parallelism Tuning

Each operator in a Flink job runs at a configured parallelism. Higher parallelism means more parallel instances processing data, but it also means more network shuffle overhead and more state shards to checkpoint.

The standard guidance is to set parallelism equal to the number of Kafka partitions for source operators, then scale downstream operators proportionally. In practice, you iterate: run the job, observe backpressure in the Flink UI, adjust, restart. Each restart on a stateful job requires either a clean start (reprocessing from the beginning) or a savepoint restore.

Changing parallelism after a job is running requires taking a savepoint first:

# Take a savepoint before changing parallelism
flink savepoint <job_id> s3://my-bucket/flink-savepoints

# Cancel the job
flink cancel <job_id>

# Restart with new parallelism from savepoint
flink run \
  -s s3://my-bucket/flink-savepoints/savepoint-abc123 \
  -p 16 \
  my-streaming-job.jar

This is not an emergency procedure -- it is a routine operational task. Any parallelism change, state schema migration, or Flink version upgrade follows the same savepoint dance. On a busy cluster with large state, savepoints can take 10-30 minutes and checkpoint storage costs accumulate.

Checkpointing Configuration

Checkpointing is Flink's mechanism for fault tolerance. The JobManager periodically injects checkpoint barriers into the event stream. When all operators have processed a barrier, they snapshot their state to durable storage. On failure, Flink restores from the last successful checkpoint.

The operational challenge is that checkpointing has a direct impact on latency and throughput, and the configuration space is large:

# Checkpoint configuration in flink-conf.yaml
execution.checkpointing.interval: 60s          # How often to checkpoint
execution.checkpointing.timeout: 600s          # Fail if checkpoint takes longer
execution.checkpointing.min-pause: 30s         # Gap between checkpoints
execution.checkpointing.max-concurrent-checkpoints: 1
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION

# State backend for large state
state.backend: rocksdb
state.backend.rocksdb.memory.managed: true
state.backend.rocksdb.memory.fixed-per-slot: 512mb
state.backend.rocksdb.block.cache-size: 128mb
state.backend.rocksdb.write-buffer-size: 64mb

# Checkpoint storage
state.checkpoints.dir: s3://my-bucket/flink-checkpoints
state.checkpoints.num-retained: 3

When checkpoints fail -- due to slow operators, large state, or network issues -- the job falls behind and latency increases. Diagnosing a stuck checkpoint requires navigating the Flink Web UI, reading TaskManager logs, and correlating metrics across multiple components. It is the most common 2 AM incident for Flink platform teams.

The RisingWave Approach: SQL Without the Cluster

RisingWave is a PostgreSQL-compatible streaming database. You write SQL; it handles state, checkpointing, and parallelism automatically.

Docker or Helm -- No Cluster Config Required

For development, a single command starts a full RisingWave instance:

docker run -it --pull=always \
  -p 4566:4566 \
  -p 5691:5691 \
  risingwavelabs/risingwave:latest \
  playground

For production on Kubernetes, the official Helm chart deploys all components with sensible defaults. A minimal production deployment:

helm repo add risingwave https://risingwavelabs.github.io/helm-charts/
helm repo update

helm install risingwave risingwave/risingwave \
  --namespace risingwave \
  --create-namespace \
  --set metaStore.etcd.enabled=true \
  --set stateStore.s3.enabled=true \
  --set stateStore.s3.bucket=my-risingwave-state \
  --set stateStore.s3.region=us-east-1

No JVM configuration. No parallelism decisions. No checkpoint interval tuning. You connect with psql and write SQL.

Streaming SQL Instead of Java Topology

In Flink, a purchase revenue aggregation job requires writing a Java (or Python) DataStream API job, configuring the serialization schema, setting windowing semantics, wiring source and sink connectors, and then submitting the JAR. Here is the equivalent in RisingWave SQL, verified against a live RisingWave instance:

-- Create the input table (backed by a Kafka source in production)
CREATE TABLE fcluster_events (
    event_id  BIGINT,
    user_id   BIGINT,
    event_type VARCHAR,
    amount    NUMERIC,
    ts        TIMESTAMPTZ
);

-- Define incremental aggregation as a materialized view
-- RisingWave updates this automatically as new events arrive
CREATE MATERIALIZED VIEW fcluster_revenue_per_user AS
SELECT
    user_id,
    COUNT(*) FILTER (WHERE event_type = 'purchase') AS purchase_count,
    SUM(amount)  FILTER (WHERE event_type = 'purchase') AS total_revenue,
    MAX(ts)      AS last_event_ts
FROM fcluster_events
GROUP BY user_id;

Query the result at any time with standard SQL -- no separate serving database required:

SELECT user_id, total_revenue
FROM fcluster_revenue_per_user
WHERE total_revenue > 50
ORDER BY total_revenue DESC;
 user_id | total_revenue
---------+---------------
     102 |         99.00
     101 |         89.49
(2 rows)

For windowed aggregations, RisingWave provides TUMBLE, HOP, and SESSION window functions that map directly to SQL:

CREATE MATERIALIZED VIEW fcluster_purchase_volume_1min AS
SELECT
    window_start,
    window_end,
    COUNT(*)    AS event_count,
    SUM(amount) AS window_revenue
FROM TUMBLE(fcluster_events, ts, INTERVAL '1 minute')
WHERE event_type = 'purchase'
GROUP BY window_start, window_end;
       window_start        |        window_end         | event_count | window_revenue
---------------------------+---------------------------+-------------+----------------
 2026-04-02 07:27:00+00:00 | 2026-04-02 07:28:00+00:00 |           3 |         188.49
(1 row)

RisingWave maintains these materialized views incrementally. You do not configure checkpoint intervals, parallelism, or state backends -- the system handles it.

Auto-Scaling Without Job Restarts

In Flink, scaling is a manual, disruptive process (savepoint, cancel, restart with new parallelism). In RisingWave, you adjust compute resources at the cluster level. On RisingWave Cloud, scaling is handled automatically. Self-hosted, you scale the Kubernetes StatefulSet for compute nodes and RisingWave rebalances work without downtime. Your SQL and materialized view definitions remain unchanged.

Head-to-Head Comparison

DimensionFlink (Kubernetes mode)RisingWave
Deployment unitJobManager + TaskManager pods + Operator CRDSingle Helm chart
Language for jobsJava, Scala, Python, Flink SQLSQL (PostgreSQL-compatible)
JVM tuning requiredYes (GC, heap, memory pools)No (written in Rust)
Parallelism configurationManual per-operator settingAutomatic
Checkpointing configManual (interval, timeout, backend)Automatic
State backend choiceIn-memory, RocksDB (manual selection)Built-in, S3-backed
Scaling without restartsNo (savepoint required)Yes
Serving layer neededYes (Kafka/DB for query access)No (PostgreSQL interface built-in)
High availability setupJobManager HA + ZooKeeper/etcdetcd only, auto-managed
Operator expertise neededJVM + Flink internals + K8sSQL + K8s basics
Failure recoveryManual checkpoint restoreAutomatic
Time to first query (local dev)30+ minutesUnder 5 minutes (Docker)

Operational Incident Scenarios

To make the difference concrete, here is how each system responds to three common production incidents.

Scenario 1: Traffic spikes 3x overnight. Flink: CPU and memory pressure on TaskManagers. If OOM threshold is hit, TaskManagers die, Flink restores from last checkpoint. You wake up, check the Flink UI, confirm the recovery, then plan a scale-out (savepoint, cancel, redeploy with more TaskManagers). RisingWave: backpressure is handled automatically. On RisingWave Cloud, compute scales out. Self-hosted, you run kubectl scale statefulset risingwave-compute --replicas=6. No job restart.

Scenario 2: A new streaming job is added to the cluster. Flink: calculate resource requirements, check available TaskManager slots, possibly add TaskManagers (or steal slots from other jobs), submit the JAR. RisingWave: write a CREATE MATERIALIZED VIEW statement. The query planner integrates it with existing streaming operators automatically.

Scenario 3: A bug in the stream processing logic needs a fix. Flink: fix the code, build a new JAR, take a savepoint of the running job, deploy the new JAR from savepoint (if state schema is compatible), or restart from scratch. RisingWave: issue a DROP MATERIALIZED VIEW and recreate it with the corrected SQL. Stateful restart is automatic.

Flink is the right choice when:

  • Your team already has deep Flink expertise and has automated the operational runbook
  • You need fine-grained control over per-operator resource allocation for extreme throughput requirements
  • Your existing data platform is built around Hadoop/YARN or a managed service like Amazon Managed Service for Apache Flink
  • You process complex CEP (complex event processing) patterns that map naturally to Flink's pattern matching API

For teams starting new streaming projects, Flink's operational surface is a significant investment to build up before you deliver any business value.

For a deeper look at the cost side of this comparison, see Flink vs RisingWave: Total Cost of Ownership, which breaks down infrastructure, operations, and development costs for a medium-scale deployment.

If you are evaluating whether to migrate an existing Flink deployment, Migrating from Apache Flink to RisingWave covers the SQL translation patterns and migration strategy in detail.

Platform engineers evaluating Kubernetes deployment specifically will find Running a Streaming Database on Kubernetes: A Complete Guide useful for the RisingWave Helm chart configuration and production hardening steps.

FAQ

Is RisingWave really "serverless"?

RisingWave is not serverless in the AWS Lambda sense -- you are still deploying a cluster, either self-hosted or on RisingWave Cloud. The "serverless" framing refers to the operational model: you do not configure parallelism, checkpoint intervals, JVM settings, or state backends. The system manages those concerns internally, similar to how a managed database hides storage and replication details. On RisingWave Cloud, compute scales automatically, which gets closer to the serverless experience.

Can RisingWave handle the same throughput as Flink?

RisingWave uses the Nexmark benchmark suite, the same benchmark used by the Flink community, to measure performance. In published results, RisingWave outperforms Flink on 22 of 27 Nexmark queries. Flink has advantages on specific workloads where fine-grained parallelism control and custom state serialization matter. For most standard streaming SQL workloads -- aggregations, joins, windowed computations -- RisingWave is competitive or faster.

What happens to my Flink state when I migrate to RisingWave?

Flink state is not directly portable to RisingWave. The practical migration path is to run both systems in parallel during a cutover window: Flink continues processing until RisingWave's materialized views have caught up with the backfill window, then traffic switches. For most workloads the catch-up period is hours, not days. The migration guide covers this in detail.

Does RisingWave support exactly-once semantics?

Yes. RisingWave provides exactly-once semantics for materialized views by default. State is persisted to S3-compatible object storage with transactional consistency. Unlike Flink, you do not need to configure this explicitly -- it is the only mode the system offers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.