5 Reasons Teams Switch from Flink to Streaming SQL

The Flink Reckoning

Your team adopted Apache Flink two years ago. It worked. Sort of. The Java pipeline that took three engineers six weeks to build is running in production, processing clickstream data through a chain of keyed windows. But the Flink cluster needs a dedicated platform engineer to babysit. Checkpoints fail at 3 AM. JVM garbage collection pauses cause backpressure cascades. And every time the business wants a new streaming metric, your data team files a ticket and waits weeks for Java development.

This story repeats across hundreds of engineering organizations. Flink is a powerful stream processing framework, battle-tested at companies like Alibaba and Uber. But power comes with complexity, and many teams are discovering that the complexity tax outweighs the benefits for their use cases. A growing number of data engineering teams are making the switch from Flink to streaming SQL - specifically to RisingWave, a PostgreSQL-compatible streaming database that replaces Flink clusters with standard SQL.

This article breaks down the five concrete reasons behind this shift, with real code comparisons showing what changes when you move from Flink's Java ecosystem to streaming SQL with RisingWave.

Reason 1: No More JVM Cluster Operations

Operating a Flink cluster in production is a full-time job. A minimal deployment requires a JobManager for coordination, multiple TaskManagers for parallel processing, ZooKeeper for high availability, and a state backend (typically RocksDB) with checkpointing to S3 or HDFS. Each component demands its own configuration, monitoring, and capacity planning.

The Flink Operations Burden

Consider what your platform team manages daily:

JVM memory tuning across JobManagers and TaskManagers - heap size, managed memory, network buffers, and metaspace all need separate configuration
Garbage collection optimization - long GC pauses cause checkpoint timeouts, which cause job restarts, which cause reprocessing delays
RocksDB state backend tuning - block cache sizes, write buffer counts, compaction settings, and bloom filter configuration per job
TaskManager slot allocation - over-provision and you waste money, under-provision and jobs compete for resources
Checkpoint coordination - barrier alignment, timeout configuration, and minimum pause intervals between checkpoints

When a TaskManager fails in a session cluster, every job running on that node goes down. The recovery process triggers all affected jobs to restore state simultaneously, potentially saturating your storage layer and creating a cascade of failures.

RisingWave: A Single System to Manage

RisingWave eliminates this entire operational layer. It is a streaming database, not a processing framework. You deploy a single system (or use RisingWave Cloud) and connect with any PostgreSQL client. There is no JVM to tune, no RocksDB to configure, and no TaskManager topology to design.

The architecture separates compute, storage, and metadata into independent layers:

Compute nodes handle stream processing and batch queries
Compactor nodes manage background storage optimization
Object storage (S3, GCS, MinIO) persists all state automatically
Meta service coordinates scheduling

Each layer scales independently. Need more processing power? Add compute nodes without touching storage. State growing? Object storage handles it automatically with no disk capacity planning.

Written in Rust with no JVM dependency, RisingWave avoids garbage collection pauses entirely. There is no stop-the-world event that can stall your streaming pipeline.

Reason 2: SQL Instead of Java

The gap between Flink's Java API and RisingWave's SQL interface is not incremental - it is an order-of-magnitude difference in complexity for equivalent streaming logic.

Flink: A Sliding Window in Java

Here is a real example from the Flink repository showing a sliding window aggregation:

public class GroupedProcessingTimeWindowExample {

    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env =
            StreamExecutionEnvironment.getExecutionEnvironment();

        DataGeneratorSource<Tuple2<Long, Long>> generatorSource =
            new DataGeneratorSource<>(
                generatorFunction,
                numElementsPerParallel * env.getParallelism(),
                Types.TUPLE(Types.LONG, Types.LONG));

        DataStream<Tuple2<Long, Long>> stream = env.fromSource(
            generatorSource,
            WatermarkStrategy.noWatermarks(),
            "Data Generator");

        stream
            .keyBy(value -> value.f0)
            .window(SlidingProcessingTimeWindows.of(
                Duration.ofMillis(2500),
                Duration.ofMillis(500)))
            .reduce(new SummingReducer())
            .sinkTo(new DiscardingSink<>());

        env.execute();
    }

    private static class SummingReducer
            implements ReduceFunction<Tuple2<Long, Long>> {
        @Override
        public Tuple2<Long, Long> reduce(
                Tuple2<Long, Long> value1,
                Tuple2<Long, Long> value2) {
            return new Tuple2<>(value1.f0, value1.f1 + value2.f1);
        }
    }
}

This is a minimal example - it does not include imports, error handling, serialization setup, connector configuration, or checkpoint settings. A production Flink job typically runs 200-500 lines of Java before it does anything useful.

RisingWave: The Same Logic in SQL

The equivalent streaming aggregation in RisingWave:

CREATE MATERIALIZED VIEW mv_funnel_per_minute AS
SELECT
    window_start,
    window_end,
    page_url,
    COUNT(*) FILTER (WHERE action = 'view') AS views,
    COUNT(*) FILTER (WHERE action = 'add_to_cart') AS add_to_carts,
    COUNT(*) FILTER (WHERE action = 'purchase') AS purchases
FROM TUMBLE(clickstream, event_time, INTERVAL '1 MINUTE')
GROUP BY window_start, window_end, page_url;

That is the complete pipeline definition. No build system, no deployment configuration, no serialization code. The materialized view continuously updates as new data arrives, and you query the results with a standard SELECT:

SELECT * FROM mv_funnel_per_minute ORDER BY window_start DESC LIMIT 5;

       window_start        |        window_end         |     page_url     | views | add_to_carts | purchases
---------------------------+---------------------------+------------------+-------+--------------+-----------
 2026-04-01 10:02:00+00:00 | 2026-04-01 10:03:00+00:00 | /products/tablet |     0 |            0 |         1
 2026-04-01 10:02:00+00:00 | 2026-04-01 10:03:00+00:00 | /products/phone  |     0 |            1 |         0
 2026-04-01 10:01:00+00:00 | 2026-04-01 10:02:00+00:00 | /products/tablet |     0 |            1 |         0
 2026-04-01 10:01:00+00:00 | 2026-04-01 10:02:00+00:00 | /products/phone  |     1 |            0 |         1
 2026-04-01 10:01:00+00:00 | 2026-04-01 10:02:00+00:00 | /products/laptop |     0 |            0 |         1

Verified on RisingWave 2.8.0

What This Means for Your Team

The SQL interface changes who can build streaming pipelines. Instead of a small group of Java engineers writing Flink jobs, any analyst or data engineer who knows SQL can create and modify streaming logic. Deployment is instant: run the CREATE MATERIALIZED VIEW statement and the pipeline starts processing. No CI/CD pipeline for JARs, no Flink cluster restarts, no job graph uploads.

Flink does offer Flink SQL, but it runs on top of the same JVM cluster infrastructure. You still need JobManagers, TaskManagers, and the full operational stack described in Reason 1. Flink SQL is a query layer over a framework. RisingWave SQL is the native interface of a database.

Reason 3: Seconds to Recover, Not Minutes

Failure recovery is where the architectural difference between Flink and RisingWave becomes most visible.

Flink's Recovery Problem

When a Flink TaskManager fails, the recovery process follows these steps:

The JobManager detects the failure (heartbeat timeout, typically 10-50 seconds)
The job restarts from the last successful checkpoint
Each operator restores its state from the checkpoint on remote storage (S3/HDFS) to local RocksDB
Once all operators have restored, processing resumes from the checkpoint position

Step 3 is the bottleneck. State restoration time is proportional to state size. A Flink job with 100 GB of state (common for windowed joins or large aggregations) can take 5-15 minutes to restore, during which no data is processed. With Flink's default checkpoint interval of 30 minutes, you may also need to reprocess up to 30 minutes of data.

Organizations often increase checkpoint frequency to reduce data reprocessing, but more frequent checkpoints add load to the storage backend and increase checkpoint duration, creating a tension between recovery speed and steady-state performance.

RisingWave: State Lives in Object Storage

RisingWave takes a fundamentally different approach. All state is persisted in object storage (S3, GCS, Azure Blob) through Hummock, a purpose-built LSM-tree storage engine. There is no local state to restore because state was never local in the first place.

When a compute node fails:

The meta service detects the failure
A replacement node reads state directly from object storage
Processing resumes

RisingWave checkpoints every 1 second by default (compared to Flink's 30-minute default). Combined with the shared-storage architecture, this means recovery completes in seconds regardless of state size. A pipeline with 500 GB of state recovers just as fast as one with 5 GB because neither requires downloading state to local disk.

The Business Impact

The difference between 10-minute and 5-second recovery is not just a technical metric. For a real-time fraud detection system processing 50,000 transactions per second, a 10-minute outage means 30 million transactions go unscreened. For a live pricing engine, it means 10 minutes of stale prices. Faster recovery translates directly to higher system reliability and lower business risk.

Reason 4: Lower Infrastructure Cost

Flink's compute-storage-coupled architecture creates a specific cost problem: you provision resources for peak state size on every node, even when compute demand is low.

Where Flink Costs Add Up

A typical production Flink deployment includes:

Component	Purpose	Cost driver
JobManager (HA pair)	Job coordination	Memory for job graph metadata
TaskManagers (N nodes)	Processing + local state	CPU, memory, AND local SSD for RocksDB
ZooKeeper (3-node ensemble)	Leader election, HA	Always-on, rarely utilized
S3/HDFS	Checkpoint storage	Proportional to state size x checkpoint count
Kafka	Message queuing	Separate cluster, often over-provisioned
Serving database (PostgreSQL/Redis)	Query results	Separate cluster for downstream consumers

The last two rows are critical. Flink is a processing engine, not a database. To serve streaming results to applications, you need an external database. To ingest streaming data, you typically need Kafka. Your "Flink deployment" is actually Flink + Kafka + PostgreSQL/Redis, each with its own scaling, monitoring, and cost.

RisingWave's Consolidated Architecture

RisingWave collapses the serving layer into the streaming engine. Materialized views are queryable directly with sub-20ms p99 latency through the PostgreSQL wire protocol. You do not need a separate database to serve results.

The disaggregated architecture also optimizes compute costs:

Compute nodes scale based on processing throughput - no local storage overhead
Object storage (S3) costs $0.023/GB/month - orders of magnitude cheaper than SSD-backed instances
Compactor nodes share compute resources and scale independently of the streaming workload

For many workloads, eliminating the separate Kafka cluster, the serving database, and the SSD-heavy TaskManager instances reduces total infrastructure cost by 40-60% compared to an equivalent Flink deployment.

A Concrete Example

Consider a pipeline that processes e-commerce clickstream data: 100,000 events/second, 50 GB of windowed state, results served to a dashboard.

Flink stack: 6x TaskManagers (m5.2xlarge with local SSD), 2x JobManagers, 3x ZooKeeper, 3x Kafka brokers, 1x PostgreSQL RDS for serving. Estimated monthly cost: ~$8,000-12,000.

RisingWave: 4x compute nodes (m5.xlarge), S3 for state storage, no separate serving database. Estimated monthly cost: ~$3,000-5,000.

The savings come from three places: no local SSD requirements, no separate serving database, and the Rust runtime's lower memory footprint compared to JVM processes.

Reason 5: Built-in Serving Layer

This is the architectural advantage that compounds all the others. RisingWave is not just a stream processor - it is a streaming database with a built-in serving layer.

The Flink Serving Gap

Every Flink deployment faces the same question: "How do applications read the results?" Flink processes data but does not serve it. The standard pattern is:

Flink processes streaming data
Flink writes results to a sink (Kafka topic, database table, or file)
A separate serving system (PostgreSQL, Redis, Elasticsearch) indexes the results
Applications query the serving system

Each hop adds latency, operational complexity, and potential failure points. The sink connector needs monitoring. The serving database needs its own scaling strategy. Data consistency between the processing and serving layers requires careful coordination.

RisingWave: Process and Serve in One System

In RisingWave, materialized views are both the processing definition and the serving layer. When you create a materialized view, RisingWave continuously computes the result and stores it for instant querying:

-- Define the streaming pipeline AND the serving endpoint in one statement
CREATE MATERIALIZED VIEW mv_top_products_by_region AS
SELECT * FROM (
    SELECT
        region,
        category,
        COUNT(*) AS order_count,
        SUM(amount) AS total_revenue,
        ROW_NUMBER() OVER (
            PARTITION BY region ORDER BY SUM(amount) DESC
        ) AS rank
    FROM order_events
    GROUP BY region, category
) ranked
WHERE rank <= 3;

Applications query this view using any PostgreSQL client:

SELECT * FROM mv_top_products_by_region
WHERE region = 'us-east'
ORDER BY rank;

 region  |  category   | order_count | total_revenue | rank
---------+-------------+-------------+---------------+------
 us-east | electronics |           1 |        299.99 |    1
 us-east | home        |           2 |        289.98 |    2

Verified on RisingWave 2.8.0

The result updates continuously as new order events arrive. There is no ETL pipeline between processing and serving. Your Grafana dashboard, your Python application, and your Node.js API all connect directly to RisingWave using standard PostgreSQL drivers (JDBC, psycopg2, pg).

What This Enables

The built-in serving layer makes patterns possible that are painful with Flink:

Real-time dashboards connect directly to materialized views - no intermediate database
API backends query streaming results with sub-20ms latency using connection pooling
Cascading materialized views build complex pipelines where one view feeds another, all queryable at every stage
Ad-hoc exploration lets analysts run SQL queries against live streaming state without affecting the pipeline

Side-by-Side: Flink vs RisingWave for Common Patterns

Here is a comparison table for the patterns most teams implement first when evaluating an alternative to Flink:

Pattern	Flink	RisingWave
Windowed aggregation	Java/Scala DataStream API or Flink SQL on JVM cluster	`CREATE MATERIALIZED VIEW` with `TUMBLE()` / `HOP()`
Stream-stream join	Custom Java with `CoProcessFunction` or Flink SQL temporal join	Standard SQL `JOIN` in a materialized view
Top-N ranking	Custom `ProcessFunction` with `MapState`	`ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)`
CDC ingestion	Flink CDC connector (separate project, version compatibility issues)	Built-in CDC source with `CREATE SOURCE`
Result serving	External database (PostgreSQL, Redis, Elasticsearch)	Direct `SELECT` on materialized views
Connector management	JAR dependencies, version conflicts, class loading issues	SQL `CREATE SOURCE` / `CREATE SINK` statements
Deployment	JAR upload, job graph submission, cluster restart for config changes	SQL DDL statements via psql

When Flink Is Still the Right Choice

Flink remains a strong choice for specific workloads:

Complex event processing (CEP) using MATCH_RECOGNIZE - Flink's pattern matching capabilities are more mature
Custom stateful operators that require fine-grained control over state access patterns and timers
Extremely high throughput (millions of events per second per node) where Flink's DataStream API allows hand-optimized serialization
Existing investment in Flink infrastructure with dedicated platform teams and extensive custom connectors

If your team has deep Flink expertise, a dedicated platform team, and workloads that genuinely require custom Java operators, Flink's power justifies its complexity. But for the majority of streaming use cases - aggregations, joins, filtering, enrichment, and serving - SQL covers the requirements with a fraction of the operational cost.

How Does Streaming SQL Handle State Management Differently Than Flink?

Flink stores state locally on each TaskManager using RocksDB, then periodically checkpoints that state to remote storage. This means every TaskManager needs local SSDs sized for peak state, and recovery requires downloading the full state back to local disk. RisingWave stores all state in object storage from the start. There is no local state to manage, no RocksDB to tune, and no checkpoint-restore cycle that scales with state size. The Hummock storage engine, purpose-built for streaming workloads, handles compaction and garbage collection transparently.

Can RisingWave Replace Flink for Kafka-Based Streaming Pipelines?

Yes. RisingWave natively connects to Apache Kafka as both a source and a sink. You define Kafka ingestion with a CREATE SOURCE statement specifying the broker address, topic, and format. The difference is that RisingWave processes and serves the results within the same system, so you do not need a separate database downstream. For teams running Kafka + Flink + PostgreSQL today, RisingWave replaces the Flink and PostgreSQL layers while keeping Kafka as the message transport.

What Is the Learning Curve for Teams Moving from Flink to RisingWave?

Teams proficient in SQL can be productive with RisingWave within a day. The PostgreSQL-compatible interface means existing database tools, BI platforms, and client libraries work without modification. Teams moving from Flink typically report that rewriting their Java pipelines as materialized views takes 1-2 weeks, with the SQL versions being 10-20x shorter in terms of code. The RisingWave quickstart guide walks through the core concepts in under 15 minutes.

How Does RisingWave Handle Exactly-Once Semantics Without a JVM Runtime?

RisingWave provides exactly-once processing guarantees through its barrier-based checkpoint mechanism, similar in concept to Flink's Chandy-Lamport algorithm. Barriers flow through the dataflow graph, and when all operators have processed a barrier, the checkpoint commits atomically to object storage. The difference is frequency and storage model: RisingWave checkpoints every 1 second by default (Flink defaults to not enabling checkpoints, and many production deployments use 1-5 minute intervals). Because state already resides in object storage, checkpointing does not require uploading large state snapshots, making frequent checkpoints practically free.

Conclusion

The shift from Flink to streaming SQL is not about Flink being a bad technology. Flink is a powerful, mature framework with capabilities that no other system matches in certain areas. The shift is about matching tool complexity to problem complexity.

Most streaming use cases - real-time dashboards, CDC pipelines, event-driven aggregations, streaming ETL, and live analytics - do not require custom Java operators or fine-grained state management APIs. They require SQL, reliability, and low operational overhead.

RisingWave addresses the five pain points that drive teams away from Flink:

Operations: One system instead of a JVM cluster with ZooKeeper, RocksDB, and external databases
Accessibility: PostgreSQL-compatible SQL instead of Java/Scala development
Recovery: Seconds instead of minutes, regardless of state size
Cost: Disaggregated storage on S3 instead of SSD-backed TaskManagers plus a separate serving database
Serving: Built-in query serving with sub-20ms latency instead of a separate database layer

Ready to try this yourself? Try RisingWave Cloud free - no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.