Running Stream Processing Without a JVM: SQL-Based Alternatives to Flink

If you have ever spent a Friday night tuning JVM garbage collection flags instead of building features, you already know the pain. Apache Flink is a powerful stream processing framework, but it carries the full weight of the Java Virtual Machine: unpredictable GC pauses, complex memory model tuning, dependency conflicts, and operational overhead that scales with your cluster. For teams that want real-time data processing without becoming JVM specialists, a new generation of stream processors built in Rust and Python offers a compelling exit ramp.

This article breaks down the specific problems that JVM-based stream processing introduces, then evaluates three non-JVM alternatives: RisingWave (Rust + SQL), Arroyo (Rust), and Bytewax (Python). Each takes a different approach to eliminating the JVM from your streaming pipeline. We will compare their architectures, trade-offs, and maturity levels so you can make an informed decision.

The JVM Tax on Stream Processing

Every JVM-based stream processor, whether it is Flink, Kafka Streams, or Spark Structured Streaming, inherits the same set of operational challenges. These are not bugs. They are fundamental properties of running on the Java Virtual Machine.

Garbage Collection Pauses

The JVM's garbage collector must periodically stop application threads to reclaim memory. In a stream processing context, these "stop-the-world" pauses directly translate to latency spikes. A Flink TaskManager processing thousands of events per second can experience GC pauses ranging from tens of milliseconds (with G1GC) to multiple seconds (with older collectors on large heaps).

Modern collectors like ZGC and Shenandoah reduce pause times to single-digit milliseconds, but they come with their own trade-offs: higher CPU overhead, increased memory footprint, and configuration complexity. You end up trading one tuning problem for another. According to Uber's engineering blog on JVM tuning, large-scale services routinely require dedicated performance engineering effort just to keep GC behavior under control.

The core issue is that stream processing workloads create enormous amounts of short-lived objects (event records, intermediate aggregations, serialization buffers) that put constant pressure on the garbage collector. This is the worst-case workload profile for any generational GC.

Memory Model Complexity

Flink implements its own memory management layer on top of the JVM, dividing memory into framework heap, task heap, managed memory, network buffers, and JVM metaspace. Configuring this correctly requires understanding both Flink's memory model and the JVM's memory model simultaneously.

A typical Flink flink-conf.yaml includes settings like:

taskmanager.memory.process.size: 4096m
taskmanager.memory.flink.size: 3072m
taskmanager.memory.managed.fraction: 0.4
taskmanager.memory.network.fraction: 0.1
taskmanager.memory.jvm-overhead.fraction: 0.1

Get any of these wrong, and you face either OutOfMemoryError crashes or wasted resources. The double-layered memory model (Flink on top of JVM) means that debugging memory issues requires profiling at two levels simultaneously. Container deployments add a third layer: you must ensure the JVM respects cgroup memory limits, which historically has been a source of OOM kills in Kubernetes.

Classpath and Dependency Conflicts

Java's classpath mechanism is notoriously fragile. Flink jobs frequently encounter version conflicts between user code dependencies and Flink's own libraries. Common examples include conflicting versions of Guava, Jackson, Netty, or SLF4J bindings. Flink provides classloader isolation through its child-first loading strategy, but this does not solve every conflict, and debugging ClassNotFoundException or NoSuchMethodError in production is time-consuming.

The connector ecosystem compounds this problem. Adding a Kafka connector, an Iceberg sink, and an Avro serializer to a single Flink job can trigger transitive dependency conflicts that require hours of Maven exclusion rules to resolve.

Operational Overhead

Running a Flink cluster in production means managing:

JVM version compatibility across your fleet
Heap dump analysis when memory issues arise (multi-gigabyte .hprof files)
JMX monitoring configuration for GC metrics
Rolling restarts that account for JVM warm-up time (JIT compilation)
Container resource limits tuned for JVM overhead

Each of these is solvable, but collectively they represent a significant operational tax that has nothing to do with your actual stream processing logic.

Non-JVM Alternatives: A New Generation

The stream processing landscape has shifted. Several projects now offer real-time data processing without any JVM dependency. Here are the three most notable.

RisingWave: A Streaming Database Built in Rust

RisingWave is a distributed SQL streaming database written entirely in Rust. Unlike Flink, which is a processing framework that requires you to build and deploy jobs, RisingWave is a database. You connect to it with any PostgreSQL client, write SQL, and the system continuously maintains your query results as materialized views.

Architecture. RisingWave uses a decoupled compute-storage architecture. Compute nodes handle stream processing, while persistent state is stored in object storage (S3, GCS, or MinIO). This design eliminates the need to provision local SSDs for state backends, a major cost driver in Flink deployments.

Why no JVM matters here. Rust gives RisingWave predictable memory usage with zero garbage collection pauses. Memory is allocated and freed deterministically at compile time through Rust's ownership model. There is no GC tuning, no heap sizing, and no stop-the-world pauses. For a workload that creates millions of temporary objects per second, this is a fundamental advantage.

SQL-first approach. RisingWave uses PostgreSQL-compatible SQL as its only interface. Creating a streaming pipeline is as simple as:

-- Create a source from Kafka
CREATE SOURCE orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL,
    order_time TIMESTAMP
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'localhost:9092'
) FORMAT PLAIN ENCODE JSON;

-- Create a materialized view that continuously updates
CREATE MATERIALIZED VIEW revenue_per_customer AS
SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(amount) AS total_revenue
FROM orders
GROUP BY customer_id;

-- Query the result at any time
SELECT * FROM revenue_per_customer WHERE total_revenue > 1000;

There is no job packaging, no JAR submission, no classpath configuration. You write SQL and the system handles execution, state management, and fault tolerance.

Maturity. RisingWave has been in production since 2022, with hundreds of deployments across companies of various sizes. It supports a wide range of connectors (Kafka, Pulsar, Kinesis, PostgreSQL CDC, MySQL CDC, S3, Iceberg) and advanced SQL features including temporal joins, window functions, and user-defined functions in Python and Java. The project has over 7,000 GitHub stars and an active open-source community.

Arroyo: Rust-Native Stream Processing Engine

Arroyo is a distributed stream processing engine written in Rust, designed for high-throughput, low-latency workloads. It supports SQL as a primary interface but also offers a Rust API for custom operators.

Architecture. Arroyo compiles SQL queries into optimized Rust dataflow programs using Apache Arrow as its in-memory data format. This means data stays in a columnar format throughout the pipeline, avoiding the serialization overhead common in JVM-based systems. Arroyo is designed for serverless operation, with support for scaling to zero and rapid startup times.

Strengths. Arroyo excels at event-time processing with advanced windowing support. Its SQL layer covers many common streaming patterns, and performance benchmarks show throughput exceeding Flink by 5x or more on comparable hardware. The single-binary deployment model makes it straightforward to run locally or in containers.

Considerations. Arroyo was acquired by Cloudflare to power serverless stream processing on the Cloudflare Developer Platform. The engine remains open source and self-hostable, but the project's primary development focus is now on Cloudflare integration. The connector ecosystem is smaller than RisingWave's or Flink's, and there is no built-in state serving (you cannot query results directly without an external sink).

Bytewax: Python-Native Stream Processing

Bytewax takes a different approach. It is a Python framework backed by a Rust-based execution engine built on Timely Dataflow. The goal is to bring stream processing to Python developers without requiring them to learn Java, Scala, or SQL.

Architecture. You define dataflows in Python using a functional API. Bytewax handles distributed execution, state management, and fault tolerance under the hood, with the Rust runtime providing performance that pure Python cannot match.

A simple Bytewax pipeline looks like:

from bytewax.dataflow import Dataflow
from bytewax.connectors.kafka import KafkaSource
from bytewax.connectors.stdio import StdOutSink

flow = Dataflow("order_processing")
flow.input("orders", KafkaSource(["localhost:9092"], "orders"))
flow.map(lambda msg: json.loads(msg.value))
flow.filter(lambda order: order["amount"] > 100)
flow.output("out", StdOutSink())

Strengths. Bytewax integrates directly with the Python data ecosystem: NumPy, pandas, scikit-learn, and PyTorch. This makes it a natural fit for ML feature pipelines and data science workloads where the processing logic is already written in Python. Memory consumption is reported to be 7 to 25 times lower than Flink for equivalent workloads.

Considerations. Bytewax's development pace has slowed. The last open-source release (v0.21.1) shipped in November 2024, and key deployment tools like the waxctl CLI have been archived. The project is not SQL-first, so teams looking for a declarative query interface would need to layer something on top. The connector ecosystem is limited compared to RisingWave or Flink.

Comparison: Which Alternative Fits Your Use Case?

Capability	RisingWave	Arroyo	Bytewax
Language	Rust	Rust	Rust + Python
Interface	PostgreSQL SQL	SQL + Rust API	Python API
State management	Built-in (object storage)	Built-in (checkpointing)	Built-in (recovery)
Query results directly	Yes (it is a database)	No (needs external sink)	No (needs external sink)
Connector ecosystem	Large (Kafka, Pulsar, CDC, S3, Iceberg)	Medium	Small
Deployment	Distributed cluster or Cloud	Single binary or K8s	Python process or K8s
Best for	General stream processing, analytics	High-throughput event processing	ML pipelines, Python teams
PostgreSQL compatible	Yes	No	No
Active development	Yes (frequent releases)	Yes (Cloudflare-focused)	Slow (last release Nov 2024)

When to Choose RisingWave

RisingWave is the strongest choice if you want a complete streaming platform that replaces both your stream processor and your serving layer. Because it is a database, you do not need to sink results into PostgreSQL or Redis for querying. You write SQL to define your pipelines, and you query results with the same SQL interface. This eliminates an entire category of infrastructure.

It is also the right choice if your team already knows SQL but does not know Java or Rust. The PostgreSQL compatibility means existing tools (psql, DBeaver, any PostgreSQL driver) work out of the box. The RisingWave documentation covers migration from Flink SQL in detail, and most Flink SQL jobs can be ported directly.

When to Choose Arroyo

Arroyo fits teams that need raw throughput on event-time windowed computations and are comfortable operating a newer system with a smaller community. If you are building on Cloudflare's platform, Arroyo is the native choice. If you need custom Rust operators for performance-critical logic, Arroyo's dual SQL/Rust interface gives you that flexibility.

When to Choose Bytewax

Bytewax is the right pick for Python-centric data science teams building ML feature pipelines where the processing logic involves pandas, scikit-learn, or PyTorch. The Python API feels natural for data scientists. However, the uncertain development trajectory means you should evaluate whether the project's pace of updates meets your needs for production use.

Migrating from Flink: What to Expect

If you are currently running Flink and considering a move to a non-JVM alternative, here is what the migration typically involves.

From Flink SQL to RisingWave SQL

The syntax differences between Flink SQL and RisingWave SQL are minimal for most use cases. RisingWave uses standard PostgreSQL syntax, so common patterns translate directly:

Flink SQL	RisingWave SQL
`CREATE TABLE ... WITH ('connector' = 'kafka', ...)`	`CREATE SOURCE ... WITH (connector = 'kafka', ...)`
`INSERT INTO sink_table SELECT ...`	`CREATE MATERIALIZED VIEW ... AS SELECT ...` or `CREATE SINK ...`
`TUMBLE(order_time, INTERVAL '1' HOUR)`	`TUMBLE(order_time, INTERVAL '1 hour')`
`GROUP BY TUMBLE(...)`	`GROUP BY window_start, window_end`

The most significant conceptual difference is that RisingWave uses materialized views instead of Flink's INSERT INTO pattern. A materialized view in RisingWave is a continuously updated query result that you can read at any time with a standard SELECT statement.

From Flink Java/Scala to Non-JVM

If your Flink pipelines are written in Java or Scala using the DataStream API, migration requires rewriting the processing logic. For RisingWave, this means translating your pipeline into SQL. For Arroyo, you can use SQL or Rust. For Bytewax, you rewrite in Python.

The good news is that most stream processing logic (filters, aggregations, joins, windows) maps cleanly to SQL, which is more concise than equivalent Java code. A 200-line Flink Java job often becomes 10-20 lines of SQL.

What does "SQL-first stream processing" mean?

SQL-first stream processing means that SQL is the primary and often only interface for defining streaming pipelines. Instead of writing imperative code in Java or Python that describes how to process each event, you write declarative SQL queries that describe what results you want. The system figures out how to compute and maintain those results incrementally as new data arrives. RisingWave is a SQL-first streaming database where you use PostgreSQL-compatible SQL to create sources, define transformations as materialized views, and query results, all without writing a single line of application code.

Can RisingWave replace Apache Flink for all use cases?

RisingWave covers approximately 80% of common stream processing use cases, particularly those involving SQL-expressible transformations: filtering, aggregation, joins, windowing, and CDC processing. For complex event processing patterns that require custom stateful operators written in Java, or for batch-and-stream unified processing, Flink still has an edge. However, for the majority of teams building real-time dashboards, monitoring systems, feature pipelines, or event-driven microservices, RisingWave provides equivalent functionality with dramatically less operational complexity. The Nexmark benchmark results show RisingWave outperforming Flink in 22 out of 27 standard streaming queries.

How do Rust-based stream processors handle memory differently from JVM-based ones?

Rust uses an ownership-and-borrowing memory model that enforces memory safety at compile time without a garbage collector. When a Rust program allocates memory for an event record, that memory is freed deterministically when the owning variable goes out of scope. There are no GC pauses, no heap tuning parameters, and no risk of stop-the-world freezes under load. This makes Rust-based processors like RisingWave and Arroyo particularly well-suited for latency-sensitive workloads where consistent sub-millisecond processing times matter. The trade-off is that Rust has a steeper learning curve for contributors, but as an end user interacting through SQL, this is invisible to you.

Is Bytewax production-ready for large-scale stream processing?

Bytewax offers a compelling Python-native approach to stream processing, but its production readiness requires careful evaluation as of early 2026. The last open-source release (v0.21.1) was in November 2024, and key operational tools have been archived. For small to medium-scale ML feature pipelines where your team is Python-centric, Bytewax can be a good fit. For large-scale, mission-critical streaming workloads, RisingWave or Arroyo offer more active development, broader connector ecosystems, and proven production track records at scale.

Conclusion

The JVM has been the default runtime for stream processing for over a decade, but it is no longer the only option. GC pauses, memory model complexity, classpath conflicts, and operational overhead are real costs that non-JVM alternatives eliminate at the architectural level.

Here are the key takeaways:

RisingWave is the most mature SQL-first alternative. It is a full streaming database (not just a processing engine), supports PostgreSQL-compatible SQL, stores state in object storage, and lets you query results directly. If you want to replace Flink and simplify your stack, start here.
Arroyo offers excellent raw performance in Rust with a SQL interface, but its focus has shifted toward Cloudflare's platform. It is best for high-throughput event processing where you need the option of custom Rust operators.
Bytewax serves Python-centric data science teams, but its development pace has slowed. Evaluate carefully before committing to production use.
For most teams, the migration from Flink SQL to RisingWave SQL is straightforward, with minimal syntax changes and significant operational simplification.

The stream processing ecosystem is moving toward simpler, more accessible tools. SQL is winning as the universal interface, and Rust is proving itself as the systems language for data infrastructure. The JVM is not going away, but for new streaming projects, it is no longer the default choice.

Ready to try stream processing without the JVM? Get started with RisingWave in 5 minutes. Quickstart →

Join our Slack community to ask questions and connect with other stream processing developers.