RisingWave vs Arroyo: Rust-Based Stream Processors Compared

RisingWave vs Arroyo: Rust-Based Stream Processors Compared

RisingWave and Arroyo are both open-source, Rust-based stream processors licensed under Apache 2.0. RisingWave is a PostgreSQL-compatible streaming database that exposes the full SQL interface for defining and querying continuously updated materialized views. Arroyo is a distributed stream processing engine that supports SQL pipelines and a REST-based pipeline API, with its cloud offering now embedded in Cloudflare's developer platform. If you are choosing between two Rust-built systems that avoid the JVM tax and take SQL seriously, this comparison gives you the technical detail to decide.

Why Both Systems Chose Rust

Rust is increasingly the foundation of choice for new data infrastructure. Its memory safety guarantees eliminate entire classes of bugs that plague C++ systems while delivering performance that matches or exceeds JVM-based alternatives without garbage-collection pauses.

RisingWave's codebase is 92% Rust. Arroyo's is 85% Rust. Both teams made that choice deliberately: stream processing involves high-throughput event ingestion, complex stateful operations, and latency requirements where GC pauses are unacceptable. Rust delivers the performance of C with the safety guarantees that make large engineering teams productive.

The shared language choice does not mean the two systems are similar. Their architecture, SQL models, deployment postures, and target users diverge significantly.

At a Glance: Feature Comparison

DimensionRisingWaveArroyo
Primary languageRust (92%)Rust (85%)
LicenseApache 2.0Apache 2.0
SQL dialectPostgreSQL-compatibleApache DataFusion-based
Pipeline definitionSQL only (CREATE MATERIALIZED VIEW)SQL + REST pipeline API
Query interfaceAny PostgreSQL client (psql, JDBC, psycopg2)Web UI + REST API
Stateful operationsMaterialized views, windowed aggregations, joinsWindows, joins, aggregations, async UDFs
State backendHummock (LSM-tree on S3)Checkpointing to object storage
Kafka integrationNative source and sinkNative connector, Confluent partner
Exactly-onceYesYes
Cloud offeringRisingWave Cloud (managed)Cloudflare Pipelines (beta)
GitHub stars (April 2026)~8.9k~4.9k
Latest stable releasev2.8.1 (March 2026)v0.15.0 (December 2025)
Production readinessGenerally availablePre-1.0
UDFsSQL, PythonPython, Rust, async

Architecture: Streaming Database vs Processing Engine

The architectural split between RisingWave and Arroyo is the most consequential difference.

RisingWave: A Streaming Database

RisingWave is a streaming database. It has a PostgreSQL-compatible wire protocol, a query planner, and a built-in serving layer. You connect with psql and write SQL. Results are stored as materialized views that update continuously as new data arrives, and you query those views exactly as you would query any table.

The storage architecture decouples compute from state. RisingWave's Hummock storage engine writes all materialized view state as sorted SSTables to S3-compatible object storage. Compute nodes are stateless: they process events, push state to Hummock, and serve queries by reading from Hummock. This means you can add compute nodes without rebalancing state, and you can scale storage independently by simply writing more to S3.

The system includes four independent layers: compute nodes, compactor nodes (background SSTable merging), meta service (checkpoint coordination), and object storage. Each scales independently. This design is covered in detail in what is a streaming database.

Arroyo: A Processing Engine with SQL

Arroyo is a distributed stream processing engine, not a database. It executes stateful streaming pipelines defined in SQL or via its REST API. It does not have a PostgreSQL wire protocol or a persistent serving layer you can query with a SQL client.

Arroyo's SQL engine is built on Apache Arrow and Apache DataFusion, a move the team made in version 0.10.0 (April 2024) that delivered a 3x performance improvement and allowed distribution as a single binary. The system uses the Dataflow model: pipelines compile to DAGs of operators that execute in parallel across worker nodes. State is checkpointed to object storage for fault tolerance.

Arroyo's web UI lets you write and run SQL pipelines interactively, monitor throughput and latency, and inspect state. For programmatic pipeline management, a REST API lets you deploy, start, stop, and query pipelines without touching the UI.

SQL Support

Both systems take SQL seriously, but they serve different SQL needs.

RisingWave: PostgreSQL Dialect, Full Query Interface

RisingWave's SQL is PostgreSQL-compatible. You write DDL to define sources and sinks, then define streaming pipelines as materialized views. Any PostgreSQL client connects natively.

Creating a streaming aggregation over Kafka data:

-- Define the Kafka source
CREATE TABLE arroyo_orders (
    order_id    BIGINT,
    user_id     BIGINT,
    product_id  BIGINT,
    amount      DOUBLE PRECISION,
    region      VARCHAR,
    status      VARCHAR,
    order_ts    TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

-- Define the continuously updating aggregation
CREATE MATERIALIZED VIEW arroyo_regional_revenue AS
SELECT
    region,
    COUNT(*)          AS order_count,
    SUM(amount)       AS total_revenue,
    AVG(amount)       AS avg_order_value
FROM arroyo_orders
GROUP BY region;

Querying the result is a plain SELECT:

SELECT * FROM arroyo_regional_revenue
ORDER BY total_revenue DESC
LIMIT 10;

The view is always current. RisingWave maintains it incrementally: when a new row arrives in arroyo_orders, the aggregation updates in place rather than re-scanning all historical rows. This incremental maintenance is what incremental materialized views accomplish.

RisingWave also supports cascading materialized views, where one view reads from another, enabling multi-stage pipelines entirely in SQL:

-- First-stage: flag high-value users
CREATE MATERIALIZED VIEW arroyo_high_value_users AS
SELECT
    user_id,
    COUNT(*)   AS total_orders,
    SUM(amount) AS lifetime_value
FROM arroyo_orders
GROUP BY user_id
HAVING SUM(amount) > 1000;

-- Second-stage: join with clickstream for engagement signals
CREATE MATERIALIZED VIEW arroyo_user_activity_pipeline AS
SELECT
    o.user_id,
    o.total_orders,
    o.lifetime_value,
    c.recent_page_views
FROM arroyo_high_value_users o
JOIN (
    SELECT user_id, COUNT(*) AS recent_page_views
    FROM arroyo_clickstream
    GROUP BY user_id
) c ON o.user_id = c.user_id;

This two-stage pipeline runs entirely in SQL with no custom code, no DAG configuration, and no separate serving infrastructure.

Arroyo: DataFusion Dialect, Pipeline-Oriented

Arroyo's SQL is based on Apache DataFusion and follows ANSI SQL conventions. It supports over 300 functions including window aggregations, tumbling/sliding/session windows, joins, and user-defined functions. The interface is primarily a pipeline: you submit a SQL query that defines the transformation, and Arroyo runs it as a continuous job.

An equivalent Arroyo pipeline might look like:

-- Arroyo SQL (submitted via web UI or REST API)
SELECT
    region,
    count(*)   AS order_count,
    sum(amount) AS total_revenue,
    avg(amount) AS avg_order_value
FROM orders
GROUP BY region;

The key difference is the consumption model. In Arroyo, the results are pushed to a sink (Kafka topic, S3, database). You do not query Arroyo the way you query a database; you configure where the output goes and consume it from there.

Arroyo's REST pipeline API lets you manage pipelines programmatically:

curl -X POST http://arroyo-api:8080/api/v1/pipelines \
  -H "Content-Type: application/json" \
  -d '{
    "name": "regional-revenue",
    "query": "SELECT region, count(*), sum(amount) FROM orders GROUP BY region",
    "parallelism": 4
  }'

This API-first approach makes Arroyo well-suited for platforms and tooling that manage pipelines programmatically. But it means results are always pushed to an external system, not queried in place.

Key SQL difference: RisingWave lets you SELECT from materialized views using any PostgreSQL client at any time, making the streaming database also serve as the query endpoint. Arroyo requires an external sink for output consumption.

Stateful Operations

Stateful stream processing is where the engineering gets interesting. Both systems support windowed aggregations, joins, and exactly-once semantics, but their state models differ.

RisingWave: Hummock on S3

RisingWave's state lives in Hummock, its purpose-built LSM-tree storage engine backed by object storage. Every materialized view's state is an SSTable stored on S3 (or GCS, Azure Blob). This has three practical consequences:

First, there is no local disk to provision or tune. State scales with your object storage, which is effectively unlimited.

Second, checkpointing is cheap. Because state is already in object storage, recovery does not require uploading large checkpoint snapshots over the network. RisingWave restarts by reading from the last consistent SSTable set in Hummock.

Third, compute and state are decoupled. Adding a compute node does not require rebalancing any local state. Each node reads from and writes to Hummock independently.

Windowing in RisingWave uses the TUMBLE and HOP functions. The results are maintained incrementally:

-- Tumbling window: event counts per page, per 5-minute window
CREATE MATERIALIZED VIEW arroyo_tumble_page_views AS
SELECT
    window_start,
    window_end,
    page,
    COUNT(*) AS view_count
FROM TUMBLE(arroyo_clickstream, ts, INTERVAL '5' MINUTE)
GROUP BY window_start, window_end, page;

This materialized view tracks all open windows and closes them when the watermark advances past window_end. The details of how watermarks drive window closure are covered in watermarks in stream processing.

Arroyo: Checkpointed State

Arroyo manages state through its Dataflow runtime. Each pipeline operator holds state in memory and periodically checkpoints to object storage for fault tolerance. Recovery restores from the last successful checkpoint.

Arroyo supports tumbling, sliding, and session windows in its SQL dialect. Async UDFs allow pipeline stages to call external services (databases, ML endpoints) as part of the stateful computation, which is a capability RisingWave does not currently offer natively in the same way.

For exactly-once processing, both systems implement barrier-based checkpointing protocols inspired by the Chandy-Lamport algorithm. Arroyo pairs this with transactional Kafka sinks for end-to-end exactly-once delivery.

Kafka Integration

Kafka is the dominant streaming transport for production systems, so integration quality matters.

RisingWave Kafka Integration

RisingWave treats Kafka as a first-class connector. You define a Kafka source as a table and consume it with SQL:

CREATE TABLE arroyo_clickstream (
    user_id    BIGINT,
    page       VARCHAR,
    event_type VARCHAR,
    ts         TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'clickstream',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

RisingWave supports Avro, Protobuf, JSON, CSV, and Parquet encodings natively. For sinks, you write to Kafka topics using CREATE SINK:

CREATE SINK arroyo_revenue_sink
FROM arroyo_regional_revenue
WITH (
    connector = 'kafka',
    topic = 'regional-revenue-output',
    properties.bootstrap.server = 'broker:9092'
) FORMAT PLAIN ENCODE JSON;

Consumer group offsets are committed transactionally with checkpoint barriers, ensuring exactly-once delivery across the source-to-sink pipeline.

Arroyo Kafka Integration

Arroyo also ships with a native Kafka connector and is a Confluent Connect partner. You define the Kafka source in Arroyo's SQL DDL, and Arroyo manages offset tracking and consumer group coordination.

Arroyo's Kafka integration includes transactional producer support for exactly-once sink semantics. Its Confluent partnership means it works alongside Confluent Schema Registry for Avro and Protobuf schema management.

Both systems handle Kafka well at the connector level. The difference is in what happens after Kafka: RisingWave turns Kafka events into queryable materialized views; Arroyo turns them into pipeline outputs pushed to sinks.

Production Readiness

This is the most significant practical distinction between the two systems.

RisingWave: Generally Available

RisingWave v1.0 shipped in September 2023. As of March 2026, the project is at v2.8.1 with 14,000+ commits, 114 releases, and active deployment at production companies across financial services, e-commerce, and data infrastructure. RisingWave Cloud, the managed offering, provides a fully operational cluster with monitoring, backup, and auto-scaling.

The project has a stable API surface, migration guides between versions, and documented operational runbooks. The community on Slack and GitHub is active, with 8.9k GitHub stars.

Arroyo: Pre-1.0, Cloudflare-Backed

Arroyo is at v0.15.0 as of December 2025. It has not shipped a 1.0 release. This matters: pre-1.0 versioning signals that the team reserves the right to make breaking changes to APIs, configuration formats, and SQL syntax between releases.

The April 2025 acquisition by Cloudflare changed Arroyo's trajectory significantly. The Arroyo engine powers Cloudflare Pipelines, a serverless stream processing product in beta. The open-source engine remains Apache 2.0 and self-hostable. But the primary development and product focus has shifted toward Cloudflare's platform, which means the roadmap is driven by Cloudflare's requirements rather than the general-purpose stream processing market.

The community has ~4.9k GitHub stars and an active Discord. Development velocity was strong through 2024 but slowed after the Cloudflare acquisition, with one release (v0.15.0) in the second half of 2025.

Practical guidance: If you are deploying to production today and need a stable API, RisingWave's GA status and version history make it the lower-risk choice. If you are building on Cloudflare's developer platform, Arroyo's integration with Cloudflare Queues, Workers, and R2 may make it the natural fit.

Open Source and Cloud Offerings

AspectRisingWaveArroyo
LicenseApache 2.0Apache 2.0
Self-hostableYes (Docker, Kubernetes, binary)Yes (Docker, Kubernetes, binary)
Managed cloudRisingWave CloudCloudflare Pipelines (beta)
Cloud pricing modelCompute + storage creditsServerless (Cloudflare pricing)
Vendor lock-in riskLow (open core, portable state)Low for self-hosted; Cloudflare platform for managed

Both systems are genuinely open-source under Apache 2.0. Neither imposes a source-available or BSL license on core features.

RisingWave Cloud is a fully managed deployment that handles provisioning, upgrades, and monitoring. State lives in your own S3 bucket or RisingWave's managed storage depending on configuration. The separation of compute and storage means you can migrate data out of RisingWave Cloud by accessing the Hummock SSTables directly.

Arroyo's managed path is Cloudflare Pipelines, which is in beta as of early 2026. It targets serverless workloads on Cloudflare's network and integrates tightly with Cloudflare's ecosystem (R2, Queues, Workers). For teams already committed to Cloudflare's platform, this is compelling. For teams with a multi-cloud or on-premises requirement, the self-hosted Arroyo engine is the only option.

Community and Ecosystem

Community size matters for a different reason than raw numbers: it correlates with connector availability, answered Stack Overflow questions, third-party integrations, and the likelihood that your edge case has already been solved by someone else.

RisingWave has approximately 8.9k GitHub stars, 14,000+ commits, 752 forks, and an active Slack community. The project ships regular release notes, maintains comprehensive documentation at docs.risingwave.com, and has connector integrations with Kafka, Kinesis, MySQL, PostgreSQL, MongoDB, Redis, S3, and more.

Arroyo has approximately 4.9k GitHub stars, 819 commits, and an active Discord. Connector coverage includes Kafka, Kinesis, MQTT, RabbitMQ, Postgres, MySQL, Redis, and Iceberg. The documentation was maintained actively through early 2025 and continues to be updated.

The gap in GitHub stars and commit volume reflects a gap in community maturity. That does not make Arroyo a poor choice, but it does mean you are more likely to hit an undocumented edge case and less likely to find existing community-sourced solutions.

When to Choose RisingWave

Choose RisingWave when:

  • Your team wants to query streaming results with a PostgreSQL client using standard SQL, without routing output to a separate serving store.
  • You need a generally available system with a stable API and version history for a production deployment.
  • Your workload involves complex multi-stage pipelines best expressed as cascading materialized views.
  • You want disaggregated storage on S3 with independent compute and storage scaling.
  • You need a managed cloud offering that is production-ready today.

When to Consider Arroyo

Consider Arroyo when:

  • You are building on Cloudflare's developer platform and want native integration with R2, Queues, and Workers.
  • Your pipeline outputs always go to an external sink and you do not need to query the streaming state directly.
  • Your team wants async UDFs that call external services as part of the pipeline logic.
  • You prefer a REST API-first approach to pipeline management over a SQL DDL interface.
  • You are building a prototype or early-stage product and the pre-1.0 API instability is acceptable.

FAQ

Can I use standard SQL tools like psql or Tableau with both systems?

With RisingWave, yes. RisingWave implements the PostgreSQL wire protocol, so any PostgreSQL-compatible client connects natively. You can use psql, JDBC, Python psycopg2, Tableau, and DBeaver without any modification. Arroyo does not expose a PostgreSQL wire protocol; you interact with it via its web UI or REST API, and you consume output by reading from the configured sink (Kafka, S3, database).

How do RisingWave and Arroyo handle late-arriving events?

Both systems use watermarks to track event-time progress and handle late data. RisingWave lets you define watermarks on source tables and configure allowed lateness; events arriving after the watermark threshold are dropped or handled by a configurable policy. Arroyo uses its Dataflow model's watermark propagation to manage late events similarly. The mechanics differ at implementation level but the user-visible behavior, configuring a watermark column and a lateness allowance, is conceptually similar in both systems.

Is Arroyo still a viable open-source project after the Cloudflare acquisition?

Yes, with caveats. Arroyo remains Apache 2.0 and self-hostable. Cloudflare committed to keeping the engine open-source. However, development priorities are now driven by Cloudflare's platform needs, and the release cadence slowed in the second half of 2025. Teams that want a community-driven roadmap independent of a single cloud vendor's priorities should factor this into their evaluation.

Does RisingWave support UDFs like Arroyo does?

RisingWave supports user-defined functions written in Python, Java, and Rust via an external UDF server mechanism. Arroyo supports UDFs written in Python and Rust, including async UDFs that make network calls during pipeline execution. Arroyo's async UDF model is particularly useful for pipelines that need to enrich streaming records with database lookups or ML model inferences inline. RisingWave's UDF approach requires spinning up a separate UDF server process, which is more operationally involved but allows arbitrary computation.

Summary

RisingWave and Arroyo are both Rust-built, Apache 2.0 stream processors that take SQL seriously. Their shared language and license make them appear more similar than they are.

RisingWave is a streaming database. You write SQL, results are queryable at any time from any PostgreSQL client, and state lives in cloud-native object storage. It is generally available, has a stable API, and has the larger community. It is the right choice for teams that want a production-ready system where streaming results are first-class queryable objects.

Arroyo is a stream processing engine. SQL defines pipelines whose outputs are pushed to sinks. It is pre-1.0, backed by Cloudflare, and best suited for teams building on Cloudflare's platform or those who prefer a pipeline API model over a database query model.

If you are evaluating stream processors more broadly, the streaming database landscape guide covers additional systems in the space. For a deeper look at windowing in stream processing, which both systems support, that post covers tumbling, hopping, and session windows with executable SQL examples.

To try RisingWave, start with the quickstart or connect to a free RisingWave Cloud cluster at cloud.risingwave.com. The Kafka connectors, windowed aggregations, and cascading materialized views shown in this article all run without modification on any RisingWave instance.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.