RisingWave vs Databricks Streaming: When SQL Beats Spark

Every data team eventually hits the same question: should we bolt streaming onto our existing Spark platform, or adopt a purpose-built streaming engine? Databricks Structured Streaming and RisingWave represent two fundamentally different answers to that question. One extends a batch-oriented framework to handle streams. The other starts from streaming-first principles and wraps everything in PostgreSQL-compatible SQL.

This comparison breaks down the architectural trade-offs, latency profiles, SQL dialects, cost structures, and operational models so you can decide which approach fits your streaming workloads. We aim to be fair and balanced: both platforms have legitimate strengths, and the right choice depends on what you are building.

Architecture: True Streaming vs Micro-Batch

The most consequential difference between these two systems lives at the architecture level. It shapes everything that follows: latency, cost, operational complexity, and developer experience.

Databricks Structured Streaming

Databricks Structured Streaming is built on Apache Spark. By default, it processes data in micro-batches: the engine collects incoming records into small batches, then processes each batch using Spark's distributed execution engine. Each micro-batch goes through the full Spark lifecycle of planning, scheduling, execution, and checkpointing.

This design inherits Spark's strengths (mature distributed computing, broad connector ecosystem, unified batch and streaming API) but also its constraints. Each micro-batch carries fixed overhead costs: writing log files to durable object storage before and after execution, uploading state updates, and the latency of logical and physical planning. When batches are small, these fixed costs dominate processing time.

In 2025, Databricks introduced Real-Time Mode to address these limitations. Real-Time Mode replaces the micro-batch loop with continuous data flow, pipeline scheduling (stages run simultaneously instead of sequentially), and streaming shuffle (data passes between tasks immediately rather than through disk-based intermediate storage). Databricks reports sub-300ms p99 latencies for a broad set of stateless and stateful queries in Real-Time Mode.

Real-Time Mode is a significant step forward, but it requires Databricks Runtime 16.4 LTS or later, dedicated compute resources, and is only available on Databricks' managed platform. It also remains a fundamentally Spark-based system, with the Spark execution model and Spark SQL dialect underneath.

RisingWave

RisingWave is a streaming database built from the ground up in Rust. There is no batch execution layer underneath. Data flows continuously from sources (Kafka, Pulsar, CDC, S3, and others) through a dataflow graph maintained by incrementally updated materialized views. When a new event arrives, RisingWave propagates changes through the dataflow immediately, without waiting to accumulate a batch.

This architecture delivers sub-100ms end-to-end latency by default, not as a special mode that requires specific runtime versions or dedicated clusters. The compute and storage layers are decoupled: compute nodes handle stream processing, while state is persisted to object storage (S3, GCS, or Azure Blob Storage) for durability and cost efficiency. This separation enables independent scaling of compute and storage.

Because RisingWave is PostgreSQL wire-compatible, you connect using psql, JDBC, or any PostgreSQL client library. There is no separate API, no SDK to learn, and no cluster manager to configure beyond the database itself.

The Fundamental Trade-Off

Databricks gives you one platform for batch, streaming, ML, and analytics. If you already run Spark workloads on Databricks, adding streaming to the same platform simplifies your infrastructure footprint. The cost is architectural compromise: streaming is layered onto a batch foundation.

RisingWave gives you a purpose-built streaming engine with the lowest possible latency and the simplest possible interface (SQL over a PostgreSQL connection). The cost is that it does one thing: stream processing. For batch ETL, ML training, or ad-hoc analytics on historical data, you still need another system.

Latency: Milliseconds vs Seconds

Latency is often the deciding factor in choosing a streaming platform. Here is where these two systems land in practice.

Databricks Latency Profile

In default micro-batch mode, Databricks Structured Streaming typically delivers latencies measured in seconds to tens of seconds. The Databricks documentation recommends micro-batch mode for "analytical processing, ETL pipelines, data transformations, and medallion architecture implementations where latency requirements are measured in seconds or minutes."

With Real-Time Mode enabled, Databricks achieves sub-300ms p99 latencies. This is a substantial improvement, but Real-Time Mode comes with constraints:

Requires dedicated compute (no serverless, no shared clusters)
Available only on Databricks Runtime 16.4 LTS and later
Higher resource consumption compared to micro-batch mode
Limited to Databricks' managed cloud environment

RisingWave Latency Profile

RisingWave delivers sub-100ms end-to-end processing latency as its default operating mode. Materialized views are updated incrementally as events arrive, and point queries against those materialized views return results with 10-20ms p99 latency.

This latency is available out of the box, on any deployment model (self-hosted, Kubernetes, or RisingWave Cloud), without special configuration flags or dedicated runtime versions.

When Latency Matters

For use cases like real-time dashboards refreshing every few seconds, Databricks micro-batch mode may be sufficient. For fraud detection, live pricing engines, real-time personalization, or operational alerting where every 100ms counts, RisingWave's consistently low latency is a material advantage.

SQL Dialect: PostgreSQL vs Spark SQL

The SQL dialect you write determines your developer experience, the pool of engineers who can work with your streaming pipelines, and how easily you can integrate with existing tools.

Spark SQL on Databricks

Databricks uses Spark SQL, which is based on the Hive SQL dialect with extensions for distributed processing. Spark SQL supports a broad range of operations: window functions, complex aggregations, UDFs in Python/Scala/Java, and schema evolution for semi-structured data formats like JSON and Parquet.

However, Spark SQL diverges from standard SQL in several ways:

Type system: Uses Spark-specific types (StringType, IntegerType, StructType) rather than standard SQL types
Function library: Many Spark-specific functions (e.g., explode(), posexplode(), from_json()) that do not exist in standard SQL
Streaming-specific syntax: Requires readStream/writeStream API calls in Python/Scala to define streaming pipelines, with SQL used primarily for transformations within those pipelines
State management: Stateful operations require understanding Spark's watermarking and output mode concepts (append, update, complete)

Writing a streaming pipeline on Databricks typically means writing Python or Scala code that orchestrates Spark DataFrame operations, with SQL embedded for the transformation logic.

PostgreSQL-Compatible SQL on RisingWave

RisingWave implements the PostgreSQL wire protocol and a PostgreSQL-compatible SQL dialect. You define sources, sinks, and streaming transformations entirely in SQL:

-- Define a source from Kafka
CREATE SOURCE orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL,
    order_time TIMESTAMP WITH TIME ZONE
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;

-- Create a materialized view for real-time aggregation
CREATE MATERIALIZED VIEW revenue_per_customer AS
SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
GROUP BY customer_id;

-- Query the materialized view like a regular table
SELECT * FROM revenue_per_customer
WHERE total_revenue > 1000
ORDER BY total_revenue DESC;

Key advantages of the PostgreSQL-compatible approach:

Familiar syntax: Any engineer who knows PostgreSQL (or MySQL, or standard SQL) can write streaming pipelines without learning a new API
Standard tooling: Connect with psql, DBeaver, DataGrip, Metabase, Superset, SQLAlchemy, or any PostgreSQL-compatible client
No orchestration code: No Python/Scala wrappers needed. The entire pipeline is defined in SQL DDL and DML
Standard types: BIGINT, VARCHAR, TIMESTAMP WITH TIME ZONE, JSONB - the same types you use in PostgreSQL

The Developer Experience Gap

Consider what it takes to create a real-time aggregation on each platform:

Databricks (Python + Spark SQL):

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, count, avg, col, from_json
from pyspark.sql.types import StructType, StructField, LongType, DecimalType, TimestampType

spark = SparkSession.builder.appName("OrderAgg").getOrCreate()

schema = StructType([
    StructField("order_id", LongType()),
    StructField("customer_id", LongType()),
    StructField("amount", DecimalType(10, 2)),
    StructField("order_time", TimestampType())
])

orders = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "orders")
    .load()
    .select(from_json(col("value").cast("string"), schema).alias("data"))
    .select("data.*")
    .withWatermark("order_time", "10 minutes"))

revenue = (orders
    .groupBy("customer_id")
    .agg(
        count("*").alias("total_orders"),
        sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_order_value")
    ))

query = (revenue.writeStream
    .outputMode("update")
    .format("delta")
    .option("checkpointLocation", "/checkpoints/revenue")
    .toTable("revenue_per_customer"))

RisingWave (Pure SQL):

CREATE SOURCE orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL,
    order_time TIMESTAMP WITH TIME ZONE
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;

CREATE MATERIALIZED VIEW revenue_per_customer AS
SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
GROUP BY customer_id;

Both achieve the same result. The RisingWave version is pure SQL, requires no SDK, no schema definition objects, no output mode configuration, and no checkpoint management. For teams where SQL is the common language, this difference in complexity compounds across dozens of streaming pipelines.

Feature Comparison Table

Feature	RisingWave	Databricks Structured Streaming
Processing Model	Continuous, event-driven	Micro-batch (default) or Real-Time Mode
Default Latency	Sub-100ms end-to-end	Seconds to tens of seconds (micro-batch)
Best-Case Latency	Sub-100ms	Sub-300ms (Real-Time Mode)
SQL Dialect	PostgreSQL-compatible	Spark SQL (Hive-derived)
Client Protocol	PostgreSQL wire protocol	Spark Connect, JDBC, ODBC
Pipeline Definition	Pure SQL (DDL/DML)	Python/Scala + SQL
Materialized Views	Incrementally maintained, always fresh	Scheduled refresh (Delta Live Tables)
State Management	Automatic, transparent	Manual watermarking and output modes
Connector Ecosystem	Kafka, Pulsar, Kinesis, CDC, S3, Iceberg	Kafka, Kinesis, Event Hubs, Auto Loader, Delta
Deployment Options	Self-hosted, Kubernetes, Cloud (SaaS)	Databricks managed cloud only
Batch Processing	Not the primary use case	Full batch + streaming unified
ML Integration	Via downstream systems	Native MLflow, ML Runtime
Language	Rust	Scala/Java (Spark core)
Open Source	Yes (Apache 2.0)	Spark is open source; Databricks platform is proprietary
Iceberg Integration	Native sink and catalog	Native via Delta-Iceberg interop

Cost: Purpose-Built vs Platform Tax

Cost structure is where the architectural differences become financially concrete.

Databricks Cost Model

Databricks charges based on Databricks Units (DBUs), which scale with cluster size and runtime type. For streaming workloads:

Micro-batch mode: Can use autoscaling to some extent, but Databricks' autoscaling is based on job queue depth, which does not align well with streaming workloads that run continuously. In practice, most streaming jobs run on fixed-size clusters.
Real-Time Mode: Requires dedicated compute with no serverless option. This means you pay for always-on clusters sized for peak throughput, even during quiet periods.
Platform overhead: Running streaming on Databricks means paying the Databricks platform markup on top of cloud compute costs. If streaming is your only workload, you are paying for a full lakehouse platform to run what is essentially a stream processor.

For organizations already paying for Databricks for batch ETL, ML, and analytics, adding streaming workloads to the same platform can be cost-effective because you are amortizing the platform cost across multiple use cases. But if streaming is the primary or only workload, the platform overhead is harder to justify.

RisingWave Cost Model

RisingWave's decoupled compute-storage architecture means you pay separately for compute (CPU/memory for stream processing) and storage (object storage for state). This separation has direct cost implications:

Compute scales independently: Scale compute up for throughput, down during quiet periods, without affecting stored state
Storage is cheap: State lives in object storage (S3, GCS) at object storage prices, not on fast local SSDs attached to compute nodes
No platform tax: Self-hosted RisingWave is open source (Apache 2.0). RisingWave Cloud charges only for the streaming resources you use
Right-sized for streaming: You are not paying for batch processing, ML runtimes, or notebook environments you do not need

For streaming-only workloads, a purpose-built system like RisingWave typically costs significantly less than running equivalent pipelines on a general-purpose platform. The savings come from three places: no platform markup, efficient resource utilization (Rust vs JVM), and storage costs (object storage vs always-on cluster storage).

A Practical Cost Scenario

Consider a streaming workload that processes 50,000 events per second with five materialized views performing aggregations and joins.

On Databricks, you need an always-on cluster (likely 4-8 nodes with Real-Time Mode for sub-second latency) plus DBU costs. On RisingWave, a 2-4 node cluster handles this throughput with state offloaded to S3. The RisingWave deployment uses fewer compute nodes (Rust efficiency vs JVM overhead) and cheaper storage, resulting in lower total cost of ownership for this streaming-focused workload.

The calculus changes if you also run batch ETL, ML training, and ad-hoc SQL analytics on the same data. In that case, Databricks' unified platform may reduce total infrastructure complexity even if the per-workload streaming cost is higher.

Operational Model: Database vs Platform

How you deploy, monitor, and maintain a streaming system matters as much as its features.

Operating Databricks Streaming

Databricks is a managed platform, which means Databricks handles cluster provisioning, Spark version upgrades, and infrastructure maintenance. Your operational responsibilities include:

Cluster configuration: Choosing instance types, cluster sizes, and autoscaling policies
Checkpoint management: Ensuring checkpoint locations are configured correctly and cleaned up
Watermark tuning: Setting appropriate watermarks for stateful operations to balance latency and memory
Pipeline orchestration: Managing dependencies between streaming jobs, often using Databricks Workflows or external orchestrators
Monitoring: Using Spark UI, Ganglia metrics, or Databricks-specific monitoring for streaming query progress

The operational model assumes familiarity with Spark concepts: executors, partitions, shuffle, catalyst optimizer, and the Spark memory model. Debugging performance issues often requires understanding Spark internals.

Operating RisingWave

RisingWave operates like a database. You connect, run SQL, and the system handles the rest. Operationally:

Deployment: Deploy via Docker, Kubernetes (Helm chart), or RisingWave Cloud (fully managed)
No checkpoint management: State management and recovery are handled internally, transparently
No watermark tuning: RisingWave manages event-time processing automatically through its materialized view semantics
Monitoring: Standard database metrics (query latency, throughput, memory usage) exposed via Prometheus-compatible endpoints, plus a built-in dashboard
Scaling: Add or remove compute nodes; state in object storage is unaffected

The operational model assumes familiarity with database concepts: connections, queries, schemas, and indexes. For teams with strong SQL and database skills (which describes most data teams), this is a lower learning curve than mastering Spark internals.

Failure Recovery

Both systems handle failures, but differently:

Databricks: Recovers from checkpoint state in durable storage. Recovery time depends on checkpoint size and cluster restart time. Spark's fault tolerance is mature and well-tested.
RisingWave: Recovers from state persisted in object storage (S3/GCS). The decoupled architecture means compute nodes can restart and reload state without data loss. Recovery time is typically seconds to low minutes depending on state size.

When to Choose Databricks Streaming

Databricks Structured Streaming is the stronger choice when:

You already run Databricks for batch ETL, ML, and analytics, and want to add streaming without introducing a new system
Unified batch and streaming is a priority, and you want one platform for both, even if streaming latency is seconds rather than milliseconds
Your team knows Spark and is productive with the PySpark/Scala API
Latency requirements are relaxed (seconds to minutes is acceptable)
ML integration is critical, and you want streaming data to feed directly into MLflow models on the same platform
Delta Lake is your storage layer, and you want native streaming writes to Delta tables

When to Choose RisingWave

RisingWave is the stronger choice when:

Sub-100ms latency is a requirement, not a nice-to-have
SQL is your team's primary language, and you want to define streaming pipelines without Python/Scala orchestration code
Operational simplicity matters, and you want a system that operates like a database rather than a distributed computing platform
Streaming is the primary workload, and you do not want to pay for a full lakehouse platform
PostgreSQL ecosystem integration is valuable (connecting BI tools, application backends, or existing PostgreSQL-based workflows)
Self-hosted or multi-cloud deployment is needed, since RisingWave runs anywhere (Docker, Kubernetes, or managed cloud)
Cost efficiency for streaming is important, and you want to avoid the platform overhead of a general-purpose data platform

What Is the Main Difference Between RisingWave and Databricks Streaming?

The main difference is the processing architecture. RisingWave is a purpose-built streaming database that processes events continuously as they arrive, delivering sub-100ms latency with a PostgreSQL-compatible SQL interface. Databricks Structured Streaming is built on Apache Spark and processes data in micro-batches by default (with Real-Time Mode available for sub-300ms latency). RisingWave is designed exclusively for streaming, while Databricks is a unified platform covering batch, streaming, ML, and analytics.

Can Databricks Real-Time Mode Match RisingWave's Latency?

Databricks Real-Time Mode achieves sub-300ms p99 latencies, which is closer to but still higher than RisingWave's sub-100ms default. More importantly, Real-Time Mode requires dedicated compute resources, Databricks Runtime 16.4 LTS or later, and is only available on the Databricks managed platform. RisingWave delivers its latency profile on any deployment model without special configuration.

Is RisingWave a Replacement for Databricks?

No. RisingWave and Databricks serve different primary use cases. RisingWave replaces the streaming component of your architecture with a more efficient, lower-latency, SQL-native alternative. Databricks covers batch processing, ML training, ad-hoc analytics, and data warehousing in addition to streaming. Many organizations use both: RisingWave for real-time stream processing and materialized views, with results flowing into Databricks (via Apache Iceberg integration) for historical analysis and ML workloads.

How Does the SQL Experience Compare Between the Two Platforms?

RisingWave uses PostgreSQL-compatible SQL, so you can connect with psql, JDBC drivers, or any PostgreSQL client and define entire streaming pipelines in pure SQL. Databricks uses Spark SQL, which requires Python or Scala wrapper code to define streaming sources, sinks, and execution parameters. The transformation logic within Databricks pipelines uses SQL, but the pipeline orchestration requires a programming language. For teams that prefer a pure SQL workflow, RisingWave offers a significantly simpler developer experience.

Conclusion

RisingWave and Databricks Structured Streaming are not interchangeable tools competing for the same slot in your architecture. They reflect different philosophies about how to handle streaming data.

Key takeaways:

Architecture: RisingWave is streaming-native with continuous processing. Databricks layers streaming onto a batch framework (with Real-Time Mode closing the latency gap).
Latency: RisingWave delivers sub-100ms by default. Databricks achieves sub-300ms with Real-Time Mode or seconds-level with micro-batch.
SQL: RisingWave speaks PostgreSQL. Databricks speaks Spark SQL wrapped in Python/Scala.
Cost: For streaming-only workloads, RisingWave's purpose-built architecture and Rust efficiency translate to lower infrastructure costs. For organizations already paying for the Databricks platform, adding streaming has lower marginal cost.
Operations: RisingWave operates like a database. Databricks operates like a distributed computing platform.

Choose Databricks when you need a unified platform and your streaming latency requirements are measured in seconds. Choose RisingWave when streaming is a first-class workload, sub-100ms latency matters, and you want the simplicity of pure SQL on a PostgreSQL-compatible interface.

Ready to try streaming SQL? Try RisingWave Cloud free - no credit card required. Sign up →

Join our Slack community to ask questions and connect with other stream processing developers.