RisingWave vs Spark Structured Streaming for Real-Time Analytics

RisingWave vs Spark Structured Streaming for Real-Time Analytics

The Streaming Decision That Shapes Your Stack

Your team needs real-time analytics. Dashboards that update in seconds, not hours. Fraud scores computed before transactions settle. Inventory counts that reflect what is happening on the warehouse floor right now.

You have narrowed the field to two options: Apache Spark Structured Streaming, the streaming extension of the most widely deployed big data framework, and RisingWave, a purpose-built streaming database that speaks PostgreSQL. Both can ingest Kafka topics, apply transformations, and produce results. But they approach the problem from fundamentally different starting points, and that difference shapes everything from latency to hiring costs.

This RisingWave vs Spark Structured Streaming comparison walks through the technical tradeoffs across six dimensions: processing model, latency, SQL dialect, state management, operational complexity, and cost. The goal is a fair, practical guide so you can pick the right tool for your workload.

Processing Model: Micro-Batch vs. True Streaming

The deepest architectural difference between these two systems is how they process data.

Spark Structured Streaming: Batch DNA

Spark Structured Streaming treats a data stream as an unbounded table and processes it using the same Spark SQL engine that powers batch jobs. Under the hood, the default execution mode is micro-batching: the engine collects incoming events into small batches, applies transformations, writes results, and then starts the next batch.

This design has a clear advantage: you reuse Spark's mature optimizer (Catalyst), its code generation (Tungsten), and its massive connector ecosystem. If your organization already runs Spark for batch ETL or machine learning, Structured Streaming is a natural extension.

Spark 4.x introduced a Real-Time Mode that processes events as they arrive rather than collecting them into batches. This is a significant step forward, but it currently requires Databricks-specific infrastructure, does not support all stateful operations, and disables autoscaling. For most open-source Spark deployments, micro-batch remains the primary execution model.

RisingWave: Streaming-Native

RisingWave is a streaming database designed from the ground up around continuous, incremental computation. When a new event arrives, it propagates through the dataflow graph immediately. There is no batching interval, no trigger configuration, and no "wait for the next micro-batch" delay.

Every streaming pipeline is expressed as a CREATE MATERIALIZED VIEW statement. The database continuously and incrementally maintains the result of that query, so a SELECT against the materialized view always returns the latest state. This is fundamentally different from Spark's model, where results are periodically flushed to an output sink.

The consequence: RisingWave processes each event exactly once as it arrives, while Spark Structured Streaming (in its default mode) processes events in discrete chunks with a configurable trigger interval.

Latency: Seconds vs. Sub-Second

Latency is often the deciding factor in choosing a streaming engine.

Spark Structured Streaming Latency

In micro-batch mode, Spark's end-to-end latency is bounded by the trigger interval plus processing time. The documentation states latencies as low as 100 milliseconds are achievable, but in practice, production workloads with stateful aggregations, shuffles, and sink commits typically see 500 ms to several seconds of end-to-end latency. Research benchmarks place typical micro-batch latency in the 510-570 ms range for simple workloads.

The older Continuous Processing mode (introduced in Spark 2.3) can achieve 1 ms latency but only provides at-least-once guarantees and supports a very limited set of operations: no aggregations, no streaming joins, no windowing.

Spark's new Real-Time Mode targets single-digit millisecond p99 latencies for simple transformations, but reports show p99 latencies ranging from a few milliseconds to roughly 300 ms depending on transformation complexity. It remains a Databricks-specific feature as of early 2026.

RisingWave Latency

RisingWave delivers sub-second end-to-end latency for most streaming queries, with typical p99 query latency of 10-20 ms when reading from materialized views. Because events propagate through the dataflow immediately rather than waiting for a batch boundary, latency is determined by computation complexity rather than scheduling overhead.

For use cases like real-time dashboards, alerting, and fraud detection where every second counts, the latency difference between "always under a second" and "typically 500 ms to several seconds with periodic spikes" is meaningful.

SQL Dialect and Developer Experience

How you write and deploy streaming logic matters as much as raw performance.

Spark Structured Streaming: DataFrame API First

Spark Structured Streaming's primary interface is the DataFrame/Dataset API in Scala, Java, Python, or R. While Spark SQL exists and can express many streaming queries, the full power of Structured Streaming requires the programmatic API:

# Spark Structured Streaming - Python
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col, sum as _sum

spark = SparkSession.builder.appName("orders").getOrCreate()

orders = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "orders") \
    .load()

order_stats = orders \
    .withWatermark("event_time", "10 minutes") \
    .groupBy(
        window(col("event_time"), "5 minutes"),
        col("region")
    ) \
    .agg(
        _sum("amount").alias("total_revenue"),
        count("*").alias("order_count")
    )

query = order_stats.writeStream \
    .outputMode("update") \
    .format("console") \
    .trigger(processingTime="10 seconds") \
    .start()

This code requires understanding Spark sessions, read/write stream configuration, watermarking, output modes (append, update, complete), and trigger semantics. You need to configure a Spark cluster (driver + executors), manage dependencies, and deploy JARs or Python packages.

RisingWave: PostgreSQL-Compatible SQL

RisingWave uses PostgreSQL-compatible SQL as its only interface. The equivalent of the Spark pipeline above looks like this:

-- RisingWave - Standard SQL
CREATE SOURCE orders_source (
    order_id BIGINT,
    region VARCHAR,
    amount DECIMAL,
    event_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'localhost:9092'
) FORMAT PLAIN ENCODE JSON;

CREATE MATERIALIZED VIEW order_stats AS
SELECT
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue
FROM orders_source
GROUP BY region;

No cluster configuration. No output modes. No trigger intervals. You connect with psql, any JDBC driver, or any PostgreSQL client library and write standard SQL. The materialized view is continuously updated, and you query it with SELECT * FROM order_stats to get the latest results.

This difference has a direct impact on team velocity. Engineers who know SQL can build and maintain streaming pipelines without learning Spark internals. The hiring pool for "SQL developers" is orders of magnitude larger than the pool for "Spark Structured Streaming engineers."

State Management: JVM Heap vs. Object Storage

Stateful operations like aggregations, joins, and windowed computations require the engine to maintain intermediate state. How each system handles that state has major implications for reliability and scale.

Spark's State Model

Spark Structured Streaming stores state in the JVM heap by default (using the HDFS-backed state store). For production workloads, Databricks recommends the RocksDB state store provider, which stores state on local disk to avoid garbage collection pressure.

The challenge emerges at scale. When state grows beyond tens of gigabytes:

  • JVM garbage collection becomes a bottleneck. With the default state store, GC pauses of 10 seconds to over a minute have been observed for large heap sizes (>32 GB).
  • Checkpointing overhead increases linearly with state size. Spark periodically snapshots all state to durable storage (HDFS or S3). Large checkpoints slow down recovery.
  • State cleanup requires explicit watermarking configuration. Without proper watermarks, state grows unboundedly and eventually causes out-of-memory failures.
  • Schema evolution in stateful queries is restricted. Changing a streaming query's schema often requires discarding checkpoint data and reprocessing from scratch.

RisingWave's State Model

RisingWave persists all state natively in object storage (S3, GCS, Azure Blob) through Hummock, a purpose-built LSM-tree storage engine. There is no JVM, no garbage collection, and no local disk to manage.

Key differences:

  • No state size limits tied to local resources. State scales with object storage, which is practically unlimited and costs a fraction of attached SSDs.
  • Continuous persistence. State changes stream to object storage incrementally rather than through periodic full snapshots. This means recovery is fast (seconds, not minutes) because there is less data to replay.
  • Automatic state cleanup. RisingWave manages watermarks and state expiration internally for windowed operations.
  • Schema evolution is supported through ALTER MATERIALIZED VIEW for compatible changes.

For workloads with large state (hundreds of gigabytes to terabytes of join or aggregation state), this architectural difference is often the deciding factor.

Operational Complexity: What It Takes to Run in Production

Getting a streaming system running in a demo is one thing. Keeping it stable in production at 3 AM is another.

Running Spark Structured Streaming in Production

A production Spark Structured Streaming deployment requires:

  1. Cluster management: A Spark cluster with driver and executor nodes, sized for your peak throughput. You must configure executor memory, cores, and parallelism. Autoscaling does not work well with streaming workloads because Spark treats each micro-batch as a short job.
  2. JVM tuning: Garbage collector selection (G1GC, ZGC), heap sizes, off-heap memory, and direct memory buffers all require tuning for your specific workload.
  3. Checkpoint management: You must configure checkpoint locations, monitor checkpoint sizes, and handle checkpoint corruption. Upgrading Spark versions sometimes requires checkpoint migration.
  4. Monitoring and alerting: Spark exposes metrics through the Spark UI, but you need additional tooling (Prometheus, Grafana) for production alerting on processing lag, batch duration, and state size.
  5. Failure recovery: When a streaming job fails, you must diagnose whether the failure is a transient issue (OOM, network partition) or a permanent one (schema change, bad data). Recovery involves restarting from the last checkpoint, which can take minutes to hours depending on state size.
  6. Dependency management: Spark jobs bring a large dependency tree (Hadoop, Hive, Jackson, Guava) that can cause version conflicts.

Running RisingWave in Production

A production RisingWave deployment requires:

  1. Database deployment: Deploy RisingWave as a single binary, via Docker, or on Kubernetes using the Helm chart. No separate cluster manager needed.
  2. Object storage configuration: Point RisingWave at an S3 bucket (or compatible storage) for state persistence. No local disk provisioning.
  3. Standard database monitoring: RisingWave exposes a Prometheus endpoint with streaming-specific metrics (barrier latency, throughput, materialized view lag). It also supports EXPLAIN for streaming plans.
  4. PostgreSQL-compatible tooling: Existing database tools (pgAdmin, DBeaver, Grafana's PostgreSQL plugin) work out of the box for monitoring and querying.
  5. Recovery: RisingWave recovers automatically from compute node failures by rescheduling work and reading state from object storage. Recovery time is typically seconds.

The operational gap is widest for teams without deep Spark expertise. If your team already has Spark specialists, the operational burden of Structured Streaming is manageable. If you are building a new team or your engineers are primarily SQL-focused, RisingWave's PostgreSQL-compatible interface significantly reduces the operational surface area.

Cost: Always-On Clusters vs. Elastic Streaming

Spark Structured Streaming Cost Profile

Spark Structured Streaming requires continuously running clusters. The cost components are:

  • Compute: Always-on executor nodes sized for peak throughput. Since autoscaling is ineffective for streaming, you pay for peak capacity 24/7.
  • Storage: Checkpoint data in HDFS or S3. State store data on local SSDs if using RocksDB.
  • Managed service premium: On Databricks, streaming workloads run on dedicated clusters at standard DBU rates. EMR Serverless offers a consumption-based model but adds scheduling latency.
  • Engineering time: JVM tuning, checkpoint management, and dependency resolution require specialized expertise.

For a typical mid-scale streaming workload (10,000-50,000 events/second), expect to run 3-5 large instances (e.g., r5.2xlarge on AWS) continuously, plus a driver node.

RisingWave Cost Profile

RisingWave's decoupled architecture changes the cost equation:

  • Compute: Scale compute nodes independently based on throughput needs. Add nodes for peak traffic, remove them during off-hours.
  • Storage: All state goes to object storage at roughly $0.023/GB/month (S3 standard), orders of magnitude cheaper than attached SSDs or EBS volumes.
  • No JVM overhead: RisingWave is written in Rust, resulting in lower memory overhead and no GC pauses. The same workload typically requires fewer compute resources.
  • RisingWave Cloud: A fully managed option with pay-per-use pricing, eliminating operational costs. Sign up for free to test your workload.

For the same 10,000-50,000 events/second workload, RisingWave typically requires fewer and smaller compute nodes because there is no JVM overhead and state does not consume local resources.

Feature Comparison Table

DimensionSpark Structured StreamingRisingWave
Processing modelMicro-batch (default), Real-Time Mode (Databricks only)True continuous streaming
End-to-end latency500 ms - seconds (micro-batch); ~1-300 ms (Real-Time Mode)Sub-second; 10-20 ms p99 for MV reads
Primary interfaceDataFrame API (Scala/Java/Python/R)PostgreSQL-compatible SQL
SQL compatibilitySpark SQL dialectPostgreSQL dialect
Client connectivitySpark-specific drivers/sessionsAny PostgreSQL client (psql, JDBC, psycopg2)
State storageJVM heap / RocksDB on local diskS3-compatible object storage (Hummock)
State size limitBounded by local disk/memoryBounded by object storage (practically unlimited)
Fault toleranceExactly-once (micro-batch); at-least-once (continuous)Exactly-once
Recovery timeMinutes to hours (checkpoint restore)Seconds (state on object storage)
RuntimeJVM (Java/Scala)Native (Rust)
Cluster managementDriver + executors; YARN/Mesos/K8sSingle binary or K8s Helm chart
AutoscalingIneffective for streamingSupported (decouple compute/storage)
Built-in servingNo (requires external sink/database)Yes (query materialized views directly)
Batch and stream unificationYes (same DataFrame API)Streaming-first; supports batch queries on tables
ML integrationNative (MLlib, pandas UDFs)Via external systems (export to data lake)
Connector ecosystemVery large (200+ connectors)Growing (Kafka, Pulsar, Kinesis, S3, Iceberg, PostgreSQL CDC, MySQL CDC, and more)
Open sourceYes (Apache 2.0)Yes (Apache 2.0)
Managed cloud optionDatabricks, EMR, Dataproc, HDInsightRisingWave Cloud

When to Choose Spark Structured Streaming

Spark Structured Streaming is the better choice when:

  • You already run a Spark platform. If your organization has invested in Spark for batch ETL, ML training, and ad hoc analytics, adding Structured Streaming to existing clusters is the path of least resistance. The unified batch-streaming API reduces code duplication.
  • Seconds-level latency is acceptable. If your use case tolerates 1-5 second delays (daily aggregation dashboards, hourly reports that update more frequently), micro-batch latency is not a bottleneck.
  • You need tight ML integration. Spark's MLlib, pandas UDFs, and integration with MLflow make it the natural choice if your streaming pipeline feeds directly into model training or scoring.
  • You need the broadest connector ecosystem. Spark's 200+ connectors cover nearly every data source and sink. RisingWave's connector library is growing but not yet as extensive.

When to Choose RisingWave

RisingWave is the better choice when:

  • Sub-second latency matters. Real-time alerting, fraud detection, live dashboards, and operational analytics all benefit from RisingWave's continuous processing model.
  • Your team knows SQL, not Spark. If your engineers are database developers, data analysts, or backend engineers who work in SQL daily, RisingWave lets them build streaming pipelines without learning a new paradigm.
  • State management is a concern. If your streaming workload involves large join state, high-cardinality aggregations, or long time windows, RisingWave's object storage-backed state eliminates the JVM memory and GC challenges that plague Spark at scale.
  • You want built-in serving. RisingWave can serve materialized view results directly with low-latency queries, eliminating the need for a separate serving database (Redis, PostgreSQL, Cassandra) downstream of your streaming pipeline.
  • Operational simplicity is a priority. Smaller teams and organizations without dedicated Spark platform engineers benefit from RisingWave's simpler deployment model and PostgreSQL-compatible tooling.

Can You Use Both?

Yes. Many organizations run Spark for batch analytics and historical processing while using RisingWave for the real-time layer. A common pattern:

  1. Kafka collects events from applications.
  2. RisingWave ingests from Kafka, maintains real-time materialized views, and serves live dashboards and APIs.
  3. RisingWave sinks to Apache Iceberg for long-term storage via its native Iceberg integration.
  4. Spark reads from Iceberg for large-scale batch analytics, ML training, and historical queries.

This architecture gives you sub-second real-time analytics through RisingWave and petabyte-scale batch analytics through Spark, without forcing either tool into a role it was not designed for.

What is the difference between micro-batch and true streaming processing?

Micro-batch processing collects incoming events into small batches and processes each batch as a unit. True streaming (also called continuous or event-at-a-time processing) processes each event individually as it arrives. The practical difference is latency: micro-batch adds scheduling and batching overhead that creates a latency floor (typically hundreds of milliseconds to seconds), while true streaming can deliver results in milliseconds. Spark Structured Streaming defaults to micro-batch; RisingWave uses true streaming natively.

How does RisingWave compare to Spark Structured Streaming for large-scale state?

RisingWave handles large state more gracefully because it persists state directly to object storage (S3) rather than keeping it in JVM memory or on local disks. Spark Structured Streaming stores state in the JVM heap (default) or RocksDB (recommended for production), and both approaches hit scaling limits: JVM garbage collection pauses grow with heap size, and local disk capacity bounds RocksDB state. RisingWave's approach means state can scale to terabytes without impacting compute performance.

Can I migrate from Spark Structured Streaming to RisingWave?

Migration is straightforward for SQL-heavy workloads. If your Spark Structured Streaming pipelines are written in Spark SQL, you can often translate them to RisingWave's PostgreSQL dialect with minor syntax adjustments. Pipelines written against the DataFrame API in Python or Scala require rewriting in SQL, but RisingWave's SQL expressiveness (joins, windows, aggregations, UDFs) covers most common patterns. Start by running both systems in parallel on the same Kafka topics and comparing results before cutting over.

Is Spark Structured Streaming still relevant in 2026?

Absolutely. Spark Structured Streaming remains the right choice for organizations deeply invested in the Spark ecosystem, particularly those that need unified batch-streaming on a single platform with tight ML integration. Its new Real-Time Mode on Databricks narrows the latency gap significantly. The question is not whether Spark is relevant but whether your specific use case is better served by a batch-first system with streaming bolted on, or a streaming-native system with serving built in.

Conclusion

The RisingWave vs Spark Structured Streaming decision comes down to your starting point and your latency requirements.

  • Spark Structured Streaming is the pragmatic choice for Spark-native organizations where seconds-level latency is acceptable and ML integration is critical. Its micro-batch model, while not the fastest, is reliable and benefits from the largest ecosystem in big data.
  • RisingWave is purpose-built for real-time analytics with sub-second latency, PostgreSQL compatibility, and operational simplicity. Its streaming-native architecture eliminates the JVM tuning, checkpoint management, and serving layer complexity that comes with running Spark Structured Streaming in production.

Both are open source under the Apache 2.0 license. Both are production-ready. The best choice depends on whether you are extending an existing Spark platform or building a new real-time analytics stack from scratch.


Ready to try RisingWave for your streaming workloads? Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.