RisingWave can replace most Spark batch ETL jobs with continuously updated materialized views that produce the same output — but in seconds instead of hours. The migration path is straightforward: translate your Spark SQL transformations into RisingWave CREATE MATERIALIZED VIEW statements, point the output to the same Apache Iceberg tables, and decommission the Spark cluster.
Why Replace Spark Batch Jobs?
Spark batch jobs are the workhorse of data engineering, but they come with well-known pain points:
- Latency: Hourly or daily jobs mean data is always stale by definition
- Cost: Spark clusters are expensive, especially when jobs run for 30–60 minutes every hour
- Operational overhead: Job scheduling (Airflow), cluster management, dependency hell, and on-call burden
- Failure recovery: Failed Spark jobs require manual reruns; partial outputs create correctness issues
RisingWave's streaming model eliminates these problems: the "job" runs continuously as a persistent materialized view, updates incrementally as new data arrives, and recovers automatically from failures.
When to Replace vs. Keep Spark
| Workload Type | Replace with RisingWave? | Reason |
| Incremental aggregations (sums, counts, averages) | Yes | Native streaming aggregation |
| Rolling window calculations | Yes | TUMBLE/HOP/SESSION windows |
| CDC-based dimension updates | Yes | Native CDC connectors |
| Full historical reprocessing | No | Spark excels at one-time batch scans |
| ML model training | No | Spark MLlib has no equivalent |
| Complex shuffles across TBs of history | Hybrid | Use RisingWave for incremental, Spark for backfill |
| Time-series aggregations | Yes | Window functions purpose-built for this |
Migration Pattern
Step 1: Identify the Spark Job Logic
Take a typical Spark batch job that computes daily revenue by product category:
# Existing PySpark batch job (runs hourly via Airflow)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, date_trunc
spark = SparkSession.builder.appName("revenue_agg").getOrCreate()
orders = spark.read.format("iceberg").load("s3://warehouse/raw/orders")
products = spark.read.format("iceberg").load("s3://warehouse/raw/products")
result = (orders
.join(products, "product_id")
.filter(col("order_date") >= "2024-01-01")
.groupBy(date_trunc("day", col("order_date")).alias("day"), "category")
.agg(
sum(col("quantity") * col("unit_price")).alias("revenue"),
count("order_id").alias("order_count")
)
)
result.write.format("iceberg").mode("overwrite") \
.save("s3://warehouse/analytics/daily_revenue_by_category")
Step 2: Translate to RisingWave
-- Source: orders from PostgreSQL CDC
CREATE SOURCE orders_cdc (
order_id BIGINT PRIMARY KEY,
product_id BIGINT,
quantity INT,
unit_price NUMERIC(12, 2),
order_date TIMESTAMPTZ,
status VARCHAR
)
WITH (
connector = 'postgres-cdc',
hostname = 'postgres.internal',
port = '5432',
username = 'cdc_user',
password = 'secret',
database.name = 'ecommerce',
table.name = 'orders'
)
FORMAT DEBEZIUM ENCODE JSON;
-- Reference: product catalog
CREATE TABLE product_catalog (
product_id BIGINT PRIMARY KEY,
sku VARCHAR,
name VARCHAR,
category VARCHAR
);
-- Continuously updated daily revenue aggregation
CREATE MATERIALIZED VIEW daily_revenue_by_category AS
SELECT
DATE_TRUNC('day', o.order_date) AS day,
p.category,
SUM(o.quantity * o.unit_price) AS revenue,
COUNT(o.order_id) AS order_count,
AVG(o.unit_price) AS avg_unit_price
FROM orders_cdc o
JOIN product_catalog p ON o.product_id = p.product_id
WHERE o.status != 'cancelled'
GROUP BY DATE_TRUNC('day', o.order_date), p.category;
-- Sink to Iceberg — same table the Spark job was writing
CREATE SINK daily_revenue_sink AS
SELECT * FROM daily_revenue_by_category
WITH (
connector = 'iceberg',
type = 'upsert',
primary_key = 'day,category',
catalog.type = 'rest',
catalog.uri = 'http://iceberg-catalog:8181',
warehouse.path = 's3://warehouse',
s3.region = 'us-east-1',
database.name = 'analytics',
table.name = 'daily_revenue_by_category'
);
The Spark job ran hourly and took 20 minutes to complete. RisingWave updates the aggregation within seconds of each new order, continuously, without a cluster to manage.
Step 3: Validate Output Parity
Before decommissioning Spark, run both pipelines in parallel and compare outputs. In RisingWave v2.8+, query the Iceberg table directly:
-- Query the Iceberg table from RisingWave (v2.8+)
-- to validate against the materialized view
SELECT
mv.day,
mv.category,
mv.revenue AS mv_revenue,
mv.order_count AS mv_orders
FROM daily_revenue_by_category mv
WHERE mv.day = CURRENT_DATE - INTERVAL '1 DAY'
ORDER BY revenue DESC
LIMIT 20;
Compare this against the Spark output for the same day. When they match consistently, decommission the Spark job.
Cost Comparison
| Dimension | Spark Batch (hourly) | RisingWave Streaming |
| Cluster cost | ~$2/hour × 2 hours/day × 30 days = $120/month | ~$0.30/hour × 24h × 30 = $216/month* |
| Data freshness | Up to 60 minutes stale | 30–60 seconds |
| Failure recovery | Manual Airflow rerun | Automatic |
| Engineering overhead | High (job scheduling, tuning) | Low (SQL only) |
| Backfill capability | Full historical | Requires initial load |
*RisingWave is always-on but uses significantly less compute than a Spark cluster. For many workloads, the total cost is lower because the processing is incremental rather than full table scans.
Handling the Initial Backfill
RisingWave processes new events going forward. To backfill historical data into the Iceberg table, run a one-time Spark or Trino job before switching over:
-- In Trino: one-time backfill of historical data
INSERT INTO iceberg.analytics.daily_revenue_by_category
SELECT
DATE_TRUNC('day', o.order_date) AS day,
p.category,
SUM(o.quantity * o.unit_price) AS revenue,
COUNT(o.order_id) AS order_count,
AVG(o.unit_price) AS avg_unit_price
FROM iceberg.raw.orders o
JOIN iceberg.raw.products p ON o.product_id = p.product_id
WHERE o.status != 'cancelled'
AND o.order_date < '2024-01-01' -- historical period
GROUP BY DATE_TRUNC('day', o.order_date), p.category;
After the backfill completes, RisingWave takes over for all new data. The Iceberg table contains both historical (from Trino/Spark) and real-time (from RisingWave) data seamlessly.
FAQ
Q: Can RisingWave handle the same scale as Spark batch jobs? A: RisingWave scales horizontally by adding compute nodes. For most aggregation workloads that fit in RisingWave's incremental model, it handles scale comparable to or exceeding typical Spark batch jobs — with much lower latency.
Q: What about Spark jobs that do full historical scans for slowly changing dimensions? A: SCD updates are well-handled by RisingWave's CDC connectors. For true full-history reprocessing (e.g., model retraining), keep Spark for those specific jobs and use RisingWave for incremental updates.
Q: Does RisingWave support complex joins like Spark's broadcast joins? A: RisingWave supports standard equi-joins, including joins on streaming tables and static reference tables. For large-scale shuffled joins on historical data, Spark or Trino may be more appropriate.
Q: How do I migrate Spark Structured Streaming jobs?
A: Spark Structured Streaming jobs are the closest match to RisingWave's model. Most Structured Streaming logic translates directly to RisingWave SQL with minor syntax changes (e.g., watermark becomes WATERMARK in window definitions).
Q: What is the migration risk? A: The main risk is semantic differences in aggregation behavior during the transition window. Run both pipelines in parallel for at least one full business cycle (day/week) and validate outputs before decommissioning Spark.
Get Started
Migrate your Spark batch jobs to real-time with RisingWave and Iceberg:

