Replacing Apache Spark Batch Jobs with RisingWave and Iceberg

RisingWave can replace most Spark batch ETL jobs with continuously updated materialized views that produce the same output — but in seconds instead of hours. The migration path is straightforward: translate your Spark SQL transformations into RisingWave CREATE MATERIALIZED VIEW statements, point the output to the same Apache Iceberg tables, and decommission the Spark cluster.

Why Replace Spark Batch Jobs?

Spark batch jobs are the workhorse of data engineering, but they come with well-known pain points:

Latency: Hourly or daily jobs mean data is always stale by definition
Cost: Spark clusters are expensive, especially when jobs run for 30–60 minutes every hour
Operational overhead: Job scheduling (Airflow), cluster management, dependency hell, and on-call burden
Failure recovery: Failed Spark jobs require manual reruns; partial outputs create correctness issues

RisingWave's streaming model eliminates these problems: the "job" runs continuously as a persistent materialized view, updates incrementally as new data arrives, and recovers automatically from failures.

When to Replace vs. Keep Spark

Workload Type	Replace with RisingWave?	Reason
Incremental aggregations (sums, counts, averages)	Yes	Native streaming aggregation
Rolling window calculations	Yes	TUMBLE/HOP/SESSION windows
CDC-based dimension updates	Yes	Native CDC connectors
Full historical reprocessing	No	Spark excels at one-time batch scans
ML model training	No	Spark MLlib has no equivalent
Complex shuffles across TBs of history	Hybrid	Use RisingWave for incremental, Spark for backfill
Time-series aggregations	Yes	Window functions purpose-built for this

Migration Pattern

Step 1: Identify the Spark Job Logic

Take a typical Spark batch job that computes daily revenue by product category:

# Existing PySpark batch job (runs hourly via Airflow)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, date_trunc

spark = SparkSession.builder.appName("revenue_agg").getOrCreate()

orders = spark.read.format("iceberg").load("s3://warehouse/raw/orders")
products = spark.read.format("iceberg").load("s3://warehouse/raw/products")

result = (orders
    .join(products, "product_id")
    .filter(col("order_date") >= "2024-01-01")
    .groupBy(date_trunc("day", col("order_date")).alias("day"), "category")
    .agg(
        sum(col("quantity") * col("unit_price")).alias("revenue"),
        count("order_id").alias("order_count")
    )
)

result.write.format("iceberg").mode("overwrite") \
    .save("s3://warehouse/analytics/daily_revenue_by_category")

Step 2: Translate to RisingWave

-- Source: orders from PostgreSQL CDC
CREATE SOURCE orders_cdc (
    order_id    BIGINT PRIMARY KEY,
    product_id  BIGINT,
    quantity    INT,
    unit_price  NUMERIC(12, 2),
    order_date  TIMESTAMPTZ,
    status      VARCHAR
)
WITH (
    connector      = 'postgres-cdc',
    hostname       = 'postgres.internal',
    port           = '5432',
    username       = 'cdc_user',
    password       = 'secret',
    database.name  = 'ecommerce',
    table.name     = 'orders'
)
FORMAT DEBEZIUM ENCODE JSON;

-- Reference: product catalog
CREATE TABLE product_catalog (
    product_id BIGINT PRIMARY KEY,
    sku        VARCHAR,
    name       VARCHAR,
    category   VARCHAR
);

-- Continuously updated daily revenue aggregation
CREATE MATERIALIZED VIEW daily_revenue_by_category AS
SELECT
    DATE_TRUNC('day', o.order_date)   AS day,
    p.category,
    SUM(o.quantity * o.unit_price)    AS revenue,
    COUNT(o.order_id)                 AS order_count,
    AVG(o.unit_price)                 AS avg_unit_price
FROM orders_cdc o
JOIN product_catalog p ON o.product_id = p.product_id
WHERE o.status != 'cancelled'
GROUP BY DATE_TRUNC('day', o.order_date), p.category;

-- Sink to Iceberg — same table the Spark job was writing
CREATE SINK daily_revenue_sink AS
SELECT * FROM daily_revenue_by_category
WITH (
    connector      = 'iceberg',
    type           = 'upsert',
    primary_key    = 'day,category',
    catalog.type   = 'rest',
    catalog.uri    = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://warehouse',
    s3.region      = 'us-east-1',
    database.name  = 'analytics',
    table.name     = 'daily_revenue_by_category'
);

The Spark job ran hourly and took 20 minutes to complete. RisingWave updates the aggregation within seconds of each new order, continuously, without a cluster to manage.

Step 3: Validate Output Parity

Before decommissioning Spark, run both pipelines in parallel and compare outputs. In RisingWave v2.8+, query the Iceberg table directly:

-- Query the Iceberg table from RisingWave (v2.8+)
-- to validate against the materialized view
SELECT
    mv.day,
    mv.category,
    mv.revenue          AS mv_revenue,
    mv.order_count      AS mv_orders
FROM daily_revenue_by_category mv
WHERE mv.day = CURRENT_DATE - INTERVAL '1 DAY'
ORDER BY revenue DESC
LIMIT 20;

Compare this against the Spark output for the same day. When they match consistently, decommission the Spark job.

Cost Comparison

Dimension	Spark Batch (hourly)	RisingWave Streaming
Cluster cost	~$2/hour × 2 hours/day × 30 days = $120/month	~$0.30/hour × 24h × 30 = $216/month*
Data freshness	Up to 60 minutes stale	30–60 seconds
Failure recovery	Manual Airflow rerun	Automatic
Engineering overhead	High (job scheduling, tuning)	Low (SQL only)
Backfill capability	Full historical	Requires initial load

*RisingWave is always-on but uses significantly less compute than a Spark cluster. For many workloads, the total cost is lower because the processing is incremental rather than full table scans.

Handling the Initial Backfill

RisingWave processes new events going forward. To backfill historical data into the Iceberg table, run a one-time Spark or Trino job before switching over:

-- In Trino: one-time backfill of historical data
INSERT INTO iceberg.analytics.daily_revenue_by_category
SELECT
    DATE_TRUNC('day', o.order_date) AS day,
    p.category,
    SUM(o.quantity * o.unit_price)  AS revenue,
    COUNT(o.order_id)               AS order_count,
    AVG(o.unit_price)               AS avg_unit_price
FROM iceberg.raw.orders o
JOIN iceberg.raw.products p ON o.product_id = p.product_id
WHERE o.status != 'cancelled'
  AND o.order_date < '2024-01-01'  -- historical period
GROUP BY DATE_TRUNC('day', o.order_date), p.category;

After the backfill completes, RisingWave takes over for all new data. The Iceberg table contains both historical (from Trino/Spark) and real-time (from RisingWave) data seamlessly.

FAQ

Q: Can RisingWave handle the same scale as Spark batch jobs? A: RisingWave scales horizontally by adding compute nodes. For most aggregation workloads that fit in RisingWave's incremental model, it handles scale comparable to or exceeding typical Spark batch jobs — with much lower latency.

Q: What about Spark jobs that do full historical scans for slowly changing dimensions? A: SCD updates are well-handled by RisingWave's CDC connectors. For true full-history reprocessing (e.g., model retraining), keep Spark for those specific jobs and use RisingWave for incremental updates.

Q: Does RisingWave support complex joins like Spark's broadcast joins? A: RisingWave supports standard equi-joins, including joins on streaming tables and static reference tables. For large-scale shuffled joins on historical data, Spark or Trino may be more appropriate.

Q: How do I migrate Spark Structured Streaming jobs? A: Spark Structured Streaming jobs are the closest match to RisingWave's model. Most Structured Streaming logic translates directly to RisingWave SQL with minor syntax changes (e.g., watermark becomes WATERMARK in window definitions).

Q: What is the migration risk? A: The main risk is semantic differences in aggregation behavior during the transition window. Run both pipelines in parallel for at least one full business cycle (day/week) and validate outputs before decommissioning Spark.

Get Started

Migrate your Spark batch jobs to real-time with RisingWave and Iceberg: