Building a Real-Time Lakehouse with RisingWave and Apache Iceberg

TL;DR

A real-time lakehouse combines open-format storage (Apache Iceberg on S3) with a streaming SQL engine (RisingWave) to keep analytical tables continuously fresh. Instead of running hourly batch ETL jobs, RisingWave writes incremental updates directly to Iceberg every 60 seconds, and any query engine (Trino, Spark, Athena) reads them immediately.

What Is a Real-Time Lakehouse?

The lakehouse architecture was defined by Databricks as a data management system that "combines the best elements of data lakes and data warehouses." Concretely, a lakehouse stores raw and processed data in open formats on low-cost cloud object storage (S3, GCS, ADLS), while layering warehouse-grade capabilities on top: ACID transactions, schema enforcement, time travel, and efficient query execution through open table formats like Apache Iceberg.

The three-tier evolution looks like this:

Architecture	Storage	ACID	Freshness	Cost
Data warehouse	Proprietary columnar	Yes	Batch (hours)	High
Data lake	Raw files (Parquet, JSON)	No	Near-real-time	Low
Lakehouse	Parquet + Iceberg metadata	Yes	Depends on pipeline	Low

The "depends on pipeline" caveat for lakehouses is where most teams get stuck. A standard lakehouse pipeline still relies on batch ETL: a Spark job reads production databases every hour, transforms the data, and overwrites Iceberg partitions. Data is never older than one hour at best, but the engineering overhead and operational cost of running Spark clusters on a schedule is significant.

A real-time lakehouse closes this gap. Instead of batch jobs, a streaming SQL engine like RisingWave continuously reads change events from operational systems, maintains materialized views over those streams, and writes the results to Iceberg in small incremental commits. The result is Iceberg tables that reflect production data within seconds to minutes, not hours.

Architecture: RisingWave + Apache Iceberg

Here is the end-to-end architecture for a real-time lakehouse built on RisingWave and Apache Iceberg:

flowchart LR
    subgraph Sources["Operational Sources"]
        PG["PostgreSQL\n(CDC)"]
        MYSQL["MySQL\n(CDC)"]
        K["Apache Kafka\n(Event Streams)"]
    end

    subgraph RW["RisingWave (Streaming Layer)"]
        SRC["Sources &\nCDC Ingestion"]
        MV["Materialized Views\n(Streaming SQL)"]
        SINK["Iceberg Sink\n(incremental commits)"]
    end

    subgraph Lake["Lakehouse Storage (S3)"]
        ICE["Apache Iceberg\nTables"]
        META["Iceberg Catalog\n(REST / Glue)"]
    end

    subgraph Query["Query Engines"]
        TRINO["Trino / Athena"]
        SPARK["Apache Spark"]
        DUCK["DuckDB"]
    end

    PG -->|row-level changes| SRC
    MYSQL -->|row-level changes| SRC
    K -->|event records| SRC
    SRC --> MV
    MV --> SINK
    SINK -->|Parquet + metadata| ICE
    ICE --> META
    ICE --> TRINO
    ICE --> SPARK
    ICE --> DUCK

Each component has a distinct role:

RisingWave is a PostgreSQL-compatible streaming database. It ingests CDC events from operational databases and event records from Kafka, maintains stateful streaming aggregations as materialized views, and writes the results to Iceberg as they change. RisingWave is open source (Apache 2.0) and built in Rust with disaggregated storage on S3.
Apache Iceberg is the open table format that sits on top of your object storage. It wraps Parquet files with a metadata layer that provides ACID transactions, partition evolution, schema evolution, and time travel. Any engine that speaks Iceberg can read the tables RisingWave writes.
The Iceberg catalog (REST, AWS Glue, or Hive Metastore) tracks which metadata files correspond to which tables. RisingWave commits new snapshots to this catalog after each checkpoint.
Query engines (Trino, Spark, Athena, DuckDB) read the Iceberg tables for ad-hoc analytics, BI dashboards, and ML feature pipelines. They see fresh data on every query because RisingWave has already committed the latest streaming results.

Setting It Up: Step by Step

Step 1: Create a Source

For Kafka-based event streams, RisingWave creates a source that maps the topic schema:

CREATE SOURCE orders (
    order_id    BIGINT,
    customer_id BIGINT,
    amount      DECIMAL,
    region      VARCHAR,
    created_at  TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

For PostgreSQL CDC (change data capture), RisingWave connects directly to the Postgres replication slot:

CREATE SOURCE pg_orders WITH (
    connector = 'postgres-cdc',
    hostname = 'postgres',
    port = '5432',
    username = 'replicator',
    password = 'secret',
    database.name = 'ecommerce',
    schema.name = 'public',
    table.name = 'orders'
);

RisingWave supports native CDC from PostgreSQL, MySQL, MongoDB, and other sources. See CDC source documentation for the full list of supported connectors.

Step 2: Create a Materialized View

A materialized view defines the transformation you want to keep continuously up to date. RisingWave maintains this incrementally as new events arrive, without re-processing the full table:

CREATE MATERIALIZED VIEW daily_sales_lakehouse AS
SELECT
    DATE_TRUNC('day', created_at) AS sale_date,
    region,
    COUNT(*)        AS order_count,
    SUM(amount)     AS total_sales,
    AVG(amount)     AS avg_order_value
FROM orders
GROUP BY DATE_TRUNC('day', created_at), region;

This is standard SQL. RisingWave handles the incremental maintenance automatically: as new orders arrive, only the affected day/region combinations are recalculated.

Step 3: Create an Iceberg Sink

The Iceberg sink writes the materialized view results to your data lake. Here is a complete example using AWS Glue as the catalog:

CREATE SINK lakehouse_daily_sales
FROM daily_sales_lakehouse
WITH (
    connector               = 'iceberg',
    type                    = 'append-only',
    force_append_only       = 'true',
    warehouse.path          = 's3://my-data-lake/warehouse',
    database.name           = 'analytics',
    table.name              = 'daily_sales',
    create_table_if_not_exists = 'true',
    catalog.type            = 'glue',
    catalog.name            = 'my_glue_catalog',
    s3.access.key           = 'YOUR_ACCESS_KEY',
    s3.secret.key           = 'YOUR_SECRET_KEY',
    s3.region               = 'us-east-1',
    commit_checkpoint_interval = '60'
);

Key parameters explained:

force_append_only = 'true' - Required when the source is a streaming aggregation (which produces retract messages). This converts the stream to append-only for Iceberg.
create_table_if_not_exists = 'true' - RisingWave creates the Iceberg table automatically on the first commit.
commit_checkpoint_interval = '60' - RisingWave commits a new Iceberg snapshot every 60 seconds.
catalog.type = 'glue' - Use AWS Glue as the Iceberg catalog. Other options: 'rest', 'hive', 'storage'.

For upsert workloads (when you need to update existing rows rather than append), use type = 'upsert' with a primary_key:

CREATE SINK lakehouse_customers_upsert
FROM customer_profiles_mv
WITH (
    connector         = 'iceberg',
    type              = 'upsert',
    primary_key       = 'customer_id',
    write_mode        = 'merge-on-read',
    warehouse.path    = 's3://my-data-lake/warehouse',
    database.name     = 'analytics',
    table.name        = 'customer_profiles',
    catalog.type      = 'glue',
    catalog.name      = 'my_glue_catalog',
    s3.access.key     = 'YOUR_ACCESS_KEY',
    s3.secret.key     = 'YOUR_SECRET_KEY',
    s3.region         = 'us-east-1'
);

The merge-on-read write mode is the default for upsert sinks. It writes delete files alongside data files, which is more write-efficient. For read-heavy workloads where query latency matters more than write throughput, use write_mode = 'copy-on-write'. For deeper background on these trade-offs, see Apache Iceberg: A Complete Guide.

Query the Lakehouse

Once RisingWave is writing to Iceberg, any query engine can read those tables. Here is a Trino example:

-- Register the catalog in trino.properties first, then:
SELECT
    sale_date,
    region,
    total_sales,
    order_count,
    avg_order_value
FROM glue.analytics.daily_sales
WHERE sale_date >= CURRENT_DATE - INTERVAL '7' DAY
ORDER BY total_sales DESC;

The same table is queryable from Spark:

df = spark.read.format("iceberg").load("glue.analytics.daily_sales")
df.filter("sale_date >= '2026-01-01'").show()

And from DuckDB for local analytics:

-- Using DuckDB's Iceberg extension
SELECT * FROM iceberg_scan('s3://my-data-lake/warehouse/analytics/daily_sales');

Because Iceberg commits are atomic, every query sees a consistent snapshot. There are no partial reads while RisingWave is in the middle of writing a checkpoint. For a detailed look at Iceberg catalog types and how to connect different query engines, see Apache Iceberg Catalogs Explained.

Why Batch ETL Falls Short

Before explaining when to choose this architecture, it helps to understand the specific failure modes of batch ETL:

Data freshness latency. A Spark batch job that runs every hour produces data that is at minimum 60 minutes stale at the moment it completes. For operational use cases (fraud detection, inventory management, real-time pricing), an hour of latency is not acceptable.

Resource spikes. Batch jobs process hours of accumulated data all at once. This creates predictable but large compute spikes. Clusters must be sized for peak load, which means paying for capacity that sits idle most of the time.

Operational complexity. Scheduling, monitoring, and debugging batch pipelines requires additional tooling (Airflow, step functions, etc.). When a batch job fails mid-run, you need logic to resume from checkpoints or re-process from scratch.

No incremental processing. Most batch systems re-read entire source tables or large partitions on every run. This gets expensive as data volumes grow.

RisingWave's incremental approach addresses all four: it processes only new events (not full table scans), smooths the compute profile into a continuous low-level workload, and maintains exactly-once delivery via its checkpoint mechanism.

When to Choose This Architecture

This architecture is a strong fit when:

Your analytics use cases require data fresher than 15 minutes.
You have operational databases (Postgres, MySQL) and need their changes reflected in your lakehouse without batch ETL jobs.
You want a single SQL interface that handles both streaming transformations and lakehouse writes.
Your query engines (Spark, Trino, Athena) already read Iceberg, and you want to feed them fresher data.
You need schema evolution support - Iceberg handles column adds/renames gracefully, and RisingWave propagates them via auto.schema.change = 'true'.

It may be overkill when:

Hourly data freshness is acceptable and your current batch pipeline is stable.
Your data volumes are small enough that a scheduled dbt + Spark job is cheaper to operate.
You are not yet using Iceberg and would need to migrate query engines as well.

For teams already on Iceberg, adding RisingWave as the streaming write layer is generally low-friction: the table format, catalog, and query engines remain unchanged. For the broader streaming vs. batch trade-off discussion, see Streaming Payments vs. Batch.

Key Takeaways

A real-time lakehouse writes streaming results to open-format storage continuously, replacing hourly batch jobs with incremental commits that land in seconds to minutes.
RisingWave connects to operational databases via CDC and event streams via Kafka, maintains streaming aggregations as materialized views, and writes to Apache Iceberg through a native sink connector.
Apache Iceberg provides the ACID, schema evolution, and time-travel semantics that make the lakehouse queryable by Spark, Trino, Athena, and DuckDB.
RisingWave's commit_checkpoint_interval controls the commit frequency. The default is 60 seconds, which is appropriate for most analytics workloads.
Upsert sinks require primary_key and support both merge-on-read (write-efficient) and copy-on-write (read-efficient) modes.

FAQ

Q: Does RisingWave support all Iceberg catalog types?

RisingWave supports REST, AWS Glue, Hive Metastore, and storage-based (no external catalog) catalogs. REST catalog is the most portable option for cloud-agnostic deployments. AWS Glue is the simplest option for teams already on AWS. See Apache Iceberg REST Catalog: The Complete Guide for setup instructions.

Q: What happens to Iceberg data if RisingWave restarts?

RisingWave uses checkpointing to maintain exactly-once delivery to Iceberg sinks. On restart, it resumes from the last committed checkpoint and replays only uncommitted messages. Partial Iceberg snapshots from an interrupted commit are rolled back automatically.

Q: Can I query the Iceberg tables RisingWave writes from within RisingWave itself?

Yes. RisingWave 2.8 replaced its batch execution engine with Apache DataFusion, making it capable of running analytical queries over Iceberg tables directly via SQL. You can use RisingWave as both the writer and the reader, which is useful for enrichment pipelines where a streaming job needs to join against historical data stored in Iceberg.

Q: How does RisingWave handle late-arriving events in Iceberg partitioned tables?

RisingWave writes to the Iceberg partition determined by the event's timestamp, not the arrival time. If you partition by day(created_at), a late event will update the correct historical partition. Iceberg's ACID semantics ensure that concurrent readers see a consistent view before and after the update.

Want to try this? RisingWave is open source under Apache 2.0. Start with the quickstart guide or connect with the team in the RisingWave Slack community.