RisingWave + Apache Iceberg: Building a Streaming Lakehouse Architecture

A streaming lakehouse combines real-time stream processing with open-format data lake storage. RisingWave handles the streaming SQL layer — ingesting from Kafka and databases, applying continuous transformations — while Apache Iceberg provides ACID-compliant, query-engine-agnostic storage on S3 or compatible object storage. Together, they eliminate the trade-off between latency and reliability.

Why Build a Streaming Lakehouse?

Traditional data architectures force a choice: fast but fragile (Lambda architecture with separate batch and stream paths) or reliable but slow (batch pipelines into a data warehouse). The streaming lakehouse pattern resolves this tension.

The core insight is that open table formats like Apache Iceberg bring data warehouse features — ACID transactions, schema evolution, efficient querying — to object storage. Add a streaming SQL engine like RisingWave, and you get a system where data flows continuously from operational sources into reliable, queryable storage — at any scale, with no proprietary lock-in.

Architecture Overview

A RisingWave + Iceberg streaming lakehouse has four layers:

Ingestion Layer — Kafka topics, Postgres/MySQL CDC streams, or other event sources feed data into RisingWave as sources.

Processing Layer — RisingWave materializes views on top of sources, handling joins, aggregations, filtering, and enrichment in real time using incremental computation.

Storage Layer — RisingWave sinks transformed data into Apache Iceberg tables on S3. Each sink commit creates an Iceberg snapshot, enabling downstream systems to read incrementally.

Query Layer — BI tools, Spark, Trino, Athena, or RisingWave itself (v2.8+) query Iceberg tables for analytics.

Setting Up the Streaming Lakehouse: Step by Step

Step 1: Ingest from Kafka

Start by defining a RisingWave source for your Kafka topic. This creates a logical stream of events:

CREATE SOURCE orders_stream (
    order_id BIGINT,
    customer_id BIGINT,
    product_id BIGINT,
    quantity INT,
    unit_price NUMERIC(10,2),
    status VARCHAR,
    created_at TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

Step 2: Build Materialized Views

Materialized views are the transformation engine. RisingWave maintains them incrementally — as new events arrive, only the affected results are updated:

CREATE MATERIALIZED VIEW order_revenue_by_product AS
SELECT
    product_id,
    window_start AS hour_start,
    window_end AS hour_end,
    COUNT(*) AS order_count,
    SUM(quantity * unit_price) AS total_revenue,
    AVG(unit_price) AS avg_unit_price
FROM TUMBLE(orders_stream, created_at, INTERVAL '1 HOUR')
WHERE status != 'cancelled'
GROUP BY product_id, window_start, window_end;

Step 3: Sink to Iceberg

Sink your materialized view to an Iceberg table with continuous writes:

CREATE SINK revenue_to_iceberg AS
SELECT * FROM order_revenue_by_product
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-rest-catalog:8181',
    warehouse.path = 's3://my-lakehouse/warehouse',
    s3.region = 'us-east-1',
    database.name = 'commerce',
    table.name = 'order_revenue_hourly'
);

Every micro-batch committed by RisingWave becomes a new Iceberg snapshot, immediately available to downstream query engines.

Step 4: Enrich with CDC Data

A powerful pattern in the streaming lakehouse is enriching streaming events with database CDC data. Here's how to join order events with customer profiles from Postgres:

CREATE SOURCE customers_cdc
WITH (
    connector = 'postgres-cdc',
    hostname = 'postgres',
    port = '5432',
    username = 'replicator',
    password = 'secret',
    database.name = 'commerce_db',
    schema.name = 'public',
    table.name = 'customers'
);

CREATE MATERIALIZED VIEW enriched_orders AS
SELECT
    o.order_id,
    o.product_id,
    o.quantity,
    o.unit_price,
    c.customer_name,
    c.customer_segment,
    c.country,
    o.created_at
FROM orders_stream o
JOIN customers_cdc FOR SYSTEM_TIME AS OF PROCTIME() c
    ON o.customer_id = c.customer_id;

The FOR SYSTEM_TIME AS OF PROCTIME() clause is a temporal join that looks up the customer record as it existed at the time the order event arrived — preserving correctness even as customer data changes.

Comparison: Streaming Lakehouse Architectures

Architecture	Latency	Complexity	Cost	Flexibility
Lambda (batch + stream)	Medium (batch path)	Very High	High	Low
Kappa (stream only)	Low	Medium	Medium	Medium
Data Warehouse Only	High (hours)	Low	High	Low
RisingWave + Iceberg	Low (seconds)	Low	Low	High
Flink + Iceberg	Low	High	Medium	Medium

RisingWave + Iceberg offers the best combination of low latency, operational simplicity, and cost efficiency. Unlike Flink-based pipelines, RisingWave uses SQL as the primary interface — no Java/Scala code, no job management overhead.

Catalog Configuration Options

The Iceberg sink connector in RisingWave supports multiple catalog backends:

-- REST Catalog (recommended for cloud-native deployments)
CREATE SINK my_sink AS SELECT * FROM my_mv
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://bucket/warehouse',
    s3.region = 'us-east-1',
    database.name = 'my_db',
    table.name = 'my_table'
);

For AWS Glue, replace catalog.type = 'rest' with catalog.type = 'glue' and provide your AWS credentials. For Hive Metastore, use catalog.type = 'hive' and specify the HMS URI.

Monitoring and Observability

A production streaming lakehouse needs observability at every layer:

RisingWave metrics — Monitor source lag (how far behind real-time), materialized view refresh latency, and sink commit frequency through RisingWave's built-in metrics endpoint (compatible with Prometheus).

Iceberg metrics — Track snapshot count, file count, and total data size. Many catalogs expose these through REST APIs or cloud monitoring.

End-to-end latency — Instrument your pipeline by comparing event timestamps in the source with snapshot commit times in Iceberg.

Common Patterns

Fan-out sinks — A single materialized view can feed multiple Iceberg sinks. Use this to write the same data to different tables partitioned by different keys (e.g., by region and by product category) without duplicating computation.

Tiered storage — Keep hot data in RisingWave materialized views for sub-second queries, and cold data in Iceberg on S3 for historical analysis. Query both layers using a federated query engine like Trino.

Compaction scheduling — Streaming writes to Iceberg create many small files. Schedule compaction jobs (using Spark or Iceberg's own rewrite API) during off-peak hours to maintain query performance.

FAQ

Q: How does RisingWave handle backpressure when writing to Iceberg? RisingWave buffers micro-batches internally and applies backpressure upstream when the sink falls behind. The Iceberg sink commits atomically — partial batches are never written, so downstream readers always see consistent snapshots.

Q: Can I query Iceberg tables with RisingWave and Athena simultaneously? Yes. Apache Iceberg is a multi-reader, multi-writer format. You can have Athena running analytical queries on your Iceberg tables while RisingWave simultaneously streams new data into them. Each reader sees a consistent snapshot.

Q: What's the minimum latency I can achieve with this architecture? RisingWave commits to Iceberg on a configurable interval. With default settings, end-to-end latency from Kafka event to Iceberg snapshot is typically under 30 seconds. For most analytics use cases, this is well within acceptable bounds.

Q: Do I need a separate catalog service? For production deployments, yes. A REST catalog like Apache Polaris, Nessie, or Tabular provides consistency guarantees and multi-engine compatibility. For development and testing, you can use a JDBC catalog backed by a simple Postgres database.

Q: Can I use RisingWave + Iceberg for CDC pipelines? Absolutely. This is one of the strongest use cases. RisingWave ingests CDC events from Postgres or MySQL, applies transformations, and sinks the results to Iceberg with upsert semantics — maintaining a complete, up-to-date copy of your operational database in your lakehouse.

Start Building

The RisingWave + Apache Iceberg streaming lakehouse gives you a production-grade architecture with remarkably low operational complexity. You write SQL, RisingWave handles the streaming semantics, and Iceberg provides the reliable, open storage layer.

Get started today with the RisingWave quickstart guide, or join the RisingWave Slack community to connect with engineers who have deployed this architecture in production.