Real-Time Data Lakehouse: Apache Iceberg + Streaming SQL

Real-Time Data Lakehouse: Apache Iceberg + Streaming SQL

·

10 min read

Introduction

The data lakehouse promised a single platform for both data warehousing and data lake workloads: structured and unstructured data, SQL analytics and machine learning, all on open file formats. Apache Iceberg delivered on the storage side, giving teams ACID transactions, schema evolution, and time travel on Parquet files in S3. But most lakehouse deployments still rely on batch ingestion. Data arrives in hourly or daily loads, and "real-time" means "the last batch ran 30 minutes ago."

A real-time data lakehouse changes this equation. Instead of scheduled batch jobs that periodically refresh tables, a streaming engine continuously ingests events, maintains incremental materialized views, and commits fresh results to Iceberg tables. Query engines like Trino, Spark, and DuckDB see updated data within seconds of it being produced, not hours.

This article shows you how to build a real-time data lakehouse using Apache Iceberg and streaming SQL with RisingWave. You will learn the architecture, see working SQL examples, and understand what differentiates a streaming lakehouse from a traditional batch-oriented one.

What Makes a Lakehouse "Real-Time"?

A traditional data lakehouse has three layers: ingestion (batch ETL), storage (Iceberg tables on S3), and query (Trino, Spark, or a BI tool). Data moves through these layers in scheduled intervals. The freshness of any query result depends on when the last batch job completed.

A real-time data lakehouse replaces the batch ingestion layer with streaming ingestion and adds a continuous processing layer between sources and storage. This changes the system in three fundamental ways.

Continuous ingestion instead of scheduled loads

In a batch lakehouse, an orchestrator (Airflow, Dagster, or a cron job) triggers a Spark or dbt job every N minutes. The job reads new files or database snapshots, transforms them, and writes Parquet files to S3.

In a streaming lakehouse, a streaming engine maintains persistent connections to data sources. It consumes Kafka topics, CDC streams, and S3 notifications as events arrive. There is no scheduling, no job startup overhead, and no gap between batches.

Incremental materialized views instead of full recomputation

Batch ETL recomputes transformations from scratch on each run, or maintains complex incremental logic with watermarks and merge statements. A streaming engine maintains incremental materialized views that update automatically as new data arrives. When a new order event comes in, only the affected aggregation row is updated, not the entire table.

Sub-minute commit latency instead of hourly freshness

Streaming engines commit processed data to Iceberg tables at configurable intervals, typically every 30 to 60 seconds. Query engines see fresh data within a minute of it being produced at the source. This latency is configurable: you can trade commit frequency for write efficiency based on your freshness requirements.

How Does the Streaming Lakehouse Architecture Work?

The reference architecture for a real-time data lakehouse with RisingWave and Apache Iceberg has four layers:

Source layer: Kafka topics (clickstream, application events, IoT telemetry), CDC connectors (PostgreSQL, MySQL), and object storage (S3 file notifications) produce continuous event streams.

Stream processing layer: RisingWave consumes events from all sources, applies SQL transformations (filters, joins, aggregations, window functions), and maintains stateful computations in memory backed by S3. Materialized views define the transformation logic and are updated incrementally.

Storage layer: Iceberg tables on S3 store the processed results. RisingWave's Iceberg sink commits new data through Iceberg's transactional metadata protocol, ensuring exactly-once delivery. The built-in compaction service merges small files automatically.

Query layer: Trino, Spark SQL, DuckDB, StarRocks, or any Iceberg-compatible query engine reads the tables directly. Because Iceberg uses an open format, there is no vendor lock-in at the query layer. Multiple engines can read the same tables concurrently.

This separation of streaming processing and query serving is deliberate. RisingWave is optimized for continuous ingestion and stateful transformation. Trino is optimized for ad-hoc analytical queries. Each system does what it does best, and Iceberg tables serve as the interoperability boundary.

How Do You Build a Real-Time Lakehouse With RisingWave and Iceberg?

Let's build a real-time e-commerce analytics lakehouse. You have three data sources: order events from Kafka, product catalog from PostgreSQL CDC, and customer profiles from PostgreSQL CDC. The goal is to maintain continuously updated Iceberg tables for revenue analytics, product performance, and customer segmentation.

Step 1: Define your streaming sources

-- Order events from Kafka
CREATE SOURCE order_events (
    order_id BIGINT,
    customer_id INT,
    product_id INT,
    quantity INT,
    unit_price DECIMAL,
    discount_pct DECIMAL,
    order_time TIMESTAMPTZ,
    payment_method VARCHAR,
    shipping_country VARCHAR
) WITH (
    connector = 'kafka',
    topic = 'ecommerce.orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

-- Product catalog via CDC
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    name VARCHAR,
    category VARCHAR,
    subcategory VARCHAR,
    brand VARCHAR,
    cost_price DECIMAL,
    list_price DECIMAL
) WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-db.internal',
    port = '5432',
    username = 'cdc_user',
    password = '${PG_PASSWORD}',
    database.name = 'ecommerce',
    table.name = 'public.products'
);

-- Customer profiles via CDC
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    email VARCHAR,
    signup_date DATE,
    lifetime_value DECIMAL,
    segment VARCHAR,
    country VARCHAR
) WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-db.internal',
    port = '5432',
    username = 'cdc_user',
    password = '${PG_PASSWORD}',
    database.name = 'ecommerce',
    table.name = 'public.customers'
);

Step 2: Create incremental materialized views

Define the analytics transformations as materialized views. RisingWave updates these incrementally as new events arrive.

-- Revenue by product category, updated continuously
CREATE MATERIALIZED VIEW revenue_by_category AS
SELECT
    p.category,
    p.subcategory,
    date_trunc('hour', o.order_time) AS order_hour,
    COUNT(*) AS order_count,
    SUM(o.quantity) AS units_sold,
    SUM(o.quantity * o.unit_price * (1 - o.discount_pct / 100)) AS net_revenue,
    SUM(o.quantity * (o.unit_price - p.cost_price)) AS gross_profit,
    AVG(o.quantity * o.unit_price) AS avg_order_value
FROM order_events o
JOIN products p ON o.product_id = p.product_id
GROUP BY p.category, p.subcategory, date_trunc('hour', o.order_time);

-- Customer segment performance
CREATE MATERIALIZED VIEW segment_metrics AS
SELECT
    c.segment,
    c.country,
    date_trunc('day', o.order_time) AS order_date,
    COUNT(DISTINCT o.customer_id) AS active_customers,
    COUNT(*) AS total_orders,
    SUM(o.quantity * o.unit_price * (1 - o.discount_pct / 100)) AS segment_revenue,
    AVG(o.quantity * o.unit_price * (1 - o.discount_pct / 100)) AS avg_order_value
FROM order_events o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.segment, c.country, date_trunc('day', o.order_time);

-- Top products by revenue (rolling 24-hour window)
CREATE MATERIALIZED VIEW top_products_24h AS
SELECT
    p.product_id,
    p.name AS product_name,
    p.category,
    p.brand,
    COUNT(*) AS order_count,
    SUM(o.quantity) AS units_sold,
    SUM(o.quantity * o.unit_price * (1 - o.discount_pct / 100)) AS revenue_24h
FROM order_events o
JOIN products p ON o.product_id = p.product_id
WHERE o.order_time > now() - INTERVAL '24 hours'
GROUP BY p.product_id, p.name, p.category, p.brand;

These materialized views join Kafka events with CDC tables. When a product's category changes in PostgreSQL, the revenue_by_category view updates automatically. When a new order arrives in Kafka, all three views update incrementally.

Step 3: Sink to Iceberg

Create Iceberg sinks for each materialized view:

CREATE SINK revenue_iceberg FROM revenue_by_category
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'category, subcategory, order_hour',
    database.name = 'analytics',
    table.name = 'revenue_by_category',
    catalog.type = 'glue',
    catalog.name = 'lakehouse_catalog',
    warehouse.path = 's3://company-lakehouse/warehouse',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    create_table_if_not_exists = 'true'
);

CREATE SINK segment_iceberg FROM segment_metrics
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'segment, country, order_date',
    database.name = 'analytics',
    table.name = 'segment_metrics',
    catalog.type = 'glue',
    catalog.name = 'lakehouse_catalog',
    warehouse.path = 's3://company-lakehouse/warehouse',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    create_table_if_not_exists = 'true'
);

CREATE SINK top_products_iceberg FROM top_products_24h
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'product_id',
    database.name = 'analytics',
    table.name = 'top_products_24h',
    catalog.type = 'glue',
    catalog.name = 'lakehouse_catalog',
    warehouse.path = 's3://company-lakehouse/warehouse',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    create_table_if_not_exists = 'true'
);

Step 4: Query the real-time lakehouse

From Trino, Spark, or any Iceberg-compatible engine:

-- In Trino: Which categories are trending this hour?
SELECT category, subcategory, net_revenue, order_count
FROM iceberg.analytics.revenue_by_category
WHERE order_hour >= date_trunc('hour', current_timestamp)
ORDER BY net_revenue DESC;

-- In DuckDB: Customer segment performance this week
SELECT segment, country,
    SUM(total_orders) AS weekly_orders,
    SUM(segment_revenue) AS weekly_revenue
FROM iceberg.analytics.segment_metrics
WHERE order_date >= current_date - INTERVAL '7' DAY
GROUP BY segment, country
ORDER BY weekly_revenue DESC;

The data in these tables is fresh within 60 seconds of the original event. No batch jobs, no scheduling, no gap between "data produced" and "data queryable."

How Does a Streaming Lakehouse Compare to a Batch Lakehouse?

The differences between batch and streaming lakehouse architectures go beyond latency. They affect cost, complexity, and what kinds of questions you can answer.

DimensionBatch lakehouseStreaming lakehouse
Data freshnessMinutes to hoursSeconds to one minute
Ingestion modelScheduled jobs (Spark, dbt)Continuous (streaming engine)
TransformationFull recompute or incremental mergeIncremental materialized views
State managementExternal (checkpoints, watermarks)Built-in (managed by engine)
Small file handlingExternal compaction jobsBuilt-in compaction (with RisingWave)
Resource usageBursty (peak provisioning)Steady (continuous processing)
Failure recoveryRe-run failed batchAutomatic checkpoint recovery
CDC handlingSnapshot + mergeContinuous log-based CDC
Operational overheadOrchestrator + compute clusterSingle streaming engine
Multi-source joinsComplex (staging tables)Native (SQL joins across sources)

The streaming lakehouse does not eliminate the need for batch processing entirely. Historical backfills, schema migrations, and large-scale reprocessing still benefit from batch engines. The goal is to make streaming the default ingestion path, and use batch only for exceptional scenarios.

What Use Cases Benefit Most From a Streaming Lakehouse?

Operational analytics

Dashboards that show current business state (active users, revenue this hour, inventory levels) need sub-minute data freshness. A streaming lakehouse delivers this without the complexity of maintaining a separate real-time serving layer.

Customer 360

Joining real-time behavior (clickstream, app events) with historical profile data (CRM, transaction history) requires a system that can handle both streaming and dimensional joins. Materialized views that join Kafka events with CDC tables produce continuously updated customer profiles in Iceberg.

IoT and telemetry

Sensor data arrives continuously and in high volumes. Streaming ingestion avoids the small-file explosion that batch micro-batching creates, and incremental aggregations reduce the data volume before it hits storage.

Machine learning feature pipelines

ML models need fresh features for inference. A streaming lakehouse computes features as events arrive, stores them in Iceberg for training, and serves them through query engines for real-time inference. This eliminates the train-serve skew that batch feature computation introduces.

FAQ

What is a real-time data lakehouse?

A real-time data lakehouse is a data architecture that combines the open storage format of a lakehouse (typically Apache Iceberg on S3) with continuous streaming ingestion and incremental processing. Unlike traditional batch lakehouses where data refreshes on a schedule, a streaming lakehouse delivers sub-minute data freshness for analytics, dashboards, and machine learning workloads.

How is a streaming lakehouse different from Lambda architecture?

Lambda architecture maintains two separate paths: a batch layer for completeness and a speed layer for freshness. A streaming lakehouse unifies both paths. The streaming engine provides exactly-once guarantees and persistent state, eliminating the need for a separate batch layer to "correct" streaming results. You maintain one pipeline, not two.

Can I query streaming lakehouse tables with existing tools?

Yes. Because the streaming lakehouse stores data in Apache Iceberg format on S3, any Iceberg-compatible query engine can read the tables. This includes Trino, Spark SQL, DuckDB, StarRocks, Snowflake (via Iceberg integration), and BigQuery (via BigLake). The streaming engine (RisingWave) handles ingestion and transformation; your existing query tools handle analytics.

What latency can I expect with a streaming lakehouse?

With RisingWave and Apache Iceberg, end-to-end latency from source event to queryable data is typically 30 to 60 seconds, controlled by the commit_checkpoint_interval setting. The processing itself happens with millisecond latency; the dominant factor is the Iceberg commit interval, which batches writes for efficiency.

Conclusion

The real-time data lakehouse is not a rearchitecture of your entire data stack. It is an upgrade to your ingestion and processing layer that makes your existing Iceberg tables and query engines work with fresh data instead of stale snapshots.

Key takeaways:

  • Replace scheduled batch jobs with streaming ingestion to reduce data latency from hours to seconds.
  • Use incremental materialized views to avoid full recomputation and reduce processing costs.
  • RisingWave provides the SQL streaming layer that connects sources (Kafka, CDC, S3) to Iceberg tables with exactly-once guarantees and built-in compaction.
  • Your existing query engines continue to work since Iceberg's open format ensures interoperability with Trino, Spark, DuckDB, and others.
  • The streaming lakehouse unifies batch and streaming into a single pipeline, eliminating the complexity of Lambda architecture.

If you are already running an Iceberg-based lakehouse, adding RisingWave as a streaming database gives you real-time freshness without changing your storage layer or query tools. Learn more about the RisingWave lakehouse architecture or explore the documentation to get started.


Ready to build a real-time lakehouse? Try RisingWave Cloud free, with no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.