Data Lakehouse Architecture in 2026: Streaming, Iceberg, and the Real-Time Layer

A Brief History of the Lakehouse

To understand where lakehouse architecture stands in 2026, it helps to trace how it got here.

The data lake era (2010s) began when S3 made object storage cheap enough to store everything. Teams dropped raw files (JSON, CSV, Parquet) into S3 and queried them with Hive or Presto. The proposition was attractive: no schema required, any file format, infinite scale, pennies per gigabyte. The problem was that querying raw files was slow, unreliable, and operationally painful. There were no transactions. A failed write left partial data. Schema changes required manual coordination. A Hive partition scan on 100,000 small files took minutes just to plan the query.

The data warehouse era ran in parallel. Snowflake, Google BigQuery, and Amazon Redshift offered fast SQL queries, automatic scaling, and fully managed infrastructure. Teams paid a premium to escape the operational burden of the data lake. But warehouses had their own problems: proprietary formats locked data in, copying data into the warehouse added latency and cost, and warehouse compute was expensive for large-scale scans.

The lakehouse (2020-2023) resolved this tension. Apache Iceberg (and Delta Lake, Hudi) added a metadata layer to object storage that provided ACID transactions, time travel, schema evolution, and partition evolution. Query engines (Trino, Spark, DuckDB) read this metadata to skip irrelevant files, merge delete records, and plan efficient scans. Suddenly, raw Parquet files in S3 could be queried with near-warehouse performance, at data-lake cost, using any engine.

The real-time lakehouse (2024-present) added the streaming dimension. A well-built Iceberg lakehouse in 2023 was a reliable, open, query-engine-independent platform for analytics. But it was still fundamentally a batch system. Data arrived in hourly or daily loads. The question "what happened in the last 5 minutes" could not be answered.

As AI agents, operational dashboards, and event-driven applications demanded fresher data, teams added streaming layers. The real-time lakehouse of 2026 is not a new product category -- it is the combination of the 2023 lakehouse with a streaming SQL engine sitting in front of it.

The Freshness Problem That Lakehouse Did Not Solve

Even a perfectly built Iceberg lakehouse with optimized Spark jobs and a REST Catalog reflects data from 15 to 60 minutes ago. This is not a failure of Iceberg -- it is inherent to batch architecture. A Spark job must be triggered, a cluster must start, the job must plan and execute, and the commit must complete before query engines see fresh data.

For many analytical workloads, this is fine. A weekly revenue report does not need second-level freshness. A monthly cohort analysis is unaffected by 30-minute lag.

But a growing set of use cases broke the assumption that batch freshness was acceptable:

Fraud detection that needs to act on a transaction before it clears
Inventory systems where a product sold out in the last 30 seconds
Recommendation engines that personalize based on the current session
Operational dashboards where a metric spike needs to trigger an alert within seconds
AI agents that query data as part of a live reasoning loop

These use cases required a hot tier -- a system that reflects the state of the world within seconds, not minutes. And that hot tier needed to coexist with the lakehouse, not replace it.

The Three-Tier Architecture of the 2026 Lakehouse

The mature lakehouse architecture in 2026 separates data into three tiers based on freshness, query pattern, and cost profile.

Hot Tier: Real-time serving with RisingWave

The hot tier handles sub-second queries against continuously updated data. RisingWave is the primary choice here: it ingests from Kafka, Kinesis, Pulsar, and native CDC sources, processes events continuously using SQL, and maintains incremental materialized views that update in milliseconds as new events arrive.

RisingWave exposes a PostgreSQL-compatible wire protocol. Any application using a PostgreSQL client library connects directly and queries materialized views as if they were regular database tables. The results are always current.

The hot tier is appropriate for:

Operational dashboards that display metrics updated every few seconds
Application APIs that need current inventory counts, balance totals, or session state
AI agent data access where the agent queries the database as part of a reasoning step
Fraud and anomaly detection where decisions must be made before a transaction completes

Redis is sometimes used in this tier for key-value serving at extremely high throughput, but it does not run streaming SQL or maintain incremental aggregations. Redis and RisingWave serve different access patterns: RisingWave for SQL queries over streaming state, Redis for high-throughput key lookup on pre-computed values.

Warm Tier: Recent history on Iceberg

The warm tier stores processed results from the hot tier as Iceberg tables, queryable by analytical engines. RisingWave writes to Iceberg via its sink connector every 30-60 seconds. Trino, DuckDB, AWS Athena, and BigQuery query these tables for analysis spanning hours, days, or weeks.

The warm tier is appropriate for:

Analytical queries that aggregate across hours or days of data
BI tool queries (Metabase, Superset, Looker) that run on a schedule or on user demand
Data science notebooks running exploratory analysis
Cross-table joins that would be too expensive to maintain as hot-tier materialized views

Trino is the most commonly deployed query engine for the warm tier in self-managed deployments. DuckDB has become popular for single-machine or notebook-based analysis on smaller datasets (under a few terabytes). AWS Athena provides a serverless option for AWS deployments. BigQuery supports external Iceberg tables for teams already on Google Cloud.

Cold Tier: Compressed long-term history

The cold tier holds historical data going back months or years. Storage is optimized for cost: aggressive Zstd compression on Parquet files, lifecycle policies that move data to cheaper S3 storage classes (S3 Glacier Instant Retrieval for data accessed occasionally, S3 Glacier Deep Archive for data accessed rarely), and coarse-grained partitioning that suits the time-range queries typical of historical analysis.

The cold tier is the same Iceberg table format as the warm tier -- the distinction is operational. Warm-tier data is in standard S3 with frequent compaction runs. Cold-tier data has been compacted into large files, compressed aggressively, and transitioned to cheaper storage.

Iceberg's partition evolution makes this transition seamless. You can change the partitioning strategy of an Iceberg table (for example, from hour-level to month-level partitioning for data older than 90 days) without rewriting existing data.

Tool Selection at Each Tier

The following table summarizes tool choices at each tier, with honest notes on when each applies.

Hot tier:

Tool	Role	When to use
RisingWave	Streaming SQL, materialized views, CDC ingestion	Primary hot-tier engine for SQL-based streaming
Kafka / Kinesis / Pulsar	Event streaming backbone	When you have event-producing services
Redis	High-throughput key-value cache	Pre-computed lookups at very high QPS, not for SQL aggregations

Warm tier:

Tool	Role	When to use
Apache Iceberg	Table format (metadata + Parquet files)	Always -- the foundation of the warm tier
Trino	Distributed SQL query engine	Self-managed, large-scale analytical queries
DuckDB	In-process analytical engine	Single-machine analysis, notebooks, smaller datasets
AWS Athena	Serverless Iceberg query	AWS deployments that want no query infrastructure
BigQuery	Managed query engine (external Iceberg tables)	GCP deployments
Snowflake	Managed data warehouse (reads Iceberg)	Teams with existing Snowflake contracts

Cold tier:

Tool	Role	When to use
Apache Iceberg	Table format (same as warm tier)	Always
Parquet + Zstd	File format and compression	Default for analytical workloads
S3 Intelligent-Tiering / Glacier	Object storage tier	Cost optimization for infrequently accessed data
PyIceberg / Spark	Compaction and maintenance	Required to merge small files and expire snapshots

The Data Flow

The data flow through all three tiers:

Operational Systems (PostgreSQL, MySQL, MongoDB, SQL Server)
    | CDC (native RisingWave connectors)
    v
Event Streams (Kafka, Kinesis, Pulsar)
    | streaming consumption
    v
RisingWave (hot tier)
    | incremental materialized views
    | PostgreSQL wire protocol -> applications, dashboards, AI agents
    | Iceberg sink (every 30-60 seconds)
    v
Apache Iceberg on S3 (warm tier)
    | Trino / DuckDB / Athena / BigQuery queries
    | compaction (Spark / PyIceberg, scheduled)
    | snapshot expiration (scheduled)
    | lifecycle transition (S3 policies)
    v
Apache Iceberg on S3 Glacier (cold tier)
    | long-term historical queries (Trino, Athena)

Each arrow represents a continuous or scheduled process. The CDC connection and Kafka consumption are persistent (no scheduling). The Iceberg sink writes continuously. Compaction, snapshot expiration, and storage lifecycle transitions are scheduled maintenance jobs.

Setting Up the Flow from RisingWave to Iceberg

Here is a complete example that builds the hot-to-warm data flow for an e-commerce platform.

Define sources

-- Kafka source for order events
CREATE SOURCE order_events (
    order_id        BIGINT,
    customer_id     INT,
    merchant_id     INT,
    product_id      INT,
    amount          DECIMAL,
    currency        VARCHAR,
    order_time      TIMESTAMPTZ,
    status          VARCHAR
) WITH (
    connector = 'kafka',
    topic = 'orders.v1',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

-- PostgreSQL CDC for merchant data
CREATE TABLE merchants (
    merchant_id     INT PRIMARY KEY,
    merchant_name   VARCHAR,
    country         VARCHAR,
    tier            VARCHAR
) WITH (
    connector = 'postgres-cdc',
    hostname = 'postgres.internal',
    port = '5432',
    username = 'replicator',
    password = '${SECRET}',
    database.name = 'commerce',
    schema.name = 'public',
    table.name = 'merchants'
);

Build real-time materialized views (hot tier)

-- Per-merchant daily revenue, updated continuously
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT
    DATE_TRUNC('day', order_time) AS day,
    merchant_id,
    SUM(amount)   AS total_revenue,
    COUNT(*)      AS order_count,
    AVG(amount)   AS avg_order_value
FROM order_events
WHERE status = 'completed'
GROUP BY DATE_TRUNC('day', order_time), merchant_id;

-- Per-merchant hourly order volume using tumbling windows
CREATE MATERIALIZED VIEW hourly_order_volume AS
SELECT
    window_start,
    window_end,
    o.merchant_id,
    m.merchant_name,
    m.country,
    COUNT(*)        AS order_count,
    SUM(o.amount)   AS revenue
FROM TUMBLE(order_events, order_time, INTERVAL '1 hour') o
JOIN merchants m ON o.merchant_id = m.merchant_id
WHERE o.status = 'completed'
GROUP BY window_start, window_end, o.merchant_id, m.merchant_name, m.country;

Applications and dashboards query these materialized views directly via PostgreSQL wire protocol. The results reflect events processed within the last few seconds.

Sink to Iceberg (warm tier)

-- Sink daily revenue to Iceberg for historical analysis
CREATE SINK daily_revenue_iceberg FROM daily_revenue
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'storage',
    warehouse.path = 's3://analytics/warehouse',
    s3.region = 'us-east-1',
    database.name = 'revenue',
    table.name = 'daily_revenue'
);

-- Sink hourly window results as append-only
CREATE SINK hourly_volume_iceberg FROM hourly_order_volume
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'storage',
    warehouse.path = 's3://analytics/warehouse',
    s3.region = 'us-east-1',
    database.name = 'revenue',
    table.name = 'hourly_order_volume'
);

Query the warm tier with Trino

Once RisingWave has committed data to Iceberg, Trino can query it:

-- Trino: last 7 days revenue by merchant country
SELECT
    m_country                            AS country,
    SUM(total_revenue)                   AS revenue_7d,
    SUM(order_count)                     AS orders_7d,
    SUM(total_revenue) / SUM(order_count) AS avg_order_7d
FROM revenue.daily_revenue
WHERE day >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY m_country
ORDER BY revenue_7d DESC;

And with DuckDB directly on S3:

-- DuckDB: install and load Iceberg extension
INSTALL iceberg;
LOAD iceberg;

-- Query the Iceberg table directly from S3
SELECT
    day,
    SUM(total_revenue) AS daily_total
FROM iceberg_scan('s3://analytics/warehouse/revenue/daily_revenue/')
WHERE day >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY day
ORDER BY day;

Governance in 2026: Catalog and Access Control

A multi-engine lakehouse -- where RisingWave, Trino, DuckDB, and Spark all read the same Iceberg tables -- requires centralized governance. In 2026, this means a REST Catalog with role-based access control.

Apache Polaris (the open-source REST Catalog) provides namespace management, table-level permissions, and credential vending (generating short-lived cloud credentials scoped to specific tables). A data engineer granted read access to revenue.* cannot see tables in customer_profiles.*. The catalog enforces this at the credential level.

Connecting RisingWave to a REST Catalog:

-- REST Catalog sink
CREATE SINK daily_revenue_polaris FROM daily_revenue
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'https://polaris.your-company.com/api/catalog',
    catalog.credential = 'client-id:client-secret',
    warehouse = 'analytics',
    database.name = 'revenue',
    table.name = 'daily_revenue'
);

With a REST Catalog in place, every engine (Trino, DuckDB, Spark, Snowflake) connects to the same catalog endpoint and receives credentials scoped to the tables they are authorized to access. Table discovery, schema information, and access control are consistent across all engines.

Unity Catalog (Databricks) provides similar functionality and implements the REST Catalog spec, meaning non-Databricks engines can read Unity-catalogued Iceberg tables. This is particularly useful for teams that use Databricks for some workloads but want to serve data from Trino or DuckDB.

Nessie adds a branching layer on top of the catalog. Teams can create a branch, test a new version of a pipeline, validate output, and merge to main -- exactly like git branching but for data tables. This is valuable during schema migrations, significant pipeline changes, or ML feature engineering experiments where you want to preview output before it becomes visible to consumers.

What Is Still Hard

Honest assessment of the remaining challenges in 2026 real-time lakehouse deployments:

Compaction management

Streaming writes produce many small Parquet files. Without compaction, query performance degrades over time as engines scan thousands of tiny files. Compaction is not automatic in most open-source Iceberg deployments -- you must schedule it.

A basic Spark compaction job:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("iceberg-compaction") \
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.analytics", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.analytics.type", "rest") \
    .config("spark.sql.catalog.analytics.uri", "https://polaris.your-company.com/api/catalog") \
    .getOrCreate()

# Compact files in the daily_revenue table
spark.sql("""
    CALL analytics.system.rewrite_data_files(
        table => 'revenue.daily_revenue',
        strategy => 'binpack',
        options => map(
            'target-file-size-bytes', '268435456',
            'min-input-files', '5'
        )
    )
""")

# Expire old snapshots
spark.sql("""
    CALL analytics.system.expire_snapshots(
        table => 'revenue.daily_revenue',
        older_than => TIMESTAMP '2026-04-01 00:00:00',
        retain_last => 10
    )
""")

You need this job running on a schedule for every actively written table. Managed services (Tabular, Snowflake-managed Iceberg, AWS Lake Formation) handle this automatically, which is a real operational advantage over fully self-managed deployments.

Exactly-once semantics end-to-end

Exactly-once delivery from a Kafka event to an Iceberg table requires the streaming engine to implement two-phase commit or an equivalent coordination mechanism. RisingWave implements exactly-once Iceberg sinks using Iceberg's transactional commit protocol: it writes data files to a staging location and commits them atomically, ensuring that a failure during a commit does not result in partial data appearing in the table.

However, exactly-once semantics at the source are separate. If your Kafka source does not provide stable offsets or your CDC connector restarts, you may see duplicate events processed. RisingWave's checkpointing handles recovery from its own failures, but you should verify the behavior of each source connector in your deployment.

Schema evolution coordination

When an upstream schema changes -- a new column added to a PostgreSQL table, a Kafka message format updated -- the change must propagate through the streaming pipeline to the Iceberg table without breaking downstream consumers.

In practice this requires:

Evolving the Iceberg table schema first (so existing data files remain valid)
Updating the RisingWave source definition to include the new field
Updating any materialized views that reference the field
Notifying downstream consumers (Trino queries, BI tools) that the schema changed

This coordination is manual in most deployments. Data contracts and schema registries (Confluent Schema Registry, AWS Glue Schema Registry) help by enforcing that producers register schema changes before publishing. But the operational workflow of propagating a schema change across a live pipeline requires attention.

The Cost Model

How does a real-time lakehouse compare in cost to traditional approaches?

vs. Traditional data warehouse (Snowflake, BigQuery):

A dedicated data warehouse charges for compute (query processing) and storage separately. Iceberg on S3 costs a fraction of warehouse storage per gigabyte. Query compute with Trino or Athena is pay-per-query (Athena) or based on cluster size (Trino). For teams with mixed query patterns -- some heavy scans, some light ad-hoc queries -- Iceberg with Trino is typically cheaper than a dedicated warehouse at scale.

The real-time layer (RisingWave) adds compute cost for continuous processing, but replaces some of the operational overhead of running batch ETL jobs (Airflow, Spark clusters for ETL).

vs. Traditional data lake with batch ETL:

Batch ETL on Spark or dbt requires clusters sized for peak load, running on a schedule. A streaming engine like RisingWave processes events continuously at lower per-event cost and eliminates the scheduling overhead. The primary new cost is the streaming compute (RisingWave nodes or managed RisingWave Cloud).

vs. Fully managed streaming + warehouse (Confluent + Snowflake):

Fully managed services reduce operational burden but increase per-unit cost. The open-source lakehouse stack (RisingWave + Iceberg + Trino) requires more operational expertise but significantly lower licensing costs for teams at scale.

The right choice depends on team size, operational maturity, and data volume. For teams under 10 engineers or under 1 TB/day of data, managed services often make sense. For larger teams with data platform engineers, the open-source stack is worth the operational investment.

What a 2026 Lakehouse Does Not Need

It is worth noting what the 2026 lakehouse architecture does not require, given some confusion about necessary components.

You do not need a separate data warehouse. If your query engines (Trino, DuckDB) are fast enough for your analytical workloads on Iceberg, you do not need to copy data into Snowflake or BigQuery. Many teams run both (Iceberg + Snowflake), paying for two copies of data. If your Iceberg queries meet your performance requirements, that second copy is unnecessary.

You do not need a separate real-time OLAP database. ClickHouse, Apache Druid, and Pinot are excellent systems, but they require another data copy and another system to operate. RisingWave's materialized views handle most real-time analytical query patterns with sub-second latency and without a separate system.

You do not need a message queue between every component. Some architectures pass data through Kafka between every system. For streaming-to-Iceberg, RisingWave reads from Kafka (or CDC directly) and writes to Iceberg without an intermediate queue.

Summary

The data lakehouse architecture in 2026 is not a single product. It is a set of interoperating open components: a streaming SQL engine for the hot tier, Apache Iceberg as the table format for the warm and cold tiers, query engines for analytical access, and a REST Catalog for governance.

RisingWave fills the hot-tier role with PostgreSQL-compatible streaming SQL, native CDC connectors, and an Iceberg sink that continuously feeds the warm tier. Trino and DuckDB provide fast analytical queries on Iceberg without a dedicated warehouse. Polaris or Unity Catalog provides governance across all engines.

The honest part: you will need to operate compaction jobs, manage snapshot expiration, coordinate schema evolution, and think carefully about exactly-once guarantees at each integration point. These are not difficult problems, but they are real ones. The architecture that handles them well is not the one with the fewest moving parts -- it is the one where each component does one thing reliably and the integrations between them are clearly understood.

For most teams building in 2026, the combination of RisingWave for streaming, Iceberg for storage, and Trino or DuckDB for analytics covers the majority of data access patterns at a fraction of the cost of a fully managed stack, with full control over the data and no vendor lock-in at any layer.