A Brief History of the Lakehouse
To understand where lakehouse architecture stands in 2026, it helps to trace how it got here.
The data lake era (2010s) began when S3 made object storage cheap enough to store everything. Teams dropped raw files (JSON, CSV, Parquet) into S3 and queried them with Hive or Presto. The proposition was attractive: no schema required, any file format, infinite scale, pennies per gigabyte. The problem was that querying raw files was slow, unreliable, and operationally painful. There were no transactions. A failed write left partial data. Schema changes required manual coordination. A Hive partition scan on 100,000 small files took minutes just to plan the query.
The data warehouse era ran in parallel. Snowflake, Google BigQuery, and Amazon Redshift offered fast SQL queries, automatic scaling, and fully managed infrastructure. Teams paid a premium to escape the operational burden of the data lake. But warehouses had their own problems: proprietary formats locked data in, copying data into the warehouse added latency and cost, and warehouse compute was expensive for large-scale scans.
The lakehouse (2020-2023) resolved this tension. Apache Iceberg (and Delta Lake, Hudi) added a metadata layer to object storage that provided ACID transactions, time travel, schema evolution, and partition evolution. Query engines (Trino, Spark, DuckDB) read this metadata to skip irrelevant files, merge delete records, and plan efficient scans. Suddenly, raw Parquet files in S3 could be queried with near-warehouse performance, at data-lake cost, using any engine.
The real-time lakehouse (2024-present) added the streaming dimension. A well-built Iceberg lakehouse in 2023 was a reliable, open, query-engine-independent platform for analytics. But it was still fundamentally a batch system. Data arrived in hourly or daily loads. The question "what happened in the last 5 minutes" could not be answered.
As AI agents, operational dashboards, and event-driven applications demanded fresher data, teams added streaming layers. The real-time lakehouse of 2026 is not a new product category -- it is the combination of the 2023 lakehouse with a streaming SQL engine sitting in front of it.
The Freshness Problem That Lakehouse Did Not Solve
Even a perfectly built Iceberg lakehouse with optimized Spark jobs and a REST Catalog reflects data from 15 to 60 minutes ago. This is not a failure of Iceberg -- it is inherent to batch architecture. A Spark job must be triggered, a cluster must start, the job must plan and execute, and the commit must complete before query engines see fresh data.
For many analytical workloads, this is fine. A weekly revenue report does not need second-level freshness. A monthly cohort analysis is unaffected by 30-minute lag.
But a growing set of use cases broke the assumption that batch freshness was acceptable:
- Fraud detection that needs to act on a transaction before it clears
- Inventory systems where a product sold out in the last 30 seconds
- Recommendation engines that personalize based on the current session
- Operational dashboards where a metric spike needs to trigger an alert within seconds
- AI agents that query data as part of a live reasoning loop
These use cases required a hot tier -- a system that reflects the state of the world within seconds, not minutes. And that hot tier needed to coexist with the lakehouse, not replace it.
The Three-Tier Architecture of the 2026 Lakehouse
The mature lakehouse architecture in 2026 separates data into three tiers based on freshness, query pattern, and cost profile.
Hot Tier: Real-time serving with RisingWave
The hot tier handles sub-second queries against continuously updated data. RisingWave is the primary choice here: it ingests from Kafka, Kinesis, Pulsar, and native CDC sources, processes events continuously using SQL, and maintains incremental materialized views that update in milliseconds as new events arrive.
RisingWave exposes a PostgreSQL-compatible wire protocol. Any application using a PostgreSQL client library connects directly and queries materialized views as if they were regular database tables. The results are always current.
The hot tier is appropriate for:
- Operational dashboards that display metrics updated every few seconds
- Application APIs that need current inventory counts, balance totals, or session state
- AI agent data access where the agent queries the database as part of a reasoning step
- Fraud and anomaly detection where decisions must be made before a transaction completes
Redis is sometimes used in this tier for key-value serving at extremely high throughput, but it does not run streaming SQL or maintain incremental aggregations. Redis and RisingWave serve different access patterns: RisingWave for SQL queries over streaming state, Redis for high-throughput key lookup on pre-computed values.
Warm Tier: Recent history on Iceberg
The warm tier stores processed results from the hot tier as Iceberg tables, queryable by analytical engines. RisingWave writes to Iceberg via its sink connector every 30-60 seconds. Trino, DuckDB, AWS Athena, and BigQuery query these tables for analysis spanning hours, days, or weeks.
The warm tier is appropriate for:
- Analytical queries that aggregate across hours or days of data
- BI tool queries (Metabase, Superset, Looker) that run on a schedule or on user demand
- Data science notebooks running exploratory analysis
- Cross-table joins that would be too expensive to maintain as hot-tier materialized views
Trino is the most commonly deployed query engine for the warm tier in self-managed deployments. DuckDB has become popular for single-machine or notebook-based analysis on smaller datasets (under a few terabytes). AWS Athena provides a serverless option for AWS deployments. BigQuery supports external Iceberg tables for teams already on Google Cloud.
Cold Tier: Compressed long-term history
The cold tier holds historical data going back months or years. Storage is optimized for cost: aggressive Zstd compression on Parquet files, lifecycle policies that move data to cheaper S3 storage classes (S3 Glacier Instant Retrieval for data accessed occasionally, S3 Glacier Deep Archive for data accessed rarely), and coarse-grained partitioning that suits the time-range queries typical of historical analysis.
The cold tier is the same Iceberg table format as the warm tier -- the distinction is operational. Warm-tier data is in standard S3 with frequent compaction runs. Cold-tier data has been compacted into large files, compressed aggressively, and transitioned to cheaper storage.
Iceberg's partition evolution makes this transition seamless. You can change the partitioning strategy of an Iceberg table (for example, from hour-level to month-level partitioning for data older than 90 days) without rewriting existing data.
Tool Selection at Each Tier
The following table summarizes tool choices at each tier, with honest notes on when each applies.
Hot tier:
| Tool | Role | When to use |
| RisingWave | Streaming SQL, materialized views, CDC ingestion | Primary hot-tier engine for SQL-based streaming |
| Kafka / Kinesis / Pulsar | Event streaming backbone | When you have event-producing services |
| Redis | High-throughput key-value cache | Pre-computed lookups at very high QPS, not for SQL aggregations |
Warm tier:
| Tool | Role | When to use |
| Apache Iceberg | Table format (metadata + Parquet files) | Always -- the foundation of the warm tier |
| Trino | Distributed SQL query engine | Self-managed, large-scale analytical queries |
| DuckDB | In-process analytical engine | Single-machine analysis, notebooks, smaller datasets |
| AWS Athena | Serverless Iceberg query | AWS deployments that want no query infrastructure |
| BigQuery | Managed query engine (external Iceberg tables) | GCP deployments |
| Snowflake | Managed data warehouse (reads Iceberg) | Teams with existing Snowflake contracts |
Cold tier:
| Tool | Role | When to use |
| Apache Iceberg | Table format (same as warm tier) | Always |
| Parquet + Zstd | File format and compression | Default for analytical workloads |
| S3 Intelligent-Tiering / Glacier | Object storage tier | Cost optimization for infrequently accessed data |
| PyIceberg / Spark | Compaction and maintenance | Required to merge small files and expire snapshots |
The Data Flow
The data flow through all three tiers:
Operational Systems (PostgreSQL, MySQL, MongoDB, SQL Server)
| CDC (native RisingWave connectors)
v
Event Streams (Kafka, Kinesis, Pulsar)
| streaming consumption
v
RisingWave (hot tier)
| incremental materialized views
| PostgreSQL wire protocol -> applications, dashboards, AI agents
| Iceberg sink (every 30-60 seconds)
v
Apache Iceberg on S3 (warm tier)
| Trino / DuckDB / Athena / BigQuery queries
| compaction (Spark / PyIceberg, scheduled)
| snapshot expiration (scheduled)
| lifecycle transition (S3 policies)
v
Apache Iceberg on S3 Glacier (cold tier)
| long-term historical queries (Trino, Athena)
Each arrow represents a continuous or scheduled process. The CDC connection and Kafka consumption are persistent (no scheduling). The Iceberg sink writes continuously. Compaction, snapshot expiration, and storage lifecycle transitions are scheduled maintenance jobs.
Setting Up the Flow from RisingWave to Iceberg
Here is a complete example that builds the hot-to-warm data flow for an e-commerce platform.
Define sources
-- Kafka source for order events
CREATE SOURCE order_events (
order_id BIGINT,
customer_id INT,
merchant_id INT,
product_id INT,
amount DECIMAL,
currency VARCHAR,
order_time TIMESTAMPTZ,
status VARCHAR
) WITH (
connector = 'kafka',
topic = 'orders.v1',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;
-- PostgreSQL CDC for merchant data
CREATE TABLE merchants (
merchant_id INT PRIMARY KEY,
merchant_name VARCHAR,
country VARCHAR,
tier VARCHAR
) WITH (
connector = 'postgres-cdc',
hostname = 'postgres.internal',
port = '5432',
username = 'replicator',
password = '${SECRET}',
database.name = 'commerce',
schema.name = 'public',
table.name = 'merchants'
);
Build real-time materialized views (hot tier)
-- Per-merchant daily revenue, updated continuously
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT
DATE_TRUNC('day', order_time) AS day,
merchant_id,
SUM(amount) AS total_revenue,
COUNT(*) AS order_count,
AVG(amount) AS avg_order_value
FROM order_events
WHERE status = 'completed'
GROUP BY DATE_TRUNC('day', order_time), merchant_id;
-- Per-merchant hourly order volume using tumbling windows
CREATE MATERIALIZED VIEW hourly_order_volume AS
SELECT
window_start,
window_end,
o.merchant_id,
m.merchant_name,
m.country,
COUNT(*) AS order_count,
SUM(o.amount) AS revenue
FROM TUMBLE(order_events, order_time, INTERVAL '1 hour') o
JOIN merchants m ON o.merchant_id = m.merchant_id
WHERE o.status = 'completed'
GROUP BY window_start, window_end, o.merchant_id, m.merchant_name, m.country;
Applications and dashboards query these materialized views directly via PostgreSQL wire protocol. The results reflect events processed within the last few seconds.
Sink to Iceberg (warm tier)
-- Sink daily revenue to Iceberg for historical analysis
CREATE SINK daily_revenue_iceberg FROM daily_revenue
WITH (
connector = 'iceberg',
type = 'upsert',
catalog.type = 'storage',
warehouse.path = 's3://analytics/warehouse',
s3.region = 'us-east-1',
database.name = 'revenue',
table.name = 'daily_revenue'
);
-- Sink hourly window results as append-only
CREATE SINK hourly_volume_iceberg FROM hourly_order_volume
WITH (
connector = 'iceberg',
type = 'append-only',
catalog.type = 'storage',
warehouse.path = 's3://analytics/warehouse',
s3.region = 'us-east-1',
database.name = 'revenue',
table.name = 'hourly_order_volume'
);
Query the warm tier with Trino
Once RisingWave has committed data to Iceberg, Trino can query it:
-- Trino: last 7 days revenue by merchant country
SELECT
m_country AS country,
SUM(total_revenue) AS revenue_7d,
SUM(order_count) AS orders_7d,
SUM(total_revenue) / SUM(order_count) AS avg_order_7d
FROM revenue.daily_revenue
WHERE day >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY m_country
ORDER BY revenue_7d DESC;
And with DuckDB directly on S3:
-- DuckDB: install and load Iceberg extension
INSTALL iceberg;
LOAD iceberg;
-- Query the Iceberg table directly from S3
SELECT
day,
SUM(total_revenue) AS daily_total
FROM iceberg_scan('s3://analytics/warehouse/revenue/daily_revenue/')
WHERE day >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY day
ORDER BY day;
Governance in 2026: Catalog and Access Control
A multi-engine lakehouse -- where RisingWave, Trino, DuckDB, and Spark all read the same Iceberg tables -- requires centralized governance. In 2026, this means a REST Catalog with role-based access control.
Apache Polaris (the open-source REST Catalog) provides namespace management, table-level permissions, and credential vending (generating short-lived cloud credentials scoped to specific tables). A data engineer granted read access to revenue.* cannot see tables in customer_profiles.*. The catalog enforces this at the credential level.
Connecting RisingWave to a REST Catalog:
-- REST Catalog sink
CREATE SINK daily_revenue_polaris FROM daily_revenue
WITH (
connector = 'iceberg',
type = 'upsert',
catalog.type = 'rest',
catalog.uri = 'https://polaris.your-company.com/api/catalog',
catalog.credential = 'client-id:client-secret',
warehouse = 'analytics',
database.name = 'revenue',
table.name = 'daily_revenue'
);
With a REST Catalog in place, every engine (Trino, DuckDB, Spark, Snowflake) connects to the same catalog endpoint and receives credentials scoped to the tables they are authorized to access. Table discovery, schema information, and access control are consistent across all engines.
Unity Catalog (Databricks) provides similar functionality and implements the REST Catalog spec, meaning non-Databricks engines can read Unity-catalogued Iceberg tables. This is particularly useful for teams that use Databricks for some workloads but want to serve data from Trino or DuckDB.
Nessie adds a branching layer on top of the catalog. Teams can create a branch, test a new version of a pipeline, validate output, and merge to main -- exactly like git branching but for data tables. This is valuable during schema migrations, significant pipeline changes, or ML feature engineering experiments where you want to preview output before it becomes visible to consumers.
What Is Still Hard
Honest assessment of the remaining challenges in 2026 real-time lakehouse deployments:
Compaction management
Streaming writes produce many small Parquet files. Without compaction, query performance degrades over time as engines scan thousands of tiny files. Compaction is not automatic in most open-source Iceberg deployments -- you must schedule it.
A basic Spark compaction job:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("iceberg-compaction") \
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.analytics", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.analytics.type", "rest") \
.config("spark.sql.catalog.analytics.uri", "https://polaris.your-company.com/api/catalog") \
.getOrCreate()
# Compact files in the daily_revenue table
spark.sql("""
CALL analytics.system.rewrite_data_files(
table => 'revenue.daily_revenue',
strategy => 'binpack',
options => map(
'target-file-size-bytes', '268435456',
'min-input-files', '5'
)
)
""")
# Expire old snapshots
spark.sql("""
CALL analytics.system.expire_snapshots(
table => 'revenue.daily_revenue',
older_than => TIMESTAMP '2026-04-01 00:00:00',
retain_last => 10
)
""")
You need this job running on a schedule for every actively written table. Managed services (Tabular, Snowflake-managed Iceberg, AWS Lake Formation) handle this automatically, which is a real operational advantage over fully self-managed deployments.
Exactly-once semantics end-to-end
Exactly-once delivery from a Kafka event to an Iceberg table requires the streaming engine to implement two-phase commit or an equivalent coordination mechanism. RisingWave implements exactly-once Iceberg sinks using Iceberg's transactional commit protocol: it writes data files to a staging location and commits them atomically, ensuring that a failure during a commit does not result in partial data appearing in the table.
However, exactly-once semantics at the source are separate. If your Kafka source does not provide stable offsets or your CDC connector restarts, you may see duplicate events processed. RisingWave's checkpointing handles recovery from its own failures, but you should verify the behavior of each source connector in your deployment.
Schema evolution coordination
When an upstream schema changes -- a new column added to a PostgreSQL table, a Kafka message format updated -- the change must propagate through the streaming pipeline to the Iceberg table without breaking downstream consumers.
In practice this requires:
- Evolving the Iceberg table schema first (so existing data files remain valid)
- Updating the RisingWave source definition to include the new field
- Updating any materialized views that reference the field
- Notifying downstream consumers (Trino queries, BI tools) that the schema changed
This coordination is manual in most deployments. Data contracts and schema registries (Confluent Schema Registry, AWS Glue Schema Registry) help by enforcing that producers register schema changes before publishing. But the operational workflow of propagating a schema change across a live pipeline requires attention.
The Cost Model
How does a real-time lakehouse compare in cost to traditional approaches?
vs. Traditional data warehouse (Snowflake, BigQuery):
A dedicated data warehouse charges for compute (query processing) and storage separately. Iceberg on S3 costs a fraction of warehouse storage per gigabyte. Query compute with Trino or Athena is pay-per-query (Athena) or based on cluster size (Trino). For teams with mixed query patterns -- some heavy scans, some light ad-hoc queries -- Iceberg with Trino is typically cheaper than a dedicated warehouse at scale.
The real-time layer (RisingWave) adds compute cost for continuous processing, but replaces some of the operational overhead of running batch ETL jobs (Airflow, Spark clusters for ETL).
vs. Traditional data lake with batch ETL:
Batch ETL on Spark or dbt requires clusters sized for peak load, running on a schedule. A streaming engine like RisingWave processes events continuously at lower per-event cost and eliminates the scheduling overhead. The primary new cost is the streaming compute (RisingWave nodes or managed RisingWave Cloud).
vs. Fully managed streaming + warehouse (Confluent + Snowflake):
Fully managed services reduce operational burden but increase per-unit cost. The open-source lakehouse stack (RisingWave + Iceberg + Trino) requires more operational expertise but significantly lower licensing costs for teams at scale.
The right choice depends on team size, operational maturity, and data volume. For teams under 10 engineers or under 1 TB/day of data, managed services often make sense. For larger teams with data platform engineers, the open-source stack is worth the operational investment.
What a 2026 Lakehouse Does Not Need
It is worth noting what the 2026 lakehouse architecture does not require, given some confusion about necessary components.
You do not need a separate data warehouse. If your query engines (Trino, DuckDB) are fast enough for your analytical workloads on Iceberg, you do not need to copy data into Snowflake or BigQuery. Many teams run both (Iceberg + Snowflake), paying for two copies of data. If your Iceberg queries meet your performance requirements, that second copy is unnecessary.
You do not need a separate real-time OLAP database. ClickHouse, Apache Druid, and Pinot are excellent systems, but they require another data copy and another system to operate. RisingWave's materialized views handle most real-time analytical query patterns with sub-second latency and without a separate system.
You do not need a message queue between every component. Some architectures pass data through Kafka between every system. For streaming-to-Iceberg, RisingWave reads from Kafka (or CDC directly) and writes to Iceberg without an intermediate queue.
Summary
The data lakehouse architecture in 2026 is not a single product. It is a set of interoperating open components: a streaming SQL engine for the hot tier, Apache Iceberg as the table format for the warm and cold tiers, query engines for analytical access, and a REST Catalog for governance.
RisingWave fills the hot-tier role with PostgreSQL-compatible streaming SQL, native CDC connectors, and an Iceberg sink that continuously feeds the warm tier. Trino and DuckDB provide fast analytical queries on Iceberg without a dedicated warehouse. Polaris or Unity Catalog provides governance across all engines.
The honest part: you will need to operate compaction jobs, manage snapshot expiration, coordinate schema evolution, and think carefully about exactly-once guarantees at each integration point. These are not difficult problems, but they are real ones. The architecture that handles them well is not the one with the fewest moving parts -- it is the one where each component does one thing reliably and the integrations between them are clearly understood.
For most teams building in 2026, the combination of RisingWave for streaming, Iceberg for storage, and Trino or DuckDB for analytics covers the majority of data access patterns at a fraction of the cost of a fully managed stack, with full control over the data and no vendor lock-in at any layer.

