Building a Data Lakehouse: Architecture Guide (2026)
A data lakehouse combines the low-cost storage of data lakes (S3 + Parquet) with the reliability and performance of data warehouses (ACID transactions, schema enforcement, SQL queries). In 2026, the standard lakehouse stack is object storage + Apache Iceberg + a query engine (Trino, Spark, or DuckDB).
Lakehouse Architecture
┌─────────────────────────────────────────────────────┐
│ Query Engines │
│ Trino │ Spark │ DuckDB │ Snowflake │ BigQuery │
├─────────────────────────────────────────────────────┤
│ Apache Iceberg (Table Format) │
│ ACID │ Schema Evolution │ Partition Evolution │
├─────────────────────────────────────────────────────┤
│ Object Storage (S3 / GCS / ADLS) │
│ Parquet Files │ Metadata Files │ Manifest Files │
└─────────────────────────────────────────────────────┘
Adding Real-Time with a Streaming Layer
Kafka / CDC ──→ RisingWave ──→ Iceberg Tables ──→ Query Engines
│
Materialized Views (real-time serving)
RisingWave provides the streaming layer: ingest from Kafka/CDC, transform with SQL, serve real-time queries via PostgreSQL, and sink to Iceberg for historical analytics.
Lakehouse vs Data Warehouse
| Aspect | Data Warehouse | Data Lakehouse |
| Storage | Proprietary | Open (S3 + Parquet) |
| Cost | High | Low (S3 pricing) |
| Format lock-in | Yes | No (open formats) |
| Multi-engine | No | Yes |
| Real-time | Limited | Via streaming layer |
| ML/AI workloads | Limited | Native (Python, Spark) |
Frequently Asked Questions
Is a lakehouse cheaper than a data warehouse?
Yes, typically 5-10x cheaper for storage. S3 costs ~$0.023/GB/month vs proprietary warehouse storage. Compute costs depend on query engine choice.
Can I add real-time to an existing lakehouse?
Yes. Add RisingWave as a streaming layer that sinks to your existing Iceberg tables. No changes to your query engines or dashboards — they continue reading from Iceberg.

