Streaming Lakehouse Architecture: Real-Time + Historical Analytics

Streaming Lakehouse Architecture: Real-Time + Historical Analytics

Streaming Lakehouse Architecture: Real-Time + Historical Analytics

A streaming lakehouse combines real-time serving (sub-second queries) with historical analytics (scan petabytes) in a single architecture. The pattern: a streaming database (RisingWave) serves real-time queries via materialized views AND sinks data to Apache Iceberg for long-term analytical queries.

Architecture

Sources (Kafka, CDC) ──→ RisingWave
                              │
                    ┌─────────┴─────────┐
                    ↓                   ↓
           Materialized Views     Iceberg Sink
           (real-time serving)    (historical storage)
                    │                   │
                    ↓                   ↓
              Applications         Trino / Spark
              (sub-100ms)          (analytical queries)

Why Both?

NeedStreaming MVsIceberg
Latest 5-min metrics✅ Sub-100ms❌ Delayed
Last 30 days trend⚠️ Expensive to maintain✅ Efficient
Ad-hoc exploration⚠️ Pre-defined queries only✅ Flexible
ML training data❌ Wrong tool✅ Perfect

Streaming MVs are optimal for known, high-frequency queries with strict freshness requirements. Iceberg is optimal for flexible, historical analysis over large datasets.

Implementation

-- Real-time serving layer
CREATE MATERIALIZED VIEW live_metrics AS
SELECT region, COUNT(*) as orders, SUM(amount) as revenue
FROM orders_stream WHERE order_time > NOW()-INTERVAL '5 minutes'
GROUP BY region;

-- Historical analytics layer (same data, different destination)
CREATE SINK orders_to_iceberg AS SELECT * FROM orders_stream
WITH (connector='iceberg', type='append-only', ...);

Both views are fed from the same streaming source. One serves real-time; the other stores history.

Frequently Asked Questions

Do I need both streaming MVs and Iceberg?

Not always. If you only need real-time metrics, MVs alone are sufficient. If you only need historical analytics, Iceberg alone works. The streaming lakehouse pattern is for workloads requiring both — which is increasingly common.

How much does a streaming lakehouse cost?

Compute: RisingWave cluster ($100-500/month for moderate workloads). Storage: S3 at $0.023/GB/month. Query engines: Trino (self-hosted) or DuckDB (free). Total can be 5-10x cheaper than a traditional data warehouse.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.