Building a Real-Time Analytics Layer on Apache Iceberg

Building a Real-Time Analytics Layer on Apache Iceberg

·

12 min read

Introduction

Apache Iceberg gives you reliable, queryable storage on object storage. But Iceberg alone is not a real-time analytics engine. Queries against Iceberg tables, even well-compacted ones, hit object storage with its inherent latency (typically 50 to 200 ms per file open). For dashboards that need sub-second response times or operational analytics that drives automated decisions, that is too slow.

The solution is a layered architecture: use Iceberg as your cold and warm storage for historical and recent data, and add a hot analytics layer that serves real-time queries from incrementally maintained materialized views. This pattern gives you the best of both worlds: the durability and cost-efficiency of Iceberg for petabyte-scale history, and the low-latency query performance of an in-memory streaming engine for the latest data.

In this guide, we build this architecture using RisingWave as the real-time analytics layer and Apache Iceberg as the lakehouse storage. We compare this approach to traditional OLAP solutions (Snowflake, BigQuery, ClickHouse) and show you how to query both hot and cold data from a single interface.

Why Can't Apache Iceberg Alone Serve Real-Time Analytics?

Iceberg is an excellent table format. It provides ACID transactions, schema evolution, time travel, and partition pruning on object storage. But it was designed as a storage format, not a query engine. The query engine (Spark, Trino, StarRocks) sits on top and handles execution.

The performance constraints come from the storage layer:

Object storage latency

Every Iceberg query starts by reading metadata (manifest list, manifests) and then the data files themselves. On S3, each file open incurs 50 to 150 ms of latency. A query that touches 100 files spends 5 to 15 seconds just on I/O overhead before any computation begins.

No incremental computation

Iceberg stores data. It does not maintain aggregations. If your dashboard shows "total revenue in the last hour," every dashboard refresh re-scans the relevant data files and recomputes the aggregation from scratch. A streaming engine with materialized views computes the answer incrementally: each new event updates the running total, and the dashboard reads the pre-computed result.

Commit latency

Streaming pipelines that write to Iceberg typically commit every 30 seconds to 2 minutes. This means the latest data is 30 seconds to 2 minutes old by the time it appears in Iceberg. For many analytical use cases, this is fine. For operational analytics (fraud detection, inventory alerts, SLO monitoring), it is not fresh enough.

The real-time gap

The gap between "data arrives" and "data is queryable in Iceberg" is the commit interval plus query planning time plus query execution time. Typically 1 to 5 minutes end-to-end. A hot analytics layer closes this gap to sub-second.

What Does a Hot-Warm-Cold Analytics Architecture Look Like?

The layered analytics architecture separates data by temperature, matching storage and compute characteristics to access patterns:

Hot layer: RisingWave materialized views

The hot layer stores the most recent data (typically minutes to hours) in incrementally maintained materialized views. Queries against the hot layer return in milliseconds because results are pre-computed and stored in memory or fast local storage.

Warm layer: Recent Iceberg data

The warm layer contains recently committed Iceberg data (hours to days). This data is queryable through any Iceberg-compatible engine. Queries are slower than the hot layer (seconds) but cover a larger time range.

Cold layer: Historical Iceberg data

The cold layer contains older Iceberg data (weeks to years) that is accessed less frequently. This data benefits from compaction, sort-order optimization, and potentially cheaper storage tiers (S3 Glacier, Azure Cool).

Data flow

Kafka / CDC Sources
       |
       v
  RisingWave
       |
       +------> Materialized Views (HOT: sub-second queries)
       |
       +------> Iceberg Sink (WARM/COLD: Parquet on object storage)
       |
       v
  Iceberg Tables
       |
       +------> Recent partitions (WARM: seconds latency)
       +------> Historical partitions (COLD: archival, batch queries)

How Do You Build the Hot Layer with RisingWave?

The hot layer is where real-time analytics happens. RisingWave ingests streaming data, maintains materialized views incrementally, and serves queries with PostgreSQL-compatible SQL.

Step 1: Ingest streaming data

Connect RisingWave to your data sources. This example uses Kafka for clickstream events:

-- Create a source connected to Kafka
CREATE SOURCE clickstream (
    event_id BIGINT,
    user_id INT,
    page_url VARCHAR,
    event_type VARCHAR,
    session_id VARCHAR,
    event_time TIMESTAMP,
    device_type VARCHAR,
    country VARCHAR
) WITH (
    connector = 'kafka',
    topic = 'clickstream_events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

Step 2: Create materialized views for real-time metrics

Materialized views in RisingWave are continuously updated as new events arrive. They are not periodic refreshes like traditional database materialized views. Each event updates the view within milliseconds.

-- Real-time page view counts per URL, updated continuously
CREATE MATERIALIZED VIEW page_views_live AS
SELECT
    page_url,
    COUNT(*) AS total_views,
    COUNT(DISTINCT user_id) AS unique_visitors,
    COUNT(DISTINCT session_id) AS sessions,
    MAX(event_time) AS last_seen
FROM clickstream
WHERE event_type = 'page_view'
GROUP BY page_url;

-- Real-time conversion funnel, sliding 1-hour window
CREATE MATERIALIZED VIEW conversion_funnel_1h AS
SELECT
    COUNT(DISTINCT CASE WHEN event_type = 'page_view' THEN user_id END) AS visitors,
    COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart' THEN user_id END) AS added_to_cart,
    COUNT(DISTINCT CASE WHEN event_type = 'checkout_start' THEN user_id END) AS started_checkout,
    COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) AS purchased
FROM clickstream
WHERE event_time > NOW() - INTERVAL '1 hour';

Step 3: Query the hot layer

Because RisingWave is PostgreSQL-compatible, you can connect any BI tool (Grafana, Metabase, Superset, Tableau) or application using a standard PostgreSQL driver:

-- Dashboard query: sub-second response
SELECT * FROM page_views_live ORDER BY total_views DESC LIMIT 20;

-- Conversion funnel: instant response
SELECT * FROM conversion_funnel_1h;

These queries read from pre-computed results. No scanning, no aggregation at query time. Response times are typically 1 to 10 ms.

How Do You Sink Data to the Iceberg Cold Layer?

While the hot layer serves real-time queries, you simultaneously sink the same data to Iceberg for historical analytics, compliance, and batch processing.

Create the Iceberg sink

-- Sink raw clickstream data to Iceberg for historical storage
CREATE SINK clickstream_to_iceberg FROM clickstream WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'https://your-catalog-endpoint',
    warehouse = 'analytics_warehouse',
    database.name = 'clickstream',
    table.name = 'events',
    commit_checkpoint_interval = 120
);

You can also sink materialized view results to Iceberg, creating pre-aggregated tables that are cheaper to query:

-- Sink hourly aggregations to Iceberg
CREATE MATERIALIZED VIEW hourly_page_stats AS
SELECT
    page_url,
    window_start,
    window_end,
    COUNT(*) AS view_count,
    COUNT(DISTINCT user_id) AS unique_users
FROM TUMBLE(clickstream, event_time, INTERVAL '1 hour')
GROUP BY page_url, window_start, window_end;

CREATE SINK hourly_stats_iceberg FROM hourly_page_stats WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'https://your-catalog-endpoint',
    warehouse = 'analytics_warehouse',
    database.name = 'clickstream',
    table.name = 'hourly_page_stats',
    commit_checkpoint_interval = 300
);

For detailed Iceberg sink configuration, see the RisingWave Iceberg documentation.

How Does This Compare to Traditional OLAP Approaches?

Most organizations today use one of these approaches for analytics. Here is how they compare to the layered Iceberg architecture:

Snowflake / BigQuery (cloud data warehouse)

DimensionSnowflake / BigQueryRisingWave + Iceberg
Data freshnessMinutes to hours (micro-batch or scheduled loads)Sub-second (hot layer), minutes (Iceberg)
Query latencySeconds to minutes (compute spin-up)Milliseconds (hot), seconds (warm Iceberg)
Storage cost$23/TB/month (Snowflake compressed)~$2.30/TB/month (S3 standard)
Compute costPay-per-query (can spike)Predictable (streaming compute)
Vendor lock-inHigh (proprietary format)Low (open Iceberg format)
Historical queriesGoodGood (Iceberg time travel)

The cost difference is dramatic. Iceberg on object storage costs roughly 10x less than cloud data warehouse storage. For organizations with petabytes of historical data, this adds up to millions of dollars annually.

ClickHouse (real-time OLAP)

DimensionClickHouseRisingWave + Iceberg
Query latencyMilliseconds (columnar engine)Milliseconds (hot), seconds (warm)
Streaming ingestionLimited (requires Kafka Engine or materialized views)Native (SQL-based streaming)
Incremental computationLimited (materialized views are insert-only)Full (joins, aggregations, windows)
Historical storageExpensive (local disks or S3 with performance trade-offs)Cost-effective (Iceberg on S3)
Multi-engine accessClickHouse onlyAny Iceberg-compatible engine

ClickHouse is excellent for ad-hoc analytical queries on recent data. But it is not designed for complex streaming transformations (multi-way joins, temporal joins, windowed aggregations), and storing years of history in ClickHouse is expensive. The layered approach lets you use each tool for what it does best.

Apache Druid (real-time OLAP)

DimensionDruidRisingWave + Iceberg
Query latencySub-second (pre-aggregated segments)Milliseconds (materialized views)
Streaming ingestionYes (Kafka indexing)Yes (SQL-based)
SQL supportLimited (Druid SQL)Full PostgreSQL compatibility
Operational complexityHigh (multiple node types)Lower (single streaming engine + Iceberg)
Historical storageDeep storage (S3) with limited query performanceIceberg with full query capabilities

How Do You Query Both Hot and Cold Data Together?

One of the most powerful aspects of this architecture is the ability to query across time horizons. The hot layer has the last few hours with sub-second freshness. Iceberg has everything else. You can query both.

Option 1: Federated queries via Trino

Trino can query both RisingWave (via PostgreSQL connector) and Iceberg (via Iceberg connector) in a single query:

-- In Trino: combine hot (RisingWave) and cold (Iceberg) data
SELECT page_url, SUM(view_count) AS total_views
FROM (
    -- Hot data from RisingWave materialized view
    SELECT page_url, total_views AS view_count
    FROM risingwave.public.page_views_live

    UNION ALL

    -- Cold data from Iceberg
    SELECT page_url, view_count
    FROM iceberg.clickstream.hourly_page_stats
    WHERE window_end < NOW() - INTERVAL '1' HOUR
) combined
GROUP BY page_url
ORDER BY total_views DESC;

Option 2: Application-level merge

Your application queries both layers and merges results. This is simpler to implement and avoids the complexity of federated query engines:

import psycopg2  # RisingWave (PostgreSQL-compatible)
import duckdb     # Iceberg reader

# Hot layer: latest real-time metrics
rw_conn = psycopg2.connect("postgresql://risingwave:5505/dev")
hot_data = pd.read_sql("SELECT * FROM page_views_live", rw_conn)

# Cold layer: historical data from Iceberg
duck = duckdb.connect()
cold_data = duck.sql("""
    SELECT page_url, SUM(view_count) as total_views
    FROM iceberg_scan('s3://warehouse/clickstream/hourly_page_stats')
    WHERE window_end < NOW() - INTERVAL '1 hour'
    GROUP BY page_url
""").df()

# Merge
combined = pd.concat([hot_data, cold_data]).groupby('page_url').sum()

Option 3: RisingWave Iceberg source

RisingWave can also read from Iceberg tables as a source, enabling you to create materialized views that combine streaming data with historical Iceberg data:

-- Create an Iceberg source for historical data
CREATE SOURCE historical_stats WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'https://your-catalog-endpoint',
    warehouse = 'analytics_warehouse',
    database.name = 'clickstream',
    table.name = 'hourly_page_stats'
);

For more on this pattern, see the RisingWave Iceberg source documentation.

What Are the Operational Considerations?

Storage tiering

Configure Iceberg tables with lifecycle policies that move older data to cheaper storage:

  • Hot (0-24 hours): S3 Standard or equivalent. Frequently accessed for backfill or recent queries.
  • Warm (1-30 days): S3 Standard-Infrequent Access. Still queryable but accessed less often.
  • Cold (30+ days): S3 Glacier Instant Retrieval. For compliance, audit, and rare historical analysis.

Compaction strategy

The warm and cold layers need regular compaction to maintain query performance. As discussed in our Iceberg compaction guide, RisingWave handles this automatically. If you use a different ingestion tool, schedule Spark compaction jobs as described in the Iceberg maintenance documentation.

Schema management

When your streaming schema evolves, both the hot layer (RisingWave materialized views) and the cold layer (Iceberg tables) need to stay in sync. Iceberg's schema evolution handles the cold layer transparently. For the hot layer, update your RisingWave source and materialized view definitions. The RisingWave schema change documentation covers this process.

Cost estimation

For a typical streaming analytics deployment processing 10,000 events/sec:

ComponentMonthly cost estimate
RisingWave Cloud (hot layer)$200-500 (depending on materialized view complexity)
S3 storage (Iceberg, 10 TB)$230
Compaction compute (if not using RisingWave)$100-300
Query engine (Trino/StarRocks for cold queries)$200-500
Total$730-1,530

Compare this to a cloud data warehouse storing the same volume: $2,000-5,000/month for Snowflake or BigQuery with similar query patterns. The savings come from Iceberg's object storage pricing and RisingWave's efficient incremental computation.

FAQ

What is a real-time analytics layer on Apache Iceberg?

A real-time analytics layer is a streaming compute engine that sits on top of Iceberg storage, serving pre-computed query results with sub-second latency. RisingWave serves as this layer by maintaining incrementally updated materialized views while sinking data to Iceberg for historical storage and multi-engine access.

Can Apache Iceberg serve real-time queries by itself?

No. Iceberg is a storage format, not a query engine. Queries against Iceberg tables go through an engine (Spark, Trino, StarRocks) and hit object storage, which adds latency. For sub-second query response times, you need a hot analytics layer like RisingWave that pre-computes results from streaming data.

How does this approach compare to using Snowflake or BigQuery?

The layered approach (RisingWave + Iceberg) provides lower data freshness latency (sub-second vs minutes), lower storage cost (10x cheaper on object storage), and no vendor lock-in (open Iceberg format). Snowflake and BigQuery offer simpler operations and broader SQL feature coverage for ad-hoc analytics. The right choice depends on your latency requirements and data volumes.

What query engines can access the Iceberg layer?

Any Iceberg-compatible engine can query the cold and warm layers: Spark, Trino, StarRocks, DuckDB, Snowflake, BigQuery, Redshift Spectrum, and RisingWave. This multi-engine flexibility is one of the primary advantages of using Iceberg as the storage layer instead of a proprietary format.

Conclusion

Building a real-time analytics layer on Apache Iceberg gives you the best of both worlds: sub-second query performance for operational dashboards and cost-effective, petabyte-scale storage for historical analysis. Here are the key takeaways:

  • Use a hot-warm-cold architecture: RisingWave materialized views for real-time (hot), recent Iceberg partitions for interactive analytics (warm), and historical Iceberg data for batch and compliance (cold).
  • Iceberg is storage, not compute: Pair it with a streaming engine like RisingWave to close the real-time gap.
  • Cost advantage is significant: Iceberg on object storage costs 10x less than cloud data warehouses for the same data volume.
  • Open format prevents lock-in: Any Iceberg-compatible engine can query your data. You are not locked into a single vendor.
  • Start with materialized views: Identify your top 5 dashboard queries and build materialized views for them. This delivers immediate value while Iceberg handles the long tail of ad-hoc queries.

Ready to try this yourself? Build a real-time analytics layer on Iceberg with RisingWave. Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.