Real-Time Analytics on Apache Iceberg Tables with RisingWave

Real-Time Analytics on Apache Iceberg Tables with RisingWave

RisingWave enables real-time analytics on Apache Iceberg tables by continuously ingesting data from Kafka and databases into Iceberg via streaming sinks, and — in v2.8+ — directly querying Iceberg tables as sources. This means your dashboards and applications can query results that are seconds old, not hours old, while data remains in open Iceberg format accessible to any query engine.

The Real-Time Analytics Gap in Traditional Lakehouses

Traditional data lakehouse architectures have a latency problem. The typical pipeline looks like this:

  1. Events arrive in Kafka
  2. A batch Spark job runs every hour, reading from Kafka
  3. Spark writes to Iceberg on S3
  4. Analysts query Iceberg with Athena or Trino

The result: data is 1–2 hours stale. For dashboards tracking sales, fraud, user engagement, or operations, this is often unacceptably slow.

RisingWave fills this gap. It sits between your event sources and Iceberg storage, maintaining continuously updated materialized views that are committed to Iceberg every few seconds. Query engines always see fresh data — not because they're querying Kafka, but because Iceberg itself is kept current.

Architecture: Continuous Freshness

Kafka Events / CDC
        │
        ▼
   RisingWave
   ┌─────────────────────────────┐
   │  Materialized Views         │
   │  (incremental computation)  │
   └─────────┬───────────────────┘
             │ Commit every N seconds
             ▼
   Apache Iceberg on S3
        │
   ┌────┴──────────┬──────────────┐
   ▼               ▼              ▼
  Athena         Trino       RisingWave
  (batch)        (batch)     (live queries)

With this architecture, Athena and Trino queries always see data that is at most one checkpoint interval old (typically 10–60 seconds). RisingWave queries are even fresher, as they can read directly from its own in-memory materialized view state.

Building Real-Time Analytics: Step by Step

Define the Streaming Source

CREATE SOURCE app_events (
    event_id VARCHAR,
    user_id BIGINT,
    session_id VARCHAR,
    event_name VARCHAR,
    page_path VARCHAR,
    revenue NUMERIC(10,2),
    device_type VARCHAR,
    geo_country VARCHAR,
    event_time TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'app-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

Build Multi-Granularity Materialized Views

Real-time analytics requires metrics at multiple time granularities. Build views for each:

-- Real-time 5-minute rolling metrics
CREATE MATERIALIZED VIEW realtime_5min_metrics AS
SELECT
    geo_country,
    device_type,
    window_start,
    window_end,
    COUNT(DISTINCT user_id) AS active_users,
    COUNT(DISTINCT session_id) AS active_sessions,
    COUNT(*) FILTER (WHERE event_name = 'purchase') AS purchase_events,
    SUM(revenue) FILTER (WHERE event_name = 'purchase') AS revenue_5min,
    COUNT(*) AS total_events
FROM TUMBLE(app_events, event_time, INTERVAL '5 MINUTES')
GROUP BY geo_country, device_type, window_start, window_end;

-- Hourly aggregation for trend analysis
CREATE MATERIALIZED VIEW hourly_analytics AS
SELECT
    geo_country,
    device_type,
    DATE_TRUNC('hour', event_time) AS hour_bucket,
    COUNT(DISTINCT user_id) AS unique_users,
    COUNT(DISTINCT session_id) AS unique_sessions,
    SUM(revenue) AS total_revenue,
    COUNT(*) FILTER (WHERE event_name = 'page_view') AS page_views,
    COUNT(*) FILTER (WHERE event_name = 'add_to_cart') AS add_to_cart_events,
    COUNT(*) FILTER (WHERE event_name = 'purchase') AS purchases,
    AVG(revenue) FILTER (WHERE event_name = 'purchase') AS avg_order_value
FROM app_events
GROUP BY geo_country, device_type, DATE_TRUNC('hour', event_time);

Sink All Metrics to Iceberg

CREATE SINK realtime_metrics_to_iceberg AS
SELECT * FROM realtime_5min_metrics
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://analytics-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'analytics',
    table.name = 'realtime_5min_metrics'
);

CREATE SINK hourly_analytics_to_iceberg AS
SELECT * FROM hourly_analytics
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://analytics-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'analytics',
    table.name = 'hourly_analytics'
);

Query Live Analytics with Sliding Windows

For real-time dashboards, use HOP() windows to compute rolling metrics:

-- 1-hour rolling active users, updated every 5 minutes
CREATE MATERIALIZED VIEW rolling_1hr_active_users AS
SELECT
    geo_country,
    window_start,
    window_end,
    COUNT(DISTINCT user_id) AS active_users_1hr,
    SUM(revenue) AS rolling_revenue_1hr
FROM HOP(app_events, event_time, INTERVAL '5 MINUTES', INTERVAL '1 HOUR')
GROUP BY geo_country, window_start, window_end;

HOP() windows slide forward every 5 minutes while looking back 1 hour — perfect for "last hour" metrics on live dashboards.

Connecting BI Tools to Iceberg Analytics

Once data is in Iceberg, connect your favorite BI tools:

BI ToolConnection MethodFreshness
Amazon QuickSightAthena → Iceberg10–60s
TableauTrino/Athena connector10–60s
LookerBigQuery Iceberg tables10–60s
Apache SupersetTrino or Hive connector10–60s
MetabaseTrino connector10–60s
GrafanaAthena plugin10–60s

For truly sub-second query freshness, connect BI tools directly to RisingWave using its PostgreSQL-compatible wire protocol. Any tool that supports Postgres can query RisingWave's materialized views directly — before the data is even committed to Iceberg.

Advanced Pattern: Lakehouse Query + Streaming Join

RisingWave v2.8+ enables a powerful pattern: join live streaming data with historical Iceberg tables in a single query:

-- Register historical Iceberg data as a source
CREATE SOURCE iceberg_hourly_historical
WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://analytics-lake/warehouse',
    database.name = 'analytics',
    table.name = 'hourly_analytics'
);

-- Compare today's live data against the same hour last week
SELECT
    live.geo_country,
    live.hour_bucket,
    live.total_revenue AS current_revenue,
    hist.total_revenue AS last_week_revenue,
    ROUND(
        100.0 * (live.total_revenue - hist.total_revenue) / NULLIF(hist.total_revenue, 0),
        1
    ) AS revenue_growth_pct
FROM hourly_analytics live
JOIN iceberg_hourly_historical hist
    ON live.geo_country = hist.geo_country
    AND hist.hour_bucket = live.hour_bucket - INTERVAL '7 DAYS'
WHERE live.total_revenue > 0
ORDER BY revenue_growth_pct DESC;

This query computes week-over-week revenue growth by country, using live streaming data for current values and historical Iceberg data for comparison — with no data movement required.

Performance Optimization

Iceberg Table Partitioning for Analytics

Partition Iceberg tables by the dimensions most commonly used in analytics queries:

-- When creating the Iceberg table (using Spark or catalog API)
-- Partition by date and country for analytics queries
CREATE TABLE analytics.hourly_analytics (
    geo_country STRING,
    device_type STRING,
    hour_bucket TIMESTAMP,
    unique_users BIGINT,
    ...
) USING iceberg
PARTITIONED BY (DATE(hour_bucket), geo_country);

With this partitioning, a query filtered to WHERE geo_country = 'US' AND DATE(hour_bucket) = '2026-04-03' scans only the relevant partition files, not the entire table.

Materialized View Indexing

For frequently queried dimensions, consider creating secondary materialized views that pre-aggregate:

CREATE MATERIALIZED VIEW top_countries_revenue AS
SELECT
    geo_country,
    SUM(total_revenue) AS ytd_revenue,
    MAX(hour_bucket) AS last_updated
FROM hourly_analytics
WHERE EXTRACT(YEAR FROM hour_bucket) = EXTRACT(YEAR FROM CURRENT_DATE)
GROUP BY geo_country
ORDER BY ytd_revenue DESC;

Comparison: Real-Time Analytics Approaches

ApproachData FreshnessQuery PerformanceOperational ComplexityOpen Format
RisingWave → Iceberg + Athena10–60 secondsHighLowYes
Kafka → Flink → Iceberg + Trino10–30 secondsHighHighYes
Kafka → OLAP DB (Druid, Pinot)< 1 secondVery HighVery HighNo
Batch ETL → Data Warehouse1–24 hoursVery HighMediumNo
RisingWave direct query< 1 secondHighLowYes (via sink)

FAQ

Q: Can I use RisingWave as the query engine for Iceberg analytics? Yes. In v2.8+, register your Iceberg tables as sources in RisingWave and query them with standard SQL. You can also join Iceberg sources with live streaming materialized views.

Q: How do I handle time zones in analytics queries? RisingWave stores timestamps as TIMESTAMPTZ (UTC). Use AT TIME ZONE to convert for display. For Iceberg tables, store timestamps in UTC and convert in the query layer.

Q: What's the typical end-to-end latency from event to queryable result in Iceberg? With RisingWave's default settings, events are visible in Iceberg within 10–60 seconds of arrival in Kafka. With tuned checkpoint intervals, this can be reduced to under 10 seconds.

Q: Can I run anomaly detection in real time on top of these metrics? Yes. Define a materialized view that computes standard deviation over a historical window, then filter for values beyond N standard deviations. RisingWave evaluates this continuously, triggering whenever anomalies appear.

Q: How many concurrent queries can Iceberg handle? Iceberg itself has no query engine — it's a storage format. The query engine (Athena, Trino, RisingWave) determines concurrency. Athena is fully serverless and scales to unlimited concurrent queries. Trino scales with cluster size.

Start Building Real-Time Analytics

Real-time analytics on Apache Iceberg with RisingWave combines the freshness of streaming with the openness and scale of a data lakehouse. Your dashboards get live data; your data lake stays in open format.

Follow the RisingWave getting started guide to set up your first real-time analytics pipeline. Join the RisingWave Slack to share your analytics use cases and get help from the community.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.