Querying Apache Iceberg Tables with Streaming SQL

In RisingWave v2.8+, you can query Apache Iceberg tables directly using streaming SQL by registering them as sources. This enables lakehouse-style queries — joining Iceberg tables with live Kafka streams, reading historical snapshots, and building materialized views on top of Iceberg data — all in standard SQL without moving data out of the lake.

Why Query Iceberg with Streaming SQL?

Most Iceberg query engines (Athena, Trino, Spark SQL) are designed for batch analytics: you submit a query, it scans files, returns results. This is excellent for ad-hoc analysis and large batch jobs, but it has limitations for operational use cases:

No continuous queries — You can't say "alert me when this metric crosses a threshold"
No streaming joins — You can't join live Kafka data with historical Iceberg tables
No incremental computation — Each query rescans the full dataset

RisingWave brings streaming semantics to Iceberg queries. You can register an Iceberg table as a source, build a materialized view on it, and RisingWave will incrementally update query results as the Iceberg table receives new snapshots.

Registering Iceberg Tables as Sources

-- Register an existing Iceberg table as a RisingWave source
CREATE SOURCE orders_iceberg
WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'commerce',
    table.name = 'orders'
);

Once registered, orders_iceberg behaves like any other RisingWave source. You can query it directly, join it with other sources, or build materialized views on top of it.

-- Simple query of Iceberg table
SELECT
    shipping_country,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue
FROM orders_iceberg
WHERE created_at >= '2026-01-01'
GROUP BY shipping_country
ORDER BY revenue DESC
LIMIT 20;

Time Travel Queries

Apache Iceberg's snapshot-based architecture enables time travel. Query the state of a table as it was at a specific snapshot or timestamp:

-- In Spark or Trino, time travel with Iceberg:
-- SELECT * FROM orders FOR SYSTEM_VERSION AS OF 123456789;
-- SELECT * FROM orders FOR TIMESTAMP AS OF '2026-01-15 00:00:00';

While RisingWave reads from the current snapshot by default, you can use your catalog's REST API to reference historical snapshots for point-in-time analysis. For ongoing historical comparisons, use RisingWave's Iceberg sink to build a continuously updated materialized view and query historical Iceberg data separately.

Joining Live Streams with Historical Iceberg Data

This is one of the most powerful patterns enabled by RisingWave's Iceberg source support. You can enrich live streaming events with historical context stored in Iceberg:

-- Live source: current Kafka events
CREATE SOURCE live_events (
    user_id BIGINT,
    event_type VARCHAR,
    page_url VARCHAR,
    event_time TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'live-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

-- Historical source: user profiles stored in Iceberg
CREATE SOURCE user_profiles_iceberg
WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'users',
    table.name = 'user_profiles'
);

-- Join live events with historical user profiles
CREATE MATERIALIZED VIEW enriched_live_events AS
SELECT
    e.user_id,
    e.event_type,
    e.page_url,
    e.event_time,
    p.user_segment,
    p.account_age_days,
    p.lifetime_value,
    p.preferred_language
FROM live_events e
LEFT JOIN user_profiles_iceberg p ON e.user_id = p.user_id;

This materialized view is updated continuously as new events arrive in Kafka, using the user profile data from Iceberg for enrichment.

Incremental Reads from Iceberg

RisingWave can read Iceberg tables incrementally by monitoring new snapshots. When a new snapshot is committed to the Iceberg table (by RisingWave's own sink or by another engine like Spark), RisingWave reads only the new files — not the entire table.

This makes RisingWave an efficient consumer of Iceberg tables written by batch systems:

-- Iceberg table written by nightly Spark jobs
CREATE SOURCE batch_processed_features
WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://ml-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'ml_features',
    table.name = 'user_features'
);

-- Enriched real-time scoring input
CREATE MATERIALIZED VIEW scoring_input AS
SELECT
    e.user_id,
    e.session_id,
    e.event_type,
    f.feature_vector_json,
    f.risk_score,
    f.last_model_update
FROM live_events e
JOIN batch_processed_features f ON e.user_id = f.user_id;

Query Patterns for Iceberg Analytics

Pattern 1: Point-in-Time Snapshot

Use SQL aggregations to answer "what was the state at a specific time?" questions:

SELECT
    shipping_country,
    COUNT(*) AS orders,
    SUM(total_amount) AS revenue
FROM orders_iceberg
WHERE created_at BETWEEN '2026-03-01' AND '2026-03-31'
GROUP BY shipping_country;

Pattern 2: Incremental Aggregation with Tumble Windows

Build time-windowed aggregations on Iceberg sources:

CREATE MATERIALIZED VIEW iceberg_daily_revenue AS
SELECT
    shipping_country,
    window_start::DATE AS date,
    COUNT(*) AS orders,
    SUM(total_amount) AS revenue
FROM TUMBLE(orders_iceberg, created_at, INTERVAL '1 DAY')
GROUP BY shipping_country, window_start;

Pattern 3: Cross-Layer Lakehouse Analytics

Compare real-time streaming data with historical Iceberg data:

CREATE SOURCE historical_revenue_iceberg
WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'analytics',
    table.name = 'daily_revenue'
);

-- Compare current hour to same hour last week
SELECT
    live.shipping_country,
    live.total_revenue AS current_revenue,
    hist.total_revenue AS last_week_revenue,
    ROUND(100.0 * (live.total_revenue - hist.total_revenue) / NULLIF(hist.total_revenue, 0), 1) AS wow_growth_pct
FROM hourly_analytics live
JOIN historical_revenue_iceberg hist
    ON live.shipping_country = hist.shipping_country
    AND hist.date = CURRENT_DATE - INTERVAL '7 DAYS'
ORDER BY wow_growth_pct DESC;

Query Engine Comparison for Iceberg

Query Engine	Streaming Joins	Incremental MV	Sub-second Latency	SQL Standard	Open Source
RisingWave	Yes	Yes	Yes (in-memory)	Yes	Yes
Apache Flink	Yes	Yes	Yes	Limited	Yes
Trino	No	No	No	Yes	Yes
Amazon Athena	No	No	No	Yes	No
Apache Spark	No	No	No	Yes	Yes
DuckDB	No	No	Near	Yes	Yes

RisingWave is uniquely positioned as both a streaming SQL engine and an Iceberg query engine — supporting continuous materialized views on top of Iceberg sources.

Performance Tips for Iceberg Queries in RisingWave

Predicate pushdown — Always include partition key filters in your WHERE clauses. RisingWave pushes predicates down to Iceberg's manifest scan, avoiding reading irrelevant files.

Column projection — Only SELECT the columns you need. RisingWave projects columns at the Parquet file level, reducing I/O.

Limit initial scan scope — When building a materialized view on a large Iceberg table, add a date filter to avoid scanning years of history on startup.

Snapshot polling interval — RisingWave checks for new Iceberg snapshots periodically. For near-real-time freshness, configure a short polling interval.

Common Use Cases

Use Case	Iceberg Source	Live Source	Pattern
Feature enrichment for ML	User features (Iceberg)	Live events (Kafka)	Join
Anomaly detection	Historical baselines (Iceberg)	Live metrics (Kafka)	Join + filter
Trend comparison	Historical data (Iceberg)	Current data (MV)	Join
Backfill processing	Historical raw events (Iceberg)	None	Source scan
Cross-system reporting	Multiple Iceberg tables	Kafka streams	Multi-join

FAQ

Q: Can RisingWave write to and read from the same Iceberg table simultaneously? Yes. You can have a sink writing to an Iceberg table while a separate source reads from it. RisingWave reads from committed snapshots, so there's no interference between readers and writers.

Q: How does RisingWave detect new snapshots in an Iceberg table written by Spark? RisingWave polls the Iceberg catalog for new snapshots. When Spark commits a new snapshot, RisingWave detects it within the polling interval and reads the new data files.

Q: Are there any Iceberg features not supported by RisingWave's source connector? RisingWave reads standard Parquet-format Iceberg tables. Row-level delete files (positional deletes) are applied during read. ORC and Avro data files may have limited support depending on your RisingWave version.

Q: Can I use schema evolution with RisingWave's Iceberg source? Yes. When you evolve the Iceberg table schema (add columns), RisingWave will see the new columns on its next scan. Your downstream materialized views will need to be updated to reference the new columns.

Q: What catalogs does RisingWave's Iceberg source support? The Iceberg source connector supports REST catalogs (the recommended approach), AWS Glue, and Hive Metastore — the same catalogs supported by the Iceberg sink.

Start Querying Your Iceberg Lake

RisingWave's Iceberg source support transforms your data lake from a passive storage layer into an active component of your streaming analytics architecture. Historical data and live streams can be queried together, enriched together, and analyzed together.

Get started with the RisingWave documentation and explore the Iceberg source connector. Join the community on Slack to share what you're building.