Iceberg Time Travel with Streaming SQL: Query Any Point in Time

Introduction

You ran a dashboard query yesterday and got one number. Today, the same query returns something different. What changed? When did it change? Without a way to look back in time, answering these questions means digging through logs, re-running batch jobs, or simply guessing.

Apache Iceberg solves this with time travel: the ability to query your data as it existed at any previous point in time. Every write to an Iceberg table creates an immutable snapshot, and each snapshot captures the full state of the table at that moment. You can query any of these snapshots using standard SQL, which makes debugging, auditing, and reproducibility straightforward.

But here is the catch. Most Iceberg pipelines rely on batch jobs that run every few hours or once a day. That means your snapshots are spaced hours apart, and you cannot time-travel to a moment between two batch runs. This is where streaming changes the game. By writing to Iceberg continuously with a streaming SQL engine like RisingWave, you get frequent snapshots that capture granular state changes, giving you a detailed historical record you can query at any time.

What Is Iceberg Time Travel and Why Does It Matter?

Iceberg time travel is the ability to query a table as it existed at a specific point in the past. Unlike traditional databases that overwrite data in place, Iceberg maintains a chain of immutable snapshots. Each snapshot points to a set of data files and manifest files that describe the exact contents of the table when that snapshot was created.

This design gives you several practical capabilities:

Debugging data issues: Query what the data looked like before and after a suspected bug was introduced
Audit compliance: Prove what data existed at the time a regulatory report was generated
Reproducible analytics: Re-run a machine learning training job with the exact same dataset used last week
Safe rollbacks: If a bad write corrupts your table, roll back to the last known good snapshot

The key insight is that the value of time travel depends directly on how often you create snapshots. Batch pipelines that run daily give you daily granularity. Streaming pipelines that write continuously give you snapshots every minute or even every few seconds.

How Do Iceberg Snapshots Work Under the Hood?

To understand time travel, you need to understand how Iceberg manages table state. An Iceberg table consists of three layers:

The metadata layer

At the top is a metadata file (JSON) that tracks the current schema, partition spec, sort order, and a pointer to the current snapshot. Every time the table changes, a new metadata file is created.

The snapshot layer

Each snapshot has a unique snapshot ID, a timestamp, and a pointer to a manifest list. The manifest list points to one or more manifest files, and each manifest file tracks a set of data files along with column-level statistics like min/max values and null counts.

The data layer

The actual data lives in Parquet, ORC, or Avro files stored in your object storage (S3, GCS, Azure Blob). These files are never modified in place. New writes create new files, and deletes create delete files that mark records for removal.

When you execute a time travel query, the query engine:

Looks up the snapshot for your requested timestamp or snapshot ID
Reads the manifest list for that snapshot
Uses manifest-level and file-level statistics to prune irrelevant files
Reads only the data files needed to answer your query

Because old snapshots reference old data files that are never modified, time travel queries return exactly the data that existed at that point in time.

Why Does Batch Ingestion Limit Time Travel Granularity?

With batch pipelines, data arrives in large chunks at scheduled intervals. A typical setup might look like this:

An Airflow DAG runs every 4 hours
It reads new records from a source system
It writes them to an Iceberg table in a single commit

This creates one snapshot every 4 hours. If something went wrong at 2:15 PM, and your last batch ran at noon, you cannot see what the data looked like at 2:15 PM. You can only see noon and 4 PM.

There are also operational limitations with batch pipelines:

Data freshness: The lakehouse is always hours behind the source system
Large commits: Each batch writes thousands or millions of records at once, creating large data files that are expensive to read during time travel
Recovery gaps: If you need to roll back, you lose all data since the last batch run

How Does Streaming SQL Create Fine-Grained Snapshots?

A streaming SQL engine like RisingWave writes to Iceberg continuously rather than in scheduled batches. Here is how the flow works:

RisingWave ingests data from sources like Kafka, PostgreSQL CDC, or MySQL CDC
It processes the data using standard SQL (joins, aggregations, filters, transformations)
It commits results to Iceberg at regular intervals (configurable, often every 60 seconds)

Each commit creates a new Iceberg snapshot. This means you get snapshot granularity measured in seconds or minutes, not hours.

Setting up a streaming sink to Iceberg

Here is a concrete example. Suppose you have an orders stream coming from Kafka and you want to maintain a continuously updated Iceberg table:

-- Create a source to ingest from Kafka
CREATE SOURCE orders_stream (
    order_id BIGINT,
    customer_id BIGINT,
    product_id INT,
    quantity INT,
    total_amount DECIMAL,
    order_status VARCHAR,
    created_at TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'ecommerce.orders',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

-- Create a materialized view that enriches orders
CREATE MATERIALIZED VIEW enriched_orders AS
SELECT
    o.order_id,
    o.customer_id,
    o.product_id,
    o.quantity,
    o.total_amount,
    o.order_status,
    o.created_at,
    CASE
        WHEN o.total_amount > 500 THEN 'high_value'
        WHEN o.total_amount > 100 THEN 'medium_value'
        ELSE 'standard'
    END AS order_tier
FROM orders_stream o;

-- Sink enriched orders to Iceberg continuously
CREATE SINK orders_to_iceberg FROM enriched_orders
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'order_id',
    warehouse.path = 's3://my-lakehouse/warehouse',
    database.name = 'ecommerce',
    table.name = 'enriched_orders',
    catalog.type = 'rest',
    catalog.name = 'lakehouse_catalog',
    create_table_if_not_exists = 'true',
    commit_checkpoint_interval = '60',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    s3.region = 'us-west-2'
);

With commit_checkpoint_interval set to 60 seconds, RisingWave commits a new snapshot to the Iceberg table every minute. Over the course of a day, that gives you 1,440 snapshots instead of the 4-6 you would get from a batch pipeline.

Adjusting commit frequency

You can tune the commit interval based on your needs:

Every 10 seconds: Maximum time travel granularity, but creates many small files (requires regular compaction)
Every 60 seconds: Good balance between granularity and file management
Every 300 seconds: Fewer snapshots but still far more granular than hourly batch jobs

RisingWave also supports commit_checkpoint_size_threshold_mb to trigger early commits when data volume is high, ensuring large bursts of data create snapshots promptly.

How Do You Query Historical State with Time Travel SQL?

Once your streaming pipeline creates frequent snapshots, you can query any of them using time travel syntax. The exact SQL depends on which query engine you use to read the Iceberg table.

Querying with Spark SQL

-- Query data as it existed at a specific timestamp
SELECT order_id, customer_id, total_amount, order_status
FROM ecommerce.enriched_orders
TIMESTAMP AS OF '2026-03-28 14:30:00'
WHERE order_status = 'completed';

-- Query a specific snapshot by ID
SELECT COUNT(*) AS total_orders, SUM(total_amount) AS revenue
FROM ecommerce.enriched_orders
VERSION AS OF 4512389076234;

Querying with Trino

-- Trino uses a slightly different syntax
SELECT order_id, customer_id, total_amount
FROM ecommerce.enriched_orders
FOR TIMESTAMP AS OF TIMESTAMP '2026-03-28 14:30:00';

-- Query by snapshot ID in Trino
SELECT *
FROM ecommerce.enriched_orders
FOR VERSION AS OF 4512389076234;

Querying with DuckDB

-- DuckDB reads Iceberg tables through the iceberg extension
SELECT order_id, total_amount, order_tier
FROM iceberg_scan('s3://my-lakehouse/warehouse/ecommerce/enriched_orders',
                   as_of_timestamp = '2026-03-28T14:30:00Z');

Comparing data across time points

One of the most powerful time travel patterns is comparing data between two points in time. This helps you identify exactly what changed:

-- Find orders whose status changed between two snapshots
WITH before AS (
    SELECT order_id, order_status
    FROM ecommerce.enriched_orders
    TIMESTAMP AS OF '2026-03-28 12:00:00'
),
after AS (
    SELECT order_id, order_status
    FROM ecommerce.enriched_orders
    TIMESTAMP AS OF '2026-03-28 14:00:00'
)
SELECT
    a.order_id,
    b.order_status AS status_before,
    a.order_status AS status_after
FROM after a
JOIN before b ON a.order_id = b.order_id
WHERE a.order_status != b.order_status;

Inspecting snapshot history

You can view all available snapshots using Iceberg's metadata tables:

-- List all snapshots for a table (Spark SQL)
SELECT
    snapshot_id,
    committed_at,
    operation,
    summary
FROM ecommerce.enriched_orders.snapshots
ORDER BY committed_at DESC
LIMIT 20;

With streaming ingestion creating snapshots every minute, this query returns a detailed timeline of your table's evolution.

How Do You Build a Streaming Time Travel Pipeline End to End?

Here is a complete pipeline that demonstrates the full flow: ingesting CDC data from PostgreSQL, transforming it with SQL, sinking it to Iceberg, and querying historical state.

Step 1: Ingest CDC data from PostgreSQL

-- Create a CDC source from PostgreSQL
CREATE SOURCE pg_ecommerce WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-db.example.com',
    port = '5432',
    username = 'cdc_reader',
    password = '${PG_PASSWORD}',
    database.name = 'ecommerce',
    schema.name = 'public',
    slot.name = 'rw_lakehouse_slot'
);

-- Create tables that mirror upstream PostgreSQL tables
CREATE TABLE orders (
    order_id BIGINT PRIMARY KEY,
    customer_id BIGINT,
    total_amount DECIMAL,
    order_status VARCHAR,
    created_at TIMESTAMPTZ,
    updated_at TIMESTAMPTZ
) FROM pg_ecommerce TABLE 'public.orders';

CREATE TABLE customers (
    customer_id BIGINT PRIMARY KEY,
    name VARCHAR,
    email VARCHAR,
    segment VARCHAR,
    created_at TIMESTAMPTZ
) FROM pg_ecommerce TABLE 'public.customers';

Step 2: Create a materialized view with transformations

CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
    c.customer_id,
    c.name,
    c.segment,
    COUNT(o.order_id) AS total_orders,
    SUM(o.total_amount) AS total_spent,
    MAX(o.created_at) AS last_order_at,
    CASE
        WHEN SUM(o.total_amount) > 10000 THEN 'vip'
        WHEN SUM(o.total_amount) > 1000 THEN 'regular'
        ELSE 'new'
    END AS customer_tier
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.segment;

Step 3: Sink to Iceberg

CREATE SINK customer_summary_iceberg FROM customer_order_summary
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'customer_id',
    warehouse.path = 's3://analytics-lake/warehouse',
    database.name = 'analytics',
    table.name = 'customer_order_summary',
    catalog.type = 'rest',
    catalog.name = 'analytics_catalog',
    create_table_if_not_exists = 'true',
    commit_checkpoint_interval = '60',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    s3.region = 'us-east-1'
);

Step 4: Query historical customer data

Now analysts can answer questions like "What tier was customer X in last Tuesday at 3 PM?" using standard time travel queries:

-- Check a customer's tier at a specific point in time
SELECT customer_id, name, customer_tier, total_spent
FROM analytics.customer_order_summary
TIMESTAMP AS OF '2026-03-24 15:00:00'
WHERE customer_id = 42;

This entire pipeline runs continuously. Every database change in PostgreSQL flows through RisingWave, gets transformed, and appears in Iceberg within about a minute, creating a new queryable snapshot each time.

What About Snapshot Lifecycle Management?

Frequent snapshots are powerful, but they accumulate over time. A pipeline creating snapshots every minute generates over 500,000 snapshots per year. You need a strategy for managing them.

Snapshot expiration

Configure your Iceberg catalog to expire old snapshots automatically:

-- In Spark SQL, expire snapshots older than 7 days
CALL catalog.system.expire_snapshots(
    table => 'analytics.customer_order_summary',
    older_than => TIMESTAMP '2026-03-22 00:00:00',
    retain_last => 100
);

Compaction

Streaming writes create many small files. Regular compaction merges them into larger, more efficient files:

-- Rewrite small data files into optimally sized files
CALL catalog.system.rewrite_data_files(
    table => 'analytics.customer_order_summary',
    strategy => 'sort',
    sort_order => 'customer_id ASC'
);

RisingWave's managed lakehouse offering, Open Lake, handles compaction automatically so you do not need to manage it manually.

Tagging important snapshots

Before expiring old snapshots, tag the ones you need to keep for compliance or reproducibility:

-- Tag a snapshot for quarterly audit
ALTER TABLE analytics.customer_order_summary
CREATE TAG 'q1-2026-audit' AS OF VERSION 8923456789;

Tagged snapshots survive expiration policies, ensuring you always have access to critical historical states.

FAQ

What is Iceberg time travel?

Iceberg time travel is the ability to query an Apache Iceberg table as it existed at a previous point in time. It works by maintaining immutable snapshots of the table state. Each write operation creates a new snapshot with a unique ID and timestamp, and query engines can read any historical snapshot using SQL clauses like TIMESTAMP AS OF or VERSION AS OF.

How does streaming improve time travel compared to batch processing?

Streaming creates snapshots far more frequently than batch processing. A batch pipeline that runs every 4 hours creates 6 snapshots per day, while a streaming pipeline committing every 60 seconds creates 1,440 snapshots per day. This gives you much finer-grained historical visibility, letting you query data as it existed at nearly any moment rather than only at batch boundaries.

Do time travel queries affect the performance of my streaming pipeline?

No. Time travel queries read historical snapshots, which are stored as immutable files in object storage. These read operations are completely independent of the streaming write path. Your streaming pipeline continues committing new data while analysts query historical snapshots concurrently, with no interference between the two workloads.

How long can I keep historical snapshots for time travel?

You can keep snapshots as long as your storage budget allows. Most teams configure snapshot expiration policies that retain between 7 and 90 days of history, balancing storage costs against time travel needs. For compliance or auditing purposes, you can tag specific snapshots to preserve them indefinitely even after general expiration runs.

Conclusion

Iceberg time travel is only as useful as the granularity of your snapshots. Batch pipelines leave you with coarse-grained snapshots spaced hours apart, while streaming SQL pipelines give you minute-level or even second-level granularity.

Key takeaways:

Iceberg time travel queries work by reading immutable historical snapshots
Streaming ingestion with RisingWave creates frequent snapshots automatically, giving you fine-grained time travel
You can query any historical snapshot using standard SQL in Spark, Trino, DuckDB, or other engines
Combine CDC ingestion, SQL transformations, and Iceberg sinking in a single streaming pipeline
Use snapshot expiration, compaction, and tagging to manage snapshot lifecycle at scale

Ready to build a streaming lakehouse with fine-grained time travel? Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.