Apache Iceberg for AI/ML: How Real-Time Ingestion Changes Model Training Pipelines

Apache Iceberg solves the two hardest problems in ML data infrastructure: keeping training data current and making experiments reproducible. Pair it with RisingWave's streaming SQL engine and your feature tables update continuously from live event streams, no batch ETL required, while every Iceberg snapshot serves as a time-stamped, reusable training dataset.

Why Training Data Freshness Matters

Every ML model is a snapshot of the world at the moment its training data was collected. The moment that world changes, the model starts making predictions about a reality that no longer exists. Researchers call this model aging, and a 2022 peer-reviewed study in Scientific Reports ("Temporal quality degradation in AI models," Vela et al.) found degradation in 91 out of 128 model-dataset pairs tested, even in environments with minimal concept drift. The problem is not just data drift; it is the fundamental mismatch between when the world was observed and when the model is asked to predict.

In practice, stale training data produces silent failures. A fraud detection model trained on T-7 behavior patterns will not recognize this week's attack vector. A recommendation engine trained on last month's purchases cannot reflect the new product line launched on Tuesday. A credit-risk model trained on pre-recession data will misprice loans issued during a downturn.

The traditional response to staleness is more frequent batch ETL: run the pipeline nightly instead of weekly, or hourly instead of nightly. But batch ETL has a hard floor. Even an hourly job means your training features are always at least 60 minutes stale, plus pipeline latency, plus the time your training job took to complete. For behavior-dependent predictions, an hour of lag is often the difference between a relevant recommendation and a missed sale.

The Role of Apache Iceberg in ML Pipelines

Apache Iceberg provides the storage layer that makes a modern ML data platform tractable at scale. Its three most valuable properties for ML workloads are:

Schema evolution without pipeline rewrites. ML feature sets change constantly. New signals get added, old ones get deprecated, data types get refined. In traditional Hive-style tables, schema changes break downstream jobs. Iceberg handles ADD COLUMN, DROP COLUMN, and type promotions transactionally, so your training pipeline reads the schema it expects even while the table evolves underneath it.

Partition pruning and hidden partitioning. Training datasets can be enormous. Iceberg's hidden partitioning means you write WHERE feature_timestamp >= '2025-01-01' and Iceberg automatically prunes partitions, without your query needing to encode the physical partition scheme. This makes training data extraction fast without requiring ML engineers to understand storage layout.

Time travel for reproducible experiments. Every write to an Iceberg table creates a new snapshot with a snapshot ID and a timestamp. You can query any historical snapshot using standard SQL. This is the capability that makes Iceberg the right offline feature store format for ML: it ties your model artifact to an exact, reproducible slice of training data.

Iceberg Time-Travel Syntax

The official Apache Iceberg documentation defines time-travel queries using Spark SQL in two forms. To query by snapshot ID:

-- Query using a specific snapshot ID (exact Iceberg syntax)
SELECT * FROM prod.ml.user_features VERSION AS OF 'training_churn_model_v3';

-- Query using a snapshot ID number
SELECT * FROM prod.ml.user_features VERSION AS OF 8516748920627623855;

To query by timestamp:

-- Query the state of the table at a specific point in time
SELECT * FROM prod.ml.user_features
FOR SYSTEM_TIME AS OF '2025-11-15 00:00:00.000';

You can also tag a snapshot before a training run, which stores a named reference to a specific snapshot:

-- Tag the current snapshot before starting a training run
ALTER TABLE prod.ml.user_features
CREATE TAG training_churn_v4
RETAIN 730 DAYS;

-- Later, reproduce the exact training dataset
SELECT * FROM prod.ml.user_features VERSION AS OF 'training_churn_v4';

This VERSION AS OF syntax gives you the exact rows, with the exact schema, that existed when the tag was created. Not an approximation, not a backup: the actual data, queryable from PyIceberg, Spark, Trino, or any other Iceberg-compatible engine.

How RisingWave Keeps Iceberg Tables Fresh

RisingWave is a PostgreSQL-compatible streaming database that consumes events from Kafka, Kinesis, CDC feeds, and other sources, maintains the results as incrementally updated materialized views, and writes those results to downstream sinks including Apache Iceberg.

The key difference from a batch ETL approach is that the computation is incremental. When a new event arrives, RisingWave updates only the affected rows in the materialized view, not the entire table. The result propagates to the Iceberg sink within seconds, not hours.

flowchart LR
    A[Kafka Topics\nUser Events] --> B[RisingWave\nStreaming SQL]
    C[PostgreSQL CDC\nOrders / Profiles] --> B
    B --> D[Materialized Views\nML Features]
    D --> E[Apache Iceberg\nFeature Tables]
    E --> F[ML Training\nPython / Spark]
    E --> G[Online Serving\nFeature Lookups]

Defining the Feature Computation in RisingWave

First, create a table to receive events from your source. In production this would be a Kafka source using CREATE SOURCE, but the SQL logic is identical:

CREATE TABLE user_events (
    user_id        BIGINT,
    event_type     VARCHAR,
    item_id        BIGINT,
    session_duration_seconds INTEGER,
    ts             TIMESTAMPTZ
);

Next, define the feature computation as a materialized view. RisingWave maintains this view incrementally as new events arrive:

CREATE MATERIALIZED VIEW user_ml_features AS
SELECT
    user_id,
    COUNT(*)                                                      AS total_events,
    COUNT(*) FILTER (WHERE event_type = 'purchase')               AS purchase_count,
    COUNT(*) FILTER (WHERE event_type = 'view')                   AS view_count,
    ROUND(
        COUNT(*) FILTER (WHERE event_type = 'purchase')::DECIMAL /
        NULLIF(COUNT(*) FILTER (WHERE event_type = 'view'), 0) * 100,
        2
    )                                                             AS conversion_rate_pct,
    AVG(session_duration_seconds) FILTER (WHERE event_type = 'view') AS avg_session_duration,
    COUNT(DISTINCT item_id)                                       AS unique_items_interacted,
    MAX(ts)                                                       AS last_activity
FROM user_events
GROUP BY user_id;

Sample output from a running instance:

 user_id | total_events | purchase_count | view_count | conversion_rate_pct | avg_session_duration | unique_items_interacted
---------+--------------+----------------+------------+---------------------+----------------------+------------------------
       1 |            2 |              1 |          1 |                 100 |                   45 |                       1
       2 |            2 |              0 |          2 |                   0 |                   75 |                       2
       3 |            1 |              1 |          0 |                     |                      |                       1

Sinking the Materialized View to Iceberg

Once the materialized view is defined, create a sink to write the results continuously to an Iceberg table:

CREATE SINK user_ml_features_iceberg
FROM user_ml_features
WITH (
    connector                   = 'iceberg',
    type                        = 'upsert',
    primary_key                 = 'user_id',
    warehouse.path              = 's3://your-bucket/warehouse',
    database.name               = 'ml',
    table.name                  = 'user_ml_features',
    catalog.type                = 'glue',
    s3.region                   = 'us-east-1',
    create_table_if_not_exists  = 'true',
    commit_checkpoint_interval  = '60'
);

With type = 'upsert' and primary_key = 'user_id', RisingWave applies updates and deletes to the Iceberg table rather than appending new rows. The commit_checkpoint_interval controls how often new Iceberg snapshots are created, defaulting to 60 seconds. Each checkpoint becomes a queryable snapshot for time travel.

For details on all connector parameters, see the RisingWave Iceberg sink documentation.

Reproducible Experiments with Iceberg Time-Travel

The combination of RisingWave's continuous writes and Iceberg's snapshot model gives you a built-in reproducibility layer that requires no extra infrastructure.

Every 60 seconds (or at whatever interval you configure), RisingWave commits a new Iceberg snapshot. That snapshot captures the exact state of your feature table at that moment. When you start a training run, record the snapshot ID or create a named tag:

-- Tag the snapshot at the start of each training run
ALTER TABLE prod.ml.user_ml_features
CREATE TAG training_v1_2025_11_15
RETAIN 365 DAYS;

Store training_v1_2025_11_15 in your model registry alongside the model artifact. When you need to reproduce that training run, or audit why a model made a particular decision, query the exact data it was trained on:

-- Reproduce the training dataset exactly
SELECT * FROM prod.ml.user_ml_features
VERSION AS OF 'training_v1_2025_11_15';

This eliminates the most common cause of irreproducible ML experiments: the training data is no longer available because the pipeline overwrote it. With Iceberg, your feature tables retain their full history for as long as you configure retention.

You can also audit data at an arbitrary past timestamp without needing a named tag:

-- What did the feature table look like two weeks ago?
SELECT * FROM prod.ml.user_ml_features
FOR SYSTEM_TIME AS OF '2025-11-01 00:00:00.000';

This is particularly useful for debugging model regressions. If a model's performance dropped on November 10th, compare its training data snapshot to the current state of the feature table to understand what changed.

Putting It Together: An End-to-End ML Training Pipeline

Here is what the full pipeline looks like in practice:

Step 1: Events flow into RisingWave. User interactions, transactions, and behavioral signals arrive from Kafka topics or CDC feeds. RisingWave ingests them with sub-second latency.

Step 2: RisingWave maintains feature views. Materialized views compute per-user, per-item, and aggregate features incrementally. The view definitions live in SQL, are version-controlled, and are easy to modify as the feature set evolves.

Step 3: RisingWave sinks to Iceberg every 60 seconds. Each checkpoint creates a new Iceberg snapshot. The feature table is always at most 60 seconds stale, compared to hours with nightly batch ETL.

Step 4: Training jobs read from Iceberg. Python training jobs using PyIceberg or Spark query the feature table. They can read the latest snapshot for training on current data, or a specific tagged snapshot for reproducibility.

Step 5: Snapshot IDs are recorded with model artifacts. Your MLflow or model registry entry includes the Iceberg snapshot tag used for training, making every model auditable and reproducible.

-- Efficient training data extraction: read only the last 30 days
-- Iceberg's hidden partitioning prunes to the relevant files automatically
SELECT
    user_id,
    total_events,
    purchase_count,
    view_count,
    conversion_rate_pct,
    avg_session_duration,
    unique_items_interacted
FROM prod.ml.user_ml_features
FOR SYSTEM_TIME AS OF '2025-11-15 00:00:00'
WHERE last_activity >= CURRENT_TIMESTAMP - INTERVAL '30 days';

This query reads only the partitions that contain data from the past 30 days, making large-scale feature extraction orders of magnitude faster than a full table scan.

Internal Links

For a deeper look at the streaming SQL patterns used for feature engineering, see Real-Time Feature Engineering for Machine Learning Pipelines and Real-Time Feature Store in 2026: Beyond Batch ML Pipelines.

For the technical details of writing streaming data to Iceberg, see Streaming Writes to Apache Iceberg with RisingWave.

Key Takeaways

Model aging is real. A peer-reviewed study (Vela et al., 2022) found quality degradation in 91% of model-dataset pairs tested. Fresher training data directly reduces this risk.
Iceberg provides the storage primitives ML needs. Schema evolution, time travel, and efficient partition pruning make it the right offline feature store format.
RisingWave eliminates the batch ETL floor. Instead of hourly or nightly jobs, feature tables update within seconds of source events.
Snapshot-based reproducibility is automatic. Every RisingWave checkpoint creates an Iceberg snapshot. Tag it before training, and your experiment is reproducible indefinitely.
The stack is SQL-first. Feature definitions, sink configuration, and time-travel queries are all standard SQL, accessible to anyone who can write a SELECT statement.

FAQ

Can I use RisingWave with an existing Iceberg catalog like AWS Glue or Hive Metastore?

Yes. The RisingWave Iceberg sink supports catalog.type = 'glue' for AWS Glue, catalog.type = 'hive' for Hive Metastore, and catalog.type = 'rest' for REST catalogs including Polaris and Nessie. The feature table writes to whatever catalog your training infrastructure already uses.

How does Iceberg time travel interact with schema evolution?

Iceberg tracks schema changes as part of its snapshot metadata. When you query a historical snapshot using VERSION AS OF or FOR SYSTEM_TIME AS OF, the query returns the schema that existed at that snapshot, not the current schema. This means you can reproduce training runs even after adding or removing feature columns.

What is the minimum latency from a source event to an updated Iceberg snapshot?

With RisingWave's streaming ingestion and a commit_checkpoint_interval of 60 seconds, the end-to-end latency from event to queryable Iceberg snapshot is typically under 90 seconds. The feature view itself updates within milliseconds of the event arriving; the latency is dominated by the Iceberg commit interval, which is configurable down to a few seconds for freshness-critical use cases.

Building a real-time ML training pipeline with RisingWave and Apache Iceberg takes a few hours to configure and eliminates days of batch ETL maintenance. Get started with RisingWave for free, or explore the Apache Iceberg sink documentation to connect your existing Iceberg tables.