Feature Backfilling: How to Generate Historical Features for Model Retraining

Feature backfilling is one of the most unglamorous problems in ML engineering — and one of the most consequential. When a model underperforms, the instinct is to improve the model. But more often, the fix requires better training data. And getting better training data means regenerating historical features you didn't compute at the time. That's backfilling.

This article covers the core problem, the pitfalls of the naive approach, and how a streaming database that retains raw event history changes the math.

The Backfill Problem: Three Scenarios Where You Need Historical Features

Feature backfilling is necessary in three recurring situations. Each is common enough that if you're running any kind of ML system in production, you've likely hit at least one of them.

Scenario 1: Launching a new feature

Your fraud detection model has been running for a year using transaction count and average spend as features. You hypothesize that the ratio of failed-to-successful transactions over a 7-day window would be a strong signal. The problem: you have no historical values for this feature. Your pipeline never computed it. If you want to train a new model version that uses this feature, you need to reconstruct it for every user, for every day, going back at least 90 days.

Scenario 2: Fixing a feature computation bug

You discover that your user_lifetime_value feature had an off-by-one error in the time window — it was computing a 29-day window instead of 30 days because of a timezone boundary bug. The model has been trained on incorrect values for months. You fix the bug, but now your live feature pipeline computes differently than your training data. You need to regenerate historical features with the corrected logic and retrain.

Scenario 3: Improving feature granularity

Your recommendation model was trained on daily-aggregated features. After analysis, you believe hourly aggregations would better capture recency effects. Rebuilding the historical dataset at hourly granularity requires reprocessing months of raw events — a fundamentally different computation than what your existing pipeline produced.

In all three cases, you're asking the same question: given raw event history, can I reconstruct what a feature value would have been at a specific past point in time, using the logic I want to use today?

Why Backfill Consistency Matters: Training-Serving Skew

Before discussing solutions, it's worth being precise about why consistency between your backfill and your live pipeline matters. This is the training-serving skew problem, and it's more damaging than most teams expect.

When your model is trained on features computed one way and served features computed a slightly different way, the model is implicitly optimized for a distribution that doesn't match what it sees in production. The effect ranges from subtle degradation to catastrophic failure depending on how different the feature distributions are.

The dangerous version is when the inconsistency is invisible. If you backfill features using a batch Spark job and serve features from a streaming pipeline, both might look correct in isolation. But differences in handling late events, timezone offsets, null semantics, or aggregation windows can produce systematically different values. Your model trains on one distribution and is evaluated on another.

The standard ML practitioner advice is to use the same feature computation code for training and serving. In practice, this is hard. Your online feature store computes things one way. Your offline pipeline computes things another way. The gap between them is a source of persistent, hard-to-debug model underperformance.

A streaming database sidesteps this by making the same SQL query the source of truth for both historical backfill and live serving. There's no "online" and "offline" pipeline — just one query evaluated over different time windows.

The Naive Approach: Batch ETL Backfill and Its Drawbacks

The most common backfill approach is to write a batch job — usually Spark or SQL on a data warehouse — that scans the raw event history and recomputes the feature values.

The general pattern looks like this:

-- Typical data warehouse backfill query
INSERT INTO user_features_historical
SELECT
    user_id,
    DATE(event_time) AS feature_date,
    COUNT(*) AS daily_active_events,
    SUM(amount) AS daily_spend,
    COUNT(DISTINCT session_id) AS daily_sessions
FROM events_raw_archive
WHERE event_time BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY user_id, DATE(event_time);

This works, but it has several problems that compound at scale.

It's slow. A 90-day backfill over a high-volume events table means scanning billions of rows. Even with partitioning and columnar formats, a full historical backfill for a large user base can take hours or days. During that time, you can't iterate quickly on the feature definition.

It's expensive. Full table scans in data warehouses are billed per byte scanned. A 90-day backfill might cost hundreds or thousands of dollars in query costs alone. If you need to iterate on the feature definition — which you will — you pay that cost multiple times.

It drifts from your live pipeline. The most serious problem: your batch backfill almost certainly doesn't compute exactly the same values as your streaming pipeline. Differences in how each system handles late-arriving events, NULL values, timestamp precision, or window boundaries mean the training distribution diverges from serving. You've introduced training-serving skew as a structural property of your architecture.

It's hard to do point-in-time correctly. For temporal join correctness — ensuring that each training example only uses feature values available at the time the label was generated — you need point-in-time snapshots, not just aggregates. Getting this right in a batch SQL job requires careful self-joins or window functions that are easy to get wrong.

None of these problems are fatal on their own. Many teams ship production ML systems with batch backfill pipelines. But each problem adds friction, increases cost, and introduces subtle correctness risks.

The Streaming Database Approach: Raw Events + Same SQL

A streaming database like RisingWave maintains materialized views over append-only event streams. The key property for backfilling: if you configure it to retain raw events rather than discarding them after ingestion, you can rerun the same query over any historical time window.

This has a different shape than a data warehouse. In a warehouse, raw events might live in cold storage and require expensive full-scan queries. In a streaming database, raw events live in the same system that powers your live feature pipeline — and the SQL dialect, timestamp semantics, NULL handling, and window semantics are identical.

The workflow looks like this:

Your live feature pipeline is a materialized view (MV) defined in SQL over a stream of raw events.
Raw events are retained for a configurable retention window (30, 90, 365 days).
When you need a backfill, you create a separate, time-bounded MV using the same SQL logic, with an explicit WHERE event_time BETWEEN ... clause.
The backfill MV materializes over the retained event history.
You export the result to your training data store (S3, a feature store, etc.) and drop the backfill MV.

The critical guarantee: because the same SQL engine processes both the live stream and the historical backfill, the computation is byte-for-byte consistent. No pipeline drift. No reimplementation in a different system. No hidden semantic differences.

Practical Backfill Workflow in RisingWave

Here is a concrete backfill workflow. The scenario: you have a live user_features materialized view and you need to backfill 90 days of historical features to train a new model version.

First, understand your existing live feature pipeline:

-- Your existing live feature MV
CREATE MATERIALIZED VIEW user_features AS
SELECT
    user_id,
    date_trunc('day', event_time) AS feature_date,
    COUNT(*) AS daily_active_events,
    SUM(amount) AS daily_spend,
    COUNT(DISTINCT session_id) AS daily_sessions
FROM raw_events
GROUP BY user_id, date_trunc('day', event_time);

Now create a time-bounded backfill MV using the same logic:

-- Step 1: Create a one-time backfill MV with explicit time bounds
CREATE MATERIALIZED VIEW user_features_backfill_2024q1 AS
SELECT
    user_id,
    date_trunc('day', event_time) AS feature_date,
    COUNT(*) AS daily_active_events,
    SUM(amount) AS daily_spend,
    COUNT(DISTINCT session_id) AS daily_sessions
FROM raw_events
WHERE event_time BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY user_id, date_trunc('day', event_time);

-- Step 2: Export to S3 for training
CREATE SINK backfill_to_s3
FROM user_features_backfill_2024q1
WITH (
    connector = 's3',
    s3.region = 'us-east-1',
    s3.bucket_name = 'ml-training-data',
    s3.path = 'features/user_features/backfill_2024q1/',
    s3.file_type = 'parquet'
);

After the sink completes, clean up the backfill resources:

-- Step 3: Drop the sink and backfill MV after export completes
DROP SINK backfill_to_s3;
DROP MATERIALIZED VIEW user_features_backfill_2024q1;

The backfill MV is ephemeral — it exists only long enough to compute and export the historical features. Because it uses the exact same SQL as your live pipeline (differing only in the time bounds), there is no opportunity for semantic drift between training and serving features.

Time-Windowed Backfill: Point-in-Time Snapshots for Training

The previous example generates daily aggregates, which is sufficient for many use cases. But temporal correctness for model training requires point-in-time snapshots — for each training example, you need the feature values that were available at the time the label was generated, not aggregates over the entire historical period.

For example, if you're training a churn prediction model, each training example has a label (churned or not) attached to a specific date. The features for that example should be the state of the world on that date — not features computed using information that wasn't yet available.

Here's how to generate point-in-time correct features in RisingWave:

-- Generate point-in-time feature snapshots for a training set
-- training_labels contains (user_id, label_date, churned)
CREATE MATERIALIZED VIEW training_features_pit AS
SELECT
    t.user_id,
    t.label_date,
    t.churned,
    f.daily_active_events AS events_on_label_date,
    -- 7-day trailing window available at label_date
    SUM(f_trail.daily_active_events) AS events_7d,
    SUM(f_trail.daily_spend) AS spend_7d
FROM training_labels t
JOIN user_features_backfill_2024q1 f
    ON t.user_id = f.user_id
    AND f.feature_date = t.label_date
JOIN user_features_backfill_2024q1 f_trail
    ON t.user_id = f_trail.user_id
    AND f_trail.feature_date BETWEEN t.label_date - INTERVAL '6 days' AND t.label_date
GROUP BY t.user_id, t.label_date, t.churned, f.daily_active_events;

This pattern ensures temporal correctness: each training row only incorporates feature values from events that occurred on or before the label date. The join structure prevents future leakage, which is one of the most common and hard-to-detect bugs in ML training pipelines.

Cost Considerations: Raw Retention vs. Pre-Aggregation

Retaining raw events for 90-day or longer backfill windows costs storage. For high-volume event streams, this is a real consideration. Here's the tradeoff framework:

Retain raw events when:

You expect to add or change features frequently (new model versions, feature experiments)
Feature definitions are not stable — the exact aggregation logic may change
You need point-in-time correctness and can't pre-determine the grain in advance
Event volume is moderate (millions per day rather than hundreds of billions)

Pre-aggregate and discard raw events when:

Feature definitions are stable and unlikely to change
Event volume makes raw retention prohibitively expensive
You're willing to accept that new features can only be backfilled to the pre-aggregation date, not further back
The granularity of your pre-aggregates (e.g., hourly buckets) is fine-grained enough to recompute any feature you'd want

A common middle ground: retain raw events for a rolling 30-day window (sufficient for most short-window features) and export monthly pre-aggregates to cheaper object storage for longer-horizon lookbacks. When you need a feature backfill beyond 30 days, you pay the query cost to reprocess from the pre-aggregates rather than raw events.

This is a meaningful cost difference. Pre-aggregated hourly data for a large user base might be 100x smaller than raw events, making 180-day backfills from pre-aggregates economically feasible even when raw retention is not.

There is no universally correct answer here. The right retention policy depends on your feature iteration velocity, event volume, and tolerance for limited backfill depth. The key is making this a deliberate architectural decision rather than discovering the constraint when you need a backfill.

FAQ

How does this compare to using a lakehouse (Iceberg/Delta Lake) for backfill?

A lakehouse with time-travel support (Iceberg snapshots, Delta Lake history) lets you query the state of a table at a past point in time — useful for reconstructing what data existed at a given moment. But this is different from recomputing features from raw events using the same streaming semantics. The streaming database approach gives you consistent semantics between your live pipeline and your backfill. The lakehouse approach solves a different problem (point-in-time data snapshots) and typically requires a separate compute engine (Spark, Trino) for the backfill compute.

What's the backfill throughput? Will it saturate my streaming pipeline?

In RisingWave, the backfill MV is processed separately from your live streaming pipeline. Backfill jobs read from retained event history without competing directly with live ingestion. For large backfills, you can control resource allocation to avoid impacting live query latency.

Can I backfill features that require joining multiple event streams?

Yes. If your live feature pipeline joins events from multiple sources (e.g., purchases joined with page_views), your backfill MV can perform the same join over the retained histories of both streams. The consistency guarantee extends to multi-stream features as long as both streams retain the relevant historical data.

What happens if my feature logic changes between the backfill period and today?

If you change a feature definition, the backfill should use the new definition, not the historical one. The purpose of backfill is to generate training data that matches what your live pipeline will compute going forward. If you used the old definition, you'd train a model on features that differ from what it will receive in serving — which is training-serving skew.

How do I validate that my backfill is consistent with my live pipeline?

Run the backfill query on a recent time window where you also have live pipeline output. Compare the results row by row. Any discrepancy indicates a consistency issue to resolve before using the backfill for training. This validation step is cheap and should be standard practice before kicking off a full historical backfill.

Feature backfilling is genuinely hard. There's no approach that eliminates all the tradeoffs between storage cost, recompute cost, latency, and consistency. But the consistency problem — training-serving skew from divergent feature pipelines — is the most consequential one to solve, because it silently degrades model performance in ways that are difficult to attribute.

A streaming database that retains raw event history and lets you express both live and historical feature computation in the same SQL dialect addresses the consistency problem at the architectural level. The tradeoffs shift to storage cost and retention policy — decisions you can make deliberately, rather than discovering constraints when a model is underperforming and you need to diagnose why.