Feature Freshness Matters for ML Models

Feature Freshness Matters for ML Models

Your fraud model catches 94% of fraudulent transactions in offline evaluation. In production, it catches 71%. The model did not change. The features did.

This is the feature freshness problem. Most ML teams spend months tuning model architectures, hyperparameters, and training data, then deploy the model into a serving environment where its input features are hours or days old. The model was trained on fresh data but runs on stale data. The resulting performance gap is silent, systematic, and surprisingly large.

In this post, we break down the feature freshness gap, show how it degrades model performance across three common domains, and explain why real-time feature computation is the most direct fix.

What Is Feature Freshness?

Feature freshness measures how recently a feature value was computed relative to when the model uses it for prediction. A feature value that reflects the state of the world five seconds ago is fresh. One that reflects the state from six hours ago is stale.

In a typical ML system, two separate pipelines produce features:

  • Training pipeline: Reads raw event data from a data warehouse, computes features as of each training example's timestamp, and writes feature vectors to a training dataset. These features are point-in-time correct – they reflect reality at the moment the label was generated.
  • Serving pipeline: Reads pre-aggregated values from a feature store that gets refreshed on a schedule (hourly, daily, or worse). When the model requests features at inference time, it receives whatever values were last written by the batch job.

The feature freshness gap is the latency between when an event happens in the real world and when that event is reflected in the feature values the model sees at serving time. If your batch pipeline runs every six hours, your average feature freshness gap is three hours. Your worst case is nearly six.

This gap matters because the model learned statistical relationships on fresh features but must make predictions using stale ones. The distribution shift between the two is a form of training-serving skew – one of the most common and hardest-to-debug failure modes in production ML.

How Stale Features Degrade Model Performance

The impact of feature staleness varies by domain, but the pattern is consistent: the more time-sensitive the prediction task, the more damage stale features cause. Here are three examples.

Fraud detection

A fraud model relies on velocity features – the number of transactions in the last 5 minutes, the number of distinct merchants in the last hour, the ratio of current transaction amount to the user's rolling average. These features are inherently temporal. A burst of three transactions in 90 seconds is a strong fraud signal. But if your feature store refreshes hourly, the model sees one transaction (from the last batch run), not three. The velocity signal is invisible.

The consequence: the model evaluates each transaction in isolation, stripped of the temporal context it was trained on. Precision drops because the model lacks the features that distinguish fraud patterns from normal spending.

Recommendations

A recommendation model trained on session-level behavior learns that "user viewed items A and B, then purchased C" is a useful pattern. At serving time, the model needs to know what the user has viewed in the current session. If features lag by 30 minutes, the model cannot see the items the user browsed three minutes ago. It recommends items the user has already viewed – or already purchased.

Stale recommendation features do not just reduce click-through rates. They actively harm user experience by making the product feel unresponsive. The model is technically running, but it is operating on a version of the user that no longer exists.

Dynamic pricing

A pricing model uses demand signals (page views, add-to-cart events, conversion rates) and supply signals (inventory levels, competitor prices) to set optimal prices. If demand spikes at 10 AM but the feature store does not refresh until 2 PM, the model underprices during the surge and overprices after it subsides.

In a real-time pricing engine, features update with every new event, so the model sees the demand spike as it forms. With batch features, the model is always reacting to yesterday's market.

Training-Serving Skew: The Root Cause

The feature freshness gap is a specific instance of training-serving skew – the broader problem where the data distribution a model encounters in production differs from what it saw during training. Google's Rules of Machine Learning identifies this as one of the top causes of degraded model quality in production systems.

Training-serving skew from stale features has two components:

Temporal skew. The model was trained on features computed at the exact timestamp of each event. In production, features were computed at the last batch refresh. A feature like avg_transaction_amount_last_30_min might be 45 minutes old by the time the model reads it. The model learned the relationship between a 30-minute average and fraud likelihood, but it is receiving a 75-minute average (30-minute window + 45-minute staleness). The semantics of the feature have silently changed.

Aggregation skew. Training pipelines often compute features by replaying raw event logs with exact point-in-time correctness. Serving pipelines read from pre-aggregated tables where edge cases – late-arriving events, out-of-order data, null handling – are resolved differently. The two pipelines nominally compute the same feature but produce subtly different values. This is the dual-pipeline problem, and it is notoriously difficult to detect through monitoring alone.

The combination of temporal and aggregation skew means the model operates on a shifted feature distribution. The shift is usually small enough to avoid triggering alert thresholds but large enough to erode precision, recall, or calibration over time.

Why Batch Refresh Is Not the Answer

The instinctive reaction to feature staleness is to increase batch frequency. Run the pipeline every 15 minutes instead of every hour. This helps marginally but introduces its own problems:

  • Cost scales linearly. Each batch run recomputes every feature for every entity. Going from hourly to every 15 minutes quadruples your compute cost.
  • Diminishing returns. Moving from 6-hour to 1-hour freshness recovers significant model quality. Moving from 15 minutes to 5 minutes recovers much less – but the cost increase is the same.
  • Operational fragility. More frequent batch jobs means more overlapping runs, more contention for warehouse resources, and more opportunities for failures. A 15-minute batch that occasionally takes 20 minutes creates cascading delays.
  • You still have the dual-pipeline problem. No matter how frequently the batch runs, you are still maintaining two separate code paths for feature computation – one for training, one for serving.

The fundamental issue is that batch processing is the wrong abstraction for features that need to be fresh. It conflates "how often do we compute" with "how fresh are the results." A streaming approach decouples these by computing incrementally: each new event updates only the features it affects, so freshness tracks the event stream, not a batch schedule.

Solving Freshness with Real-Time Feature Computation

Real-time feature computation eliminates the freshness gap by processing events as they arrive and maintaining feature values incrementally. Instead of recomputing transaction_count_last_30_min for every user every batch cycle, the system updates the count for a specific user the moment a new transaction for that user appears.

This approach solves both components of training-serving skew:

  • Temporal skew disappears because features in the serving path reflect events within seconds, matching the point-in-time semantics the model was trained on.
  • Aggregation skew disappears because the same computation – expressed once as a SQL query – produces both the training features (via historical backfill) and the serving features (via the live materialized view). One definition, zero divergence.

A streaming database like RisingWave is designed for exactly this pattern. You define features as SQL queries over streaming sources, and the engine maintains the results as materialized views that update incrementally with each new event. No separate batch pipeline. No feature store refresh cycle. No dual code paths.

Fresh vs. Stale: A SQL Example

Consider a fraud detection feature: the number of transactions per user in the last 30 minutes. In a batch system, this feature is computed periodically and written to a feature store table:

-- Batch approach: runs every hour via Airflow
-- Feature is 0-60 minutes stale by the time the model reads it
INSERT INTO feature_store (user_id, txn_count_30m, computed_at)
SELECT
    user_id,
    COUNT(*) AS txn_count_30m,
    NOW() AS computed_at
FROM transactions
WHERE event_time > NOW() - INTERVAL '30 minutes'
GROUP BY user_id;

Every hour, this query scans the transactions table, computes the count, and overwrites the feature store. Between runs, the feature decays. A user with zero transactions at 2:00 PM could have five transactions by 2:45 PM, and the model would still see zero until the 3:00 PM batch.

With RisingWave, the same feature is defined as a continuously maintained materialized view:

-- Streaming approach: always fresh
-- Feature updates within seconds of each new transaction
CREATE MATERIALIZED VIEW user_txn_features AS
SELECT
    user_id,
    COUNT(*) AS txn_count_30m,
    SUM(amount) AS total_amount_30m,
    COUNT(DISTINCT merchant_id) AS distinct_merchants_30m
FROM transactions
WHERE event_time > NOW() - INTERVAL '30 minutes'
GROUP BY user_id;

This view is always current. When a new transaction arrives, RisingWave incrementally updates the count for that user – it does not rescan the entire table. The model queries the materialized view at inference time and receives features that reflect reality within seconds, not hours. This is the principle behind incremental view maintenance, where only the changed rows are reprocessed rather than the entire dataset.

At inference time, your model serving layer issues a simple query:

SELECT txn_count_30m, total_amount_30m, distinct_merchants_30m
FROM user_txn_features
WHERE user_id = 'u_12345';

The result is always fresh. No batch schedule to manage. No staleness window to account for. The same SQL definition can be used for both training data generation (by backfilling against historical data) and live serving (via the materialized view), eliminating the dual-pipeline problem entirely.

When Does Feature Freshness Matter Most?

Not every ML use case needs sub-second feature freshness. The value of fresh features depends on how quickly the underlying signal changes and how sensitive the model's predictions are to that change.

High value: Fraud detection, abuse prevention, real-time bidding, session-based recommendations, dynamic pricing, anomaly detection. These tasks involve fast-changing signals where minutes of staleness translate directly to lost accuracy or revenue.

Medium value: Credit scoring, churn prediction, supply chain forecasting. These tasks benefit from features that are hours-fresh rather than days-fresh, but sub-second latency offers diminishing returns.

Lower value: Monthly reporting models, long-term customer segmentation, propensity models that operate on demographic features. These features change slowly, and daily batch updates are sufficient.

The decision framework is simple: if the events that generate your features change faster than your batch pipeline runs, your features are stale, and your model quality is suffering. The wider that gap, the more accuracy you are leaving on the table.

FAQ

How much does feature staleness actually cost in model accuracy?

It depends on the use case and feature importance. Research from Google and other large-scale ML practitioners suggests that feature freshness is among the top contributors to training-serving skew. In fraud detection, teams have reported 10-25% improvements in precision after moving velocity features from hourly batch to real-time computation. For recommendation systems, click-through rate improvements of 5-15% are common when session-level features update in real time.

Can I use RisingWave as a feature store?

Yes. RisingWave's materialized views serve as an online feature layer. You define features as SQL queries, and the database keeps them current via incremental view maintenance. Your model serving layer queries materialized views with standard PostgreSQL-compatible SQL to retrieve feature vectors at inference time. For a deeper walkthrough, see How to Build Real-Time Feature Pipelines with Streaming SQL.

Do I need to rewrite my training pipeline?

Not necessarily. The ideal setup uses the same SQL definitions for both training and serving. For training, you can backfill features by running the same queries against historical data. For serving, the materialized view maintains the live version. This single-definition approach eliminates the dual-pipeline problem that causes aggregation skew.

How does this compare to a dedicated feature platform like Feast or Tecton?

Feature platforms manage feature metadata, storage, and serving infrastructure. RisingWave complements them by handling the computation layer – the part that transforms raw events into feature values in real time. You can use RisingWave to compute features and sink the results into a feature store if your architecture requires one, or you can query RisingWave's materialized views directly for online serving.

What if only some of my features need real-time freshness?

That is the common case. Most production models mix slowly-changing features (demographics, historical aggregates) with fast-changing ones (session behavior, velocity metrics). You can compute the time-sensitive features in RisingWave while leaving the rest in your existing batch pipeline. The model serving layer joins both at inference time.

Conclusion

Feature freshness is not a nice-to-have. It is a first-order determinant of model quality for any prediction task that depends on recent behavior. The feature freshness gap – the difference between training-time and serving-time feature latency – is a silent, systemic source of model degradation that monitoring dashboards rarely surface.

The fix is architectural: replace batch feature computation with continuous, incremental computation that keeps features fresh as events arrive. A streaming database like RisingWave lets you express features in SQL, maintain them as materialized views, and serve them with sub-second freshness – all without building separate training and serving pipelines.

Ready to eliminate feature staleness? Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.