Your fraud model was performing well in Q4. By February, it's letting 15% more fraud through. Nobody noticed until a business review. The root cause? Seasonal spending shifts in January quietly changed the distribution of transaction amounts, purchase categories, and merchant types -- all features the model trained on. The model was never retrained. The monitoring job ran nightly, flagged a warning nobody acted on, and quietly moved on.
This is feature drift. It is one of the most common causes of silent model degradation in production, and it is surprisingly hard to catch with traditional batch monitoring.
This article shows how to build always-on feature distribution monitoring using streaming SQL and materialized views, so you detect drift in minutes rather than after a batch job finishes hours later.
What Feature Drift Is and Why It Matters
A machine learning model is a function fitted to a training distribution. When you deploy it, you are implicitly assuming that production data will look like training data. Feature drift is what happens when that assumption breaks.
The defining characteristic of feature drift is that it is silent. The model does not throw an exception. It does not return an error code. It just starts returning predictions that are increasingly miscalibrated, and your business metrics degrade slowly enough that attribution is non-trivial.
Consider a practical example. You train a fraud detection model in Q4 2024. Your training data includes features like:
transaction_amount(mean: $85, stddev: $42)merchant_category(top categories: grocery, gas, retail)days_since_last_transaction(mean: 3.2 days)user_age(mean: 38, null rate: 2%)
In Q1, a few things shift. Post-holiday spending drops transaction amounts significantly. Users pay off credit cards rather than making purchases, changing transaction frequency. Some upstream data pipeline change starts leaving user_age null for new account sign-ups, pushing null rate from 2% to 40%.
Your model sees inputs it was never trained on. Its calibration breaks. Precision drops. Your risk team notices elevated chargebacks in late February. The investigation takes two weeks.
The entire incident window -- from first drift to detection to remediation -- was six weeks. With real-time feature monitoring, that window collapses to minutes.
What You Actually Need to Monitor
Not all features drift the same way. You need statistics that capture different failure modes:
Mean and standard deviation. The most common signal. If transaction_amount mean shifts from $85 to $40, something changed. A z-score measures how many standard deviations the current mean is from the baseline.
Null rate. Pipeline breaks almost always manifest as null rate spikes before anything else. If a feature that was 2% null is now 40% null, a data contract was violated. This is often the earliest available signal of an upstream incident.
Cardinality. For categorical features, the number of distinct values is a proxy for distribution coverage. If merchant_category drops from 47 distinct values to 8, either traffic changed dramatically or an encoding step broke.
Quantiles (p50, p95, p99). Mean is sensitive to outliers. Monitoring quantiles separately catches distribution shape changes that mean alone misses. A feature can have a stable mean while its tail behavior shifts significantly.
For each of these, you need two things: a baseline (what the distribution looked like at training time) and a current window (what it looks like right now). The drift signal is the delta between them.
How Streaming SQL Enables Continuous Monitoring
Traditional monitoring architectures run periodic batch jobs. A Spark job reads the last 24 hours of feature logs, computes statistics, writes results to a table, and a dashboard reads from that table. The lag between when drift starts and when you detect it is at least one batch interval, often longer when you account for job queue time, retries, and alert propagation.
A streaming database like RisingWave changes this model. Instead of running a job on a schedule, you define materialized views that the database maintains continuously as new data arrives. Every time a new row is inserted into feature_log, the aggregate statistics update automatically. There is no scheduler, no batch job, no pipeline to maintain.
This matters because:
- Null rate spikes are detectable within the first few minutes of an upstream pipeline break, not after 23 hours of bad data have accumulated.
- Drift that starts and reverses within a batch window is completely invisible to batch monitoring but clearly visible in a streaming window.
- Alerting can be event-driven rather than polling-based. A Kafka sink fires the moment a threshold is crossed, not the next time someone runs a query.
The tradeoff is that streaming statistical estimates are computed over a sliding time window rather than a static historical snapshot. For detecting distribution shift, this is exactly what you want.
SQL Implementation
Assume you have a feature logging table that captures model inference inputs. A common pattern is an unpivoted log table with one row per feature per prediction:
-- Source table: one row per feature value per prediction event
CREATE TABLE feature_log_unpivoted (
prediction_id VARCHAR,
feature_name VARCHAR,
value VARCHAR, -- stored as text, cast at query time
logged_at TIMESTAMPTZ
);
Step 1: Baseline Statistics
The baseline represents the expected distribution -- typically computed over your training window. In a streaming system, you can lock this to a historical time range:
-- Baseline statistics (computed from training window)
CREATE MATERIALIZED VIEW feature_baseline AS
SELECT
feature_name,
AVG(value::float) AS baseline_mean,
STDDEV(value::float) AS baseline_stddev,
COUNT(*) FILTER (WHERE value IS NULL)::float / COUNT(*) AS null_rate
FROM feature_log_unpivoted
WHERE logged_at BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY feature_name;
In practice, you may prefer to seed this from a separate table populated by your training pipeline rather than computing it from the live log. That gives you exact alignment with training data and avoids the cost of aggregating months of history continuously.
Step 2: Current Window Statistics
This view computes the live distribution over a rolling one-hour window:
-- Current window stats (last 1 hour, updated continuously)
CREATE MATERIALIZED VIEW feature_current_stats AS
SELECT
feature_name,
AVG(value::float) AS current_mean,
STDDEV(value::float) AS current_stddev,
COUNT(*) FILTER (WHERE value IS NULL)::float / COUNT(*) AS null_rate,
COUNT(*) AS sample_count
FROM feature_log_unpivoted
WHERE logged_at >= NOW() - INTERVAL '1 hour'
GROUP BY feature_name;
RisingWave maintains this view incrementally. As old rows age out of the one-hour window and new rows arrive, the aggregates update without recomputing from scratch.
The window size is a design choice. One hour gives you fast detection with enough statistical power for high-traffic features. For lower-traffic features, a six-hour or 24-hour window may be necessary to get stable estimates.
Step 3: Drift Detection
Join the two views and apply your thresholds:
-- Drift detection
CREATE MATERIALIZED VIEW feature_drift_alerts AS
SELECT
c.feature_name,
ABS(c.current_mean - b.baseline_mean) / NULLIF(b.baseline_stddev, 0) AS z_score,
c.null_rate - b.null_rate AS null_rate_delta,
CURRENT_TIMESTAMP AS detected_at
FROM feature_current_stats c
JOIN feature_baseline b ON c.feature_name = b.feature_name
WHERE
ABS(c.current_mean - b.baseline_mean) / NULLIF(b.baseline_stddev, 0) > 3
OR ABS(c.null_rate - b.null_rate) > 0.1;
The NULLIF(b.baseline_stddev, 0) guard prevents division-by-zero for constant features. A z-score above 3 means the current mean is more than three standard deviations from baseline -- a signal that would occur by chance less than 0.3% of the time under a normal distribution.
The ABS(c.null_rate - b.null_rate) > 0.1 condition fires when null rate increases by more than 10 percentage points. This is your pipeline break detector.
Alerting: Routing Drift Signals Downstream
A materialized view that nobody reads is not a monitoring system. You need to route alerts to where on-call engineers will see them.
Kafka sink for PagerDuty or Slack integration:
CREATE SINK feature_drift_sink
FROM feature_drift_alerts
WITH (
connector = 'kafka',
properties.bootstrap.server = 'kafka:9092',
topic = 'feature-drift-alerts',
type = 'append-only',
force_append_only = true
);
A downstream consumer reads from this topic and routes to PagerDuty, Slack, or your incident management system. Because the sink is driven by view updates, alerts fire within seconds of drift being detected.
Webhook sink for direct HTTP delivery:
CREATE SINK feature_drift_webhook
FROM feature_drift_alerts
WITH (
connector = 'webhook',
url = 'https://your-alerting-service.internal/feature-drift',
type = 'append-only'
);
One operational consideration: drift alerts can be noisy during model retraining periods or intentional feature changes. Add a sample_count filter to suppress alerts when the current window has insufficient data, which prevents false positives during low-traffic periods:
-- Add to the WHERE clause of feature_drift_alerts
AND c.sample_count >= 100
Z-Score vs. PSI: Which to Use
The implementation above uses z-score for mean shift and absolute delta for null rate. A common alternative in MLOps is the Population Stability Index (PSI).
Z-score approach measures shift in the mean relative to training variance. It is:
- Easy to implement in SQL
- Interpretable (3-sigma = very unlikely under no-change hypothesis)
- Sensitive to mean shifts
- Blind to distribution shape changes that preserve the mean
PSI compares the full distribution by bucketing values and measuring how much the bucket proportions changed:
PSI = SUM((current_pct - baseline_pct) * LN(current_pct / baseline_pct))
PSI thresholds are typically: PSI < 0.1 (no significant shift), 0.1 to 0.25 (moderate shift, worth investigating), > 0.25 (major shift, retrain likely needed).
PSI captures shape changes that z-score misses, but it requires pre-defining buckets, is harder to implement in SQL without a bucketing step, and is less interpretable to engineers unfamiliar with information theory.
Practical recommendation: Start with z-score for continuous features and null-rate delta for pipeline integrity. Add PSI for your highest-impact features once you have the simpler system working. A streaming database can compute both, but the simpler system is the one you will actually maintain.
Practical Example: Detecting a Null Rate Spike
Walk through a concrete scenario: user_age starts getting 40% null rate due to a schema change in an upstream user profile service.
Your baseline view shows:
feature_name | baseline_mean | baseline_stddev | null_rate
user_age | 38.2 | 12.4 | 0.02
A deployment at 14:32 UTC breaks the user profile join. By 14:38, the current window shows:
feature_name | current_mean | current_stddev | null_rate | sample_count
user_age | 37.9 | 12.1 | 0.41 | 847
The z-score for the mean is (38.2 - 37.9) / 12.4 = 0.024 -- well below the threshold. Mean-based monitoring would miss this entirely.
The null rate delta is 0.41 - 0.02 = 0.39 -- far above the 0.10 threshold. The drift alert fires at 14:39, seven minutes after the deployment. Your on-call engineer gets a PagerDuty alert with:
{
"feature_name": "user_age",
"z_score": 0.024,
"null_rate_delta": 0.39,
"detected_at": "2025-02-15T14:39:12Z"
}
The engineer rolls back the deployment at 14:52. Total exposure: 20 minutes. Without streaming monitoring, the nightly batch job would have run at 02:00 UTC the next day, flagging an alert that gets triaged the following morning. Exposure: over 12 hours.
This is what the mean shift alone would not have caught. Null rate monitoring is not a secondary concern -- for pipeline-driven features, it is often your most reliable early warning signal.
Limitations to Be Honest About
Streaming statistical monitoring has real constraints worth naming:
Small windows have high variance. A one-hour window for a feature that sees 50 predictions per hour gives you poor statistical estimates. Set a minimum sample_count threshold and consider longer windows for low-traffic features.
Correlated drift is hard to detect. Z-score tests each feature independently. A model that uses 20 correlated features can degrade significantly from small correlated shifts in all of them, none of which individually crosses your z-score threshold. Multivariate drift detection requires different tooling.
Baseline staleness. If your baseline was computed from a training window that is months old, the z-scores become meaningless as the world changes legitimately over time. Periodically updating the baseline (or using a longer rolling baseline window) is necessary for long-running models.
No ground truth. Feature drift tells you the inputs changed. It does not tell you the model's predictions are wrong -- that requires outcome monitoring (comparing predictions to labels). Feature drift is a leading indicator, not a confirmation of degraded performance.
Streaming SQL feature monitoring is a necessary component of MLOps, not a sufficient one. Use it alongside outcome monitoring, prediction distribution monitoring, and business metric tracking.
FAQ
Can I use this with any ML framework?
Yes. The monitoring layer operates on your feature log table, which is decoupled from your model serving infrastructure. Any system that logs inference inputs to a Kafka topic or database table can feed into this architecture.
What if my features are categorical, not numeric?
For categorical features, replace mean/stddev monitoring with cardinality monitoring and top-N value share. Track how many distinct values the feature takes in the current window versus baseline, and monitor whether the top categories have shifted. SQL COUNT(DISTINCT value) and conditional aggregates handle this.
How do I handle features that legitimately change over time?
Update the baseline periodically to reflect the current expected distribution. One approach is to maintain the baseline as a separate table populated by your model retraining pipeline. Each retraining updates the baseline, and your drift detection automatically recalibrates.
What window size should I use?
Start with one hour. If you see too many false positives from normal variance, increase to four or six hours. If your traffic is high enough (thousands of predictions per hour), you can go as low as 15 minutes for fast detection of acute failures like null rate spikes.
Is there a performance cost to maintaining these views?
The views add continuous aggregation work proportional to your prediction throughput. For most ML serving workloads, this is negligible. RisingWave is designed for incremental maintenance -- updates are processed as diffs, not full recomputes.
Should I monitor all features or just the most important ones?
Start with your model's top-10 features by SHAP value or feature importance score. Add null rate monitoring for all features. Expand coverage iteratively based on what you find.
Feature drift is not a data science problem -- it is an operational problem. The tools to catch it early exist today. A streaming database, a few materialized views, and a Kafka sink are all you need to move your detection window from hours to minutes. The cost of not doing this shows up in your business metrics, usually at the worst possible time.

