Feature Pipeline Observability: Monitoring Freshness, Completeness, and Drift

Feature pipelines have a reputation for failing quietly. Unlike a web service that throws a 500 and wakes up an on-call engineer, a broken feature pipeline often just silently serves degraded data -- stale values, dropped records, shifted distributions -- while your model keeps returning confident-looking predictions. Users notice something is wrong before your monitoring does.

This article is about closing that gap. We will look at the three main failure modes for feature pipelines and build a SQL-based observability stack that catches each one continuously, before the model's performance metrics start moving.

The Three Silent Failure Modes

Before writing any monitoring queries, it helps to understand exactly how feature pipelines break.

Failure Mode 1: Freshness Lag

Your pipeline runs. Data flows in. But somewhere in the chain -- a Kafka consumer lag spike, a batch job that takes longer than its schedule, a backpressure event in the stream processor -- the output features stop updating while the pipeline itself continues running. Infrastructure monitors show green. CPU and memory look fine. But the model is now scoring requests against features computed 20 minutes ago for a use case where the SLA is 60 seconds.

Real example: a fraud detection model uses a rolling 5-minute transaction velocity feature. If the feature pipeline falls 10 minutes behind, that velocity count is zero for every new request -- the model sees the world as if no transactions have happened recently, and suppresses fraud scores for active fraudsters.

Failure Mode 2: Data Incompleteness

Records get dropped. A schema change upstream causes a deserialization error. A new region starts sending events in a slightly different format. A join drops rows because a reference table is missing keys. The null rate for user_credit_score quietly climbs from 0.3% to 18%, and no one notices because the model falls back to the population mean and keeps returning predictions.

This failure mode is insidious because it often starts small. A 2% null rate is often noise. A 15% null rate is a pipeline break in disguise.

Failure Mode 3: Distribution Drift

The pipeline is running. Records are arriving. Null rates are normal. But the distribution of feature values has shifted -- the mean of session_duration_seconds has dropped by 40%, or a new category value "enterprise_v2" appears in account_tier that the model has never seen during training. The model receives valid-looking inputs but makes predictions using patterns that no longer hold.

Infrastructure monitoring cannot catch any of these. You need data-aware monitoring that runs continuously against your feature values, not just your system metrics.

Freshness Monitoring: Measuring Feature Lag

The core idea is simple: log when each feature group was last computed, then continuously query how stale those timestamps are.

Start by creating a log table that your feature computation jobs write to:

CREATE TABLE feature_computation_log (
    feature_group   VARCHAR,
    computed_at     TIMESTAMPTZ,
    record_count    BIGINT,
    source_lag_ms   BIGINT   -- optional: upstream Kafka lag at time of computation
);

Then build a materialized view that continuously measures staleness:

-- Freshness monitor: how stale is each feature?
CREATE MATERIALIZED VIEW feature_freshness_monitor AS
SELECT
    feature_group,
    MAX(computed_at) AS last_updated,
    EXTRACT(EPOCH FROM (NOW() - MAX(computed_at))) AS staleness_seconds,
    CASE
        WHEN EXTRACT(EPOCH FROM (NOW() - MAX(computed_at))) > 300 THEN 'STALE'
        WHEN EXTRACT(EPOCH FROM (NOW() - MAX(computed_at))) > 60 THEN 'DEGRADED'
        ELSE 'HEALTHY'
    END AS status
FROM feature_computation_log
GROUP BY feature_group;

Because this is a materialized view in RisingWave, it updates continuously as new rows arrive in feature_computation_log. You can query it at any time and get a live answer without scanning the full history.

Setting Freshness Thresholds

The right thresholds depend on your use case. A good starting framework:

Use Case	DEGRADED Threshold	STALE Threshold
Fraud detection	30s	120s
Recommendation ranking	5m	15m
Credit scoring	30m	2h
Churn prediction	4h	24h

Start conservative (wider thresholds) and tighten as you learn your pipeline's normal behavior. Alert on STALE; page on STALE that persists for more than 2x the threshold.

Per-Feature SLA Tracking

If different features within a group have different SLAs, track them individually:

CREATE MATERIALIZED VIEW feature_sla_violations AS
SELECT
    f.feature_group,
    f.feature_name,
    f.sla_seconds,
    fm.staleness_seconds,
    fm.staleness_seconds - f.sla_seconds AS sla_breach_seconds
FROM feature_sla_config f
JOIN feature_freshness_monitor fm ON f.feature_group = fm.feature_group
WHERE fm.staleness_seconds > f.sla_seconds;

Completeness Monitoring: Null Rates, Row Counts, and Coverage

Freshness tells you when features were computed. Completeness tells you whether the computation produced useful output.

-- Completeness: null rate by feature
CREATE MATERIALIZED VIEW feature_completeness AS
SELECT
    feature_name,
    COUNT(*) AS total_rows,
    COUNT(*) FILTER (WHERE value IS NULL) AS null_count,
    COUNT(*) FILTER (WHERE value IS NULL)::float / NULLIF(COUNT(*), 0) AS null_rate,
    CASE WHEN COUNT(*) FILTER (WHERE value IS NULL)::float / NULLIF(COUNT(*), 0) > 0.05
         THEN 'ALERT' ELSE 'OK' END AS completeness_status
FROM feature_serving_log
WHERE logged_at >= NOW() - INTERVAL '1 hour'
GROUP BY feature_name;

The NULLIF(COUNT(*), 0) guard prevents division-by-zero when a feature group stops receiving data entirely -- itself a signal worth alerting on.

Row Count Monitoring

A sudden drop in row count is often the first indicator of an upstream schema break or a dead consumer. Compare the current window against a historical baseline:

CREATE MATERIALIZED VIEW feature_volume_monitor AS
SELECT
    feature_group,
    COUNT(*) AS rows_last_hour,
    AVG(COUNT(*)) OVER (
        ORDER BY DATE_TRUNC('hour', logged_at)
        ROWS BETWEEN 24 PRECEDING AND 1 PRECEDING
    ) AS avg_rows_previous_24h,
    COUNT(*) / NULLIF(
        AVG(COUNT(*)) OVER (
            ORDER BY DATE_TRUNC('hour', logged_at)
            ROWS BETWEEN 24 PRECEDING AND 1 PRECEDING
        ), 0
    ) AS volume_ratio
FROM feature_serving_log
GROUP BY feature_group, DATE_TRUNC('hour', logged_at);

A volume_ratio below 0.7 (30% drop from the 24-hour average) warrants investigation. Below 0.3 is a pipeline break.

Distinct Value Coverage

For categorical features, track how many unique values appear in a window versus your expected cardinality:

CREATE MATERIALIZED VIEW feature_cardinality_monitor AS
SELECT
    feature_name,
    COUNT(DISTINCT value) AS distinct_values_last_hour,
    MAX(expected_cardinality) AS expected_cardinality,
    COUNT(DISTINCT value)::float / NULLIF(MAX(expected_cardinality), 0) AS coverage_ratio
FROM feature_serving_log
JOIN feature_metadata USING (feature_name)
WHERE logged_at >= NOW() - INTERVAL '1 hour'
GROUP BY feature_name;

Low coverage can indicate that certain user segments or product categories have stopped generating events.

Distribution Monitoring: Mean Shift, Variance, and Category Drift

Distribution monitoring is the hardest of the three because "normal" is inherently relative. The goal is to detect when current values look meaningfully different from a recent baseline.

Numeric Feature Statistics

CREATE MATERIALIZED VIEW feature_distribution_stats AS
SELECT
    feature_name,
    AVG(value::float) AS current_mean,
    STDDEV(value::float) AS current_stddev,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value::float) AS current_median,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value::float) AS current_p95
FROM feature_serving_log
WHERE logged_at >= NOW() - INTERVAL '1 hour'
  AND value IS NOT NULL
GROUP BY feature_name;

Compare against a baseline window (e.g., the same hour yesterday or a rolling 7-day average) to flag significant shifts:

CREATE MATERIALIZED VIEW feature_mean_shift_monitor AS
SELECT
    curr.feature_name,
    curr.current_mean,
    base.baseline_mean,
    ABS(curr.current_mean - base.baseline_mean) / NULLIF(base.baseline_stddev, 0) AS z_score,
    CASE
        WHEN ABS(curr.current_mean - base.baseline_mean) / NULLIF(base.baseline_stddev, 0) > 3 THEN 'ALERT'
        WHEN ABS(curr.current_mean - base.baseline_mean) / NULLIF(base.baseline_stddev, 0) > 2 THEN 'WARN'
        ELSE 'OK'
    END AS drift_status
FROM feature_distribution_stats curr
JOIN feature_baseline_stats base USING (feature_name);

The z-score approach normalizes the shift by historical variance, so a 10-point shift on a feature with stddev 100 triggers differently than the same shift on a feature with stddev 2.

Categorical Feature Drift: New Values

New category values are a specific risk because they often map to out-of-vocabulary embeddings or cause model fallbacks. Detect them proactively:

CREATE MATERIALIZED VIEW feature_new_categories AS
SELECT
    s.feature_name,
    s.value AS new_category_value,
    COUNT(*) AS occurrences_last_hour
FROM feature_serving_log s
LEFT JOIN feature_known_categories k
    ON s.feature_name = k.feature_name AND s.value = k.known_value
WHERE s.logged_at >= NOW() - INTERVAL '1 hour'
  AND k.known_value IS NULL
  AND s.value IS NOT NULL
GROUP BY s.feature_name, s.value
HAVING COUNT(*) > 10;

The HAVING COUNT(*) > 10 filter reduces noise from typos and one-off values. Tune this threshold based on your event volume.

Building a Monitoring Dashboard

All the materialized views above can be exposed to Prometheus via a simple exporter that queries them on a scrape interval. The recommended metrics to export:

Freshness metrics:

feature_staleness_seconds{feature_group="..."} -- current staleness
feature_sla_breach_seconds{feature_group="..."} -- seconds past SLA (0 if healthy)

Completeness metrics:

feature_null_rate{feature_name="..."} -- current null rate (0.0-1.0)
feature_volume_ratio{feature_group="..."} -- current volume vs. 24h baseline

Distribution metrics:

feature_mean_zscore{feature_name="..."} -- z-score of current mean vs. baseline
feature_new_category_count{feature_name="..."} -- count of unseen categories in last hour

A Grafana dashboard built on these metrics should have three panels per feature group: one red/green status panel for freshness, one sparkline for null rate over time, and one heatmap for mean drift over the past 24 hours.

For high-level alerting, a single feature_health_score metric -- computed as the fraction of features currently in HEALTHY state -- gives leadership a single number without requiring them to understand the individual signals.

Alert Fatigue: Setting Sensible Thresholds

The fastest way to kill an observability program is to generate alerts that engineers learn to ignore. A few principles that help:

Differentiate warn from alert. Send WARN to Slack. Send ALERT to PagerDuty. Engineers should be able to glance at Slack warnings during business hours without escalating.

Use rate-of-change, not just absolute value. A null rate of 5% that has been stable for a week is different from a null rate that jumped from 0.5% to 5% in the last 30 minutes. Alert on the jump, not the level.

Require persistence before escalating. A freshness SLA breach that self-resolves in under 2 minutes is probably a routine processing hiccup. Alert only when the breach persists for longer than one pipeline cycle.

Suppress correlated alerts. If 15 features share the same upstream Kafka topic and that topic goes down, you do not want 15 simultaneous pages. Group alerts by pipeline/source and deduplicate at the alertmanager level.

Build alert inhibition rules. If infrastructure_kafka_consumer_lag is already alerting, suppress all freshness alerts. The root cause is known; the derived alerts add no information.

A reasonable starting configuration for most ML platforms:

Signal	Warn Threshold	Alert Threshold	Alert Duration
Freshness lag	2x SLA	5x SLA	5 minutes
Null rate	2x baseline	5x baseline or >10% absolute	10 minutes
Volume ratio	<0.8	<0.5	10 minutes
Mean z-score	>2	>3	15 minutes
New categories	>5 in 1h	>50 in 1h	immediate

Runbooks: What to Do When Each Alarm Fires

Monitoring is only useful if engineers know what to do when it fires. These runbooks should live next to your alerting rules.

Freshness Alert Runbook

Check feature_freshness_monitor to confirm which feature groups are stale and by how much.
Check upstream Kafka consumer lag for the affected pipeline. If lag is growing, the stream processor is falling behind -- check for resource contention or a backpressure event.
Check the feature computation job logs for errors. A deserialization failure will cause the job to skip records silently but still appear healthy in infrastructure monitoring.
If the pipeline is healthy but the log table is not being updated, check whether the write path to feature_computation_log itself is broken (connection pool exhaustion is a common culprit).
If the staleness exceeds 10x SLA, consider serving cached features from the previous computation and flagging affected predictions as degraded in your model serving layer.

Completeness Alert Runbook

Query feature_completeness to identify which features have elevated null rates and when the rate started rising.
Check whether the null rate increase correlates with a deployment, schema migration, or upstream data source change.
Pull a sample of null rows from feature_serving_log and examine the raw source event to determine whether the upstream field is missing or the parsing logic is failing.
If the null rate is above 20% for a feature that directly influences model output, consider disabling that feature and retraining with it excluded, or falling back to the population mean with explicit flagging.

Distribution Drift Alert Runbook

Query feature_mean_shift_monitor to see which features have elevated z-scores and in which direction the shift occurred.
Check whether the shift corresponds to a real-world event (e.g., a product launch, a regional promotion, a seasonality pattern) before treating it as a bug.
If the shift is unexplained, check the upstream data source for schema changes or new data sources being merged in.
For new category alerts, run a query against feature_new_categories to see which values are new and how often they appear. Decide whether to add them to the known-values list or investigate the upstream source.
If drift is confirmed and unexplained, trigger a model performance evaluation against held-out data from the affected period before assuming the model is still reliable.

FAQ

Do I need RisingWave specifically to run these queries?

The SQL patterns work in any system that supports materialized views with continuous updates -- RisingWave, Materialize, or Apache Flink SQL. The advantage of RisingWave is that the materialized views update incrementally in real time without requiring you to write explicit streaming logic. In a traditional database or data warehouse, you would run these as scheduled batch queries, which introduces a monitoring lag of its own.

What if my feature store does not have a serving log?

Add one. Instrument your feature serving layer to write a record to a Kafka topic every time features are requested and returned. This log is the foundation of all completeness and distribution monitoring. The overhead is low -- you can sample at 1-10% if volume is a concern -- and the observability payoff is significant.

How do I handle features with known seasonal patterns?

Compare against the same time window from the previous week rather than a rolling average. A session_duration_seconds distribution on a Sunday night will naturally differ from Monday morning. Building day-of-week and hour-of-day baselines prevents seasonal variation from generating constant drift alerts.

What is the right cadence for updating baseline statistics?

For most features, recalculate baselines weekly. For fast-moving systems, daily. Avoid recalculating baselines automatically when a drift alert is active -- you do not want an alert to silently update its own baseline and resolve itself.

Should I monitor features at training time or serving time?

Both, ideally. Training-serving skew -- where the feature distribution at serving time differs from the distribution the model was trained on -- is a category of drift that serving-time monitoring alone cannot catch. Periodically compare your serving distribution against the training dataset distribution to detect skew before it becomes a performance problem.

Feature pipeline observability is not glamorous work, but it is the difference between catching a model degradation in minutes and discovering it a week later from a user complaint. The three failure modes -- freshness, completeness, and drift -- each require different monitoring approaches, but all three can be implemented as SQL queries running continuously against your feature data. Build the monitoring stack once, tune the thresholds over a few weeks, and you will spend far less time debugging silent failures in production.