The Online/Offline Feature Store Problem: How a Streaming Database Solves It

Every feature store tutorial eventually explains "offline" and "online" stores as if the distinction is natural and obvious. It is not. It is an architectural compromise that accumulated over years, and it has a specific cost: your model will silently underperform in production compared to how it behaved in evaluation. That cost has a name — training-serving skew — and most ML teams discover it only after deploying a model that seemed ready.

This article explains where the skew comes from, why the traditional dual-store architecture makes it almost inevitable, and how computing features exactly once in a streaming database eliminates it at the architectural level.

The Dual-Computation Architecture and Why It Exists

Traditional feature stores emerged from a practical constraint: different parts of the ML lifecycle need features at different scales and with different access patterns.

Training needs historical features. A fraud model trained on six months of transaction data needs to compute features like "sum of transaction amounts in the last 7 days" for every timestamp in that dataset. This is a batch workload — you scan billions of rows, aggregate by user, and write the results to a file. Spark, BigQuery, or dbt handle this well. The output lands in an "offline store," typically Parquet files in S3 or a data warehouse table.

Serving needs current features. When a transaction arrives in production, your model needs those same features computed for right now, for that specific user, in under 100ms. Batch jobs cannot help here. So you run a separate streaming pipeline — Flink, Kafka Streams, or a custom consumer — that maintains aggregates in low-latency key-value store like Redis or DynamoDB. This is the "online store."

The problem is that these two computation paths are separate code, running on separate systems, written and maintained by separate teams. They are supposed to compute the same features. They usually almost do.

Training-Serving Skew: How It Silently Kills Model Accuracy

Training-serving skew is the difference between the features your model was trained on and the features it sees in production. When the same feature logic is implemented twice, subtle divergences are nearly unavoidable.

Consider a seemingly simple feature: the number of transactions a user made in the last 30 days. In your offline pipeline, "30 days" is computed relative to each row's timestamp, and your Spark job handles this with a window function over a sorted partition. In your online pipeline, your Kafka Streams application maintains a sliding window using event time with a specific watermark strategy. If the watermark handling differs from how the batch job defines "last 30 days," the features diverge.

Now multiply this across dozens of features, each with its own edge case: how do you handle null values? What timezone does "day" mean? What happens when a user has no activity for a period? Each decision made slightly differently in two codebases produces a small feature mismatch. Individually, these mismatches are hard to detect. Collectively, they erode model performance.

A concrete example: imagine a fraud detection model trained over six months of historical data. The offline pipeline computes a "velocity" feature — transactions per hour in a rolling 24-hour window — using BigQuery's RANGE-based window framing. The online pipeline computes the same feature in a Flink job using event-time processing with a 10-second watermark. The logic is close, but not identical. The watermark causes late events to be dropped in the online pipeline that were counted in the offline pipeline.

In evaluation against the historical dataset, the model performs well. AUC looks good. You deploy. In production, the velocity feature is computed slightly differently, and the model's fraud probability scores drift from what the training distribution would predict. False negative rates creep up. You spend weeks debugging before discovering the mismatch — if you discover it at all.

Skew is especially treacherous because it degrades gradually. The model does not fail catastrophically. It just quietly underperforms, and you attribute the gap to "distribution shift" or "data quality issues" rather than a structural flaw in how features are computed.

The Single-Computation Alternative

The root cause is clear: you are computing the same feature logic twice. The solution is equally clear: compute it once.

A streaming database changes the architecture fundamentally. Instead of a batch pipeline and a streaming pipeline producing separate feature stores, you define feature logic once as a materialized view. The streaming database continuously maintains that view as new events arrive. The same view serves both use cases:

Online serving: query the materialized view directly via SQL. Latency is in the tens of milliseconds for point lookups.
Offline training data: export snapshots of the materialized view to S3 as Parquet files using a sink connector. The snapshot reflects exactly the same computation the online queries use.

One code path. One definition of "30-day window." One place to fix a bug or change logic. Training data and serving data are guaranteed to be consistent because they come from the same materialized view.

Architecture: Kafka to RisingWave

The architecture is straightforward to reason about. Events flow from Kafka into RisingWave. RisingWave continuously maintains materialized views that represent your feature logic. Those views are queryable via the PostgreSQL wire protocol for online serving. A sink connector exports view snapshots to S3 in Parquet format, which your training pipelines read.

Kafka Topics (raw events)
        |
        v
   RisingWave
   - Sources (Kafka consumers)
   - Materialized Views (feature logic)
        |
        +-----> PostgreSQL protocol ---> Online serving (model inference)
        |
        +-----> S3 Sink (Parquet) ------> Offline training data

No separate Spark batch job for historical features. No Redis to maintain separately. No Flink cluster to manage. The materialized view is the single source of truth for both paths.

SQL Walkthrough

Let us make this concrete with a fraud detection feature set.

Step 1: Define the Kafka Source

CREATE SOURCE transactions (
    user_id     VARCHAR,
    amount      NUMERIC,
    merchant_id VARCHAR,
    country     VARCHAR,
    event_time  TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'transactions',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
)
FORMAT PLAIN ENCODE JSON;

Step 2: Create Feature Materialized Views

-- Rolling 24-hour transaction velocity and spend per user
CREATE MATERIALIZED VIEW user_transaction_features AS
SELECT
    user_id,
    COUNT(*) FILTER (
        WHERE event_time > NOW() - INTERVAL '24 hours'
    ) AS tx_count_24h,
    SUM(amount) FILTER (
        WHERE event_time > NOW() - INTERVAL '24 hours'
    ) AS tx_amount_24h,
    COUNT(DISTINCT merchant_id) FILTER (
        WHERE event_time > NOW() - INTERVAL '7 days'
    ) AS unique_merchants_7d,
    COUNT(*) FILTER (
        WHERE country != LAG(country) OVER (
            PARTITION BY user_id ORDER BY event_time
        )
        AND event_time > NOW() - INTERVAL '1 hour'
    ) AS country_changes_1h,
    MAX(event_time) AS last_seen
FROM transactions
GROUP BY user_id;

-- Merchant-level baseline statistics (for anomaly detection)
CREATE MATERIALIZED VIEW merchant_baseline_features AS
SELECT
    merchant_id,
    AVG(amount) AS avg_tx_amount,
    STDDEV(amount) AS stddev_tx_amount,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) AS p95_tx_amount,
    COUNT(*) AS total_tx_count
FROM transactions
WHERE event_time > NOW() - INTERVAL '30 days'
GROUP BY merchant_id;

These views update continuously as new events arrive. There is no schedule, no batch job, no cron.

Step 3: Online Serving

At inference time, your application queries these views directly over the PostgreSQL wire protocol:

-- Fetch features for a single user (called at inference time)
SELECT
    u.tx_count_24h,
    u.tx_amount_24h,
    u.unique_merchants_7d,
    u.country_changes_1h,
    m.avg_tx_amount,
    m.stddev_tx_amount,
    m.p95_tx_amount,
    -- Derived feature: how anomalous is this transaction vs. merchant baseline?
    (current_tx_amount - m.avg_tx_amount) / NULLIF(m.stddev_tx_amount, 0)
        AS amount_z_score
FROM user_transaction_features u
JOIN merchant_baseline_features m USING (merchant_id)
WHERE u.user_id = $1
  AND m.merchant_id = $2;

Point lookups against a primary-key lookup on a materialized view return in milliseconds. For typical fraud scoring workloads (hundreds to low thousands of requests per second), this is fast enough.

Step 4: Exporting to S3 for Training

When you want to build a training dataset, you export a snapshot of the feature view to S3:

CREATE SINK user_features_snapshot
FROM user_transaction_features
WITH (
    connector = 's3',
    s3.bucket_name = 'ml-training-data',
    s3.path = 'features/user_transaction/snapshot_20260401/',
    s3.credentials.access = '...',
    s3.credentials.secret = '...',
    type = 'append-only'
)
FORMAT PLAIN ENCODE PARQUET;

The exported Parquet files contain exactly the same feature values that online serving would return for those users at that point in time. The computation is identical because it is the same computation -- not a reimplementation.

Your training pipeline reads from S3 as usual:

import pandas as pd

features = pd.read_parquet(
    "s3://ml-training-data/features/user_transaction/snapshot_20260401/"
)
labels = pd.read_parquet("s3://ml-training-data/labels/fraud_labels_april2026/")

training_data = features.merge(labels, on="user_id")

Handling Backfill: Historical Features for Retraining

Backfill is the hard problem in any feature store migration. If your model needs six months of historical features and you just switched to a streaming database, you cannot simply query the materialized view -- it only reflects current and recent state.

RisingWave supports two backfill strategies.

The first is historical table import. If you have historical event data in S3 or a database, you can define a batch source that reads it and create a separate materialized view for the historical period. You then union the historical view with the live view for training exports:

-- Historical backfill source (one-time scan)
CREATE SOURCE transactions_historical (
    user_id     VARCHAR,
    amount      NUMERIC,
    merchant_id VARCHAR,
    country     VARCHAR,
    event_time  TIMESTAMPTZ
)
WITH (
    connector = 's3',
    s3.bucket_name = 'raw-events',
    s3.path = 'transactions/2025/',
    s3.credentials.access = '...',
    s3.credentials.secret = '...'
)
FORMAT PLAIN ENCODE PARQUET;

-- Compute historical features using the same logic
CREATE MATERIALIZED VIEW user_transaction_features_historical AS
SELECT
    user_id,
    DATE_TRUNC('day', event_time) AS feature_date,
    COUNT(*) AS tx_count_24h,
    SUM(amount) AS tx_amount_24h
    -- same logic, windowed per day for point-in-time correctness
FROM transactions_historical
GROUP BY user_id, DATE_TRUNC('day', event_time);

The second strategy, when historical data is unavailable, is to start capturing feature snapshots at regular intervals using a scheduled sink. Over time, you accumulate a history of snapshots that can serve as training data for future model versions.

Neither approach is as seamless as having maintained a streaming database from the beginning, but both are workable migration paths.

Honest Limitations

A streaming database is not always the right answer for feature serving, and it is worth being specific about when it is not.

Very high QPS serving (sub-1ms latency requirements): If your model scoring path must return features in under 1 millisecond at 50,000+ requests per second, a streaming database query will not meet that bar. SQL query overhead, even for indexed point lookups, adds latency. In these cases, you still want Redis or a purpose-built key-value store as a cache layer. The architecture becomes: RisingWave computes features, a Redis sink writes them to Redis, and inference reads from Redis. You still get the single-computation guarantee, with Redis acting as a materialized cache rather than a separately computed store.

Pre-built feature reuse across teams: Large organizations often invest in feature platforms that allow teams to discover and share features. Feast and Tecton provide feature registries, lineage tracking, and permission models. A streaming database excels at the computation layer but does not replace the organizational layer. If your team is already using a feature platform, RisingWave can plug in as the computation backend while the platform handles discovery and governance.

Very complex feature logic: Some features are easier to express in Python than SQL. Rolling window models, complex sequential patterns, or features requiring external lookups are manageable in SQL but may be more natural to write in user-defined functions or custom stream processors. RisingWave supports UDFs in Python and Java, which helps, but very computation-heavy feature logic may still warrant a dedicated streaming application.

These are real tradeoffs, not reasons to avoid the architecture. For the majority of ML teams dealing with training-serving skew, the streaming database approach removes a class of bugs without introducing significant complexity.

FAQ

Won't querying a materialized view add latency compared to Redis?

For most workloads, yes -- a SQL query against RisingWave will take 10-50ms versus sub-millisecond Redis reads. Whether that matters depends on your total inference budget. If your model inference itself takes 100ms, 20ms for feature retrieval is acceptable. If you need end-to-end latency under 10ms, you will want Redis as a cache layer in front of RisingWave.

How does RisingWave handle late-arriving events?

RisingWave uses watermarks to handle event-time processing, consistent with how systems like Flink manage late data. You define a watermark policy on your source, and materialized views respect it when computing event-time windows. The key difference from the dual-pipeline architecture: both your online queries and your exported training data use the same watermark policy, so the handling of late events is identical in both contexts.

What happens if RisingWave goes down? Do we lose features for serving?

RisingWave is designed for high availability with fault tolerance and state recovery. On restart after failure, it resumes processing from its last checkpoint and catches up. During downtime, queries against the materialized view will fail -- there is no automatic fallback to Redis or another store unless you build one. For mission-critical inference paths, you should account for this in your availability design, either by maintaining a Redis cache layer or designing your inference path to fall back to default feature values.

Can we use this with existing Feast or Tecton deployments?

Yes, partially. You can use RisingWave to compute and maintain features, then write them to Feast's online store (Redis) and offline store (S3) via sinks. This gives you the single-computation guarantee while keeping the Feast registry and SDK. It requires configuring RisingWave sinks to write in formats Feast expects, but it is architecturally straightforward.

Is this approach production-ready?

RisingWave is in production use at multiple companies for real-time analytics and feature serving workloads. The pattern described here -- materialized views for online serving, S3 sinks for offline export -- is a standard use case. The limitations described above are real, but for most ML teams, this approach is significantly simpler to operate and reason about than maintaining separate batch and streaming feature pipelines.

Training-serving skew is a systems problem, not a modeling problem. It exists because feature logic is written and run in two separate places. The fix is architectural: compute features once, derive both your training data and your serving data from the same source. A streaming database makes this possible without requiring you to build and maintain a custom feature store. The SQL is approachable, the architecture is debuggable, and the skew goes away by construction.