Real-Time Feature Store for Machine Learning: A Streaming Database Approach

Feature stores fail in production because they separate freshness from serving. A streaming database like RisingWave collapses that gap by computing features incrementally as events arrive, then serving them directly over a standard PostgreSQL interface — no separate refresh job, no stale cache.

The Problem: Feature Stores Are a Duct-Tape Solution

Every ML team hits the same wall. Training a model requires aggregated, cleaned, joined features across multiple tables and time windows. Serving that model in production requires the same features, but at millisecond latency and with current data.

The standard answer is a feature store. But most feature stores are architecturally split: a batch layer recomputes features on a schedule (hourly, daily), and a serving layer (usually Redis) caches the last computed result. This creates a structural freshness ceiling.

If your batch job runs hourly, your features are up to 60 minutes stale. For fraud detection, that means a fraudster has a 60-minute window. For personalization, it means serving recommendations based on yesterday's behavior. For dynamic pricing, it means leaving money on the table.

Why Existing Approaches Fall Short

Feast is a popular open-source feature store. It handles feature registration, serving, and point-in-time joins well. But Feast is a serving layer — it does not compute features. You still need a Spark job or dbt run to materialize features into its online store.

Tecton solves more of the pipeline but is proprietary, expensive, and still relies on microbatch Spark pipelines with minimum latency in the minutes range.

Redis + batch jobs is what most teams actually run in practice. Redis is fast for lookups but has no compute. Batch jobs have compute but no freshness. You are always trading one for the other.

The root cause: these tools treat feature computation and feature serving as separate concerns. A streaming database treats them as one.

How a Streaming Database Solves This

RisingWave is a PostgreSQL-compatible streaming database. It is open source (Apache 2.0), written in Rust, and uses S3 as its storage layer. The key capability is incremental materialized views: SQL views that update automatically as upstream data changes, with results queryable over a standard PostgreSQL wire protocol.

Instead of "compute features in Spark, write to Redis, serve from Redis," the architecture becomes:

Events stream into RisingWave via Kafka or CDC connectors
SQL materialized views define feature transformations — once, declaratively
ML serving layer queries materialized views directly, like any PostgreSQL table

There is no batch job. There is no cache invalidation. Features update within seconds of the underlying event.

Ingesting Raw Events

-- Ingest clickstream events from Kafka
CREATE SOURCE user_events (
    user_id     BIGINT,
    event_type  VARCHAR,
    item_id     BIGINT,
    timestamp   TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'user-events',
    properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;

-- Ingest transactions via CDC from Postgres
CREATE SOURCE transactions WITH (
    connector = 'postgres-cdc',
    hostname = 'db.internal',
    port = '5432',
    username = 'cdc_user',
    password = '...',
    database.name = 'prod',
    schema.name = 'public',
    table.name = 'transactions'
);

Defining Features as Materialized Views

-- Feature: rolling 7-day purchase count per user
CREATE MATERIALIZED VIEW user_purchase_count_7d AS
SELECT
    user_id,
    COUNT(*)                        AS purchase_count_7d,
    SUM(amount)                     AS purchase_value_7d,
    MAX(created_at)                 AS last_purchase_at
FROM transactions
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY user_id;

-- Feature: session-level click sequence (last 30 minutes)
CREATE MATERIALIZED VIEW user_session_features AS
SELECT
    user_id,
    COUNT(*) FILTER (WHERE event_type = 'click')    AS clicks_30m,
    COUNT(*) FILTER (WHERE event_type = 'view')     AS views_30m,
    COUNT(DISTINCT item_id)                         AS unique_items_30m,
    MAX(timestamp)                                  AS last_event_at
FROM user_events
WHERE timestamp > NOW() - INTERVAL '30 minutes'
GROUP BY user_id;

-- Feature: combined feature vector for serving
CREATE MATERIALIZED VIEW ml_feature_vector AS
SELECT
    s.user_id,
    s.clicks_30m,
    s.views_30m,
    s.unique_items_30m,
    COALESCE(p.purchase_count_7d, 0)    AS purchase_count_7d,
    COALESCE(p.purchase_value_7d, 0.0)  AS purchase_value_7d,
    p.last_purchase_at
FROM user_session_features s
LEFT JOIN user_purchase_count_7d p USING (user_id);

Serving Features to the Model

Because RisingWave exposes a PostgreSQL interface, serving is a standard parameterized query:

-- Serving query from ML inference service
SELECT
    clicks_30m,
    views_30m,
    unique_items_30m,
    purchase_count_7d,
    purchase_value_7d
FROM ml_feature_vector
WHERE user_id = $1;

This is the same query you would write against any PostgreSQL database. No SDK, no feature store client library, no cache warming job.

Comparison: Feature Store Architectures

Approach	Freshness	Serving Latency	Compute Model	Operational Complexity
Batch + Redis	Minutes to hours	<5ms	Periodic full recompute	High (2 systems)
Feast + Spark	Minutes	<5ms	Microbatch	High (3+ systems)
Tecton	Seconds–minutes	<10ms	Streaming Spark	Very high (managed)
RisingWave (streaming DB)	Seconds	<20ms	Incremental SQL	Low (1 system)

The latency trade-off is real: Redis will win a pure latency contest. But for most ML applications, sub-20ms feature serving is more than fast enough, and the operational simplicity of maintaining one system instead of three is significant.

Point-in-Time Correctness for Training

Training/serving skew is one of the most common bugs in ML systems. It happens when training features are computed differently (or at different times) than serving features.

With a streaming database, you can use the same SQL definitions for training and serving. For historical training data, use RisingWave's time-travel capabilities alongside standard window functions:

-- Generate training dataset with point-in-time feature values
-- (using historical event log with explicit timestamps)
SELECT
    t.transaction_id,
    t.label,
    COUNT(e.event_type) FILTER (
        WHERE e.timestamp BETWEEN t.created_at - INTERVAL '7 days' AND t.created_at
    ) AS clicks_pre_transaction
FROM transactions t
LEFT JOIN user_events e ON e.user_id = t.user_id
GROUP BY t.transaction_id, t.label, t.created_at;

The feature logic is the same SQL. The only difference is whether you run it incrementally (serving) or over a historical window (training).

Operational Considerations

RisingWave persists materialized view state to S3. This means storage scales cheaply and independently of compute. Compute nodes are stateless and can be scaled horizontally without re-ingesting data.

For teams already running Kafka, the integration is direct — RisingWave sources connect to Kafka topics with a single CREATE SOURCE statement. For teams on relational databases, CDC connectors for PostgreSQL, MySQL, and MongoDB capture row-level changes as a stream.

The result is a feature store that requires no external cache, no batch orchestration, and no feature store framework — just SQL and a streaming database.

FAQ

Can RisingWave replace Redis entirely for feature serving? For most ML serving workloads, yes. RisingWave's materialized views are stored in memory-backed indexes and return results in low tens of milliseconds. If your model requires sub-millisecond feature lookup (e.g., high-frequency trading), Redis remains faster. For recommendation, fraud, and personalization use cases, RisingWave is sufficient and eliminates an entire system.

How does RisingWave handle late-arriving events in feature windows? RisingWave supports watermark-based processing for handling late data. You can define watermark strategies per source, and window aggregations will correctly handle out-of-order events within the configured lateness tolerance.

Does RisingWave support point-in-time joins for training data generation? RisingWave supports temporal joins and you can query historical materialized view state using its changelog. For full point-in-time join support across arbitrary historical ranges, combining RisingWave with a historical event log (in S3 or a data warehouse) is the recommended pattern.

What happens if RisingWave falls behind on ingestion? RisingWave processes streams incrementally and maintains backpressure. If a source produces faster than the system processes, it buffers in Kafka (which acts as the durable queue). Features reflect the latest processed state, and the system catches up without data loss.

Is RisingWave suitable for online learning / streaming model updates? RisingWave is designed for feature computation, not model training. However, it can emit feature aggregates to a downstream system (via Kafka sink) that triggers online learning updates, making it a natural upstream component in a streaming ML architecture.