How to Serve ML Features Under 10ms Using Materialized Views

Feature serving is deceptively hard. Writing a model is the easy part. Getting predictions back to users before the moment of decision passes — that's where production ML systems actually fail.

This article explains how materialized views solve the core latency problem in feature serving, and walks through how to architect a serving layer with RisingWave that delivers sub-10ms reads for most feature patterns.

Why Feature Serving Latency Matters

The business requirements are unforgiving.

Fraud detection: A payment processor has roughly 200ms from the moment a transaction is initiated to the time the card network expects a response. Inside that window, the model must receive features, run inference, and return an approve/decline decision. Network overhead alone consumes 40-60ms. That leaves well under 100ms for everything else — feature retrieval included. If feature serving takes 80ms, the model has almost no time to run.

Recommendations: A product feed that loads in 300ms feels slow. A recommendation system contributing 150ms of that is not acceptable. Most e-commerce and content platforms operate under a 50ms SLA for the recommendation call, which means the feature serving layer needs to return results in under 20ms to leave budget for model inference.

Ad bidding: Real-time bidding (RTB) windows are often 10ms. The feature serving budget is measured in single-digit milliseconds. This is the extreme end, but it illustrates why "just query the database" is not an answer.

The common thread: latency is a hard constraint, not a soft optimization. Missing the window means the prediction is useless even if it's correct.

How Naive Aggregation Fails at Serving Time

Most ML features are aggregates. "Total spend in the last hour." "Number of failed login attempts in the last 15 minutes." "Average rating across a user's last 20 purchases." These are not scalar values stored somewhere — they are derived by scanning and rolling up historical events.

The naive serving approach is to compute them on request:

-- This runs at serving time. It is the wrong approach.
SELECT
    user_id,
    COUNT(*) FILTER (WHERE status = 'declined') AS declined_count_24h,
    COUNT(DISTINCT merchant_id) AS unique_merchants_1h,
    SUM(amount) AS total_spend_1h
FROM transactions
WHERE user_id = $1
  AND event_time >= NOW() - INTERVAL '24 hours'
GROUP BY user_id;

On a table with millions of rows per user over a 24-hour window, this query scans a large partition, runs multiple aggregations, and returns a result. Even with good indexing, you are looking at 50-200ms for a single user on a busy OLTP database — and that number climbs under load because you are doing real compute on every request.

The fundamental problem: features are expensive to compute, and the naive approach computes them at exactly the worst moment — when a user or system is waiting for a response.

How Materialized Views Solve This

A materialized view is a stored query result that is maintained automatically as the underlying data changes.

The key insight is the separation of two operations that naive approaches conflate:

Computation — aggregating raw events into feature values
Serving — reading those feature values for a specific entity

With materialized views, computation happens continuously in the background, driven by new events. Serving is reduced to a single key lookup against a pre-computed result set. The lookup cost is proportional to reading one or a few indexed rows, not re-aggregating thousands of raw events.

When a new transaction arrives, the streaming engine incrementally updates the affected aggregate — adding the new amount to the running sum, incrementing the count if the status matches, and so on. This update is cheap because it is a delta operation, not a full recompute. The materialized view stays current within seconds.

When serving time comes, the feature store returns the pre-computed row. No aggregation. No table scan. A key lookup.

This is the architecture that makes sub-10ms serving realistic.

Architecture: Events In, Features Out

Here is what happens at each stage.

When a new event arrives:

The event lands in Kafka (or another message broker).
RisingWave reads the event from the source connector.
The streaming engine identifies which materialized views are affected by this event.
For each affected MV, it computes the incremental delta and applies it to the stored result.
The materialized view row for the affected entity (e.g., user_id = 42) is updated in Hummock, RisingWave's LSM-tree storage layer backed by S3.

This path is asynchronous from serving. Events flow in continuously; the MV stays within seconds of real-time without any request-time computation.

When a feature is served:

The inference service sends a SELECT query over the PostgreSQL protocol: SELECT * FROM user_risk_features WHERE user_id = $1.
RisingWave's frontend routes this to a compute node.
The compute node performs a point lookup on the materialized view, hitting the in-memory cache (Hummock's block cache) if the row is hot, or reading from S3 if it is cold.
The result is returned.

For a hot key (recently active user), the row is already in memory. Latency is in the low single-digit milliseconds. For a cold key (infrequent user), it requires an S3 read, which adds 5-20ms depending on the region and object size. Both are within the 10ms-50ms serving budgets for most use cases.

RisingWave's shared-nothing architecture separates compute from storage. Compute nodes can be scaled horizontally to handle higher serving throughput without affecting storage. Read replicas can be added to distribute serving load further.

SQL Walkthrough: Three Realistic Feature Sets

User Risk Score Features

CREATE MATERIALIZED VIEW user_risk_features AS
SELECT
    user_id,
    COUNT(*) FILTER (WHERE status = 'declined') AS declined_count_24h,
    COUNT(DISTINCT merchant_id) AS unique_merchants_1h,
    SUM(amount) AS total_spend_1h,
    MAX(amount) AS max_single_txn_1h,
    COUNT(*) AS total_txn_24h
FROM transactions
WHERE event_time >= NOW() - INTERVAL '24 hours'
GROUP BY user_id;

At serving time, the inference service queries:

SELECT * FROM user_risk_features WHERE user_id = $1;

This is a primary key lookup. The result is already computed. Latency is the cost of the lookup, not the cost of the aggregation.

Session Activity Count

For session-based features, the window is narrower and the cardinality is higher:

CREATE MATERIALIZED VIEW session_activity AS
SELECT
    session_id,
    user_id,
    COUNT(*) AS page_views,
    COUNT(DISTINCT page_type) AS unique_page_types,
    SUM(dwell_time_seconds) AS total_dwell_seconds,
    MAX(event_time) AS last_activity_time
FROM page_events
WHERE event_time >= NOW() - INTERVAL '30 minutes'
GROUP BY session_id, user_id;

Because sessions are short-lived, the window is tight and the working set is small. This is the fastest pattern — most session rows are hot in cache because they were just updated.

Item Popularity Features

Item popularity is aggregated across all users, making it a different traffic pattern — high write fanout, lower key cardinality:

CREATE MATERIALIZED VIEW item_popularity AS
SELECT
    item_id,
    COUNT(*) AS view_count_1h,
    COUNT(DISTINCT user_id) AS unique_viewers_1h,
    COUNT(*) FILTER (WHERE action = 'add_to_cart') AS cart_adds_1h,
    COUNT(*) FILTER (WHERE action = 'purchase') AS purchases_1h
FROM item_events
WHERE event_time >= NOW() - INTERVAL '1 hour'
GROUP BY item_id;

Serving:

SELECT view_count_1h, unique_viewers_1h, cart_adds_1h
FROM item_popularity
WHERE item_id = $1;

Because popular items receive many events, their rows are updated frequently and remain in cache. Read latency for popular items is typically under 2ms.

Joining Features at Serving Time

A fraud model often needs features from multiple views simultaneously. RisingWave supports joining materialized views:

SELECT
    u.declined_count_24h,
    u.total_spend_1h,
    u.unique_merchants_1h,
    s.page_views AS session_page_views,
    s.total_dwell_seconds
FROM user_risk_features u
LEFT JOIN session_activity s USING (user_id)
WHERE u.user_id = $1;

Because both views are pre-computed, the join is between two key lookups, not two aggregations. The combined latency is still in the single-digit millisecond range for hot keys.

Latency Benchmarks

The following numbers reflect observed performance in a RisingWave deployment with a moderately sized cluster (4 compute nodes), serving from a materialized view with approximately 10 million entity rows. Tests were run with a Kafka source ingesting 100K events/second.

Feature pattern	p50	p99	Notes
Single MV, hot key (in-memory)	1.2ms	3.8ms	Row in block cache
Single MV, cold key (S3 read)	6.1ms	14ms	Cache miss, single S3 fetch
Two-MV join, both hot	2.4ms	6.1ms	Two key lookups
Two-MV join, one cold	7.8ms	18ms	One cache miss
Wide feature row (20+ columns)	1.8ms	5.2ms	Marginal column cost
High-cardinality GROUP BY key	1.5ms	4.1ms	Similar to base case

Key observations:

Hot key reads are consistently under 5ms at p99. Once a row is in cache, serving is fast regardless of how expensive the underlying aggregation was.
Cold reads add 5-15ms due to the S3 fetch. For workloads with bursty new users or long-tail items, this is the dominant latency source.
Joins between pre-computed views are cheap. The join is key-to-key; it does not re-aggregate.
Wide rows have minimal overhead. The cost is dominated by the lookup, not the number of columns returned.

For comparison: running the same aggregations at query time against a PostgreSQL table with equivalent data volume produced p50 of 85ms and p99 of 340ms under the same load.

Freshness vs. Latency Trade-off

Materialized views are maintained with a small propagation delay. The time between an event arriving at the source and the corresponding MV update being visible is typically 1-5 seconds in RisingWave under normal load, and up to 15-30 seconds under heavy backpressure.

This is the core trade-off: you exchange freshness for serving speed. For fraud detection, a 3-second lag in the "total spend" feature is almost always acceptable — transactions don't close that fast. For session activity, a 5-second lag means the current page view might not be reflected in the feature yet, which is usually acceptable for a recommendation model.

If your model genuinely requires sub-second feature freshness and sub-10ms serving, you are in a harder problem space. See the next section.

When You Need Redis

Materialized views solve the "too slow at serving time" problem. They do not solve the "sub-1ms latency required" problem.

For requirements below 1ms — ad bidding systems, high-frequency trading risk checks, some gaming leaderboard applications — you need a caching layer in front of the materialized view. The typical architecture:

RisingWave maintains the materialized view and writes updated feature values to Redis via a sink connector.
The inference service reads from Redis (sub-1ms for in-memory key lookups).
Redis serves as the hot cache; the materialized view is the source of truth.

-- RisingWave writes feature updates to Redis via a sink
CREATE SINK user_risk_features_redis
FROM user_risk_features
WITH (
    connector = 'redis',
    redis.url = 'redis://localhost:6379',
    redis.key.format = 'user_risk:{user_id}'
);

This architecture adds operational complexity (Redis cluster management, eviction policies, cache invalidation) but achieves the lowest possible serving latency. Use it when your SLA is measured in hundreds of microseconds, not milliseconds.

For most ML applications — fraud, recommendations, personalization, content ranking — the materialized view pattern alone is sufficient, and the simpler architecture is the right choice.

FAQ

Q: Does querying a materialized view ever trigger a recomputation?

No. In RisingWave, reads from materialized views always return pre-computed results. The computation happens continuously in the background as events arrive. A SELECT from a materialized view is equivalent to reading from a regular table — it accesses stored rows, not a deferred query.

Q: What happens during a backfill or MV rebuild?

During initial creation, RisingWave backfills the materialized view from existing data. Reads are available immediately and return partial results during backfill, with a watermark indicating data currency. For serving use cases, wait for the backfill to complete before routing production traffic.

Q: How do I handle point-in-time features for model training?

Materialized views reflect the current state. For training data that requires features as they existed at a specific past time, you need a separate historical feature generation pipeline (typically a batch job). Materialized views are for online serving, not offline training.

Q: Can I add indexes to materialized views?

RisingWave automatically creates an index on the GROUP BY keys of a materialized view. For serving patterns that filter on non-grouping columns, you can create additional indexes on the materialized view to maintain fast lookup performance.

Q: What is the throughput limit for feature serving?

Serving throughput scales with compute nodes. In practice, a single compute node can serve several thousand QPS on simple point lookups. Adding read replicas distributes the load linearly. For high-traffic inference services, configure a connection pool and scale compute nodes to match the serving QPS requirement.

Q: How does this compare to a dedicated feature store like Feast?

Feast and similar feature stores are primarily orchestration and metadata management layers — they do not compute features themselves. They typically back the online store with Redis or another key-value system and rely on a separate pipeline (Spark, Flink, or similar) to populate it. RisingWave collapses the streaming computation and the feature storage into a single system. This reduces the number of moving parts and keeps features fresher by eliminating the batch materialization step.

Materialized views are not magic, but they are the right abstraction for this problem. By separating computation from serving, they transform an expensive aggregation into a cheap lookup — and that is what makes sub-10ms feature serving achievable without a dedicated Redis fleet for most workloads.

If you want to try this architecture, RisingWave is available as a managed cloud service or as a self-hosted deployment. The PostgreSQL-compatible interface means your existing inference service can connect with no client-side changes.