Building a Real-Time Feature Store with Streaming SQL

Every production ML system eventually hits the same wall: getting fresh features to models fast enough to matter. You trained a fraud detection model on features computed from the last 30 minutes of user behavior, but in production, your feature store refreshes hourly. The model that performed brilliantly in offline evaluation now misses obvious fraud patterns because it is working with stale data.

A feature store is a centralized system that manages the computation, storage, and serving of machine learning features. It bridges the gap between raw data and model inputs, providing consistent feature values for both training and real-time inference. Tools like Feast and Tecton have popularized this pattern, but most implementations still rely on a patchwork of batch pipelines, stream processors, key-value stores, and orchestrators.

What if your feature store could compute features from live streams, store the results, and serve them to models through a standard PostgreSQL connection, all in one system? That is exactly what a streaming database like RisingWave enables. In this guide, you will build a real-time feature store using streaming SQL, compare this approach against traditional feature store architectures, and see how materialized views provide point-in-time correctness without the operational complexity of dedicated feature store platforms.

The Feature Store Problem

Feature stores exist because ML teams kept solving the same problems over and over. Data scientists would write feature logic in Python notebooks during training, then engineers would rewrite that logic in Java or Scala for production serving. Features would drift between training and serving environments. Teams would duplicate feature computation across projects.

A well-designed feature store solves three core problems:

Feature computation: Transforming raw data into ML-ready features (aggregations, encodings, joins)
Feature storage: Maintaining both historical features for training (offline store) and current features for inference (online store)
Feature serving: Delivering features to models with low latency at high throughput

The traditional architecture splits these responsibilities across multiple systems. Batch pipelines (Airflow + Spark) compute features and write them to an offline store (data warehouse). A separate streaming pipeline (Flink or Kafka Streams) computes real-time features and writes them to an online store (Redis or DynamoDB). A serving layer sits in front, routing requests and handling fallbacks.

This architecture works, but it carries a cost. You maintain two separate computation paths for batch and streaming features. You manage synchronization between offline and online stores. You operate a serving layer that adds latency and failure modes. And every new feature requires changes in multiple systems.

Why Streaming SQL Changes the Equation

A streaming database collapses multiple layers of the feature store stack into a single system. Instead of separate compute, storage, and serving components, you get all three from one SQL interface.

RisingWave is a streaming database that ingests data from sources like Kafka, computes transformations incrementally, and stores the results in materialized views that you query using standard PostgreSQL protocol. For feature store workloads, this means:

Feature computation happens through SQL materialized views. Define your feature logic as a CREATE MATERIALIZED VIEW statement, and RisingWave continuously updates the results as new data arrives.
Feature storage is handled automatically. Materialized views persist their results, serving as both the computation engine and the storage layer.
Feature serving uses the PostgreSQL wire protocol. Any tool that connects to PostgreSQL, including Python's psycopg2, SQLAlchemy, JDBC, or any ML serving framework, can query features directly.

This eliminates the dual-write problem where batch and streaming pipelines must produce identical results. There is one definition of each feature, written in SQL, running continuously.

Building Feature Tables as Streaming Sources

In a production deployment, your event data typically flows from Kafka or another message broker. RisingWave can ingest directly from these sources using CREATE SOURCE statements. For this tutorial, we use tables that simulate streaming input, letting you follow along on any RisingWave instance.

All SQL in this article is verified against RisingWave 2.8.0.

-- Raw event stream: user browsing behavior
CREATE TABLE user_events (
    user_id INT,
    event_type VARCHAR,
    page_category VARCHAR,
    session_id VARCHAR,
    event_timestamp TIMESTAMPTZ
);

-- Raw event stream: purchase transactions
CREATE TABLE purchase_events (
    user_id INT,
    product_id INT,
    product_category VARCHAR,
    amount DECIMAL,
    purchase_timestamp TIMESTAMPTZ
);

Insert sample data to represent a realistic event stream:

INSERT INTO user_events VALUES
(1, 'page_view', 'electronics', 'sess_001', '2026-04-01 10:00:00+00'),
(1, 'page_view', 'electronics', 'sess_001', '2026-04-01 10:05:00+00'),
(1, 'add_to_cart', 'electronics', 'sess_001', '2026-04-01 10:10:00+00'),
(1, 'page_view', 'clothing', 'sess_002', '2026-04-01 14:00:00+00'),
(2, 'page_view', 'electronics', 'sess_003', '2026-04-01 09:00:00+00'),
(2, 'page_view', 'electronics', 'sess_003', '2026-04-01 09:15:00+00'),
(2, 'page_view', 'books', 'sess_003', '2026-04-01 09:30:00+00'),
(2, 'add_to_cart', 'books', 'sess_003', '2026-04-01 09:35:00+00'),
(2, 'purchase', 'books', 'sess_003', '2026-04-01 09:40:00+00'),
(3, 'page_view', 'clothing', 'sess_004', '2026-04-01 11:00:00+00'),
(3, 'page_view', 'clothing', 'sess_004', '2026-04-01 11:10:00+00'),
(3, 'page_view', 'shoes', 'sess_004', '2026-04-01 11:20:00+00'),
(3, 'add_to_cart', 'shoes', 'sess_004', '2026-04-01 11:25:00+00');

INSERT INTO purchase_events VALUES
(1, 101, 'electronics', 299.99, '2026-03-28 15:00:00+00'),
(1, 102, 'electronics', 49.99, '2026-03-30 12:00:00+00'),
(1, 103, 'clothing', 79.99, '2026-04-01 10:15:00+00'),
(2, 201, 'books', 24.99, '2026-03-25 09:00:00+00'),
(2, 202, 'books', 19.99, '2026-03-29 14:00:00+00'),
(2, 203, 'electronics', 599.99, '2026-04-01 09:45:00+00'),
(3, 301, 'clothing', 129.99, '2026-03-27 16:00:00+00'),
(3, 302, 'shoes', 89.99, '2026-04-01 11:30:00+00');

In production, you would replace these tables with Kafka sources:

CREATE SOURCE user_events_stream (
    user_id INT,
    event_type VARCHAR,
    page_category VARCHAR,
    session_id VARCHAR,
    event_timestamp TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'user-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

Computing Features with Materialized Views

The core of a streaming feature store is the feature computation layer. In RisingWave, each feature group becomes a materialized view that continuously updates as new events arrive.

User Activity Features

This materialized view computes behavioral signals from browsing events. These features feed recommendation models and churn prediction:

CREATE MATERIALIZED VIEW user_activity_features AS
SELECT
    user_id,
    COUNT(*) AS total_events,
    COUNT(DISTINCT session_id) AS total_sessions,
    COUNT(DISTINCT page_category) AS categories_browsed,
    COUNT(*) FILTER (WHERE event_type = 'add_to_cart') AS cart_additions,
    COUNT(*) FILTER (WHERE event_type = 'purchase') AS purchases,
    MAX(event_timestamp) AS last_active_at
FROM user_events
GROUP BY user_id;

Query the results:

SELECT * FROM user_activity_features ORDER BY user_id;

 user_id | total_events | total_sessions | categories_browsed | cart_additions | purchases |      last_active_at
---------+--------------+----------------+--------------------+----------------+-----------+---------------------------
       1 |            4 |              2 |                  2 |              1 |         0 | 2026-04-01 14:00:00+00:00
       2 |            5 |              1 |                  2 |              1 |         1 | 2026-04-01 09:40:00+00:00
       3 |            4 |              1 |                  2 |              1 |         0 | 2026-04-01 11:25:00+00:00

Notice the FILTER clause. RisingWave supports standard SQL conditional aggregation, which makes computing event-type-specific counts clean and readable. Each feature in this view updates the instant a new event arrives, not on a batch schedule.

User Purchase Features

Spending patterns form a separate feature group. Keeping them in their own materialized view allows independent evolution and avoids recomputing browsing features when purchase logic changes:

CREATE MATERIALIZED VIEW user_purchase_features AS
SELECT
    user_id,
    COUNT(*) AS total_purchases,
    SUM(amount) AS total_spent,
    AVG(amount) AS avg_order_value,
    MAX(amount) AS max_order_value,
    COUNT(DISTINCT product_category) AS categories_purchased,
    MAX(purchase_timestamp) AS last_purchase_at
FROM purchase_events
GROUP BY user_id;

SELECT * FROM user_purchase_features ORDER BY user_id;

 user_id | total_purchases | total_spent | avg_order_value | max_order_value | categories_purchased |     last_purchase_at
---------+-----------------+-------------+-----------------+-----------------+----------------------+---------------------------
       1 |               3 |      429.97 |          143.32 |          299.99 |                    2 | 2026-04-01 10:15:00+00:00
       2 |               3 |      644.97 |          214.99 |          599.99 |                    2 | 2026-04-01 09:45:00+00:00
       3 |               2 |      219.98 |          109.99 |          129.99 |                    2 | 2026-04-01 11:30:00+00:00

Product Popularity Features

Product-level features help with item ranking and catalog optimization. This view aggregates purchase signals per product:

CREATE MATERIALIZED VIEW product_popularity_features AS
SELECT
    product_id,
    product_category,
    COUNT(*) AS times_purchased,
    COUNT(DISTINCT user_id) AS unique_buyers,
    AVG(amount) AS avg_sale_price,
    SUM(amount) AS total_revenue
FROM purchase_events
GROUP BY product_id, product_category;

SELECT * FROM product_popularity_features ORDER BY product_id;

 product_id | product_category | times_purchased | unique_buyers | avg_sale_price | total_revenue
------------+------------------+-----------------+---------------+----------------+---------------
        101 | electronics      |               1 |             1 |         299.99 |        299.99
        102 | electronics      |               1 |             1 |          49.99 |         49.99
        103 | clothing         |               1 |             1 |          79.99 |         79.99
        201 | books            |               1 |             1 |          24.99 |         24.99
        202 | books            |               1 |             1 |          19.99 |         19.99
        203 | electronics      |               1 |             1 |         599.99 |        599.99
        301 | clothing         |               1 |             1 |         129.99 |        129.99
        302 | shoes            |               1 |             1 |          89.99 |         89.99

Combining Feature Groups with Streaming Joins

Real ML models consume features from multiple sources. A recommendation model needs both browsing behavior and purchase history. In a traditional feature store, combining feature groups requires application-level joins or a serving layer that fetches from multiple stores. In RisingWave, you join materialized views directly:

CREATE MATERIALIZED VIEW combined_recommendation_features AS
SELECT
    a.user_id,
    a.total_events,
    a.total_sessions,
    a.categories_browsed,
    a.cart_additions,
    COALESCE(p.total_purchases, 0) AS total_purchases,
    COALESCE(p.total_spent, 0) AS total_spent,
    COALESCE(p.avg_order_value, 0) AS avg_order_value,
    CASE
        WHEN a.total_events > 0 THEN
            COALESCE(p.total_purchases, 0)::DECIMAL / a.total_events
        ELSE 0
    END AS conversion_rate,
    a.last_active_at,
    p.last_purchase_at
FROM user_activity_features a
LEFT JOIN user_purchase_features p ON a.user_id = p.user_id;

SELECT * FROM combined_recommendation_features ORDER BY user_id;

 user_id | total_events | total_sessions | categories_browsed | cart_additions | total_purchases | total_spent | avg_order_value | conversion_rate |      last_active_at       |     last_purchase_at
---------+--------------+----------------+--------------------+----------------+-----------------+-------------+-----------------+-----------------+---------------------------+---------------------------
       1 |            4 |              2 |                  2 |              1 |               3 |      429.97 |          143.32 |            0.75 | 2026-04-01 14:00:00+00:00 | 2026-04-01 10:15:00+00:00
       2 |            5 |              1 |                  2 |              1 |               3 |      644.97 |          214.99 |            0.60 | 2026-04-01 09:40:00+00:00 | 2026-04-01 09:45:00+00:00
       3 |            4 |              1 |                  2 |              1 |               2 |      219.98 |          109.99 |            0.50 | 2026-04-01 11:25:00+00:00 | 2026-04-01 11:30:00+00:00

This combined view updates automatically when either the activity or purchase features change. The LEFT JOIN ensures users who have browsed but not purchased still appear in the feature set, with purchase metrics defaulting to zero via COALESCE.

Serving Features to ML Models

The key advantage of using RisingWave as a feature store is the serving layer: it speaks PostgreSQL wire protocol. Every language, framework, and ORM that connects to PostgreSQL works out of the box. No special SDKs. No gRPC endpoints. No feature server to deploy and operate.

Python Feature Retrieval

Here is how a model serving application fetches features at inference time:

import psycopg2

# Connect to RisingWave using standard PostgreSQL driver
conn = psycopg2.connect(
    host="localhost",
    port=4566,
    user="root",
    dbname="dev"
)

def get_user_features(user_id: int) -> dict:
    """Fetch real-time features for a single user."""
    cursor = conn.cursor()
    cursor.execute("""
        SELECT user_id, total_events, total_sessions,
               categories_browsed, cart_additions,
               total_purchases, total_spent,
               avg_order_value, conversion_rate
        FROM combined_recommendation_features
        WHERE user_id = %s
    """, (user_id,))

    row = cursor.fetchone()
    if row is None:
        return {}

    columns = [desc[0] for desc in cursor.description]
    return dict(zip(columns, row))

# Fetch features for user 2
features = get_user_features(2)
print(features)
# {'user_id': 2, 'total_events': 5, 'total_sessions': 1,
#  'categories_browsed': 2, 'cart_additions': 1,
#  'total_purchases': 3, 'total_spent': Decimal('644.97'),
#  'avg_order_value': Decimal('214.99'), 'conversion_rate': Decimal('0.60')}

Point lookups on materialized views in RisingWave return results at 10-20ms p99 latency. That is fast enough for online inference in most ML serving scenarios.

Batch Feature Retrieval for Training

Training pipelines need historical feature snapshots. The same materialized views serve batch reads:

import pandas as pd
import psycopg2

conn = psycopg2.connect(
    host="localhost", port=4566,
    user="root", dbname="dev"
)

# Load entire feature set for model training
df = pd.read_sql("""
    SELECT * FROM combined_recommendation_features
""", conn)

print(df.shape)  # (3, 11) - all users with all features

This eliminates the offline/online store split. One system serves both training and inference reads.

Point-in-Time Correctness

Point-in-time correctness is a critical requirement for feature stores. When a model requests features for a prediction, it should receive the feature values that reflect all events up to that moment, and no events beyond it. Getting this wrong introduces data leakage during training and inconsistent predictions during serving.

Traditional feature stores handle this through versioned snapshots or event-time lookups against a time-travel-capable store. This adds significant infrastructure complexity.

RisingWave's materialized views provide point-in-time correctness by design. The streaming engine processes events in order and maintains transactional consistency across materialized views. When you query a view, you see a consistent snapshot that reflects all processed events up to that point. There is no window where some features have updated while others lag behind, as can happen with systems that write to multiple external stores independently.

For streaming sources with event-time semantics, RisingWave handles watermarks and late-arriving data correctly. Features computed from windowed aggregations use event timestamps, not processing timestamps, so your features reflect when events actually happened rather than when they were processed.

Comparison: RisingWave vs. Traditional Feature Stores

How does the streaming SQL approach compare with dedicated feature store platforms? Here is a breakdown across the dimensions that matter most for ML teams.

Capability	Feast	Tecton	RisingWave as Feature Store
Feature computation	External (Spark, Airflow, dbt)	Built-in (Python SDK, SQL)	Built-in (SQL materialized views)
Real-time features	Requires external stream processor	Native streaming support	Native streaming via MVs
Feature storage	Pluggable offline/online stores	Managed online + offline	Integrated (MV state storage)
Feature serving	Python SDK or REST API	REST/gRPC API	PostgreSQL wire protocol
Feature definition language	Python SDK with decorators	Python SDK or SQL	Standard SQL
Point-in-time correctness	Manual entity timestamps	Built-in with Tecton SDK	Built-in via streaming semantics
Deployment model	Self-hosted (open source)	Managed SaaS	Self-hosted or RisingWave Cloud
Learning curve	Moderate (Feast SDK + infra)	Moderate (Tecton SDK)	Low (SQL + PostgreSQL client)
Cost	Infra cost for compute + stores	Enterprise SaaS pricing	Single system cost

When Feast or Tecton Makes More Sense

Feast and Tecton are the right choice when your organization already has an established ML platform with existing feature definitions, or when you need a feature registry with built-in governance, versioning, and collaboration features. Tecton excels in environments that require managed infrastructure with enterprise support. Feast is ideal when you want an open-source registry to catalog features computed by existing Spark or Airflow pipelines.

When RisingWave Fits Better

RisingWave as a feature store fits best when:

Real-time features are primary, not an afterthought bolted onto a batch pipeline
Your team knows SQL and does not want to learn a feature-store-specific SDK
You want to reduce infrastructure, replacing separate stream processors, key-value stores, and serving layers with one system
Low-latency serving matters, and you want sub-20ms feature lookups without operating Redis or DynamoDB
You are already using RisingWave for other streaming workloads like dashboards, alerting, or real-time analytics

The streaming SQL approach is not a complete replacement for every feature store use case. If you need extensive feature versioning, A/B testing of feature sets, or a web UI for feature discovery, a dedicated platform like Feast or Tecton provides those capabilities out of the box. But for the core compute-store-serve pipeline, a streaming database handles it with far less operational overhead.

Production Architecture

In a production deployment, the full architecture looks like this:

graph LR
    A[Application Events] --> B[Kafka]
    C[Database CDC] --> B
    B --> D[RisingWave]
    D --> E[User Feature MVs]
    D --> F[Product Feature MVs]
    D --> G[Session Feature MVs]
    E --> H[ML Inference Service<br/>via PostgreSQL]
    F --> H
    G --> H
    E --> I[Training Pipeline<br/>via PostgreSQL]
    F --> I
    G --> I
    D --> J[Downstream Sink<br/>Kafka / Iceberg / S3]

Event data flows from applications and databases into Kafka. RisingWave ingests from Kafka topics, computes features through layered materialized views, and serves the results through its PostgreSQL interface. The ML inference service queries features using a standard database connection. Training pipelines read the same views for batch feature extraction.

If you need features materialized into an external system, RisingWave supports sinks to Kafka, Apache Iceberg, PostgreSQL, and other destinations. This lets you push feature updates to a downstream key-value store if your serving requirements demand it, while still keeping the single-source-of-truth computation in SQL.

Scaling Considerations

RisingWave stores materialized view state in cloud object storage (S3, GCS, Azure Blob), which means state size is not limited by memory. For feature stores with millions of entities and hundreds of features, this is critical. Compute scales horizontally by adding streaming nodes, and the storage layer scales independently.

For serving throughput, a single RisingWave instance handles thousands of point-lookup queries per second. For higher throughput, you can add read replicas or cache hot feature vectors in your application layer.

What is a real-time feature store?

A real-time feature store is a system that computes, stores, and serves machine learning features from live data streams with low latency. Unlike batch feature stores that refresh on a schedule (hourly or daily), a real-time feature store updates feature values continuously as new events arrive, typically within seconds. This ensures that ML models always receive the freshest possible features during inference, which is critical for use cases like fraud detection, dynamic pricing, and personalized recommendations.

How does RisingWave compare to Feast for real-time features?

RisingWave handles feature computation, storage, and serving in a single system using SQL materialized views, while Feast is a feature registry that coordinates between external compute engines and storage backends. For real-time features, Feast requires you to operate a separate stream processor like Apache Flink or Spark Streaming, plus an online store like Redis. RisingWave eliminates this multi-system complexity by computing features incrementally in materialized views and serving them through its built-in PostgreSQL interface, reducing both operational overhead and end-to-end latency.

Can I use RisingWave as both a feature store and a stream processor?

Yes. RisingWave functions as both a stream processor and a feature serving layer simultaneously. You define streaming transformations as materialized views, which RisingWave incrementally maintains and makes queryable through PostgreSQL wire protocol. This dual capability means you can run the same system for real-time analytics dashboards, event-driven alerting, and ML feature serving without deploying separate infrastructure for each use case. The RisingWave documentation covers the full range of supported streaming operations.

When should I use a dedicated feature store instead of a streaming database?

A dedicated feature store like Feast or Tecton is the better choice when you need advanced feature management capabilities: feature versioning and lineage tracking, a feature registry with web UI for discovery, built-in support for A/B testing different feature sets, or governance controls for compliance requirements. These platforms also shine when your features are primarily batch-computed and real-time features are a small addition. If your primary need is computing and serving real-time features with minimal infrastructure, a streaming database approach offers a simpler architecture.

Conclusion

Building a real-time feature store does not require assembling a complex stack of batch pipelines, stream processors, online stores, and serving layers. A streaming database approach consolidates these responsibilities into SQL materialized views that compute, store, and serve features in one system.

Key takeaways:

Materialized views are feature tables. Define feature computation as SQL, and RisingWave incrementally maintains the results as new events arrive.
PostgreSQL protocol eliminates the serving layer. Any tool that connects to PostgreSQL can fetch features, no custom SDK or API gateway required.
Streaming joins combine feature groups. Join materialized views to create composite features that update automatically across data sources.
Point-in-time correctness comes built in. The streaming engine maintains transactional consistency across views, avoiding the partial-update bugs common in multi-store architectures.
Start simple, scale later. Begin with a few feature views, add more as your model requirements grow, and leverage sinks to push features to external stores if needed.

Ready to build your own real-time feature store? Try RisingWave Cloud free - no credit card required. Sign up

Join our Slack community to ask questions and connect with other stream processing developers.