Build a Real-Time Feature Store with Streaming SQL

Build a Real-Time Feature Store with Streaming SQL

Feature stores were supposed to solve the hardest part of production machine learning: getting the right features to the right model at the right time. Instead, most teams end up managing a sprawling stack of batch jobs, caching layers, and synchronization logic just to keep feature values reasonably fresh.

The typical architecture looks like this: a Spark job computes features every hour, writes them to an offline store (S3, BigQuery, or a data warehouse), and a separate process copies a subset of those features into an online store (Redis, DynamoDB) for low-latency serving. Feast orchestrates the plumbing. Tecton adds a managed compute layer on top. Both require you to maintain separate offline and online pipelines, reconcile their outputs, and debug discrepancies when the numbers diverge.

What if you could skip all of that? What if the feature computation, the storage, and the serving layer were the same system – and features updated themselves in milliseconds instead of hours?

That is exactly what materialized views in RisingWave enable. You define features as SQL queries over live event streams, and the database incrementally maintains the results. Your model serving layer reads fresh features over a standard PostgreSQL connection. No batch jobs, no Redis cache, no synchronization logic.

This post walks through building a lightweight real-time feature store using nothing but streaming SQL.

What Is a Feature Store?

A feature store is a data system purpose-built for machine learning. It serves two functions:

  1. Offline store: Provides historical feature values for model training. Training pipelines need point-in-time correct features – the feature values as they existed at the moment each training example occurred, not the current values.

  2. Online store: Provides current feature values for model inference. When your fraud detection model receives a transaction, it needs the user's transaction count over the last 30 minutes right now, not from the last batch run.

The challenge is keeping these two stores consistent. The offline store is typically a data warehouse (Snowflake, BigQuery) or a lakehouse (Apache Iceberg, Delta Lake), optimized for large scans. The online store is typically a key-value store (Redis, DynamoDB), optimized for low-latency point lookups. Different storage engines, different update mechanisms, different failure modes.

Traditional feature stores like Feast and Tecton sit between these two stores and manage the data flow. But they do not compute features themselves. They rely on external compute engines – Spark, Flink, Airflow DAGs – to produce feature values. The feature store is a coordination layer, not a computation engine.

Why Real-Time Feature Freshness Matters

Batch feature computation works fine when the world changes slowly. Customer demographics, product catalogs, and account settings do not shift minute by minute. A daily refresh is adequate.

But many high-value ML use cases depend on signals that change constantly:

  • Fraud detection: A user's transaction velocity in the last 5 minutes is a stronger fraud signal than their average spending over the past month. A batch pipeline that refreshes hourly will miss the burst of rapid-fire transactions that indicate a compromised account.
  • Real-time recommendations: A user who just added running shoes to their cart should see related accessories, not yesterday's recommendations based on their browsing history from last week.
  • Dynamic pricing: Ride-sharing surge pricing, airline seat pricing, and e-commerce flash sales all depend on demand signals that shift within seconds. A pricing model fed hour-old features will consistently underprice surges and overprice lulls.
  • Risk scoring: Credit risk models that incorporate real-time account activity – deposits, withdrawals, unusual login patterns – outperform those that rely on daily snapshots.

In each of these cases, the freshness gap between when an event happens and when the model sees it as a feature directly impacts model accuracy and business outcomes.

Building a Real-Time Feature Store with RisingWave

RisingWave is a streaming database that ingests event streams, computes continuous queries, and serves results through a PostgreSQL-compatible interface. The core primitive is the materialized view: a SQL query whose results are incrementally maintained as new data arrives.

This combination of continuous computation and low-latency serving is precisely what a feature store needs. Here is how to build one.

Step 1: Define Your Event Sources

Connect to the event streams that produce raw data for your features. This example uses Kafka topics for user transactions and product interactions:

CREATE SOURCE user_transactions (
    transaction_id VARCHAR,
    user_id VARCHAR,
    amount DECIMAL,
    merchant_category VARCHAR,
    payment_method VARCHAR,
    is_international BOOLEAN,
    event_time TIMESTAMP
) WITH (
    connector = 'kafka',
    topic = 'user_transactions',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;
CREATE SOURCE product_interactions (
    interaction_id VARCHAR,
    user_id VARCHAR,
    product_id VARCHAR,
    interaction_type VARCHAR,  -- 'view', 'add_to_cart', 'purchase', 'review'
    session_id VARCHAR,
    event_time TIMESTAMP
) WITH (
    connector = 'kafka',
    topic = 'product_interactions',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;
CREATE SOURCE product_catalog (
    product_id VARCHAR PRIMARY KEY,
    category VARCHAR,
    price DECIMAL,
    brand VARCHAR
) WITH (
    connector = 'kafka',
    topic = 'product_catalog',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
) FORMAT UPSERT ENCODE JSON;

Step 2: Create User Feature Materialized Views

Define features as materialized views. RisingWave incrementally updates these as each event arrives, so the feature values are always fresh.

User spending features – aggregations that capture recent transaction behavior:

CREATE MATERIALIZED VIEW user_spending_features AS
SELECT
    user_id,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '5 minutes')
        AS txn_count_5min,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS txn_count_1h,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '24 hours')
        AS txn_count_24h,
    COALESCE(SUM(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour'), 0)
        AS total_spend_1h,
    COALESCE(AVG(amount) FILTER (WHERE event_time > NOW() - INTERVAL '24 hours'), 0)
        AS avg_txn_amount_24h,
    COALESCE(MAX(amount) FILTER (WHERE event_time > NOW() - INTERVAL '24 hours'), 0)
        AS max_txn_amount_24h,
    COUNT(DISTINCT merchant_category)
        FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS distinct_merchants_1h,
    COUNT(*) FILTER (
        WHERE is_international = TRUE
        AND event_time > NOW() - INTERVAL '24 hours'
    ) AS international_txn_count_24h
FROM user_transactions
GROUP BY user_id;

User engagement features – behavioral signals from product interactions:

CREATE MATERIALIZED VIEW user_engagement_features AS
SELECT
    user_id,
    COUNT(*) FILTER (
        WHERE interaction_type = 'view'
        AND event_time > NOW() - INTERVAL '1 hour'
    ) AS product_views_1h,
    COUNT(*) FILTER (
        WHERE interaction_type = 'add_to_cart'
        AND event_time > NOW() - INTERVAL '1 hour'
    ) AS add_to_cart_1h,
    COUNT(*) FILTER (
        WHERE interaction_type = 'purchase'
        AND event_time > NOW() - INTERVAL '24 hours'
    ) AS purchases_24h,
    COUNT(DISTINCT product_id)
        FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS distinct_products_viewed_1h,
    COUNT(DISTINCT session_id)
        FILTER (WHERE event_time > NOW() - INTERVAL '24 hours')
        AS sessions_24h
FROM product_interactions
GROUP BY user_id;

Step 3: Create Product Feature Materialized Views

Features are not limited to users. Product-level features power recommendation models, demand forecasting, and inventory optimization.

Product popularity features – real-time demand signals per product:

CREATE MATERIALIZED VIEW product_popularity_features AS
SELECT
    p.product_id,
    p.category,
    p.price,
    p.brand,
    COUNT(pi.interaction_id) FILTER (
        WHERE pi.interaction_type = 'view'
        AND pi.event_time > NOW() - INTERVAL '1 hour'
    ) AS views_1h,
    COUNT(pi.interaction_id) FILTER (
        WHERE pi.interaction_type = 'purchase'
        AND pi.event_time > NOW() - INTERVAL '1 hour'
    ) AS purchases_1h,
    COUNT(pi.interaction_id) FILTER (
        WHERE pi.interaction_type = 'add_to_cart'
        AND pi.event_time > NOW() - INTERVAL '1 hour'
    ) AS cart_adds_1h,
    COUNT(DISTINCT pi.user_id)
        FILTER (WHERE pi.event_time > NOW() - INTERVAL '1 hour')
        AS unique_users_1h
FROM product_catalog p
LEFT JOIN product_interactions pi ON p.product_id = pi.product_id
GROUP BY p.product_id, p.category, p.price, p.brand;

Step 4: Serve Features via PostgreSQL-Compatible Queries

Because RisingWave speaks the PostgreSQL wire protocol, your model serving layer connects with any PostgreSQL client library – psycopg2 in Python, pgx in Go, JDBC in Java. No custom SDK, no proprietary API.

Fetch user features for a fraud detection model at inference time:

SELECT
    s.txn_count_5min,
    s.txn_count_1h,
    s.total_spend_1h,
    s.avg_txn_amount_24h,
    s.distinct_merchants_1h,
    s.international_txn_count_24h,
    e.product_views_1h,
    e.add_to_cart_1h,
    e.sessions_24h
FROM user_spending_features s
JOIN user_engagement_features e ON s.user_id = e.user_id
WHERE s.user_id = 'user_8837';

This query returns in single-digit milliseconds. It reads pre-computed results from the materialized views, not raw event data. There is no aggregation at query time – the work was already done incrementally as each event arrived.

Fetch product features for a recommendation model:

SELECT
    product_id,
    category,
    views_1h,
    purchases_1h,
    cart_adds_1h,
    unique_users_1h
FROM product_popularity_features
WHERE category = 'electronics'
ORDER BY purchases_1h DESC
LIMIT 20;

Step 5: Sink Features for Offline Training

A feature store needs to serve both online inference and offline training. RisingWave handles this by sinking materialized view results to a data lakehouse for historical storage:

CREATE SINK user_features_to_iceberg
FROM user_spending_features
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'user_id',
    warehouse.path = 's3://ml-warehouse/features',
    database.name = 'feature_store',
    table.name = 'user_spending_features',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181'
);

Your training pipeline reads from the Iceberg table. Your serving pipeline reads from the materialized view. Both use the same SQL definition for feature computation, which eliminates training-serving skew by design.

RisingWave vs. Feast vs. Tecton: A Direct Comparison

The table below compares using RisingWave as a lightweight feature store against two widely adopted alternatives. The comparison focuses on the real-time feature serving use case – not the full MLOps lifecycle.

CapabilityFeastTectonRisingWave
Feature computationExternal (Spark, Airflow)Built-in (managed Spark/Flink)Built-in (streaming SQL)
Online storeRedis, DynamoDB, PostgresManaged key-value storeMaterialized views (built-in)
Offline storeS3, BigQuery, RedshiftManaged Spark tablesSink to Iceberg, Postgres, S3
Online feature freshnessDepends on batch scheduleSeconds to minutes (stream transforms)Milliseconds to seconds
Feature definition languagePython (Feast SDK)Python (Tecton SDK)SQL
Serving interfaceREST API or Python SDKREST API or Python SDKPostgreSQL wire protocol
Infrastructure requiredRedis/DynamoDB + Spark + orchestratorManaged (SaaS)Single RisingWave cluster
Feature registry/catalogBuilt-inBuilt-inNot included (pair with Feast)
Training-serving consistencyPoint-in-time joins via SDKManaged by platformSame SQL for both paths
Incremental computationNo (full recompute per batch)Yes (for stream features)Yes (all materialized views)
Pricing modelOpen sourceSaaS per feature readOpen source + RisingWave Cloud

When to use each

Choose Feast when you need a feature registry and metadata catalog, your features are primarily batch-computed, and you already have Spark or Airflow in your stack. Feast is the right choice for teams that want governance and discovery over feature definitions across many models.

Choose Tecton when you want a fully managed platform and are willing to pay for it. Tecton handles both batch and real-time feature computation in a single SaaS product, which reduces operational burden for teams without dedicated ML infrastructure engineers.

Choose RisingWave when feature freshness is your primary concern, you want to define features in SQL without managing additional infrastructure, and you are comfortable pairing it with Feast or a simple metadata store for feature discovery. RisingWave excels as the computation and serving layer – the engine that keeps features fresh – while leaving catalog and governance to purpose-built tools.

For many teams, the practical architecture is RisingWave for real-time computation and serving, plus Feast for feature registration and point-in-time training joins. This gives you millisecond-fresh features online and a well-governed offline store for training, without the complexity of managing Spark, Flink, Redis, and an orchestrator.

FAQ

What is a real-time feature store?

A real-time feature store is a data system that computes and serves machine learning features with sub-second freshness. Unlike batch feature stores that refresh features on hourly or daily schedules, a real-time feature store continuously updates feature values as new events arrive. This is critical for ML models that act on rapidly changing signals, such as fraud detection, dynamic pricing, and real-time personalization.

Can RisingWave fully replace Feast or Tecton?

RisingWave replaces the computation and online serving components of a feature store, but it does not include a feature registry or metadata catalog. If your team needs feature discovery, access control, and point-in-time join utilities for training, you should pair RisingWave with Feast or a similar registry. For teams that only need real-time feature serving and are comfortable managing feature definitions as SQL files in version control, RisingWave can serve as a standalone lightweight feature store.

How do I prevent training-serving skew with this approach?

Training-serving skew happens when the code that computes features for training differs from the code that computes features for serving. With RisingWave, you define each feature as a single SQL materialized view. The live materialized view serves online inference. A sink writes the same view's output to an Iceberg table for offline training. Since both paths share the same SQL definition, there is no skew.

What throughput and latency can I expect?

RisingWave scales horizontally. A three-node cluster can ingest tens of thousands of events per second while maintaining feature freshness under 200 milliseconds. For higher throughput, add compute nodes. Feature serving queries against materialized views typically return in 1 to 5 milliseconds for point lookups by primary key.

Do I need Kafka to use this approach?

Kafka is the most common event source, but RisingWave also supports ingestion from Amazon Kinesis, Apache Pulsar, and several CDC connectors for databases like PostgreSQL and MySQL. You can also insert data directly into RisingWave tables if your application does not use a message broker.

Conclusion

Traditional feature stores solve an important problem – organizing and serving ML features in production – but they introduce substantial infrastructure complexity. Maintaining separate offline and online stores, running batch compute pipelines, and reconciling the two paths creates operational overhead that many teams struggle with.

RisingWave offers a simpler alternative for the real-time serving side of this problem. Materialized views continuously compute feature values from live event streams with millisecond freshness. The PostgreSQL-compatible interface means your serving layer needs nothing more than a standard database connection. And sinking results to Iceberg or PostgreSQL gives you offline storage for training without building a separate pipeline.

The key trade-off is clear: RisingWave gives you fast, simple, SQL-native feature computation and serving. For feature governance, discovery, and catalog management, pair it with a purpose-built registry like Feast. Together, they form a feature store architecture that is both fresh and well-organized – without the Spark jobs, Redis clusters, and Airflow DAGs that most teams dread maintaining.


Ready to build your own real-time feature store? Get started with RisingWave in 5 minutes. Quickstart →

Have questions about feature engineering with streaming SQL? Join the RisingWave Slack community to connect with other ML and data engineers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.