You Don't Need a Feature Store: Use a Streaming Database

You Don't Need a Feature Store: Use a Streaming Database

Let me be upfront: dedicated feature stores like Feast, Tecton, and Hopsworks are good tools. If your organization has 50 data scientists, a dedicated ML platform team, and hundreds of features shared across dozens of models, you should probably use one. This article is not for you.

This article is for the rest of us — the team running two or three ML models in production, drowning in infrastructure complexity before the actual ML work even begins. The team that just wants to compute rolling aggregates over user events and serve them at inference time without standing up a distributed system.

Here's the uncomfortable truth: most teams adopting feature stores spend more time managing infrastructure than building features.

The Feature Store Stack Is Three Systems Pretending to Be One

Open any feature store architecture diagram and you'll see the same thing: an offline store for training data (usually a data warehouse or S3 + Hive), an online store for low-latency serving (usually Redis or DynamoDB), and a feature computation layer in the middle (usually Spark, Flink, or dbt) that keeps both stores synchronized.

That's three systems, plus an orchestration layer to coordinate them, plus a feature registry to track metadata, plus monitoring for pipeline lag between the offline and online stores.

Each system has its own failure modes. Spark jobs fail silently. Redis runs out of memory. The orchestration pipeline falls behind during traffic spikes. And the worst failure of all: the offline store and online store drift out of sync, which means your model trains on data that looks nothing like what it sees in production.

This problem has a name: training-serving skew. It's endemic to any architecture where you compute features twice — once for training (in batch) and once for serving (in real time) — using different code paths.

The conventional solution is to write your feature logic once in a DSL (Python functions in Feast, for example) and let the platform translate it to Spark for offline computation and to streaming for online computation. This works until it doesn't — and when it breaks, you're debugging across two runtime environments with different semantics.

There's a simpler model. Let's talk about it.

What a Feature Store Actually Does

Before replacing something, you need to understand what it actually does. A feature store has four jobs:

1. Feature computation. Raw events (clicks, transactions, sensor readings) are transformed into ML-ready features — usually aggregations over time windows. "User's average transaction amount over the last 7 days." "Number of failed login attempts in the last hour."

2. Online serving. At inference time, your model needs feature values for a specific entity (a user, a device, a transaction) in milliseconds. The online store — typically Redis — is a key-value cache that answers these lookups fast.

3. Offline training data. When you're training a new model version, you need a dataset with point-in-time correct feature values: "what was the user's 7-day transaction average at the time they made this purchase, not today?" Getting this wrong — a classic mistake — introduces data leakage and makes your model look better in training than it performs in production.

4. Feature registry. A catalog of what features exist, how they're computed, who owns them, and where they're stored. This matters a lot at scale; it matters less when you have ten features.

A streaming database can handle jobs 1 through 3 directly, and job 4 partially.

How a Streaming Database Maps to Each Role

A streaming database like RisingWave ingests streams of raw events, computes queries over them incrementally, and materializes the results in a way that's immediately queryable via standard SQL. It's not a stream processor you write jobs for — it's a database you write queries against, and those queries are always up to date.

Here's how that maps to feature store jobs:

Feature computation via materialized views. You define your features as SQL materialized views. RisingWave evaluates them incrementally as new events arrive — not in batch, not on a schedule, continuously. When a new transaction comes in, the 7-day rolling aggregate for that user updates within seconds.

Online serving via PostgreSQL protocol. RisingWave speaks the PostgreSQL wire protocol, which means any Postgres client can query it. Your model serving API connects the same way it connects to a Postgres database, runs SELECT against a materialized view, and gets current feature values. Latency is in the single-digit milliseconds for point lookups.

Offline training data via SQL + temporal queries. RisingWave supports FOR SYSTEM_TIME AS OF syntax for temporal joins, which is the mechanism for point-in-time correct feature extraction. More on this below.

The key insight is that there's only one computation path. Your features are defined in SQL, computed by the streaming database, and served from the same system. There's no Spark-to-Redis sync to manage. There's no offline/online consistency problem because there's no offline/online split.

Let's Look at the SQL

Here's a concrete example: a fraud detection model that uses rolling transaction aggregates as features.

Step 1: Ingest raw events from Kafka.

CREATE SOURCE transactions (
    user_id       BIGINT,
    amount        NUMERIC,
    merchant_id   VARCHAR,
    event_time    TIMESTAMP WITH TIME ZONE
)
WITH (
    connector = 'kafka',
    topic = 'raw_transactions',
    properties.bootstrap.server = 'kafka:9092'
)
FORMAT PLAIN ENCODE JSON;

This is a source, not a table — RisingWave reads from Kafka continuously. No batch jobs, no watermark management in application code.

Step 2: Define features as materialized views.

-- Rolling 7-day aggregates per user
CREATE MATERIALIZED VIEW user_transaction_features AS
SELECT
    user_id,
    COUNT(*)                          AS txn_count_7d,
    SUM(amount)                       AS txn_volume_7d,
    AVG(amount)                       AS txn_avg_7d,
    MAX(amount)                       AS txn_max_7d,
    STDDEV(amount)                    AS txn_stddev_7d,
    COUNT(DISTINCT merchant_id)       AS unique_merchants_7d,
    MAX(event_time)                   AS last_updated
FROM transactions
WHERE event_time >= NOW() - INTERVAL '7 days'
GROUP BY user_id;

This view is maintained incrementally. When a new transaction lands in Kafka, the relevant rows in this view update automatically. You don't schedule a Spark job — you just query the view.

Step 3: Hourly velocity features using tumbling windows.

CREATE MATERIALIZED VIEW user_hourly_velocity AS
SELECT
    user_id,
    window_start,
    window_end,
    COUNT(*)    AS txn_count,
    SUM(amount) AS txn_volume
FROM TUMBLE(transactions, event_time, INTERVAL '1 hour')
GROUP BY user_id, window_start, window_end;

Step 4: Serve features via your model API.

Your inference service runs a query like this:

SELECT
    txn_count_7d,
    txn_volume_7d,
    txn_avg_7d,
    txn_stddev_7d,
    unique_merchants_7d
FROM user_transaction_features
WHERE user_id = $1;

This is a standard Postgres query. Any language with a Postgres driver works. No custom SDK, no feature store client library.

Step 5: (Optional) Sink features to Redis for ultra-low-latency serving.

If your serving latency requirement is sub-millisecond — which is rare but real — you can sink the materialized view to Redis:

CREATE SINK features_to_redis
FROM user_transaction_features
WITH (
    connector = 'redis',
    redis.url = 'redis://localhost:6379'
);

RisingWave handles the write propagation. You still define features once in SQL; you've just added a read-side cache.

Point-in-Time Correctness for Training Data

This is the part where most "simple" solutions fall apart. When you're building a training dataset for a fraud model, you can't just join your labeled examples to today's feature values — that leaks future information. A transaction labeled "fraud" in your dataset might have a high txn_count_7d because you computed it after subsequent fraudulent transactions were already in the dataset.

You need to know: what was the user's transaction count at the exact moment the transaction occurred?

RisingWave supports this via temporal joins with FOR SYSTEM_TIME AS OF:

-- Build a training dataset with point-in-time correct features
SELECT
    t.transaction_id,
    t.user_id,
    t.amount,
    t.event_time,
    f.txn_count_7d,
    f.txn_volume_7d,
    f.txn_avg_7d,
    f.txn_stddev_7d,
    t.is_fraud  -- label
FROM labeled_transactions t
LEFT JOIN user_transaction_features FOR SYSTEM_TIME AS OF t.event_time f
    ON t.user_id = f.user_id;

The FOR SYSTEM_TIME AS OF clause tells RisingWave to join against the state of user_transaction_features as it existed at t.event_time, not as it exists today. This is how you get training-serving consistency without a separate offline computation pipeline.

It's worth noting that this requires RisingWave to have historical data retained — you'll want to configure appropriate retention policies. And for very large historical datasets, you may still want to export to a data warehouse for training data storage. But the feature computation logic stays in one place.

The Single-System Advantage

Let me be concrete about what you stop managing when you collapse offline store, online store, and computation into one system.

You no longer manage sync lag. There's no pipeline keeping Redis in sync with your data warehouse, which means there's no monitoring for sync lag, no alerting when that pipeline falls behind during traffic spikes, and no incidents at 2am because Redis has stale feature values.

You no longer have two implementations of the same logic. The aggregation you wrote for Spark is the aggregation you wrote for the streaming layer. They're the same SQL query. Feature logic drift — where the Spark job and the streaming job compute slightly different things due to implementation differences — goes away.

You no longer need a feature computation DSL. If your team knows SQL, they can define features. No Feast's Python SDK, no Tecton's feature configuration language, no "how do I express a custom aggregation in this framework."

The operational surface area shrinks significantly. You're running one stateful system instead of three.

When You Still Need a Dedicated Feature Store

I said upfront I'd be honest about tradeoffs. Here's when this approach doesn't work:

Very high serving QPS (>1M/sec). RisingWave's PostgreSQL serving layer is fast, but it's a database, not a dedicated key-value store. If you're a large-scale recommendation system serving millions of requests per second, Redis or DynamoDB's latency and throughput characteristics matter. You probably need a dedicated online store. In that case, you can still use RisingWave to compute features and sink them to your online store — but you've introduced the sync step back into the picture.

Python-first teams with complex feature logic. If your feature engineering involves custom Python functions, NumPy operations, or libraries like scikit-learn's preprocessing steps, SQL is a constraint. Some feature transformations express naturally in Python and awkwardly in SQL. Dedicated feature platforms with Python-native feature definition (Tecton, Feast) are a better fit if your team primarily thinks in Python.

Large-scale feature sharing across many teams. The feature registry problem becomes real when you have many teams, many models, and a need to reuse features across all of them. Dedicated feature stores have mature tooling for feature discovery, lineage, access control, and documentation. If you're an ML platform team serving dozens of model teams, you need that infrastructure. RisingWave has no equivalent of Feast's feature repository or Hopsworks's feature store UI.

Regulatory requirements for offline data audit. Some industries require long-term, immutable storage of the exact feature values used in model decisions. A dedicated data lake or warehouse with append-only semantics is a better fit for this than a streaming database's materialized views.

The honest framing: RisingWave as a feature store works best for teams with 2-10 ML models in production, serving QPS in the tens of thousands or below, and data engineers who are more SQL-fluent than Python-fluent.

FAQ

Can I use RisingWave with my existing ML framework?

Yes. RisingWave exposes a PostgreSQL wire protocol, so any ML serving framework that can run a SQL query can use it. Scikit-learn, XGBoost, LightGBM, and TensorFlow Serving don't care where their feature data comes from — they care that they get a vector of numbers at inference time. Your serving code calls a Postgres query instead of a feature store SDK call.

What happens if RisingWave goes down? Do I lose my features?

RisingWave is designed for S3-native storage — state is checkpointed to object storage, not kept only in memory. If the service restarts, it recovers from checkpoint and replays from Kafka to catch up to the current stream position. The failure mode is increased latency during recovery, not data loss. For high-availability requirements, RisingWave supports multi-node deployment.

Is this actually cheaper than a dedicated feature store?

It depends on your scale. At small to medium scale — a few GB of feature state, serving QPS in the thousands — running a single RisingWave cluster is significantly cheaper than running Spark (for batch computation), Redis (for online serving), and a data warehouse (for offline storage) separately. At large scale, the economics flip because you're running a general-purpose streaming database where a specialized key-value store would be more cost-efficient per request.

How fresh are the features? Is there lag?

RisingWave maintains materialized views incrementally, so feature updates happen within seconds of the underlying event. For the tumbling window example above, hourly features finalize at the end of each hour window. For the rolling 7-day aggregates, every new event updates the relevant rows. The end-to-end latency from a Kafka message arriving to a feature value being queryable is typically under 5 seconds in a properly configured deployment — comparable to real-time feature platforms, much better than batch pipelines.


If you're spending more time managing feature store infrastructure than building features, that's a signal to question the architecture, not to hire more DevOps engineers.

RisingWave is open source (Apache 2.0) and written in Rust. You can run it locally with Docker in five minutes. The SQL you write for development is the SQL you run in production.

That's a different kind of simplicity than most MLOps tooling offers — and sometimes simpler is the right call.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.