Building an Open-Source Feature Store: RisingWave + Redis + S3

Building an Open-Source Feature Store: RisingWave + Redis + S3

You bought the managed feature platform. You ran through the sales pitch, saw the polished UI, and signed the contract. Then production arrived. Your costs tripled. Your ML team couldn't debug why a feature value disagreed with what was logged during training. The vendor's support SLA meant two business days for a P1 incident.

If that sounds familiar, this article is for you.

The managed feature platform market is real, but the pitch oversells what you actually need. A feature store is fundamentally three things: a computation engine, a low-latency serving layer, and a training data archive. None of those require proprietary infrastructure. What they require is good architecture.

This guide walks through a production-ready open-source alternative built on RisingWave, Redis, and S3. It covers the architecture decisions, the SQL, the operational gotchas, and a realistic cost breakdown for a mid-scale deployment.


Why Teams Build Their Own

Cost at Scale Scales Against You

Managed feature platforms typically charge on some combination of compute, feature reads, and stored feature values. The pricing sounds reasonable at prototype scale. At 100K features per second across 1,000 models, the bill looks very different.

Tecton, Feast Cloud, and similar platforms can cost $80K-$200K per year for a mid-sized ML organization. That is not a niche complaint -- that is standard pricing. The problem is that feature computation and serving are largely commodity infrastructure problems. You are paying a premium for glue and a UI.

Black-Box Abstractions Break at 2 AM

When a model's behavior degrades in production, the first question is always: "Did the features change?" If your feature store is a black box, answering that question requires opening a support ticket. Latency spikes, silent feature drift, and TTL misconfiguration are all harder to debug when you do not own the stack.

Open-source components expose their internals. RisingWave has a EXPLAIN command and system tables. Redis has DEBUG OBJECT and a rich observability ecosystem. S3 objects are just files. You debug these the same way you debug everything else.

PostgreSQL Compatibility as a Superpower

RisingWave speaks the PostgreSQL wire protocol. That means any ML framework, serving API, or BI tool that can talk to Postgres can read features directly from RisingWave -- no custom SDK, no client library, no vendor-specific connector. FastAPI, SQLAlchemy, Feast (with a custom registry), dbt, Metabase, Grafana -- all of them work out of the box.

That is a meaningful architectural advantage. Your feature definitions live in SQL. Your feature values are queryable with standard tooling. The abstraction is transparent.


Architecture Overview

The three-layer design maps directly to the three access patterns that every feature store needs to support:

Kafka / CDC sources
       |
       v
RisingWave (feature computation layer)
       |
       |-- PostgreSQL protocol --> ML serving API (online, sub-10ms)
       |
       v
  Redis cache (hot feature serving, <1ms)
       |
       v
  S3 / Parquet (training data archive, point-in-time snapshots)

RisingWave handles the hard part: continuous incremental computation over streaming data. It ingests from Kafka topics and CDC sources, maintains materialized views that represent your feature definitions, and exposes them over the PostgreSQL protocol. It is the source of truth for feature values.

Redis serves as the hot cache for the most latency-sensitive lookups. Not all models need sub-millisecond feature serving, but the ones that do (fraud detection, real-time recommendations, dynamic pricing) require it. RisingWave populates Redis via its native Redis sink, triggered on every materialized view update.

S3 with Parquet files handles training data. Point-in-time correct feature snapshots are sunk from RisingWave to S3 on a schedule. Your training pipeline reads directly from S3 without touching the serving infrastructure.

The total ops burden for this stack: you run RisingWave (either self-hosted or via RisingWave Cloud), Redis (ElastiCache or a self-managed cluster), and S3 (any object store). You do not run a separate feature computation engine, a feature serving tier, or a proprietary registry service.


Layer 1: Feature Computation in RisingWave

Ingesting from Kafka

Feature computation starts at the source. RisingWave ingests from Kafka using a CREATE SOURCE statement. The source definition describes the schema and the Kafka connection:

CREATE SOURCE user_events (
    user_id     BIGINT,
    event_type  VARCHAR,
    item_id     BIGINT,
    session_id  VARCHAR,
    ts          TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'user-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
)
FORMAT PLAIN ENCODE JSON;

For CDC sources (Postgres, MySQL, MongoDB), the connector is postgres-cdc or mysql-cdc. RisingWave handles the offset tracking and exactly-once semantics internally. You do not manage consumer groups for feature computation.

Defining Features as Materialized Views

A feature in this architecture is a materialized view. The view definition is the feature definition. RisingWave maintains it incrementally -- when new events arrive, only the affected rows are recomputed.

Here is a realistic example: a set of user engagement features computed over a sliding 1-hour window.

CREATE MATERIALIZED VIEW user_features_1h AS
SELECT
    user_id,
    COUNT(*)                                          AS event_count_1h,
    COUNT(DISTINCT session_id)                        AS session_count_1h,
    COUNT(*) FILTER (WHERE event_type = 'purchase')  AS purchase_count_1h,
    COUNT(*) FILTER (WHERE event_type = 'view')      AS view_count_1h,
    MAX(ts)                                           AS last_event_ts
FROM user_events
WHERE ts > NOW() - INTERVAL '1 hour'
GROUP BY user_id;

The key property is that this view stays current without a batch job. When a user makes a purchase, the purchase_count_1h for that user updates immediately. At serving time, you read SELECT * FROM user_features_1h WHERE user_id = $1 and get fresh values.

For features that join multiple sources (for example, user features joined with item catalog features):

CREATE MATERIALIZED VIEW user_item_affinity AS
SELECT
    e.user_id,
    i.category,
    COUNT(*)              AS category_views_24h,
    SUM(i.price)          AS category_spend_24h
FROM user_events e
JOIN item_catalog i ON e.item_id = i.item_id
WHERE e.ts > NOW() - INTERVAL '24 hours'
  AND e.event_type IN ('view', 'purchase')
GROUP BY e.user_id, i.category;

Handling CDC Sources

When your features derive from operational databases rather than Kafka, use CDC connectors. RisingWave treats CDC as a first-class source:

CREATE SOURCE orders_cdc
WITH (
    connector = 'postgres-cdc',
    hostname = 'rds.example.com',
    port = '5432',
    username = 'replication_user',
    password = '...',
    database.name = 'production',
    schema.name = 'public',
    table.name = 'orders',
    slot.name = 'risingwave_slot'
);

CREATE MATERIALIZED VIEW order_value_features AS
SELECT
    customer_id,
    COUNT(*)                     AS total_orders,
    SUM(order_value)             AS lifetime_value,
    AVG(order_value)             AS avg_order_value,
    MAX(created_at)              AS last_order_ts
FROM orders_cdc
GROUP BY customer_id;

Changes to the orders table replicate through the CDC slot and update the materialized view in near real-time. Your ML serving layer always sees a current view of customer order history.


Layer 2: Hot Serving Cache with Redis

When to Use Redis

Not every feature lookup needs Redis. If your model serving latency budget is 50ms or more, RisingWave over the PostgreSQL protocol is fast enough -- a well-indexed materialized view lookup runs in 2-5ms for most workloads.

Use Redis when:

  • You need consistent sub-millisecond p99 latency
  • Your serving tier is doing many parallel lookups per request (20+ features per inference call)
  • You have burst traffic patterns that would cause connection pooling pressure on RisingWave

For fraud detection and real-time bidding, Redis is non-negotiable. For batch-adjacent recommendation serving, it is optional.

Populating Redis from RisingWave

RisingWave has a native Redis sink. Define a sink that pushes materialized view changes directly to Redis as they happen:

CREATE SINK user_features_to_redis
FROM user_features_1h
WITH (
    connector = 'redis',
    primary_key = 'user_id',
    redis.url = 'redis://redis:6379'
);

RisingWave serializes the row as a Redis hash or JSON string (configurable), keyed by user_id. Every time a user's features update in the materialized view, the corresponding Redis key updates within seconds.

TTL Strategy

Set Redis TTL to match your feature freshness requirements plus a safety buffer. For 1-hour window features, a 2-hour TTL means a user who goes inactive has their features expire from cache but can be re-served from RisingWave on cache miss.

# Serving layer: Redis-first with RisingWave fallback
def get_user_features(user_id: int) -> dict:
    key = f"features:user:{user_id}"
    cached = redis_client.hgetall(key)
    if cached:
        return cached

    # Cache miss: query RisingWave directly
    row = pg_conn.fetchrow(
        "SELECT * FROM user_features_1h WHERE user_id = $1",
        user_id
    )
    if row:
        redis_client.hset(key, mapping=dict(row))
        redis_client.expire(key, 7200)  # 2-hour TTL
    return dict(row) if row else {}

This pattern avoids cold-start problems on Redis restarts and keeps Redis size bounded without manual eviction management.


Layer 3: Training Data Archive on S3

Why S3, Not a Data Warehouse

Training data has different access characteristics than serving data. It is accessed in large bulk reads, rarely, by training jobs that can tolerate higher latency. S3 with columnar Parquet files is the cheapest and most portable format for this workload. It avoids coupling your training pipeline to your serving infrastructure.

Point-in-time correctness is the hard part. A naive approach -- joining your current feature values with historical labels -- produces feature leakage. Your training set needs features as they appeared at the time of the label event, not as they appear today.

RisingWave handles this via periodic sink jobs. You schedule a daily sink of your materialized views to S3, partitioned by date:

CREATE SINK user_features_daily_snapshot
FROM user_features_1h
WITH (
    connector = 's3',
    s3.region = 'us-east-1',
    s3.bucket_name = 'ml-feature-store',
    s3.path = 'snapshots/user_features_1h/date={snapshot_date}/',
    s3.credentials.access = '...',
    s3.credentials.secret = '...'
)
FORMAT PLAIN ENCODE PARQUET;

For point-in-time joins during training, your training pipeline reads the snapshot that corresponds to the label timestamp:

import pyarrow.dataset as ds

def get_training_features(label_df, feature_name, label_ts_col):
    """Point-in-time correct feature join."""
    snapshots = ds.dataset(
        f"s3://ml-feature-store/snapshots/{feature_name}/",
        format="parquet",
        partitioning="hive"
    )
    # Read the snapshot closest to (but not after) each label timestamp
    # Implementation depends on your label distribution and snapshot frequency
    ...

Partitioning Strategy

Partition by date at the top level, and consider sub-partitioning by a high-cardinality dimension if your training jobs always filter on it (for example, by region or product_line). Over-partitioning creates small files and hurts read performance. The right partition granularity is the one that aligns with your training job's typical filter predicates.

s3://ml-feature-store/
  snapshots/
    user_features_1h/
      date=2025-01-15/
        part-00001.parquet
        part-00002.parquet
      date=2025-01-16/
        ...
    order_value_features/
      date=2025-01-15/
        ...

Feature Registry: Managing Feature Definitions

A feature registry answers the question: "What features exist, what do they mean, and where do they come from?" Managed platforms bundle this as a proprietary UI. You have two options for the open-source path.

Simple: YAML + Git

For most teams, a YAML file per feature group in a Git repository is sufficient. Each file describes the feature group, the materialized view that backs it, the owner, and the freshness SLA:

# features/user_engagement.yaml
name: user_features_1h
description: "User engagement features computed over a 1-hour rolling window"
owner: ml-platform@example.com
freshness_sla_seconds: 30
materialized_view: user_features_1h
features:
  - name: event_count_1h
    dtype: int64
    description: "Total events in last 1 hour"
  - name: purchase_count_1h
    dtype: int64
    description: "Purchase events in last 1 hour"
  - name: session_count_1h
    dtype: int64
    description: "Distinct sessions in last 1 hour"

CI validates that every YAML definition has a corresponding materialized view in RisingWave. The Git history is your audit trail. No additional infrastructure required.

Advanced: dbt-Style Catalog

If your organization already uses dbt, you can model feature definitions as dbt models with custom metadata. The dbt manifest becomes your feature catalog, and dbt's lineage graph shows upstream data sources. This approach works well when your data team already lives in dbt and you want feature definitions to co-exist with analytics models.


Cost Estimate: 100K Features/Second, 1,000 Models

A representative mid-scale deployment: 100,000 feature updates per second, serving 1,000 models with a mix of latency requirements, 90 days of training data retention.

ComponentConfigurationMonthly Cost
RisingWave16-core cluster, 64GB RAM (2 nodes)~$1,200
RedisElastiCache r7g.xlarge (2 nodes, Multi-AZ)~$600
S3 + Parquet5TB storage, typical egress~$120
Kafka (if not existing)MSK 3-broker cluster~$800
Total~$2,720/month

Compare that to managed feature platform pricing for the same scale: $8,000-$18,000 per month is typical, depending on vendor and contract. The open-source stack lands at roughly 80% less at this scale. The gap widens as you scale up, because your Kafka and object storage costs grow sublinearly while managed platform pricing grows per-feature-read.

The honest caveat: this does not account for engineering time to build and maintain the stack. If your team has zero streaming infrastructure experience, the first three months have a real learning cost. Budget for it. For teams that already operate Kafka and Redis, the marginal cost is low.


Operational Considerations

Failover

RisingWave replicates state across nodes. If a compute node fails, it recovers from its checkpoint and replays from Kafka from the last consistent offset. For most workloads, recovery time is under 60 seconds.

Redis in ElastiCache Multi-AZ mode fails over automatically -- the endpoint does not change, and reconnect logic in your serving layer handles the brief interruption. S3 is durable by construction.

The serving layer should implement a circuit breaker for the Redis-to-RisingWave fallback. If RisingWave is recovering, you want to serve stale features from Redis rather than returning errors. Stale features are usually better than no features.

Backfill

When you add a new feature that requires historical data, you need to backfill the materialized view. If your Kafka topics retain enough history (7-30 days is common), RisingWave can backfill by replaying from the beginning of the topic. For longer historical windows, load historical data from your data warehouse into RisingWave via a batch source, then switch to streaming.

Backfill of the S3 training archive requires a separate historical export from your source-of-truth systems. Build that tooling the first time you add a major feature group -- you will need it again.

Schema Evolution

Adding a column to a materialized view requires dropping and recreating it. That causes a brief period where the feature is unavailable from RisingWave. For zero-downtime schema changes:

  1. Create a new materialized view with the updated schema (e.g., user_features_1h_v2).
  2. Let it backfill while the old view continues serving.
  3. Switch serving traffic to the new view once it is caught up.
  4. Drop the old view.

This is the same blue-green pattern you use for database migrations. It is manual, but it is predictable.


FAQ

Do I need all three layers?

No. Start with just RisingWave. The PostgreSQL protocol is sufficient for most models. Add Redis when you hit latency requirements that RisingWave cannot meet. Add S3 when you have training jobs that need historical snapshots.

How does this compare to Feast?

Feast is a feature registry and serving abstraction layer. It needs a backend for computation and storage -- you configure it to use an offline store (like BigQuery or Redshift) and an online store (like Redis or DynamoDB). This architecture replaces Feast's offline store with RisingWave and keeps Redis for online serving. You can run Feast on top of this stack if you want its registry and SDK, or use YAML + Git if you want simpler.

What if my models need features from multiple entity types in a single request?

Build a feature vector assembly step in your serving API. Query Redis (or RisingWave) for each entity type in parallel, then merge. For N entity types, your serving latency is the max of N parallel lookups, not the sum. Most serving layers already do this.

Is RisingWave production-ready?

Yes. RisingWave is used in production at companies including Naver, Trendyol, and others at substantial scale. It has been Generally Available since 2023. The PostgreSQL wire protocol compatibility means your existing connection poolers, monitoring agents, and query tools work without modification.

What is the realistic team size to operate this?

One experienced data/platform engineer can operate this stack in a support role alongside other responsibilities. Full ownership and active development of new features requires closer to 0.5 FTE. That is substantially less than managing a proprietary platform that generates support tickets, debugging sessions, and contract negotiations.


The managed feature platform pitch is compelling because it abstracts the hard parts. The problem is that the abstractions do not disappear -- they just move behind a vendor wall where you cannot see or fix them. This stack keeps the abstractions visible, the costs predictable, and the debugging tools familiar.

If your team can run Kafka and Postgres, you can run this. The SQL is not exotic. The operations are not novel. The cost savings are real.

Start with RisingWave and a single materialized view. Add Redis when your latency requirements demand it. Add S3 when your training pipeline needs historical snapshots. The architecture grows with your actual needs, not with a vendor's upsell roadmap.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.