Real-Time Feature Engineering for ML with SQL (2026)

Real-Time Feature Engineering for ML with SQL (2026)

Real-Time Feature Engineering for ML with SQL (2026)

Real-time feature engineering computes machine learning features from streaming data as events happen — not hours later in a batch pipeline. A streaming database like RisingWave lets you define feature transformations as SQL materialized views that update in real time, serving fresh features to ML models via standard PostgreSQL connections. This eliminates the need for dedicated feature store infrastructure like Feast or Tecton for many use cases.

This guide shows how to build a real-time feature pipeline using SQL, without writing Python feature engineering code or managing a separate feature store.

The Feature Freshness Problem

Traditional ML feature pipelines look like this:

Raw data → Batch ETL (hourly/daily) → Feature store → Model inference

The problem is feature staleness. A fraud detection model using features computed from last night's batch run can't see that a user just made three purchases in three different countries in the last five minutes. A recommendation model using yesterday's user behavior can't personalize based on what the user clicked thirty seconds ago.

Real-time features close this gap:

Event streams → Streaming SQL → Real-time features → Model inference

Feature Engineering with SQL Materialized Views

In RisingWave, every real-time feature is a SQL materialized view. The view updates automatically as new events arrive, and models query it via PostgreSQL protocol.

User Behavioral Features

-- Features for recommendation / personalization models
CREATE MATERIALIZED VIEW user_behavior_features AS
SELECT
  user_id,
  -- Session features
  COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '5 minutes') as clicks_5min,
  COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour') as clicks_1h,
  COUNT(DISTINCT page_category) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour') as categories_1h,
  -- Engagement features
  AVG(duration_seconds) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour') as avg_duration_1h,
  MAX(event_time) as last_active,
  -- Purchase features
  COUNT(*) FILTER (WHERE event_type = 'purchase' AND event_time > NOW() - INTERVAL '24 hours') as purchases_24h,
  SUM(amount) FILTER (WHERE event_type = 'purchase' AND event_time > NOW() - INTERVAL '30 days') as spend_30d
FROM user_events
GROUP BY user_id;

Fraud Detection Features

-- Features for transaction fraud model
CREATE MATERIALIZED VIEW fraud_features AS
SELECT
  user_id,
  -- Velocity features
  COUNT(*) FILTER (WHERE txn_time > NOW() - INTERVAL '1 minute') as txns_1min,
  COUNT(*) FILTER (WHERE txn_time > NOW() - INTERVAL '1 hour') as txns_1h,
  COUNT(DISTINCT merchant_id) FILTER (WHERE txn_time > NOW() - INTERVAL '1 hour') as unique_merchants_1h,
  COUNT(DISTINCT country) FILTER (WHERE txn_time > NOW() - INTERVAL '1 hour') as unique_countries_1h,
  -- Amount features
  SUM(amount) FILTER (WHERE txn_time > NOW() - INTERVAL '1 hour') as total_amount_1h,
  MAX(amount) FILTER (WHERE txn_time > NOW() - INTERVAL '24 hours') as max_txn_24h,
  AVG(amount) FILTER (WHERE txn_time > NOW() - INTERVAL '30 days') as avg_txn_30d,
  -- Ratio features
  CASE WHEN AVG(amount) FILTER (WHERE txn_time > NOW() - INTERVAL '30 days') > 0
    THEN MAX(amount) FILTER (WHERE txn_time > NOW() - INTERVAL '1 hour')
         / AVG(amount) FILTER (WHERE txn_time > NOW() - INTERVAL '30 days')
    ELSE 0
  END as amount_ratio
FROM transactions
GROUP BY user_id;

Serving Features to Models

import psycopg2
import numpy as np

def get_features(user_id: str) -> np.ndarray:
    conn = psycopg2.connect(host="risingwave", port=4566, dbname="dev")
    cursor = conn.cursor()
    cursor.execute("""
        SELECT clicks_5min, clicks_1h, categories_1h, avg_duration_1h,
               purchases_24h, spend_30d
        FROM user_behavior_features
        WHERE user_id = %s
    """, (user_id,))
    row = cursor.fetchone()
    conn.close()
    return np.array(row) if row else np.zeros(6)

# Use in model inference
features = get_features("user_123")
prediction = model.predict(features.reshape(1, -1))

Streaming SQL vs Traditional Feature Stores

AspectRisingWave (Streaming SQL)FeastTecton
Feature definitionSQL materialized viewsPython SDKPython SDK
Real-time computationBuilt-in (streaming engine)Requires external stream processorBuilt-in
ServingPostgreSQL protocolREST API / gRPCREST API
Batch + streamingStreaming-first, batch via SQLBatch-first, streaming add-onBoth
InfrastructureSingle systemRedis + offline store + orchestratorManaged platform
Learning curveSQLPython + Feast conceptsPython + Tecton concepts
CostOpen source (Apache 2.0)Open sourceCommercial
Best forSQL-native teams, simpler architecturesComplex ML platform teamsEnterprise ML platforms

When to Use Streaming SQL for Features

Use RisingWave when:

  • Your features are expressible in SQL (aggregations, joins, window functions)
  • You want a single system for feature computation and serving
  • Your team knows SQL better than Python feature engineering
  • You don't need a full ML platform (feature registry, experiment tracking, etc.)

Use a dedicated feature store when:

  • You need feature versioning and lineage tracking
  • You have a large ML platform team managing hundreds of models
  • You need tight integration with ML training pipelines
  • You need features from non-SQL sources (image processing, NLP embeddings)

Real-Time Feature Patterns

Sliding Window Aggregations

CREATE MATERIALIZED VIEW rolling_features AS
SELECT
  entity_id,
  COUNT(*) FILTER (WHERE ts > NOW() - INTERVAL '5 minutes') as count_5m,
  COUNT(*) FILTER (WHERE ts > NOW() - INTERVAL '1 hour') as count_1h,
  COUNT(*) FILTER (WHERE ts > NOW() - INTERVAL '24 hours') as count_24h
FROM events
GROUP BY entity_id;

Cross-Stream Feature Joins

-- Join user events with product catalog for richer features
CREATE MATERIALIZED VIEW enriched_features AS
SELECT
  ue.user_id,
  p.category,
  COUNT(*) as views_per_category,
  AVG(p.price) as avg_viewed_price
FROM user_events ue
JOIN products p ON ue.product_id = p.product_id
WHERE ue.event_time > NOW() - INTERVAL '1 hour'
GROUP BY ue.user_id, p.category;

Feature Freshness Monitoring

-- Monitor feature staleness
CREATE MATERIALIZED VIEW feature_freshness AS
SELECT
  'user_behavior_features' as feature_table,
  MAX(last_active) as newest_feature,
  NOW() - MAX(last_active) as staleness,
  COUNT(*) as total_entities
FROM user_behavior_features;

Frequently Asked Questions

What is real-time feature engineering?

Real-time feature engineering computes machine learning features from streaming data as events arrive, rather than in batch jobs that run hours later. This gives ML models access to the freshest possible signals — critical for use cases like fraud detection, recommendations, and dynamic pricing where feature freshness directly impacts model accuracy.

Can SQL replace Python for feature engineering?

For aggregation-based features (counts, sums, averages, ratios, window functions), SQL is not only sufficient but often simpler and more maintainable than equivalent Python code. SQL features defined as materialized views update automatically without orchestration. For features requiring complex transformations (NLP, image processing, custom algorithms), Python remains necessary.

How does RisingWave compare to Feast for real-time features?

RisingWave is a streaming database that computes and serves features via SQL materialized views and PostgreSQL protocol. Feast is a feature store that manages feature definitions, storage, and serving. RisingWave is simpler (single system, SQL-only) but doesn't provide feature registry, versioning, or training dataset generation. Use RisingWave for SQL-expressible real-time features; use Feast when you need full feature store capabilities.

What latency can I expect for real-time features?

RisingWave updates materialized views within milliseconds of new data arriving, with point query latency of 10-20ms p99. This means features are available for model inference within sub-second of the triggering event — fast enough for real-time fraud detection, personalization, and dynamic pricing use cases.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.