What Is Real-Time Feature Engineering?

Feature engineering determines whether a machine learning model works in production. You can have the best architecture, the largest training set, and the most sophisticated hyperparameter tuning -- but if your features are stale, incomplete, or inconsistent between training and serving, the model will underperform.

Most teams build feature pipelines as batch jobs. A Spark or Airflow DAG runs every hour, reads from the data warehouse, computes aggregates, and writes results to a feature store. This works until it doesn't. Fraud patterns shift within seconds. Recommendation relevance decays within minutes. Pricing signals go cold within the hour. By the time your batch pipeline refreshes, the opportunity window has closed.

Real-time feature engineering solves this by computing features continuously from live event streams, delivering fresh values to your models within milliseconds of each source event. This guide explains what real-time feature engineering is, why it matters, and how streaming SQL in RisingWave makes it practical without requiring you to learn new frameworks or languages.

What Is Real-Time Feature Engineering?

Feature engineering is the process of transforming raw data into numerical inputs that a machine learning model can consume. A raw event like "user X purchased item Y for $42.50 at 14:03:07" becomes a set of features: transaction count in the last 5 minutes, average spend in the last 24 hours, number of distinct merchants in the last hour.

In batch feature engineering, these transformations run on a schedule. A job computes all features from a snapshot of historical data, writes the results, and then waits until the next scheduled run.

Real-time feature engineering removes the schedule. Instead of reading from a warehouse on a timer, it reads from event streams (Kafka, Kinesis, Redpanda) and updates feature values incrementally as each event arrives. The result is a feature store that reflects the state of the world right now, not the state from the last batch run.

Three properties that define real-time feature engineering

Continuous computation: Features update with every incoming event, not on a cron schedule. There is no "next run" to wait for.
Incremental processing: When a new event arrives, only the affected feature rows are recomputed. The system does not re-scan the entire dataset. This is what makes sub-second latency achievable at scale.
Stream-native: The pipeline reads directly from event streams and produces results that are immediately queryable. There is no intermediate landing zone or staging table.

Why Batch Feature Pipelines Break in Production

Batch feature pipelines are familiar and well-understood, but they introduce three systemic problems that get worse as your ML system matures.

The freshness problem

A batch pipeline that runs every hour produces features that are, on average, 30 minutes old at serving time. For some use cases, this is fine -- a content recommendation model can tolerate hourly refreshes. But for fraud detection, dynamic pricing, real-time personalization, and anomaly detection, stale features directly reduce model accuracy.

Consider a fraud model. It was trained on features like "transaction count in the last 5 minutes" computed from raw event logs. But in production, the feature store only refreshes hourly. The model receives a txn_count_5min value that was accurate 45 minutes ago. A burst of fraudulent transactions that started 3 minutes ago is invisible to the model.

Training-serving skew

This is the most insidious failure mode in production ML. Training-serving skew happens when the code that computes features for training differs from the code that computes features for serving, even subtly.

In a typical batch architecture, training features are computed by a PySpark job reading from a data lake, while serving features come from a separate pipeline writing to Redis. The two code paths use different join semantics, different null handling, different window boundaries. The model sees a different feature distribution in production than it saw during training, and its predictions degrade silently -- no errors, no crashes, just worse outcomes.

Operational overhead

Each batch feature pipeline is a DAG node. It needs scheduling, monitoring, alerting, backfill logic for failed runs, and dependency management. At 10 features, this is manageable. At 200 features across 15 models, it is a full-time job. Late-arriving data triggers reruns. Schema changes cascade through downstream jobs. Backfills compete with production runs for cluster resources.

How Streaming SQL Makes Real-Time Features Practical

The traditional answer to real-time feature engineering has been to write custom Flink or Spark Structured Streaming applications in Java, Scala, or Python. This works, but it introduces a new tech stack, a new deployment model, and a new skillset requirement for your team.

Streaming SQL offers a different path. If you can write a GROUP BY query, you can build a real-time feature pipeline. RisingWave is a streaming database that lets you define features as SQL queries over live streams, maintained continuously as materialized views.

Connect to your event stream

First, declare a source that connects to your event stream:

CREATE SOURCE transactions (
    txn_id VARCHAR,
    user_id VARCHAR,
    amount DECIMAL,
    merchant_category VARCHAR,
    ip_address VARCHAR,
    event_time TIMESTAMP
) WITH (
    connector = 'kafka',
    topic = 'transactions',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

Define features as materialized views

Each materialized view continuously computes a group of related features. RisingWave incrementally maintains these views -- when a new event arrives, only the affected rows are updated.

User transaction velocity features:

CREATE MATERIALIZED VIEW user_txn_velocity AS
SELECT
    user_id,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '5 minutes')
        AS txn_count_5min,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS txn_count_1h,
    SUM(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS total_spend_1h,
    AVG(amount) FILTER (WHERE event_time > NOW() - INTERVAL '24 hours')
        AS avg_txn_amount_24h,
    COUNT(DISTINCT merchant_category)
        FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS distinct_merchants_1h,
    COUNT(DISTINCT ip_address)
        FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS distinct_ips_1h
FROM transactions
GROUP BY user_id;

Merchant-level aggregation features:

CREATE MATERIALIZED VIEW merchant_risk_features AS
SELECT
    merchant_category,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '10 minutes')
        AS txn_volume_10min,
    AVG(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS avg_txn_amount_1h,
    COUNT(DISTINCT user_id)
        FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS unique_users_1h
FROM transactions
GROUP BY merchant_category;

Cross-entity features via streaming joins:

CREATE MATERIALIZED VIEW user_merchant_features AS
SELECT
    u.user_id,
    u.txn_count_5min,
    u.avg_txn_amount_24h,
    m.txn_volume_10min AS merchant_volume_10min,
    m.avg_txn_amount_1h AS merchant_avg_amount_1h
FROM user_txn_velocity u
JOIN merchant_risk_features m
    ON u.user_id IS NOT NULL  -- placeholder; in practice, join via a transaction's merchant
;

In real pipelines, you would join through an intermediate view that preserves the user_id to merchant_category mapping from the latest transaction. The point is that RisingWave handles streaming joins natively -- no custom code needed.

Serve features at inference time

Because RisingWave speaks the PostgreSQL wire protocol, your inference service can read features with a standard SQL query over any PostgreSQL client library:

SELECT
    txn_count_5min,
    txn_count_1h,
    total_spend_1h,
    avg_txn_amount_24h,
    distinct_merchants_1h,
    distinct_ips_1h
FROM user_txn_velocity
WHERE user_id = 'user_8832';

This query returns in single-digit milliseconds because it reads from a pre-computed materialized view, not from raw events. No special SDK required -- any language with a PostgreSQL driver works.

Sink features to external systems

If your serving infrastructure already reads from Redis, PostgreSQL, or an Apache Iceberg table, you can push features downstream:

CREATE SINK txn_features_to_iceberg
FROM user_txn_velocity
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'user_id',
    warehouse.path = 's3://feature-warehouse/features',
    database.name = 'ml_features',
    table.name = 'user_txn_velocity',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181'
);

This pattern unifies training and serving. Your training pipeline reads historical features from the Iceberg table, while your serving pipeline reads live features from the materialized view. Both use identical computation logic defined in the same SQL, eliminating training-serving skew by design.

Real-Time vs. Batch Feature Engineering: A Comparison

The choice between batch and real-time feature engineering is not binary. Most production ML systems use both. The question is which features need real-time freshness and which can tolerate staleness.

Dimension	Batch Feature Engineering	Real-Time Feature Engineering
Freshness	Minutes to hours	Milliseconds to seconds
Compute model	Full recomputation on schedule	Incremental per event
Typical tools	Spark, Airflow, dbt	Flink, Spark Streaming, RisingWave
Feature definition	Python/Scala transformations	SQL materialized views (in RisingWave)
Training-serving consistency	Requires careful engineering	Same SQL for both paths
Orchestration	DAGs, schedulers, backfill logic	None -- always running
Best for	Slowly changing features, large historical aggregates	Time-sensitive signals, velocity features, session features

Use real-time for features where freshness directly affects model accuracy: transaction velocity, session behavior, real-time pricing signals, inventory levels, error rates.

Use batch for features where staleness is acceptable: demographic attributes, lifetime aggregates, features derived from data that updates daily (credit scores, account age).

FAQ

What types of ML features benefit most from real-time computation?

Features that capture recent behavior and change rapidly benefit the most. Examples include transaction counts over short windows (last 5 minutes, last hour), session-level aggregates (pages viewed in current session), rolling ratios (error rate in the last 10 minutes), and velocity metrics (rate of change in spending). If the feature value can shift meaningfully between batch runs, it is a candidate for real-time computation.

How does real-time feature engineering eliminate training-serving skew?

Training-serving skew arises when training and serving use different code to compute the same feature. With streaming SQL, you define each feature once as a materialized view. The live view serves real-time inference, and a sink to Iceberg or another storage layer provides the historical data for training. Since both paths execute the same SQL definition, the feature distributions seen during training match what the model encounters in production.

Do I need to replace my existing feature store to use real-time features?

No. A streaming SQL engine like RisingWave replaces the feature computation layer, not the feature store. You can sink computed features from RisingWave into Feast, Tecton, or any other feature store. RisingWave handles the real-time computation; the feature store handles discovery, governance, and access control.

How does RisingWave compare to Apache Flink for feature engineering?

Both can compute features from event streams in real time. The key differences are in developer experience and operational complexity. RisingWave uses standard SQL and manages state internally -- you define features as materialized views, and the system handles incremental maintenance, checkpointing, and recovery. Flink requires writing Java or Python application code, managing state backends (RocksDB), and configuring checkpointing separately. For aggregate and window-based features -- which cover the majority of ML feature engineering use cases -- streaming SQL is simpler to write, test, and maintain.

Conclusion

Real-time feature engineering is the practice of computing ML features continuously from live event streams, replacing scheduled batch jobs with always-on pipelines that deliver sub-second freshness. Here are the key takeaways:

Stale features degrade model accuracy: For time-sensitive use cases like fraud detection, dynamic pricing, and personalization, the gap between batch refreshes directly translates to worse predictions.
Training-serving skew is a systemic risk in batch architectures: Maintaining separate training and serving code paths for features creates silent model degradation that is difficult to detect and debug.
Streaming SQL lowers the barrier to real-time features: Instead of learning Flink or writing custom streaming applications, you can define features as SQL materialized views that update incrementally with each event.
You do not need to choose one or the other: Use real-time feature engineering for time-sensitive signals and batch for slowly changing attributes. The two approaches complement each other.
RisingWave makes this practical with PostgreSQL compatibility: Define features in SQL, query them over a standard PostgreSQL connection, and sink them to your existing feature store or data lake.

Get started with RisingWave in 5 minutes. Quickstart →

Join our Slack community to ask questions and connect with other data and ML engineers building real-time feature pipelines.