Real-Time vs Batch Feature Engineering

Real-Time vs Batch Feature Engineering

·

12 min read

Your fraud detection model runs on features computed six hours ago. Your recommendation engine serves suggestions based on yesterday's browsing session. Your pricing model still thinks demand is low because the batch job has not caught up to the lunch rush.

None of these are hypothetical. They are the direct result of batch feature engineering applied to problems that need real-time data. But the reverse mistake is just as expensive: building a streaming pipeline for features that change once a week, burning engineering hours and cloud spend on freshness that no model will ever notice.

This post breaks down when batch feature engineering is the right choice, when you genuinely need real-time, and how to evaluate the tradeoffs between them. We will also look at how RisingWave – a PostgreSQL-compatible streaming database – makes real-time feature engineering as straightforward as writing a batch SQL query.

How Batch Feature Engineering Works

Batch feature engineering is the traditional approach. A scheduled job – typically Spark, dbt, or a warehouse-native query – reads from a data warehouse or data lake, computes aggregated features, and writes the results to a feature store or serving table. This runs on a fixed cadence: hourly, daily, or weekly.

A Typical Batch Feature Query

Here is what a batch feature pipeline looks like in practice. This query computes per-user spending features from a transaction history table, meant to run as a scheduled dbt model or Airflow task:

-- Scheduled to run every 6 hours via Airflow
INSERT INTO feature_store.user_spending_features
SELECT
    user_id,
    COUNT(*) AS txn_count_7d,
    SUM(amount) AS total_spend_7d,
    AVG(amount) AS avg_txn_amount_7d,
    MAX(amount) AS max_txn_amount_7d,
    COUNT(DISTINCT merchant_id) AS distinct_merchants_7d,
    CURRENT_TIMESTAMP AS computed_at
FROM warehouse.transactions
WHERE event_time > CURRENT_TIMESTAMP - INTERVAL '7 days'
GROUP BY user_id;

Every six hours, this job scans the entire seven-day window of transactions, recomputes every user's features from scratch, and overwrites the feature store. Between runs, the features are static.

Strengths of Batch

Batch feature engineering earned its dominance for good reasons:

  • Simplicity: SQL against a warehouse is the most widely understood data pattern in the industry. Every data engineer knows how to write, test, and debug it.
  • Mature tooling: Airflow, dbt, Great Expectations, and the entire modern data stack are built around batch. Monitoring, testing, and lineage tracking are well-solved.
  • Cost efficiency for stable features: If a feature like "total lifetime purchases" or "account age in days" only changes meaningfully over hours or days, recomputing it once per batch cycle wastes no resources on unnecessary updates.
  • Easy backfills: When you add a new feature or fix a bug, you can reprocess historical data in a single run against the warehouse.

Where Batch Breaks Down

The batch model has three structural limitations that surface as your ML system matures:

1. Freshness decay. A feature computed at 2:00 AM is 22 hours stale by midnight. For use cases where user behavior shifts within minutes – fraud patterns, session-based recommendations, real-time pricing – that staleness directly degrades model accuracy.

2. Training-serving skew. Training pipelines often compute features from raw event logs using one code path (a Spark job), while serving pipelines read pre-aggregated rows from the feature store (a different code path). Subtle differences in join logic, null handling, or window boundaries mean the model sees different feature distributions in production than during training.

3. Scaling wall. As your feature set grows, batch jobs take longer. A job that scanned 10 million rows in 2022 now scans 500 million. You add more compute, the job finishes faster, but the cost curve steepens. And the freshness interval cannot shrink below the job duration – you cannot run a 45-minute job every 30 minutes.

How Real-Time Feature Engineering Works

Real-time feature engineering replaces the scheduled job with a continuous process. Instead of scanning a warehouse on a timer, the system reads from an event stream (Kafka, Kinesis, Pulsar) and updates feature values incrementally as each event arrives. The feature store always reflects the most recent data, not the last batch run.

A Streaming Feature Query in RisingWave

Here is the same set of user spending features, but defined as a materialized view in RisingWave:

-- Continuously updated, no scheduler needed
CREATE MATERIALIZED VIEW user_spending_features AS
SELECT
    user_id,
    COUNT(*) AS txn_count_7d,
    SUM(amount) AS total_spend_7d,
    AVG(amount) AS avg_txn_amount_7d,
    MAX(amount) AS max_txn_amount_7d,
    COUNT(DISTINCT merchant_id) AS distinct_merchants_7d
FROM transactions_stream
WHERE event_time > NOW() - INTERVAL '7 days'
GROUP BY user_id;

The SQL is nearly identical to the batch version. The difference is in how it executes. RisingWave reads from the transactions_stream source (connected to Kafka), and incrementally maintains the materialized view. When a new transaction arrives, only that user's feature row is recomputed – not the entire table. The result is queryable via any PostgreSQL client with sub-second freshness.

No Airflow DAG. No cron schedule. No full-table scan every six hours.

Strengths of Real-Time

  • Sub-second freshness: Features reflect the latest event, not the last batch window. This is the difference between catching fraud in progress and reviewing it the next morning.
  • No training-serving skew: The same SQL definition produces both training data (via a historical query) and serving data (via the live materialized view). One definition, two uses, zero drift.
  • No orchestration overhead: Materialized views are always up to date. You do not manage schedulers, retries, SLAs, or backfill jobs for the streaming path.
  • Efficient incremental computation: Instead of scanning the full window on every run, a streaming database processes only the delta – the new events since the last update. This is far cheaper per event than repeated full scans.

Challenges of Real-Time

Real-time feature engineering is not without cost:

  • Infrastructure complexity: You need a message broker (Kafka), a stream processor (RisingWave, Flink), and potentially a separate serving layer. This is more moving parts than a warehouse-only setup.
  • Debugging is harder: When a batch job produces wrong results, you re-run it with logging. When a streaming pipeline produces wrong results, the bad events have already flowed downstream. You need observability, dead-letter queues, and replay capabilities.
  • State management: Aggregations over time windows require the system to maintain state – the running count, sum, and distinct set for every user. This state must be durable, consistent, and recoverable after failures. Mature streaming databases like RisingWave handle this internally, but it is still a consideration when sizing your deployment.
  • Late-arriving data: Events do not always arrive in order. A transaction from 10 seconds ago might land in Kafka after a transaction from 5 seconds ago. Your streaming engine must handle out-of-order events correctly, or your time-windowed features will be wrong.

Comparison Table

DimensionBatchReal-Time
Feature freshnessMinutes to hours (depends on schedule)Sub-second to seconds
Compute modelFull scan on each runIncremental per event
InfrastructureWarehouse + scheduler (Airflow/dbt)Stream broker + stream processor
Operational complexityDAG management, backfills, SLA monitoringState management, late data handling
Cost at scaleGrows with data volume per runGrows with event throughput
Training-serving skewHigh risk (separate code paths)Low risk (single SQL definition)
Backfill easeSimple – reprocess from warehouseRequires replay from stream or warehouse
Best forSlowly changing features, historical aggregatesSession features, velocity signals, real-time scoring

Decision Framework: When to Use Which

The choice between batch and real-time is not about which is better. It is about which matches the temporal dynamics of your specific features.

Use Batch When:

  • The feature changes slowly. Lifetime value, account age, historical purchase count, demographic attributes. These features shift on the scale of days or weeks. A nightly batch is more than fresh enough.
  • Your model is not latency-sensitive. A weekly churn prediction model, a monthly segmentation refresh, or an offline recommendation training pipeline all work perfectly with batch features.
  • You are early in your ML journey. If you are building your first feature store, start with batch. Get the feature definitions right, prove the model works, then migrate the latency-sensitive features to streaming.

Use Real-Time When:

  • Freshness directly affects model accuracy. Fraud detection, dynamic pricing, session-based recommendations, real-time bidding. In these domains, a feature that is even five minutes stale can cause measurable degradation in precision or revenue.
  • You need velocity or recency signals. "Number of login attempts in the last 60 seconds," "time since last purchase," "items added to cart in this session" – these features are meaningless if computed from a six-hour-old snapshot.
  • Training-serving skew is costing you. If your team spends significant effort debugging discrepancies between training features and serving features, unifying both behind a single streaming SQL definition eliminates the problem at its source.

The Middle Ground: Near-Real-Time Micro-Batching

Some teams run batch jobs on very short intervals – every 5 or 15 minutes – as a compromise. This works for moderate freshness requirements but inherits all the operational complexity of batch (schedulers, full scans, failure handling) while delivering none of the efficiency benefits of true incremental processing. As your data volume grows, micro-batch intervals hit a floor: you cannot shrink the interval below the job's runtime.

The Hybrid Approach: Batch + Real-Time Together

In practice, most production ML systems end up with both. The question is how to combine them cleanly.

A common architecture uses batch features for stable, slowly changing dimensions and real-time features for fast-moving signals. Both feed into the same feature store, and the model consumes them together at inference time.

With RisingWave, the hybrid approach becomes especially clean because both batch-style and streaming features use the same SQL dialect. You can define a materialized view for real-time features and run a scheduled query for batch features, all against the same system.

Example: Hybrid Feature Set for Fraud Detection

-- Real-time features (materialized view, always fresh)
CREATE MATERIALIZED VIEW fraud_realtime_features AS
SELECT
    user_id,
    COUNT(*) FILTER (
        WHERE event_time > NOW() - INTERVAL '10 minutes'
    ) AS txn_count_10min,
    COUNT(DISTINCT transaction_country) FILTER (
        WHERE event_time > NOW() - INTERVAL '1 hour'
    ) AS distinct_countries_1h,
    MAX(amount) FILTER (
        WHERE event_time > NOW() - INTERVAL '1 hour'
    ) AS max_amount_1h
FROM transactions_stream
GROUP BY user_id;

-- Batch features (computed daily, stored in warehouse)
-- These change slowly and do not benefit from sub-second freshness
-- total_lifetime_transactions, account_age_days,
-- avg_monthly_spend, preferred_merchant_category

At inference time, the scoring service queries both:

SELECT
    r.txn_count_10min,
    r.distinct_countries_1h,
    r.max_amount_1h,
    b.total_lifetime_transactions,
    b.account_age_days,
    b.avg_monthly_spend
FROM fraud_realtime_features r
JOIN batch_user_features b ON r.user_id = b.user_id
WHERE r.user_id = '12345';

This join combines sub-second velocity signals with stable historical context. The real-time features catch the burst of suspicious activity happening right now. The batch features provide the baseline that helps distinguish a high-spending frequent traveler from a compromised account.

Why RisingWave for Real-Time Feature Engineering

Several tools can do real-time feature engineering. Apache Flink, Kafka Streams, and Spark Structured Streaming all support it. But each requires you to learn a new programming model, manage clusters, and write features in Java, Scala, or a framework-specific DSL.

RisingWave takes a different approach: it is a PostgreSQL-compatible streaming database. You define features as SQL materialized views, and the engine handles incremental computation, state management, fault tolerance, and exactly-once processing internally. This means:

  • Any SQL user can build streaming features. Data analysts and ML engineers who already know SQL can write real-time feature pipelines without learning Flink's DataStream API or Kafka Streams' topology builder.
  • Your existing tools still work. psql, DBeaver, any PostgreSQL driver, any BI tool – they all connect to RisingWave and query materialized views like regular tables.
  • You can start simple and scale. A single materialized view in RisingWave replaces an Airflow DAG + Spark job + feature store writer. You can go from prototype to production with the same SQL definition.

If you want to see this in action with a complete end-to-end pipeline, check out our guide on building real-time feature pipelines with streaming SQL.

Frequently Asked Questions

Can I migrate from batch to real-time incrementally?

Yes. Start by identifying the features where staleness has the highest business cost – usually velocity features, session features, and anything used in fraud or pricing models. Migrate those to materialized views in RisingWave while keeping the rest on your existing batch pipeline. You do not have to go all-in on streaming from day one.

Does real-time feature engineering replace the feature store?

Not necessarily. A feature store provides versioning, point-in-time lookups for training, and a central registry. RisingWave can act as the computation layer that feeds fresh features into your feature store (via a sink to your store's backing database or API), or you can query RisingWave's materialized views directly as a lightweight serving layer.

How do I handle backfills for streaming features?

RisingWave supports creating sources from both streaming systems (Kafka) and batch-oriented stores (S3, PostgreSQL via CDC). For backfills, you can replay historical data through Kafka or connect directly to a historical data source to bootstrap your materialized views with past data before switching to live stream consumption.

What about exactly-once processing?

RisingWave provides exactly-once semantics through internal checkpointing and state management. Each event is processed exactly once, even through failures and restarts. This is critical for feature correctness – you do not want a transaction counted twice in a velocity feature because a node restarted.

Is real-time feature engineering more expensive than batch?

It depends on the workload. Real-time is more efficient per event because it processes incrementally rather than scanning the full window every run. But it requires always-on compute, whereas batch jobs only consume resources during their scheduled window. For high-throughput, freshness-sensitive use cases, streaming is often cheaper per feature-update than repeated full scans. For low-volume, infrequently changing features, batch remains the cost-effective choice.

Conclusion

Batch and real-time feature engineering are not competing approaches – they are complementary tools for different temporal requirements. Batch excels at stable, slowly evolving features where simplicity and cost efficiency matter most. Real-time excels at fast-moving signals where freshness directly affects model accuracy and business outcomes.

The key is matching the approach to the feature. Audit your feature set: which features lose value when they are stale? Those are your candidates for real-time. Which features barely change between batch runs? Keep those on the schedule.

If the infrastructure complexity of real-time has held you back, RisingWave removes that barrier. Standard SQL, PostgreSQL compatibility, and incremental materialized views mean you can build real-time feature pipelines with the same skills and tools you already use for batch.

Ready to try it? Start with the RisingWave quickstart guide and have your first streaming materialized view running in minutes. Questions or want to discuss your architecture? Join the RisingWave Slack community – the engineering team is active and happy to help.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.