Window Functions for ML Feature Engineering

Window Functions for ML Feature Engineering

·

11 min read

Your fraud detection model needs the number of transactions a user made in the last 10 minutes. Your recommendation engine needs average session duration per user, updated continuously. Your dynamic pricing model needs the rolling 95th percentile of order values per product category.

In a batch pipeline, you would compute these features with Spark window functions on a schedule -- every hour, every six hours, overnight. The model trains on fresh features but serves on stale ones. By the time the batch job finishes, the world has moved on.

Streaming window functions eliminate this gap. They compute the same aggregations -- rolling sums, session-level statistics, time-bucketed counts -- continuously, on every incoming event. In a streaming database like RisingWave, you define these computations as SQL queries wrapped in materialized views. The features update within milliseconds of each new event, and your model reads them over a standard PostgreSQL connection.

This post walks through every streaming window type available for feature engineering: tumble windows, hop windows, session windows, and over (analytical) windows. Each section includes a concrete SQL example you can run against real-world data patterns.

Why Streaming Windows Matter for Feature Engineering

Most ML features are aggregations over time. "Count of logins in the last hour," "average transaction amount in the last 7 days," "number of distinct devices seen in the last 30 minutes" -- these are all window computations. The only question is whether you compute them in batch or in real time.

Batch window functions recompute the entire dataset on each run. If you have 100 million transactions and want the rolling 1-hour count per user, Spark scans all recent rows, groups them, and writes the result to a feature store. This works, but it introduces latency and operational overhead.

Streaming window functions take a fundamentally different approach. Instead of reprocessing data, RisingWave maintains window state incrementally. When a new transaction arrives, only the affected windows are updated. This means:

  • Sub-second freshness. Features reflect the most recent event, not the most recent batch run.
  • No scheduling. No Airflow DAGs, no cron jobs. The materialized view is always current.
  • Lower compute cost. Incremental updates are far cheaper than full recomputation for large datasets.
  • No training-serving skew. The same SQL query defines both the training feature (via historical backfill) and the serving feature (via the live materialized view).

Tumble Windows: Fixed-Interval Feature Buckets

A tumble window divides a stream into fixed-size, non-overlapping intervals. Every event belongs to exactly one window. This is the simplest window type and maps directly to "events per minute," "revenue per hour," or "errors per 5-second interval."

When to use tumble windows for features

  • Time-bucketed counts (transactions per minute per user)
  • Periodic aggregates that feed a model expecting fixed-cadence inputs
  • Binning event streams into uniform intervals for downstream joins

Example: transaction velocity features

A fraud model needs to know how many transactions each user makes per minute and the total amount spent in that minute. These velocity features are strong fraud signals.

CREATE SOURCE transactions (
    txn_id BIGINT,
    user_id VARCHAR,
    amount DECIMAL,
    merchant_category VARCHAR,
    event_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'transactions',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

CREATE MATERIALIZED VIEW user_txn_velocity_1min AS
SELECT
    user_id,
    window_start,
    window_end,
    COUNT(*) AS txn_count_1min,
    SUM(amount) AS total_amount_1min,
    AVG(amount) AS avg_amount_1min,
    MAX(amount) AS max_amount_1min,
    COUNT(DISTINCT merchant_category) AS distinct_merchants_1min
FROM TUMBLE(transactions, event_time, INTERVAL '1 MINUTE')
GROUP BY user_id, window_start, window_end;

This materialized view continuously emits one row per user per minute. Your fraud model can query it directly at inference time to get the latest velocity features for any user.

The TUMBLE() function in RisingWave takes three arguments: the source table, the time column, and the window size. It automatically appends window_start and window_end columns that you can use in GROUP BY and downstream joins. See the time window functions documentation for the full syntax reference.

Hop Windows: Overlapping Rolling Aggregations

A hop window has a fixed size but advances by a smaller step, creating overlapping windows. A 10-minute window that hops every 1 minute produces a new result every minute, each covering the last 10 minutes of data. This is the streaming equivalent of a "rolling" or "sliding" aggregation.

When to use hop windows for features

  • Rolling averages, sums, or counts (e.g., "average latency over the last 5 minutes, updated every 30 seconds")
  • Smoothed metrics that reduce noise for model inputs
  • Features where you need overlapping coverage to avoid boundary effects

Example: rolling page view statistics for a recommendation model

A recommendation engine needs to know how active a user has been recently. A 10-minute rolling window updated every minute captures recent engagement without the sharp edges of a tumble window.

CREATE SOURCE page_views (
    user_id VARCHAR,
    page_url VARCHAR,
    referrer VARCHAR,
    device_type VARCHAR,
    viewed_at TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'page_views',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

CREATE MATERIALIZED VIEW user_engagement_rolling AS
SELECT
    user_id,
    window_start,
    window_end,
    COUNT(*) AS page_views_10min,
    COUNT(DISTINCT page_url) AS distinct_pages_10min,
    COUNT(DISTINCT device_type) AS distinct_devices_10min
FROM HOP(page_views, viewed_at, INTERVAL '1 MINUTE', INTERVAL '10 MINUTES')
GROUP BY user_id, window_start, window_end;

The HOP() function takes four arguments: the source, the time column, the hop size (how often a new window starts), and the window size (how far back each window looks). A single page view event will appear in up to 10 windows (10 minutes / 1 minute hop). Refer to the hop window documentation for more detail.

The trade-off is state size. Each event belongs to multiple windows, so hop windows consume more memory than tumble windows. If you need a 24-hour rolling window updated every minute, that is 1,440 overlapping windows per event -- consider whether a tumble window with a larger interval or an over window (covered below) might be more efficient.

Session Windows: Activity-Based Feature Groups

Session windows group events by periods of activity separated by idle gaps. Unlike tumble and hop windows, session windows have variable sizes -- a session lasts as long as events keep arriving within the gap threshold. When activity stops for longer than the gap, the session closes and a new one begins with the next event.

When to use session windows for features

  • Sessionization features (session duration, events per session, time between sessions)
  • User engagement modeling where activity bursts matter more than clock intervals
  • IoT device communication patterns where reading bursts indicate state changes

Example: e-commerce session features

A churn prediction model benefits from session-level features: how long users browse, how many products they view per session, and how sessions compare to each other.

CREATE MATERIALIZED VIEW user_session_features AS
SELECT
    user_id,
    window_start AS session_start,
    window_end AS session_end,
    window_end - window_start AS session_duration,
    COUNT(*) AS pages_per_session,
    COUNT(DISTINCT page_url) AS distinct_pages_per_session
FROM SESSION(page_views, viewed_at, INTERVAL '30 MINUTES')
GROUP BY user_id, window_start, window_end;

A 30-minute gap threshold means that if a user is inactive for more than 30 minutes, their next page view starts a new session. This is the same logic used by Google Analytics and most web analytics platforms.

Session windows are powerful for feature engineering because they capture natural user behavior boundaries. A user who views 50 pages in a 3-minute burst followed by 2 hours of inactivity is very different from a user who views 50 pages spread across 2 hours -- and session windows make that distinction visible to your model.

Over Windows: Row-Level Running Aggregates

The window types above (tumble, hop, session) are time-based group windows that collapse multiple rows into one aggregated row per window. Over windows -- also called analytical or OVER clause windows -- are different. They compute a value for each row based on a sliding frame of surrounding rows, without collapsing anything.

This makes over windows ideal for features that need per-event context: "how does this transaction compare to the user's average?" or "what rank is this reading within the sensor's recent history?"

When to use over windows for features

  • Running statistics (cumulative sum, moving average per entity)
  • Relative features (current value vs. historical average, z-score)
  • Sequential features (previous event value, time since last event)
  • Ranking and deduplication within partitions

Example: per-transaction deviation features for fraud detection

A fraud model is more accurate when it knows not just the transaction amount, but how that amount compares to the user's recent spending pattern. Over windows compute this per-row context efficiently.

CREATE MATERIALIZED VIEW txn_deviation_features AS
SELECT
    txn_id,
    user_id,
    amount,
    event_time,
    AVG(amount) OVER w AS avg_amount_last_20,
    STDDEV_POP(amount) OVER w AS stddev_amount_last_20,
    (amount - AVG(amount) OVER w) / NULLIF(STDDEV_POP(amount) OVER w, 0)
        AS z_score,
    COUNT(*) OVER w AS txn_count_last_20,
    LAG(amount) OVER (
        PARTITION BY user_id ORDER BY event_time
    ) AS prev_txn_amount,
    event_time - LAG(event_time) OVER (
        PARTITION BY user_id ORDER BY event_time
    ) AS time_since_last_txn
FROM transactions
WINDOW w AS (
    PARTITION BY user_id
    ORDER BY event_time
    ROWS BETWEEN 19 PRECEDING AND CURRENT ROW
);

This single materialized view computes six features for every transaction: the rolling average, standard deviation, z-score, count, previous amount, and inter-event time. Each feature updates incrementally as new transactions arrive.

The z-score feature is particularly useful -- a transaction with a z-score above 3 means the amount is more than 3 standard deviations above the user's recent average, which is a strong anomaly signal. See the window function reference for the full list of supported analytical functions.

Combining Windows for Rich Feature Sets

Real-world ML models consume dozens or hundreds of features. The most effective feature engineering combines multiple window types to capture different temporal perspectives of the same entity.

Example: multi-resolution transaction features

Build a feature view that combines tumble-window velocity features, over-window deviation features, and cross-window comparisons:

-- Step 1: 5-minute tumble window for short-term velocity
CREATE MATERIALIZED VIEW txn_velocity_5min AS
SELECT
    user_id,
    window_start,
    window_end,
    COUNT(*) AS txn_count_5min,
    SUM(amount) AS total_amount_5min,
    MAX(amount) AS max_amount_5min
FROM TUMBLE(transactions, event_time, INTERVAL '5 MINUTES')
GROUP BY user_id, window_start, window_end;

-- Step 2: 1-hour tumble window for medium-term baseline
CREATE MATERIALIZED VIEW txn_velocity_1hr AS
SELECT
    user_id,
    window_start,
    window_end,
    COUNT(*) AS txn_count_1hr,
    SUM(amount) AS total_amount_1hr,
    AVG(amount) AS avg_amount_1hr
FROM TUMBLE(transactions, event_time, INTERVAL '1 HOUR')
GROUP BY user_id, window_start, window_end;

-- Step 3: Compare short-term velocity to long-term baseline
CREATE MATERIALIZED VIEW txn_feature_comparison AS
SELECT
    v5.user_id,
    v5.window_start AS feature_time,
    v5.txn_count_5min,
    v5.total_amount_5min,
    v1.txn_count_1hr,
    v1.avg_amount_1hr,
    v5.txn_count_5min::FLOAT / NULLIF(v1.txn_count_1hr, 0)
        AS velocity_ratio,
    v5.max_amount_5min / NULLIF(v1.avg_amount_1hr, 0)
        AS max_to_avg_ratio
FROM txn_velocity_5min v5
JOIN txn_velocity_1hr v1
    ON v5.user_id = v1.user_id
    AND v5.window_start >= v1.window_start
    AND v5.window_start < v1.window_end;

The velocity_ratio feature tells the model whether the user's recent 5-minute transaction rate is unusually high compared to their hourly baseline. A ratio above 1.0 means the user is transacting faster than average -- a useful signal for both fraud detection and real-time personalization.

Performance Tips for Window-Based Feature Pipelines

When deploying streaming window functions for production ML features, keep these guidelines in mind:

  1. Right-size your windows. Larger windows hold more state in memory. A 7-day hop window with a 1-minute hop is expensive. Ask whether your model actually benefits from a 7-day lookback or whether 1 hour captures enough signal.

  2. Use watermarks for event-time processing. Define watermarks on your sources to handle late-arriving events and allow RisingWave to garbage-collect expired window state. Without watermarks, state grows unbounded.

  3. Prefer tumble over hop when overlap is unnecessary. Tumble windows are cheaper because each event belongs to exactly one window. Only use hop windows when your model genuinely needs overlapping coverage.

  4. Materialize intermediate results. If multiple feature views read from the same source with different window sizes, consider materializing a coarser intermediate view first, then building finer-grained features on top of it.

  5. Partition wisely. Over windows partitioned by high-cardinality keys (e.g., user_id with millions of users) distribute state well. Partitioning by low-cardinality keys (e.g., country) can create hotspots.

FAQ

How do streaming window functions compare to batch window functions?

Batch window functions in Spark or PostgreSQL reprocess the full dataset on every run. Streaming window functions in RisingWave maintain state incrementally -- only the affected windows update when a new event arrives. The SQL syntax is similar, but the execution model is fundamentally different. Streaming gives you sub-second freshness at lower compute cost, while batch gives you access to the full historical dataset in each computation.

Can I backfill historical features using the same window queries?

Yes. You can ingest historical data into RisingWave and let the materialized views process it as if it were a live stream. This produces the same features your model will see in production, eliminating training-serving skew. For large-scale historical backfills, you can also sink the materialized view results to an Apache Iceberg table for offline training.

What happens when window state grows too large?

RisingWave spills window state to disk when memory is insufficient, so you will not hit out-of-memory errors. However, large windows (long durations or high-cardinality partitions) do increase storage and slow down recovery after failures. Use watermarks to bound the maximum window lifetime, and monitor state size through RisingWave's built-in system tables.

How do I serve these features to my ML model at inference time?

RisingWave exposes materialized views through a PostgreSQL-compatible interface. Your inference service connects with any PostgreSQL client library (psycopg2, JDBC, etc.) and runs a SELECT query to fetch the latest features for a given entity. Latency is typically single-digit milliseconds because the features are pre-computed and stored, not calculated on the fly.

Conclusion

Streaming window functions give ML engineers the same expressive power as batch window functions -- rolling aggregations, sessionization, time-bucketed statistics, running comparisons -- but with sub-second freshness and zero scheduling overhead. Here are the key takeaways:

  • Tumble windows produce clean, non-overlapping time buckets for velocity and throughput features.
  • Hop windows provide smoothed, overlapping rolling aggregations that reduce boundary noise.
  • Session windows capture natural activity patterns for engagement and behavioral features.
  • Over windows compute per-event context features like z-scores, running averages, and inter-event intervals without collapsing rows.
  • Combining window types across multiple materialized views creates rich, multi-resolution feature sets that give models both short-term signals and long-term baselines.

Every SQL example in this post runs directly in RisingWave. Define your sources, create your materialized views, and your features stay fresh automatically.


Ready to build real-time features? Get started with RisingWave in 5 minutes. Quickstart

Have questions about window functions or feature engineering? Join our Slack community to connect with the engineering team and other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.