Real-Time Feature Engineering for Fraud Detection

Introduction

Fraud detection models are only as good as the features they consume. A model trained on real-time signals but served hour-old aggregations will miss the card that just fired six transactions in ninety seconds. Real-time feature engineering closes that gap by computing fraud signals – transaction velocity, spending aggregations, device fingerprints, geolocation anomalies – continuously, as events arrive.

The challenge has always been infrastructure. Building real-time feature pipelines traditionally meant writing stateful applications in Java or Scala on top of Flink or Kafka Streams, then maintaining a separate serving layer. Streaming SQL changes that equation. With RisingWave, you define fraud features as materialized views in standard SQL. The engine handles incremental computation, state management, and low-latency serving – no custom code, no orchestration.

This post walks through concrete SQL examples for the most common fraud feature categories and shows how to compose them into a single feature vector your model can query at inference time.

Why Batch Features Miss Fraud

Most production fraud systems still compute features on a schedule. A Spark job runs every hour, aggregates transaction counts and amounts per user, and writes the results to a feature store. The model reads from that store at scoring time.

This architecture has three problems that compound in fraud use cases:

The freshness gap is the attack window. Fraudsters operate in bursts. A compromised card is typically drained within minutes, not hours. If your velocity features update hourly, the model sees zero transactions for a card that has already been used twelve times. The fraud happens entirely inside the blind spot between batch runs.
Training-serving skew distorts risk scores. Training pipelines often compute features from raw event logs with precise timestamps. Serving pipelines read pre-aggregated values from a feature store snapshot. Subtle differences in window boundaries, null handling, or join semantics between these two paths cause the model to see different distributions in production than during training. For fraud models, where the decision threshold is tuned to basis-point precision, even small skew increases false positives or lets fraud through.
Late-arriving events break batch windows. Payment events can arrive out of order due to network retries, processor delays, or timezone conversions. A batch pipeline that cuts off at the hour boundary will miss events that logically belong to the previous window but arrived a few seconds late. Streaming engines handle late arrivals natively through watermarks and event-time processing.

A streaming approach eliminates all three problems. Features update within milliseconds of each source event. The same SQL defines both training and serving features. And the engine's built-in watermark handling accounts for late data automatically.

Setting Up the Transaction Stream

All SQL in this post is compatible with RisingWave v2.2+. You can follow along on RisingWave Cloud or a local instance.

Create the Transaction Source

Connect RisingWave to a Kafka topic carrying raw transaction events:

CREATE SOURCE raw_transactions (
    transaction_id VARCHAR,
    user_id VARCHAR,
    amount DECIMAL,
    currency VARCHAR,
    merchant_id VARCHAR,
    merchant_category VARCHAR,
    card_id VARCHAR,
    device_id VARCHAR,
    ip_address VARCHAR,
    latitude DOUBLE PRECISION,
    longitude DOUBLE PRECISION,
    country_code VARCHAR,
    event_time TIMESTAMP
)
WITH (
    connector = 'kafka',
    topic = 'payments.transactions',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;

For local testing without Kafka, use a table instead:

CREATE TABLE raw_transactions (
    transaction_id VARCHAR PRIMARY KEY,
    user_id VARCHAR NOT NULL,
    amount DECIMAL NOT NULL,
    currency VARCHAR,
    merchant_id VARCHAR,
    merchant_category VARCHAR,
    card_id VARCHAR,
    device_id VARCHAR,
    ip_address VARCHAR,
    latitude DOUBLE PRECISION,
    longitude DOUBLE PRECISION,
    country_code VARCHAR,
    event_time TIMESTAMP NOT NULL
);

Seed Test Data

Insert sample data that includes both normal behavior and a fraud pattern:

INSERT INTO raw_transactions VALUES
  -- Normal user: steady, single-device, single-country
  ('txn_001', 'user_A', 14.50, 'USD', 'merch_coffee', 'food',
   'card_A1', 'dev_A1', '10.0.0.1', 40.7128, -74.0060, 'US',
   '2025-06-10 09:00:00'),
  ('txn_002', 'user_A', 62.00, 'USD', 'merch_grocery', 'grocery',
   'card_A1', 'dev_A1', '10.0.0.1', 40.7128, -74.0060, 'US',
   '2025-06-10 09:25:00'),
  ('txn_003', 'user_A', 9.99, 'USD', 'merch_news', 'digital',
   'card_A1', 'dev_A1', '10.0.0.1', 40.7128, -74.0060, 'US',
   '2025-06-10 10:10:00'),

  -- Fraud pattern: rapid burst, multiple devices, country-hopping
  ('txn_101', 'user_B', 1.00, 'USD', 'merch_test', 'other',
   'card_B1', 'dev_B1', '185.22.0.9', 51.5074, -0.1278, 'GB',
   '2025-06-10 10:00:00'),
  ('txn_102', 'user_B', 4200.00, 'USD', 'merch_electronics', 'electronics',
   'card_B1', 'dev_B2', '91.45.3.7', 48.8566, 2.3522, 'FR',
   '2025-06-10 10:00:45'),
  ('txn_103', 'user_B', 3800.00, 'USD', 'merch_jewelry', 'luxury',
   'card_B1', 'dev_B3', '102.22.44.5', 6.5244, 3.3792, 'NG',
   '2025-06-10 10:01:20'),
  ('txn_104', 'user_B', 5500.00, 'USD', 'merch_wire', 'transfer',
   'card_B1', 'dev_B2', '91.45.3.7', 48.8566, 2.3522, 'FR',
   '2025-06-10 10:02:00'),
  ('txn_105', 'user_B', 2900.00, 'USD', 'merch_gift', 'retail',
   'card_B1', 'dev_B4', '177.54.12.8', 55.7558, 37.6173, 'RU',
   '2025-06-10 10:02:30');

user_B shows a textbook fraud signature: a $1 test charge followed by high-value purchases across four countries on four devices in under three minutes. user_A shows normal behavior across a single device in one location.

Feature Category 1: Transaction Velocity

Velocity features measure how fast a user or card is transacting. Fraudsters operate in bursts – the faster they drain a compromised card, the less likely the legitimate cardholder notices in time.

CREATE MATERIALIZED VIEW mv_velocity_features AS
SELECT
    user_id,
    card_id,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '5 minutes')
        AS txn_count_5min,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '30 minutes')
        AS txn_count_30min,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS txn_count_1h,
    COUNT(*) FILTER (WHERE event_time > NOW() - INTERVAL '24 hours')
        AS txn_count_24h,
    MAX(event_time) AS last_txn_time,
    -- Time since previous transaction (smaller = more suspicious)
    NOW() - MAX(event_time) AS time_since_last_txn
FROM raw_transactions
GROUP BY user_id, card_id;

Query the result:

SELECT user_id, card_id, txn_count_5min, txn_count_30min, txn_count_1h
FROM mv_velocity_features;

Expected output:

 user_id | card_id | txn_count_5min | txn_count_30min | txn_count_1h
---------+---------+----------------+-----------------+--------------
 user_A  | card_A1 |              0 |               1 |            2
 user_B  | card_B1 |              5 |               5 |            5

user_B has five transactions in five minutes. That single feature – txn_count_5min = 5 – is already a strong fraud signal. Most legitimate cardholders rarely exceed two transactions in a five-minute window.

The FILTER (WHERE ...) clause is key here. It lets you compute multiple time windows in a single materialized view without needing separate views or self-joins. RisingWave evaluates these filters incrementally as new events arrive. For more on how materialized views work under the hood, see the documentation.

Feature Category 2: Amount Aggregations

Amount features capture spending patterns and detect deviations from a user's normal behavior. A $5,000 transaction is not inherently suspicious – unless the user's average spend is $35.

CREATE MATERIALIZED VIEW mv_amount_features AS
SELECT
    user_id,
    card_id,
    SUM(amount) FILTER (WHERE event_time > NOW() - INTERVAL '5 minutes')
        AS total_amount_5min,
    SUM(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS total_amount_1h,
    AVG(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS avg_amount_1h,
    MAX(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS max_amount_1h,
    MIN(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS min_amount_1h,
    -- Standard deviation detects unusual spread in transaction sizes
    STDDEV_SAMP(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS stddev_amount_1h,
    -- Ratio of max to average: spikes indicate anomalies
    CASE
        WHEN AVG(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour') > 0
        THEN MAX(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
             / AVG(amount) FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        ELSE 0
    END AS max_to_avg_ratio_1h
FROM raw_transactions
GROUP BY user_id, card_id;

Query:

SELECT user_id, total_amount_5min, avg_amount_1h,
       ROUND(stddev_amount_1h, 2) AS stddev, ROUND(max_to_avg_ratio_1h, 2) AS spike_ratio
FROM mv_amount_features;

Expected output:

 user_id | total_amount_5min | avg_amount_1h | stddev  | spike_ratio
---------+-------------------+---------------+---------+-------------
 user_A  |              0.00 |         28.83 |   28.02 |        2.15
 user_B  |          16401.00 |       3280.20 | 2063.81 |        1.68

user_B spent $16,401 in five minutes. Combined with the high standard deviation (indicating a mix of the $1 test charge and large purchases), these features give a model strong signal. The max_to_avg_ratio is useful for catching single large outlier transactions against a background of small ones.

Feature Category 3: Device Fingerprinting

Device features track how many distinct devices and IP addresses a user employs. A legitimate user typically transacts from one or two devices. A compromised card used by a fraud ring will show transactions from many devices in a short period.

CREATE MATERIALIZED VIEW mv_device_features AS
SELECT
    user_id,
    card_id,
    COUNT(DISTINCT device_id)
        FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS distinct_devices_1h,
    COUNT(DISTINCT device_id)
        FILTER (WHERE event_time > NOW() - INTERVAL '24 hours')
        AS distinct_devices_24h,
    COUNT(DISTINCT ip_address)
        FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS distinct_ips_1h,
    -- Flag: more than 2 devices in an hour is unusual
    CASE
        WHEN COUNT(DISTINCT device_id)
             FILTER (WHERE event_time > NOW() - INTERVAL '1 hour') > 2
        THEN TRUE ELSE FALSE
    END AS multi_device_flag
FROM raw_transactions
GROUP BY user_id, card_id;

Expected output:

 user_id | distinct_devices_1h | distinct_devices_24h | distinct_ips_1h | multi_device_flag
---------+---------------------+----------------------+-----------------+-------------------
 user_A  |                   1 |                    1 |               1 | f
 user_B  |                   4 |                    4 |               4 | t

Four distinct devices in one hour is a clear anomaly. Even users who own multiple devices rarely switch between four of them within sixty minutes for payment transactions.

Feature Category 4: Geolocation Anomalies

Geographic features detect impossible travel – transactions from countries that are physically too far apart given the elapsed time. This is one of the strongest fraud indicators for card-not-present fraud.

First, create a view that pairs each transaction with the user's previous transaction:

CREATE MATERIALIZED VIEW mv_geo_pairs AS
SELECT
    transaction_id,
    user_id,
    card_id,
    country_code,
    latitude,
    longitude,
    event_time,
    LAG(country_code) OVER (
        PARTITION BY user_id, card_id ORDER BY event_time
    ) AS prev_country,
    LAG(latitude) OVER (
        PARTITION BY user_id, card_id ORDER BY event_time
    ) AS prev_lat,
    LAG(longitude) OVER (
        PARTITION BY user_id, card_id ORDER BY event_time
    ) AS prev_lon,
    LAG(event_time) OVER (
        PARTITION BY user_id, card_id ORDER BY event_time
    ) AS prev_event_time
FROM raw_transactions;

Then, build a feature view that computes distance and speed:

CREATE MATERIALIZED VIEW mv_geo_anomaly_features AS
SELECT
    user_id,
    card_id,
    COUNT(*) FILTER (
        WHERE prev_country IS NOT NULL
          AND country_code <> prev_country
          AND event_time - prev_event_time < INTERVAL '10 minutes'
    ) AS cross_country_txns_10min,
    COUNT(DISTINCT country_code)
        FILTER (WHERE event_time > NOW() - INTERVAL '1 hour')
        AS distinct_countries_1h,
    -- Flag impossible travel: different country within 5 minutes
    BOOL_OR(
        prev_country IS NOT NULL
        AND country_code <> prev_country
        AND event_time - prev_event_time < INTERVAL '5 minutes'
    ) AS impossible_travel_flag
FROM mv_geo_pairs
GROUP BY user_id, card_id;

Expected output:

 user_id | cross_country_txns_10min | distinct_countries_1h | impossible_travel_flag
---------+--------------------------+-----------------------+------------------------
 user_A  |                        0 |                     1 | f
 user_B  |                        4 |                     4 | t

user_B has four cross-country transactions in under ten minutes and hits the impossible travel flag. No human can fly from London to Paris to Lagos to Moscow in 2.5 minutes.

For more on using window functions like LAG() in streaming contexts, see the streaming SQL window functions tutorial on the RisingWave blog.

Putting It All Together: The Unified Feature Vector

Individual feature categories are useful, but your fraud model needs a single row per user at scoring time. Create a unified feature view that joins all feature materialized views:

CREATE MATERIALIZED VIEW mv_fraud_feature_vector AS
SELECT
    v.user_id,
    v.card_id,
    -- Velocity
    v.txn_count_5min,
    v.txn_count_30min,
    v.txn_count_1h,
    v.txn_count_24h,
    -- Amounts
    a.total_amount_5min,
    a.total_amount_1h,
    a.avg_amount_1h,
    a.max_amount_1h,
    a.stddev_amount_1h,
    a.max_to_avg_ratio_1h,
    -- Device
    d.distinct_devices_1h,
    d.distinct_devices_24h,
    d.distinct_ips_1h,
    d.multi_device_flag,
    -- Geolocation
    g.cross_country_txns_10min,
    g.distinct_countries_1h,
    g.impossible_travel_flag
FROM mv_velocity_features v
JOIN mv_amount_features a ON v.user_id = a.user_id AND v.card_id = a.card_id
JOIN mv_device_features d ON v.user_id = d.user_id AND v.card_id = d.card_id
JOIN mv_geo_anomaly_features g ON v.user_id = g.user_id AND v.card_id = g.card_id;

Your inference service fetches the complete feature vector with a single query:

SELECT * FROM mv_fraud_feature_vector WHERE user_id = 'user_B' AND card_id = 'card_B1';

Because RisingWave is PostgreSQL-compatible, your model serving layer connects using any PostgreSQL driver – psycopg2 in Python, pgx in Go, JDBC in Java. The query reads pre-computed results from the materialized view, so it returns in single-digit milliseconds regardless of how many raw events the system has processed.

Sinking Features to External Systems

If your inference service reads from Redis or a dedicated feature store rather than querying RisingWave directly, use a sink to push features downstream:

CREATE SINK fraud_features_to_kafka
FROM mv_fraud_feature_vector
WITH (
    connector = 'kafka',
    topic = 'fraud.features',
    properties.bootstrap.server = 'broker:9092',
    type = 'upsert',
    primary_key = 'user_id, card_id'
)
FORMAT UPSERT ENCODE JSON;

Every time a feature value changes, RisingWave emits an updated record to the Kafka topic. A downstream consumer can write these to Redis, PostgreSQL, or any store your scoring service prefers.

FAQ

How does this differ from computing features in Flink or Kafka Streams?

The features themselves are the same – velocity counts, amount aggregations, device cardinality. The difference is how you define and maintain them. In Flink, each feature is a Java or Scala application with custom state management and deployment. In RisingWave, each feature is a SQL materialized view with the same incremental computation semantics but significantly less operational overhead. If your team already knows SQL, the ramp-up time drops from weeks to hours.

Can RisingWave handle late-arriving events?

Yes. RisingWave supports event-time processing with watermarks, so late-arriving transactions are incorporated into the correct time windows. Features that use NOW()-based filters evaluate against processing time, which is appropriate for serving. For training pipelines where you need point-in-time correctness, use event-time windows with the TUMBLE or HOP window functions.

What throughput can I expect?

RisingWave scales horizontally. A single node handles tens of thousands of events per second for typical fraud feature workloads. For high-volume payment processors, you can scale out compute nodes to handle millions of events per second. The RisingWave GitHub repository includes benchmarks and deployment guides for various scale profiles.

How do I backfill historical features for model training?

Load historical transaction data into a RisingWave table, and the materialized views will compute features over that data just as they would over live streams. You can then export the resulting feature table to your training environment. This ensures your training features are computed with the exact same SQL as your serving features – eliminating training-serving skew by definition.

Conclusion

Fraud detection is a latency-sensitive problem where the gap between event and feature computation is the gap between catching fraud and missing it. Batch feature pipelines were a reasonable trade-off when real-time computation required specialized distributed systems expertise. Streaming SQL removes that trade-off.

With RisingWave, you define fraud features – velocity, amounts, device fingerprints, geolocation anomalies – as SQL materialized views. The engine incrementally maintains them as each transaction arrives. Your model gets fresh features in milliseconds, not hours. And because the same SQL defines both training and serving features, you eliminate the skew that quietly degrades model performance over time.

The SQL examples in this post are production-ready starting points. Adapt the thresholds and time windows to your transaction patterns, plug the unified feature view into your scoring service, and you have a real-time fraud feature pipeline with no custom code to maintain.

Try RisingWave Cloud free – no credit card required. Sign up →

Join the RisingWave Community on Slack to ask questions and share what you build.