Real-Time Feature Engineering for Fraud Detection

Introduction

Fraud detection models are only as good as the features they consume. A model trained on real-time signals but served hour-old aggregations will miss the card that just fired six transactions in ninety seconds. Real-time feature engineering closes that gap by computing fraud signals -- transaction velocity, spending aggregations, device fingerprints, geolocation anomalies -- continuously, as events arrive.

The challenge has always been infrastructure. Building real-time feature pipelines traditionally meant writing stateful applications in Java or Scala on top of Flink or Kafka Streams, then maintaining a separate serving layer. Streaming SQL changes that equation. With RisingWave, you define fraud features as materialized views in standard SQL. The engine handles incremental computation, state management, and low-latency serving -- no custom code, no orchestration.

This post walks through concrete SQL examples for the most common fraud feature categories and shows how to compose them into a single feature vector your model can query at inference time.

Why Batch Features Miss Fraud

Most production fraud systems still compute features on a schedule. A Spark job runs every hour, aggregates transaction counts and amounts per user, and writes the results to a feature store. The model reads from that store at scoring time.

This architecture has three problems that compound in fraud use cases:

The freshness gap is the attack window. Fraudsters operate in bursts. A compromised card is typically drained within minutes, not hours. If your velocity features update hourly, the model sees zero transactions for a card that has already been used twelve times. The fraud happens entirely inside the blind spot between batch runs.
Training-serving skew distorts risk scores. Training pipelines often compute features from raw event logs with precise timestamps. Serving pipelines read pre-aggregated values from a feature store snapshot. Subtle differences in window boundaries, null handling, or join semantics between these two paths cause the model to see different distributions in production than during training. For fraud models, where the decision threshold is tuned to basis-point precision, even small skew increases false positives or lets fraud through.
Late-arriving events break batch windows. Payment events can arrive out of order due to network retries, processor delays, or timezone conversions. A batch pipeline that cuts off at the hour boundary will miss events that logically belong to the previous window but arrived a few seconds late. Streaming engines handle late arrivals natively through watermarks and event-time processing.

A streaming approach eliminates all three problems. Features update within milliseconds of each source event. The same SQL defines both training and serving features. And the engine's built-in watermark handling accounts for late data automatically.

Setting Up the Transaction Stream

All SQL in this post is compatible with RisingWave v2.8+. You can follow along on RisingWave Cloud or a local instance.

Create the Transaction Source

Connect RisingWave to a Kafka topic carrying raw transaction events:

CREATE SOURCE raw_transactions (
    transaction_id VARCHAR,
    user_id VARCHAR,
    amount DECIMAL,
    merchant_category VARCHAR,
    card_id VARCHAR,
    device_id VARCHAR,
    ip_address VARCHAR,
    latitude DOUBLE PRECISION,
    longitude DOUBLE PRECISION,
    country_code VARCHAR,
    event_time TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'payments.transactions',
    properties.bootstrap.server = 'broker:9092',
    scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;

For local testing without Kafka, use a table instead:

CREATE TABLE raw_transactions (
    transaction_id VARCHAR PRIMARY KEY,
    user_id VARCHAR NOT NULL,
    amount DECIMAL NOT NULL,
    merchant_category VARCHAR,
    card_id VARCHAR,
    device_id VARCHAR,
    ip_address VARCHAR,
    latitude DOUBLE PRECISION,
    longitude DOUBLE PRECISION,
    country_code VARCHAR,
    event_time TIMESTAMPTZ NOT NULL
);

Seed Test Data

Insert sample data that includes both normal behavior and a fraud pattern:

INSERT INTO raw_transactions VALUES
  -- Normal user: steady, single-device, single-country
  ('txn_001', 'user_A', 14.50, 'food',
   'card_A1', 'dev_A1', '10.0.0.1', 40.7128, -74.0060, 'US',
   NOW() - INTERVAL '2 hours'),
  ('txn_002', 'user_A', 62.00, 'grocery',
   'card_A1', 'dev_A1', '10.0.0.1', 40.7128, -74.0060, 'US',
   NOW() - INTERVAL '25 minutes'),
  ('txn_003', 'user_A', 9.99, 'digital',
   'card_A1', 'dev_A1', '10.0.0.1', 40.7128, -74.0060, 'US',
   NOW() - INTERVAL '3 minutes'),

  -- Fraud pattern: rapid burst, multiple devices, country-hopping
  ('txn_101', 'user_B', 1.00, 'other',
   'card_B1', 'dev_B1', '185.22.0.9', 51.5074, -0.1278, 'GB',
   NOW() - INTERVAL '4 minutes'),
  ('txn_102', 'user_B', 4200.00, 'electronics',
   'card_B1', 'dev_B2', '91.45.3.7', 48.8566, 2.3522, 'FR',
   NOW() - INTERVAL '3 minutes 15 seconds'),
  ('txn_103', 'user_B', 3800.00, 'luxury',
   'card_B1', 'dev_B3', '102.22.44.5', 6.5244, 3.3792, 'NG',
   NOW() - INTERVAL '2 minutes 40 seconds'),
  ('txn_104', 'user_B', 5500.00, 'transfer',
   'card_B1', 'dev_B2', '91.45.3.7', 48.8566, 2.3522, 'FR',
   NOW() - INTERVAL '2 minutes'),
  ('txn_105', 'user_B', 2900.00, 'retail',
   'card_B1', 'dev_B4', '177.54.12.8', 55.7558, 37.6173, 'RU',
   NOW() - INTERVAL '1 minute 30 seconds');

user_B shows a textbook fraud signature: a $1 test charge followed by high-value purchases across four countries on four devices in under three minutes. user_A shows normal behavior across a single device in one location.

Feature Category 1: Transaction Velocity

Velocity features measure how fast a user or card is transacting. Fraudsters operate in bursts -- the faster they drain a compromised card, the less likely the legitimate cardholder notices in time.

In RisingWave, the NOW() function is only allowed in WHERE, HAVING, and ON clauses for streaming queries. This means you create separate materialized views for each time window, with NOW() in the WHERE clause acting as a temporal filter that automatically expires old rows:

CREATE MATERIALIZED VIEW mv_txn_velocity_5min AS
SELECT
    user_id,
    card_id,
    COUNT(*) AS txn_count_5min,
    SUM(amount) AS total_amount_5min
FROM raw_transactions
WHERE event_time > NOW() - INTERVAL '5 minutes'
GROUP BY user_id, card_id;

CREATE MATERIALIZED VIEW mv_txn_velocity_1h AS
SELECT
    user_id,
    card_id,
    COUNT(*) AS txn_count_1h,
    SUM(amount) AS total_amount_1h,
    AVG(amount) AS avg_amount_1h,
    MAX(amount) AS max_amount_1h,
    MIN(amount) AS min_amount_1h,
    STDDEV_SAMP(amount) AS stddev_amount_1h,
    MAX(event_time) AS last_txn_time
FROM raw_transactions
WHERE event_time > NOW() - INTERVAL '1 hour'
GROUP BY user_id, card_id;

CREATE MATERIALIZED VIEW mv_txn_velocity_24h AS
SELECT
    user_id,
    card_id,
    COUNT(*) AS txn_count_24h,
    SUM(amount) AS total_amount_24h,
    AVG(amount) AS avg_amount_24h
FROM raw_transactions
WHERE event_time > NOW() - INTERVAL '24 hours'
GROUP BY user_id, card_id;

Query the 5-minute velocity:

SELECT user_id, card_id, txn_count_5min, total_amount_5min
FROM mv_txn_velocity_5min;

Expected output:

 user_id | card_id | txn_count_5min | total_amount_5min
---------+---------+----------------+-------------------
 user_A  | card_A1 |              1 |              9.99
 user_B  | card_B1 |              5 |          16401.00

user_B has five transactions totaling $16,401 in five minutes. That single feature -- txn_count_5min = 5 -- is already a strong fraud signal. Most legitimate cardholders rarely exceed two transactions in a five-minute window.

Each materialized view uses a temporal filter in the WHERE clause. RisingWave automatically adds new events as they arrive and removes events that fall outside the time window -- no manual cleanup or TTL configuration needed. For more on how materialized views work under the hood, see the documentation.

Feature Category 2: Device Fingerprinting

Device features track how many distinct devices and IP addresses a user employs. A legitimate user typically transacts from one or two devices. A compromised card used by a fraud ring will show transactions from many devices in a short period.

CREATE MATERIALIZED VIEW mv_device_features AS
SELECT
    user_id,
    card_id,
    COUNT(DISTINCT device_id) AS distinct_devices_1h,
    COUNT(DISTINCT ip_address) AS distinct_ips_1h,
    CASE
        WHEN COUNT(DISTINCT device_id) > 2
        THEN TRUE ELSE FALSE
    END AS multi_device_flag
FROM raw_transactions
WHERE event_time > NOW() - INTERVAL '1 hour'
GROUP BY user_id, card_id;

Expected output:

 user_id | distinct_devices_1h | distinct_ips_1h | multi_device_flag
---------+---------------------+-----------------+-------------------
 user_A  |                   1 |               1 | f
 user_B  |                   4 |               4 | t

Four distinct devices in one hour is a clear anomaly. Even users who own multiple devices rarely switch between four of them within sixty minutes for payment transactions.

Feature Category 3: Geolocation Anomalies

Geographic features detect impossible travel -- transactions from countries that are physically too far apart given the elapsed time. This is one of the strongest fraud indicators for card-not-present fraud.

First, create a view that pairs each transaction with the user's previous transaction:

CREATE MATERIALIZED VIEW mv_geo_pairs AS
SELECT
    transaction_id,
    user_id,
    card_id,
    country_code,
    latitude,
    longitude,
    event_time,
    LAG(country_code) OVER (
        PARTITION BY user_id, card_id ORDER BY event_time
    ) AS prev_country,
    LAG(latitude) OVER (
        PARTITION BY user_id, card_id ORDER BY event_time
    ) AS prev_lat,
    LAG(longitude) OVER (
        PARTITION BY user_id, card_id ORDER BY event_time
    ) AS prev_lon,
    LAG(event_time) OVER (
        PARTITION BY user_id, card_id ORDER BY event_time
    ) AS prev_event_time
FROM raw_transactions;

Then, build a feature view that detects cross-country transactions and impossible travel:

CREATE MATERIALIZED VIEW mv_geo_anomaly_features AS
SELECT
    user_id,
    card_id,
    COUNT(*) FILTER (
        WHERE prev_country IS NOT NULL
          AND country_code <> prev_country
          AND event_time - prev_event_time < INTERVAL '10 minutes'
    ) AS cross_country_txns_10min,
    BOOL_OR(
        prev_country IS NOT NULL
        AND country_code <> prev_country
        AND event_time - prev_event_time < INTERVAL '5 minutes'
    ) AS impossible_travel_flag
FROM mv_geo_pairs
GROUP BY user_id, card_id;

For distinct country counts within a time window, create a separate view:

CREATE MATERIALIZED VIEW mv_geo_countries_1h AS
SELECT
    user_id,
    card_id,
    COUNT(DISTINCT country_code) AS distinct_countries_1h
FROM raw_transactions
WHERE event_time > NOW() - INTERVAL '1 hour'
GROUP BY user_id, card_id;

Expected output from mv_geo_anomaly_features:

 user_id | cross_country_txns_10min | impossible_travel_flag
---------+--------------------------+------------------------
 user_A  |                        0 | f
 user_B  |                        4 | t

user_B has four cross-country transactions in under ten minutes and hits the impossible travel flag. No human can fly from London to Paris to Lagos to Moscow in 2.5 minutes.

Putting It All Together: The Unified Feature Vector

Individual feature categories are useful, but your fraud model needs a single row per user at scoring time. Create a unified feature view that joins all feature materialized views:

CREATE MATERIALIZED VIEW mv_fraud_feature_vector AS
SELECT
    v5.user_id,
    v5.card_id,
    -- Velocity
    v5.txn_count_5min,
    v1h.txn_count_1h,
    v24.txn_count_24h,
    -- Amounts
    v5.total_amount_5min,
    v1h.total_amount_1h,
    v1h.avg_amount_1h,
    v1h.max_amount_1h,
    v1h.stddev_amount_1h,
    -- Device
    d.distinct_devices_1h,
    d.distinct_ips_1h,
    d.multi_device_flag,
    -- Geolocation
    g.cross_country_txns_10min,
    gc.distinct_countries_1h,
    g.impossible_travel_flag
FROM mv_txn_velocity_5min v5
JOIN mv_txn_velocity_1h v1h
    ON v5.user_id = v1h.user_id AND v5.card_id = v1h.card_id
JOIN mv_txn_velocity_24h v24
    ON v5.user_id = v24.user_id AND v5.card_id = v24.card_id
JOIN mv_device_features d
    ON v5.user_id = d.user_id AND v5.card_id = d.card_id
JOIN mv_geo_anomaly_features g
    ON v5.user_id = g.user_id AND v5.card_id = g.card_id
JOIN mv_geo_countries_1h gc
    ON v5.user_id = gc.user_id AND v5.card_id = gc.card_id;

Your inference service fetches the complete feature vector with a single query:

SELECT * FROM mv_fraud_feature_vector
WHERE user_id = 'user_B' AND card_id = 'card_B1';

Because RisingWave is PostgreSQL-compatible, your model serving layer connects using any PostgreSQL driver -- psycopg2 in Python, pgx in Go, JDBC in Java. The query reads pre-computed results from the materialized view, so it returns in single-digit milliseconds regardless of how many raw events the system has processed.

Sinking Features to External Systems

If your inference service reads from Redis or a dedicated feature store rather than querying RisingWave directly, use a sink to push features downstream:

CREATE SINK fraud_features_to_kafka
FROM mv_fraud_feature_vector
WITH (
    connector = 'kafka',
    topic = 'fraud.features',
    properties.bootstrap.server = 'broker:9092',
    type = 'upsert',
    primary_key = 'user_id, card_id'
)
FORMAT UPSERT ENCODE JSON;

Every time a feature value changes, RisingWave emits an updated record to the Kafka topic. A downstream consumer can write these to Redis, PostgreSQL, or any store your scoring service prefers.

FAQ

How does this differ from computing features in Flink or Kafka Streams?

The features themselves are the same -- velocity counts, amount aggregations, device cardinality. The difference is how you define and maintain them. In Flink, each feature is a Java or Scala application with custom state management and deployment. In RisingWave, each feature is a SQL materialized view with the same incremental computation semantics but significantly less operational overhead. If your team already knows SQL, the ramp-up time drops from weeks to hours.

Can RisingWave handle late-arriving events?

Yes. RisingWave supports event-time processing with watermarks, so late-arriving transactions are incorporated into the correct time windows. Features that use NOW()-based temporal filters evaluate against processing time, which is appropriate for serving. For training pipelines where you need point-in-time correctness, use event-time windows with the TUMBLE or HOP window functions.

What throughput can I expect?

RisingWave scales horizontally. A single node handles tens of thousands of events per second for typical fraud feature workloads. For high-volume payment processors, you can scale out compute nodes to handle millions of events per second. The RisingWave GitHub repository includes benchmarks and deployment guides for various scale profiles.

How do I backfill historical features for model training?

Load historical transaction data into a RisingWave table, and the materialized views will compute features over that data just as they would over live streams. You can then export the resulting feature table to your training environment. This ensures your training features are computed with the exact same SQL as your serving features -- eliminating training-serving skew by definition.

Conclusion

Fraud detection is a latency-sensitive problem where the gap between event and feature computation is the gap between catching fraud and missing it. Batch feature pipelines were a reasonable trade-off when real-time computation required specialized distributed systems expertise. Streaming SQL removes that trade-off.

With RisingWave, you define fraud features -- velocity, amounts, device fingerprints, geolocation anomalies -- as SQL materialized views. The engine incrementally maintains them as each transaction arrives. Your model gets fresh features in milliseconds, not hours. And because the same SQL defines both training and serving features, you eliminate the skew that quietly degrades model performance over time.

The SQL examples in this post are production-ready starting points. Adapt the thresholds and time windows to your transaction patterns, plug the unified feature view into your scoring service, and you have a real-time fraud feature pipeline with no custom code to maintain.

Try RisingWave Cloud free -- no credit card required. Sign up here.

Join the RisingWave Community on Slack to ask questions and share what you build.