Building a Fraud Detection Feature Pipeline: SQL vs. Python

Fraud detection is a latency game. When a cardholder swipes at a point of sale, the issuing bank has roughly 100-200 milliseconds to approve or decline before the terminal times out. In that window, your model needs features — velocity counts, behavioral baselines, peer comparisons — computed from data that is seconds old, not minutes.

The engineering question is: what kind of pipeline do you use to produce those features?

Two dominant approaches exist: Python-based pipelines (custom Kafka consumers, Python UDFs in Flink, pandas-heavy batch jobs promoted to near-real-time) and SQL-based pipelines (materialized views over a streaming engine like RisingWave). This article walks through both using the same concrete feature set, then makes an honest case for when each is the right call.

What Fraud Detection Features Actually Look Like

Before writing any code, it helps to understand the three categories of features that fraud models rely on most.

Velocity features measure how fast activity is happening for a given entity. How many transactions has this card made in the last hour? How much total spend? Across how many distinct merchants and geographies? Sudden spikes in velocity are among the strongest fraud signals because legitimate cardholders have remarkably consistent behavior patterns over short windows.

Peer comparison features compare a single transaction against a statistical baseline. Is this $400 charge at a gas station unusual relative to other transactions at that same station? A z-score against the merchant's rolling average amount is a simple but effective anomaly signal. The key is that the baseline must be continuously updated — a snapshot from yesterday misses intraday patterns.

Behavioral anomaly features detect departures from a cardholder's own history. Has this card ever transacted in this country before? Has it used this merchant category in the last 30 days? These features require longer lookback windows (days to months) and are typically precomputed offline, then joined in at scoring time.

For this article, we focus on the first two categories because they illustrate the real-time pipeline tradeoffs most clearly.

The Python Approach

A typical Python-based pipeline looks like this: a Kafka consumer reads raw transaction events, applies feature logic in Python, and writes results to a feature store (Redis, Feast, or similar). Alternatively, Flink provides Python UDF support, letting you embed Python logic inside a Flink topology.

Here is a skeleton of what the velocity feature computation looks like in a Flink Python pipeline:

# Flink Python UDF approach — pseudocode/skeleton

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import ProcessWindowFunction
from pyflink.datastream.window import TumblingEventTimeWindows
from datetime import timedelta
import redis

class VelocityAggregator(ProcessWindowFunction):
    def __init__(self):
        self.redis_client = None  # initialized lazily per task

    def open(self, context):
        # Each parallel task creates its own connection — connection pooling
        # must be managed manually. Misconfigure this and you flood Redis.
        self.redis_client = redis.Redis(host="redis-host", port=6379)

    def process(self, key, context, elements):
        card_number = key
        elements_list = list(elements)

        txn_count = len(elements_list)
        total_spend = sum(e.amount for e in elements_list)
        unique_merchants = len(set(e.merchant_id for e in elements_list))
        unique_countries = len(set(e.country for e in elements_list))

        features = {
            "txn_count_1h": txn_count,
            "spend_1h": total_spend,
            "unique_merchants_1h": unique_merchants,
            "countries_1h": unique_countries,
        }

        # Write to feature store — what happens if Redis is down?
        # Retry logic, dead letter queues, and alerting are all your problem.
        self.redis_client.hset(f"velocity:{card_number}", mapping=features)
        yield features

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(8)

transactions = (
    env
    .add_source(kafka_source("transactions"))
    .assign_timestamps_and_watermarks(watermark_strategy)
    .key_by(lambda e: e.card_number)
    .window(TumblingEventTimeWindows.of(timedelta(hours=1)))
    .process(VelocityAggregator())
)

# Peer comparison requires a separate job, or a complex join topology
# that you'll need to manage state for independently.
merchant_stats = (
    env
    .add_source(kafka_source("transactions"))
    .key_by(lambda e: e.merchant_id)
    .window(TumblingEventTimeWindows.of(timedelta(days=7)))
    .aggregate(MerchantStatsAggregator())  # another class to write and test
)

# Joining velocity + merchant_stats + writing anomaly scores is another
# topology, another set of state backends to configure, another failure mode.

env.execute("fraud_feature_pipeline")

This is a simplified skeleton. A production version adds: watermark handling for late-arriving events, exactly-once sink configuration, checkpoint interval tuning, state backend selection (RocksDB vs. heap), operator chaining decisions, savepoint management for upgrades, and custom metrics for lag monitoring.

The peer comparison join is particularly painful in Python. You are joining a stream against a continuously updated aggregate, which requires either: (a) maintaining the aggregate in external state (Redis) and looking it up on each event, or (b) building a stream-stream join with appropriate state TTLs to prevent unbounded state growth. Option (a) introduces an external dependency that can desynchronize. Option (b) is notoriously tricky to get right and even harder to debug when it goes wrong.

The SQL Approach

Here is the same feature set expressed as SQL materialized views in RisingWave:

-- Step 1: Velocity features per card over the last hour
CREATE MATERIALIZED VIEW txn_velocity AS
SELECT
    card_number,
    COUNT(*) AS txn_count_1h,
    SUM(amount) AS spend_1h,
    COUNT(DISTINCT merchant_id) AS unique_merchants_1h,
    COUNT(DISTINCT country) AS countries_1h
FROM transactions
WHERE txn_time >= NOW() - INTERVAL '1 hour'
GROUP BY card_number;

-- Step 2: Merchant baseline stats over the last 7 days
CREATE MATERIALIZED VIEW merchant_avg_stats AS
SELECT
    merchant_id,
    merchant_category,
    AVG(amount) AS avg_txn_amount,
    STDDEV(amount) AS stddev_txn_amount
FROM transactions
WHERE txn_time >= NOW() - INTERVAL '7 days'
GROUP BY merchant_id, merchant_category;

-- Step 3: Anomaly score — join velocity with merchant baseline
CREATE MATERIALIZED VIEW txn_anomaly_score AS
SELECT
    t.txn_id,
    t.card_number,
    t.amount,
    m.avg_txn_amount,
    (t.amount - m.avg_txn_amount) / NULLIF(m.stddev_txn_amount, 0) AS amount_z_score
FROM transactions t
JOIN merchant_avg_stats m ON t.merchant_id = m.merchant_id
WHERE t.txn_time >= NOW() - INTERVAL '1 hour';

That's the complete pipeline: three DDL statements. RisingWave maintains these views incrementally — as new transactions arrive, only the affected aggregates update. There is no separate job deployment, no state backend to configure, no checkpoint interval to tune.

Your scoring service queries txn_velocity and txn_anomaly_score directly via a standard PostgreSQL-compatible connection. Latency from transaction event to queryable feature is consistently under 200ms for moderate event volumes.

To extend this with behavioral history — for example, whether the card has used this merchant category in the last 30 days — you add another materialized view:

CREATE MATERIALIZED VIEW card_category_history AS
SELECT
    card_number,
    merchant_category,
    COUNT(*) AS category_txn_count_30d,
    MAX(txn_time) AS last_seen_category
FROM transactions
WHERE txn_time >= NOW() - INTERVAL '30 days'
GROUP BY card_number, merchant_category;

Adding a feature in SQL is a DDL statement. Adding a feature in a Python pipeline means modifying application code, deploying a new version, ensuring backward compatibility with the state backend, and testing the new operator in isolation before production.

Direct Comparison

Development Speed

SQL wins for anything expressible as aggregations, window functions, or joins. An experienced data engineer can write and test these three materialized views in under an hour. The equivalent Flink Python topology requires a full day or more — environment setup, topology design, testing watermark behavior, validating state correctness, deploying to a Flink cluster.

Python wins for features that require custom logic: graph traversal, ML model inference in the pipeline, string parsing with complex rules. No SQL dialect can do those things.

Debuggability

SQL is dramatically easier to debug. When a feature looks wrong, you query the materialized view directly, check intermediate results, and trace backward through the SQL. The logic is transparent and inspectable at every step.

With a Python Flink pipeline, debugging requires: attaching to a running job, reading operator logs, understanding Flink's internal state representation, and often reproducing the issue locally — which requires replicating the Kafka topic state, checkpoint state, and watermark position. Production post-mortems routinely take hours just to identify which operator introduced the problem.

Performance

Both approaches can achieve sub-200ms feature freshness. The difference is in operational consistency. RisingWave's incremental view maintenance is designed around exactly this use case: low-latency, high-throughput aggregation over sliding windows. It handles backpressure, memory pressure, and state management internally.

Python-based pipelines can be tuned to similar latency targets, but they require sustained operational investment. A Flink Python UDF that processes 50K events/second may behave differently at 200K events/second due to GIL contention and serialization overhead — issues that don't affect a SQL engine with native operators.

Maintenance Overhead

SQL pipelines are nearly maintenance-free once deployed. There are no library versions to pin, no Python environment to manage, no operator compatibility matrix to check across Flink versions. Schema changes in the upstream table propagate to views automatically.

Python pipelines accumulate technical debt. The pandas version that worked in 2023 may conflict with the Flink Python client in 2025. The Redis client library needs upgrading. The watermark strategy needs revisiting after a schema change. This overhead is real and compounds over time.

Hiring and Onboarding

Any data engineer reads SQL. The materialized view definitions above are self-documenting — a new team member can understand the fraud feature set in minutes. Python Flink pipelines require Flink expertise, which is a specialty skill that narrows your hiring pool and extends onboarding time significantly.

When Python Is the Right Choice

Python wins in specific, well-defined scenarios:

In-pipeline model inference. If your fraud feature pipeline needs to run an ML model mid-stream — a graph neural network for account relationship analysis, or an embedding model for merchant categorization — SQL cannot help you there. Python (or Rust, or Java) is the right tool. The model is the pipeline.

Graph features. Fraud rings often manifest as connected components in a transaction graph: accounts that share devices, phone numbers, or addresses. Detecting these requires traversing edges in a graph structure. SQL can express simple self-joins, but complex graph traversal (multi-hop BFS/DFS) is not SQL's domain.

Complex pattern matching. If your fraud signal requires detecting a specific sequence of events — a password reset followed by an address change followed by a high-value transaction — SQL window functions can handle some of this, but complex CEP (Complex Event Processing) patterns are more naturally expressed in Flink's MATCH_RECOGNIZE or Python with explicit state machines.

Custom serialization and protocol parsing. If your event source is binary-encoded (Avro, Protobuf with custom extensions, proprietary formats), Python gives you the full deserialization ecosystem.

When SQL Wins

SQL wins for the vast majority of fraud feature engineering:

Velocity checks. Transaction counts, spend totals, and distinct entity counts over rolling windows are pure aggregation. SQL handles these with a single GROUP BY plus a time filter. Any Python implementation is strictly more code for the same result.

Peer comparison baselines. Statistical baselines (mean, standard deviation, percentiles) over merchant or category windows are SQL aggregations. The join between a transaction and its merchant baseline is a SQL join.

Rule-based signals. Amount thresholds, geographic restrictions, time-of-day patterns, merchant category filters — all expressible as SQL predicates. These are also the signals most often subject to regulatory audit. SQL is easier to audit than Python.

Joining with reference data. Blocked card lists, sanctioned merchant lists, BIN tables, IP geolocation databases — joining transaction events with static reference tables is standard SQL. In Python, this requires either broadcasting the reference table to all workers (expensive) or maintaining it in an external store (an additional dependency).

Feature serving. SQL materialized views are directly queryable. No separate feature store, no cache warming logic, no staleness check. The view is always current.

The Hybrid Approach

The architecture that works well in practice is not purely SQL or purely Python — it uses each where it fits.

SQL for feature computation. Use RisingWave materialized views for all aggregation-based features: velocity, peer baselines, behavioral history. These are always fresh, always queryable, and require almost no operational overhead.

Python for model serving. Your scoring service is a Python application (FastAPI, Flask, or similar). It receives an authorization request, queries the RisingWave materialized views for precomputed features, optionally computes a few request-time features (relative to the incoming transaction), calls the fraud model, and returns a score.

Python for offline feature engineering. Historical feature computation for model training runs in a batch Python environment (Spark, pandas) against historical data. This is separate from the production pipeline and does not affect latency.

Python for graph and embedding features. If your model uses graph-based or embedding-based features, run a separate Python service that updates these on a slower cadence (minutes, not milliseconds). Store results in RisingWave or Redis and join them in at scoring time.

This split keeps the fast path simple and SQL-native, while leaving Python in roles where its flexibility genuinely adds value.

FAQ

Can RisingWave handle the transaction volumes of a large payment processor?

RisingWave scales horizontally. Production deployments handle millions of events per second. For context, even large card networks operate at tens of thousands of transactions per second during peak hours — well within range. The constraint is usually not throughput but state size: a 30-day rolling window over millions of active cards requires substantial memory. RisingWave spills state to object storage to handle this.

What happens if RisingWave is down when we need to score a transaction?

The same thing that happens if your Flink cluster is down: you fall back to degraded scoring (rules only, no ML features) or reject the transaction. This is not a RisingWave-specific risk. The mitigation is standard: deploy with high availability, use replicated storage, and implement a fallback scoring path in your authorization service.

We already have a Flink cluster. Is it worth migrating?

Partially. If you have significant investment in Flink Java/Scala pipelines with good operational practices, migrating for migration's sake has real costs. The more productive path is to use RisingWave alongside Flink: route aggregation-heavy feature jobs to RisingWave, keep any existing Flink jobs that are stable and complex enough to justify the operational overhead. Evaluate migration job by job based on maintenance burden.

Our fraud features include some ML-generated embeddings. How do we integrate those?

Pre-compute embeddings offline (or in a background Python service) and store them in RisingWave as a table. Your authorization flow joins the incoming transaction against the embedding table at scoring time. The embedding table updates on whatever cadence your ML pipeline runs — hourly, daily, or as an online update. This is a common pattern in production fraud systems.

How do we handle schema changes in the transactions table?

Add columns to the source table, update the relevant materialized views, and drop/recreate any views that need to reference the new column. RisingWave supports ALTER TABLE for source tables. Downstream views that don't reference the changed column are unaffected. This is significantly easier than the Flink equivalent, which often requires a full savepoint restore cycle.

The core argument here is not that SQL is always better. It is that SQL is the right tool for a specific, important subset of fraud feature engineering — the aggregation-heavy, join-heavy, rule-based subset that constitutes the majority of features in most production systems. Use it for what it does well. Keep Python for what requires flexibility that SQL genuinely cannot provide.