Insurance Claims Fraud Detection with Streaming SQL

Insurance Claims Fraud Detection with Streaming SQL

Insurance fraud detection with streaming SQL means analyzing claims in real time as they arrive—checking behavioral patterns, cross-referencing policy data, and flagging anomalies within seconds of submission—rather than running fraud scoring as a nightly batch job after claims have already entered the payment queue.

Why Streaming Beats Batch for Fraud Detection

Insurance fraud costs the industry more than $80 billion per year in the US alone. A significant portion involves staged accidents, duplicate claims, and provider fraud schemes that are detectable from patterns—but only if you look at the data quickly enough. Batch fraud detection catches fraud after payment. Streaming fraud detection catches it before.

The pattern-matching that catches fraud is fundamentally a streaming aggregation problem: how many claims has this claimant filed in the last 30 days? Has this provider billed for the same procedure on the same patient multiple times this week? Is this claim arriving suspiciously fast after a policy was purchased? These questions are answered by maintaining incrementally updated materialized views over claims streams.

Ingesting Claims and Policy Data

-- Real-time claims stream from the claims intake system
CREATE SOURCE insurance_claims (
    claim_id            VARCHAR,
    policy_id           VARCHAR,
    claimant_id         VARCHAR,
    provider_id         VARCHAR,
    claim_type          VARCHAR,    -- 'AUTO', 'HEALTH', 'PROPERTY', 'LIFE'
    claimed_amount      DECIMAL(18,2),
    incident_date       DATE,
    submission_time     TIMESTAMPTZ,
    ip_address          VARCHAR,
    device_fingerprint  VARCHAR
) WITH (
    connector = 'kafka',
    topic = 'claims-intake',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

-- Policy reference data (CDC from policy management system)
CREATE SOURCE policy_data (
    policy_id           VARCHAR,
    claimant_id         VARCHAR,
    policy_type         VARCHAR,
    inception_date      DATE,
    coverage_limit      DECIMAL(18,2),
    is_active           BOOLEAN,
    underwriter_id      VARCHAR
) WITH (
    connector = 'postgres-cdc',
    hostname = 'policy-db.internal',
    port = '5432',
    username = 'rw_reader',
    password = 'secret',
    database.name = 'policies',
    schema.name = 'public',
    table.name = 'policies'
) FORMAT DEBEZIUM ENCODE JSON;

Building Fraud Indicator Views

The key insight is that fraud rarely appears in a single claim—it appears in patterns across claims. Materialized views maintain those patterns continuously:

-- Claimant behavior profile: rolling 30-day claim history
CREATE MATERIALIZED VIEW claimant_claim_profile AS
SELECT
    claimant_id,
    COUNT(*)                                            AS claims_last_30d,
    SUM(claimed_amount)                                 AS total_claimed_30d,
    AVG(claimed_amount)                                 AS avg_claim_amount,
    COUNT(DISTINCT claim_type)                          AS claim_type_diversity,
    COUNT(DISTINCT provider_id)                         AS distinct_providers,
    MAX(submission_time)                                AS most_recent_claim,
    MIN(submission_time)                                AS first_claim_30d
FROM insurance_claims
WHERE submission_time >= NOW() - INTERVAL '30 days'
GROUP BY claimant_id;

-- Provider anomaly detection: same procedure, multiple patients in short window
CREATE MATERIALIZED VIEW provider_billing_anomalies AS
SELECT
    window_start,
    window_end,
    provider_id,
    claim_type,
    COUNT(*)                                AS claims_in_window,
    COUNT(DISTINCT claimant_id)             AS unique_claimants,
    SUM(claimed_amount)                     AS total_billed,
    AVG(claimed_amount)                     AS avg_claim
FROM TUMBLE(insurance_claims, submission_time, INTERVAL '1 HOUR')
GROUP BY window_start, window_end, provider_id, claim_type
HAVING COUNT(*) > 10;

-- Real-time fraud score combining multiple signals
CREATE MATERIALIZED VIEW claim_fraud_signals AS
SELECT
    c.claim_id,
    c.claimant_id,
    c.policy_id,
    c.claimed_amount,
    c.submission_time,
    -- Signal 1: High claim frequency
    CASE WHEN cp.claims_last_30d > 5 THEN 1 ELSE 0 END          AS high_frequency_flag,
    -- Signal 2: Claim amount far above claimant average
    CASE WHEN c.claimed_amount > cp.avg_claim_amount * 3 THEN 1 ELSE 0 END AS high_amount_flag,
    -- Signal 3: Policy very new at time of claim
    CASE WHEN c.incident_date <= p.inception_date + INTERVAL '30 days' THEN 1 ELSE 0 END AS new_policy_flag,
    -- Signal 4: Multiple providers in window
    CASE WHEN cp.distinct_providers > 4 THEN 1 ELSE 0 END        AS multi_provider_flag,
    cp.claims_last_30d,
    cp.total_claimed_30d,
    p.inception_date
FROM insurance_claims c
LEFT JOIN claimant_claim_profile cp ON c.claimant_id = cp.claimant_id
LEFT JOIN policy_data p ON c.policy_id = p.policy_id;

Routing High-Risk Claims

Once fraud signals are computed, sink the results to a fraud review queue:

-- Sink high-risk claims to Kafka for adjuster review workflow
CREATE SINK fraud_review_queue AS
SELECT
    claim_id,
    claimant_id,
    policy_id,
    claimed_amount,
    (high_frequency_flag + high_amount_flag + new_policy_flag + multi_provider_flag) AS fraud_score,
    submission_time
FROM claim_fraud_signals
WHERE (high_frequency_flag + high_amount_flag + new_policy_flag + multi_provider_flag) >= 2
WITH (
    connector = 'kafka',
    properties.bootstrap.server = 'kafka:9092',
    topic = 'fraud-review-queue'
) FORMAT PLAIN ENCODE JSON;

Windowed Pattern: Duplicate Claim Detection

-- Detect potential duplicate claims within 7-day sliding windows
CREATE MATERIALIZED VIEW potential_duplicates AS
SELECT
    window_start,
    window_end,
    claimant_id,
    claim_type,
    COUNT(*)            AS claim_count,
    MIN(claim_id)       AS first_claim_id,
    MAX(claim_id)       AS latest_claim_id,
    MIN(claimed_amount) AS min_amount,
    MAX(claimed_amount) AS max_amount
FROM TUMBLE(insurance_claims, submission_time, INTERVAL '7 DAYS')
GROUP BY window_start, window_end, claimant_id, claim_type
HAVING COUNT(*) > 1;

Comparison: Batch vs. Streaming Fraud Detection

DimensionBatch Fraud ScoringStreaming with RisingWave
Detection timingAfter payment processingBefore payment, at submission
Pattern lookbackStatic snapshot at job runRolling window, always current
False negative rateHigh (patterns missed before job)Lower (real-time cross-claim signals)
InfrastructureML pipeline + data warehouse + schedulerStreaming SQL + materialized views
Adjuster workflow latencyHours to daysSeconds
CoverageSubset of high-value claimsAll claims, all the time

FAQ

Q: Can RisingWave integrate with machine learning fraud models? A: Yes. Use RisingWave to compute streaming features (claim frequency, amount deviation, provider patterns) and sink them to Kafka. An ML inference service consumes from Kafka, scores claims, and publishes results back to another Kafka topic that RisingWave can ingest as a source. The materialized views handle feature engineering; the ML model handles scoring.

Q: How do we handle claims that span multiple systems (auto, health, life)? A: Create separate Kafka topics per claims system and ingest each as a separate source. Then union them in a view: CREATE MATERIALIZED VIEW all_claims AS SELECT * FROM auto_claims UNION ALL SELECT * FROM health_claims UNION ALL SELECT * FROM life_claims. Fraud signals can then run across all claim types.

Q: What about organized fraud rings that coordinate across multiple claimants? A: Build a graph-based signal by counting shared attributes (same IP address, same device fingerprint, same phone number) across claimants within a time window using materialized views. This requires enriching claims with device/contact data from the intake system, but the aggregation pattern is the same.

Q: How does RisingWave handle schema changes when the claims intake format changes? A: RisingWave supports ALTER SOURCE to update schema definitions. For Kafka sources using JSON encoding, adding new nullable fields is backward compatible. For CDC sources, schema changes are handled by the Debezium connector automatically.

Q: Is streaming fraud detection compliant with GDPR and CCPA? A: RisingWave itself is a data processing engine—compliance depends on how you configure data retention and access. Use Kafka topic retention policies to control how long raw claims data is stored, implement column-level access controls on sensitive fields, and use the audit log from your Postgres-compatible interface for compliance reporting.


Get Started

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.