Building a UEBA System Without ML: Streaming Behavioral Baselines with SQL

A SQL-based UEBA system computes rolling behavioral baselines for users and entities using window aggregations, then flags deviations in real time. Unlike ML-based UEBA, it requires no model training, no feature pipelines, and no retraining cycles. It detects volume and rate anomalies immediately and is fully auditable by any analyst who can read SQL.

What is UEBA and why do most teams struggle to deploy it?

User and Entity Behavior Analytics (UEBA) is a security discipline that establishes normal behavioral patterns for users, devices, and applications, then raises alerts when observed behavior deviates significantly from those patterns.

Most commercial UEBA products are built on machine learning. That sounds powerful, but it creates real friction for security teams:

Training data requirements. ML models need weeks or months of labeled, clean historical activity data before they produce reliable baselines. New environments, new applications, or organizational changes require re-collection.

Feature engineering overhead. Raw event logs (login timestamps, query counts, file access records) must be transformed into structured feature vectors. This work requires data engineering expertise that many security teams do not have.

Retraining cycles. User behavior shifts over time. An employee who changes roles, joins a new project, or starts working remotely will look anomalous to a stale model. Keeping models current requires scheduled retraining pipelines, drift monitoring, and validation workflows.

Black-box alerts. When an ML model fires an alert, explaining why to an analyst is hard. Interpretability tools help, but the output is still probabilistic. Tuning false positive rates without understanding the underlying features is difficult.

The result: many organizations buy UEBA products, struggle through a 90-day tuning period, and end up either drowning in false positives or turning sensitivity so low the system misses real threats.

SQL-based baselines skip this complexity. The baseline logic is written in plain SQL, visible to anyone, and starts producing useful results as soon as you have a few days of activity data.

How SQL rolling baselines detect behavioral anomalies

A SQL baseline works through three steps:

Compute a rolling average of activity over a trailing window (7 days, 30 days, or whatever fits your environment).
Observe the current rate for the same user or entity in a recent short window (last hour, last 15 minutes).
Calculate a deviation ratio: current rate divided by baseline. Flag anything above a threshold (for example, 3x normal).

This is straightforward statistics, not machine learning. The math is transparent: if a user normally runs 40 database queries per day and today they ran 800, the deviation ratio is 20x. You do not need a model to know that is worth investigating.

The tradeoff is precision. SQL baselines are calibrated to volume and rate. They work well when the anomaly is "too much" or "too fast" or "at the wrong time." They do not capture subtle behavioral shifts like "this user's sequence of accessed tables matches the pattern of a previous data exfiltration incident." Detecting patterns like that requires feature vectors that are impractical to build in pure SQL. ML models excel there.

What anomaly patterns work well with SQL baselines?

SQL baselines are well-suited for several concrete threat patterns:

Volume spikes. A user who normally runs 50 queries per day running 2,000 is suspicious regardless of what those queries contain. Same logic applies to login attempts, file reads, API calls, and privilege escalations. Volume anomalies are the clearest signal SQL baselines handle.

Login rate anomalies. Too many failed logins in a short window, or successful logins at a rate far above the user's baseline, are classic indicators of credential stuffing or account takeover. A rolling average of daily logins compared against a 1-hour window catches these quickly.

Data access volume. A user downloading 10x their normal data volume in a single session is a meaningful insider threat signal. SQL can compute this from bytes-transferred fields in access logs.

Time-of-day anomalies. Activity outside a user's established active hours is easy to flag with SQL. If a user's historical activity is concentrated between 09:00 and 18:00, a login at 02:30 is a deviation worth surfacing. This is a simple filter condition, not a learned pattern.

New resource access. A first-ever access to a sensitive table or a high-privilege system account is a distinct event type. SQL can detect this with a simple anti-join against a history table of previously accessed resources.

None of these require ML. They require clean event data and a streaming database that can maintain rolling aggregations.

Step-by-step: building per-user baselines in RisingWave

RisingWave ingests event streams from Kafka and maintains incrementally updated materialized views. Each view is a SQL query; RisingWave keeps its results current as new events arrive. No batch jobs, no manual refreshes.

Step 1: Define the activity source

CREATE SOURCE user_activity (
    event_time  TIMESTAMPTZ,
    user_id     VARCHAR,
    action_type VARCHAR,
    resource    VARCHAR,
    bytes_read  BIGINT
)
WITH (
    connector = 'kafka',
    topic = 'user-activity-events',
    properties.bootstrap.server = 'kafka:9092'
)
FORMAT PLAIN ENCODE JSON;

Step 2: Compute the 30-day rolling baseline

This materialized view computes each user's average daily query count over a trailing 30-day window. RisingWave updates it continuously as new events arrive.

CREATE MATERIALIZED VIEW user_daily_baseline AS
SELECT
    user_id,
    COUNT(*)::FLOAT / 30.0              AS avg_daily_actions,
    AVG(bytes_read)                     AS avg_bytes_per_action,
    STDDEV(COUNT(*))                    AS stddev_daily_actions
FROM user_activity
WHERE event_time >= NOW() - INTERVAL '30 days'
GROUP BY
    user_id,
    DATE_TRUNC('day', event_time);

Step 3: Observe current activity rate

CREATE MATERIALIZED VIEW user_current_activity AS
SELECT
    user_id,
    COUNT(*)       AS actions_last_hour,
    SUM(bytes_read) AS bytes_last_hour
FROM user_activity
WHERE event_time >= NOW() - INTERVAL '1 hour'
GROUP BY user_id;

Step 4: Build the anomaly detection view

This view joins current activity against the baseline and computes a deviation ratio. Any user whose hourly rate exceeds 5x their expected hourly rate (daily average divided by 24) appears in the results.

CREATE MATERIALIZED VIEW ueba_anomalies AS
SELECT
    c.user_id,
    c.actions_last_hour,
    b.avg_daily_actions / 24.0          AS expected_hourly_actions,
    c.actions_last_hour
        / NULLIF(b.avg_daily_actions / 24.0, 0)
                                        AS deviation_ratio,
    c.bytes_last_hour,
    b.avg_bytes_per_action,
    NOW()                               AS detected_at
FROM user_current_activity c
JOIN user_daily_baseline b
    ON c.user_id = b.user_id
WHERE
    c.actions_last_hour
        / NULLIF(b.avg_daily_actions / 24.0, 0) > 5.0
    OR c.bytes_last_hour > b.avg_bytes_per_action * 100;

An analyst can query ueba_anomalies at any time and get the current list of flagged users. The deviation ratio is self-explanatory: a value of 12 means this user is acting at 12x their normal rate right now. No model interpretation needed.

You can also push these results to an alerting sink (Kafka, webhook, SIEM) using RisingWave's sink connectors, so anomalies trigger downstream workflows automatically.

How does this compare to ML-based UEBA?

Being direct about the tradeoffs helps you make the right architectural choice.

Where SQL baselines win:

Immediate value. A SQL baseline view starts producing results the same day you deploy it, once you have a few days of activity data.
Interpretability. Every alert comes with a plain deviation ratio and raw counts. Analysts understand what triggered the alert without needing data science expertise.
Easy tuning. Adjusting a false positive rate means changing a threshold constant in a SQL view. Redeploying takes seconds.
No infrastructure tax. No model serving layer, no feature store, no training pipelines. The entire system is SQL views on top of a streaming database.

Where ML-based UEBA wins:

Complex behavioral patterns. A trained model can learn that a specific sequence of resource accesses, combined with timing patterns and peer-group comparisons, is suspicious even when no individual metric is obviously anomalous. SQL baselines cannot capture this.
Feature-rich anomaly detection. Models can incorporate dozens of behavioral dimensions simultaneously. A SQL baseline is effectively a one-dimensional or two-dimensional comparison. Adding more dimensions makes the SQL increasingly complex and harder to maintain.
Adaptation to gradual drift. A user who slowly shifts their working hours over six months will not trigger a SQL threshold, but a model with a recency-weighted feature window can detect the drift. SQL baselines require manual threshold updates to account for legitimate behavioral changes.

The practical recommendation: start with SQL baselines to get detection coverage quickly, catch the clear-cut volumetric threats, and demonstrate value to stakeholders. Use ML models for the hard cases where volume-based heuristics produce too many false positives or miss sophisticated attackers who keep their volume low.

How does this compare to Flink for behavioral analytics?

Apache Flink is the most common alternative for streaming analytics at this layer. It is capable and battle-tested. The comparison is worth making directly.

Flink's DataStream API requires Java or Scala. Building a behavioral baseline pipeline means writing stateful operators, managing keyed state, handling watermarks, and wiring up source and sink connectors. A team without JVM expertise faces a significant ramp-up before shipping anything.

Flink SQL exists and handles some of this, but complex stateful queries that maintain rolling aggregates per user across multiple time windows often require custom user-defined functions or fall back to the DataStream API for performance reasons.

RisingWave's model is SQL-native from the beginning. Every baseline and anomaly detection query shown above is standard SQL that runs as an incrementally maintained materialized view. There is no application code to maintain, no deployment artifact to build, and no operator topology to reason about. A security analyst with SQL skills can read, understand, and modify the detection logic directly.

For teams that already have deep Flink expertise and existing Flink infrastructure, staying on Flink makes sense. For teams starting from scratch, or for security teams that want to own their detection logic without depending on engineering support, RisingWave's SQL-first model significantly reduces the barrier to entry.

Frequently Asked Questions

Can SQL baselines detect insider threats, or only external attacks?

SQL baselines are well-suited for detecting insider threats, specifically the kind that involve abnormal access volume or off-hours activity. An employee exfiltrating data by querying far more records than usual will appear in the deviation ratio view. However, a sophisticated insider who stays within their normal volume thresholds while slowly moving sensitive data will not be caught by rate-based baselines alone. Supplementing with access-pattern analysis or peer-group comparisons (which require more complex SQL or ML) addresses this gap.

How much historical data do I need before the baselines are reliable?

A minimum of 7 days of activity data is enough to start getting useful signals, though baselines based on fewer than 14 days will have more variance. For users who only access systems a few times per week, a 30-day window produces more stable averages. The deviation ratio threshold can be set conservatively (10x) while the baseline is still accumulating data, then tightened as confidence grows.

How does RisingWave handle users who have no baseline yet?

The anomaly detection view uses a JOIN, so users with no history in the baseline view will not appear in results. You can handle new users separately by building a second view that flags any activity from users with fewer than 7 days of history, treating all of their activity as requiring review until a baseline is established.

What is the latency between an anomalous event and an alert?

RisingWave processes events as they arrive from Kafka with sub-second latency. The materialized views are updated incrementally, so the anomaly detection view reflects new activity within seconds. End-to-end latency from event ingestion to an alert appearing in the output sink is typically under 10 seconds, depending on Kafka consumer lag and downstream sink configuration.

SQL-based UEBA will not replace a mature ML security platform. But for teams that need behavioral anomaly detection running in production this week, rather than after a three-month ML pipeline project, it is a practical and immediately valuable starting point.

RisingWave makes this accessible to any team with SQL skills. Define your activity source, write your baseline views, and your detection logic is live. Start with volume and rate baselines, cover the obvious threats, and build from there.

Get started with RisingWave or explore the documentation to learn more about streaming materialized views and window functions.