Building a Real-Time Alerting System on Streaming Data

Building a Real-Time Alerting System on Streaming Data

Your monitoring dashboard turns red at 3 AM. By the time your on-call engineer wakes up, checks the dashboard, and correlates logs across three services, the outage has been running for 47 minutes. Customers have noticed. Twitter has noticed.

The core problem is not that your systems fail. They always will. The problem is the gap between when a failure happens and when someone knows about it. Traditional alerting stacks bolt together a metrics collector, a time-series database, a rules engine, and a notification service. Each layer adds latency, each integration adds fragility, and each tool requires its own configuration language.

A streaming database like RisingWave collapses this stack into SQL. You define alert conditions as materialized views, and those views update continuously as new data arrives. When a metric crosses a threshold or deviates from its baseline, the alert row appears in milliseconds. Sink that view to Kafka, and a lightweight consumer delivers it to Slack, PagerDuty, or any webhook endpoint.

In this guide, you will build a complete real-time alerting system using RisingWave. You will create streaming sources for infrastructure metrics and application events, define threshold-based alerts for known failure modes, build anomaly-based alerts using statistical deviation, and route everything to your notification channels. All in SQL. All verified against RisingWave 2.8.0.

Why Streaming SQL Replaces Traditional Alerting Pipelines

A typical alerting pipeline involves at least four moving parts: Prometheus scrapes metrics every 15-30 seconds, stores them in a time-series database, evaluates rules on a fixed schedule, and sends notifications through Alertmanager. That is four processes, four configuration formats, and a minimum alerting latency of 30-60 seconds even for critical issues.

Streaming SQL reduces this to two concepts: sources (where data comes in) and materialized views (where alert logic lives). RisingWave evaluates your alert conditions incrementally, the moment new data arrives, not on a polling schedule. This means:

  • Sub-second alert latency for threshold violations, because materialized views update as each event is processed
  • One language for all alert logic, from simple threshold checks to complex statistical anomaly detection
  • Composable alerts, where one materialized view can feed into another, enabling multi-stage alerting pipelines
  • Built-in state management, so windowed aggregations and baselines are maintained automatically without external storage

Traditional vs. Streaming SQL Alerting

AspectTraditional (Prometheus + Alertmanager)Streaming SQL (RisingWave)
Alert evaluationPolling (every 15-60s)Continuous (sub-second)
Rule languagePromQL / custom DSLStandard SQL
State managementExternal TSDBBuilt-in
Multi-signal correlationRequires custom glue codeSQL JOINs across streams
Backfill and replayNot supportedReplay from Kafka offsets
Operational overhead4+ services to manageSingle database

Setting Up Streaming Sources for Metrics and Events

Every alerting system starts with data ingestion. In production, your metrics and application events typically flow through Apache Kafka. RisingWave connects directly to Kafka topics using CREATE SOURCE.

Infrastructure Metrics Source

This source ingests server-level metrics like CPU usage, memory utilization, and disk I/O:

CREATE SOURCE server_metrics (
    metric_id VARCHAR,
    host_id VARCHAR,
    service_name VARCHAR,
    metric_name VARCHAR,
    metric_value DOUBLE PRECISION,
    ts TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'server_metrics',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;

Application Events Source

This source captures HTTP requests, error logs, and business events from your application layer:

CREATE SOURCE application_events (
    event_id VARCHAR,
    service_name VARCHAR,
    event_type VARCHAR,
    status_code INT,
    response_time_ms DOUBLE PRECISION,
    user_id VARCHAR,
    endpoint VARCHAR,
    ts TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'application_events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;

For development and testing without Kafka, you can use CREATE TABLE instead of CREATE SOURCE and insert data directly. All the SQL examples below work with either approach. Here is the table-based equivalent:

CREATE TABLE server_metrics (
    metric_id VARCHAR,
    host_id VARCHAR,
    service_name VARCHAR,
    metric_name VARCHAR,
    metric_value DOUBLE PRECISION,
    ts TIMESTAMPTZ
);

CREATE TABLE application_events (
    event_id VARCHAR,
    service_name VARCHAR,
    event_type VARCHAR,
    status_code INT,
    response_time_ms DOUBLE PRECISION,
    user_id VARCHAR,
    endpoint VARCHAR,
    ts TIMESTAMPTZ
);

Building Threshold-Based Alerts with Materialized Views

Threshold-based alerts are the workhorses of any alerting system. They fire when a metric crosses a known boundary: CPU above 85%, error rate above 5%, disk usage above 90%. In RisingWave, each alert rule is a materialized view with a HAVING or WHERE clause that encodes the threshold.

Alert 1: High CPU Usage

This materialized view computes average CPU usage per host in 1-minute tumbling windows and surfaces only the windows where the average exceeds 85%:

CREATE MATERIALIZED VIEW high_cpu_alerts AS
SELECT
    host_id,
    service_name,
    AVG(metric_value) AS avg_cpu,
    MAX(metric_value) AS peak_cpu,
    COUNT(*) AS sample_count,
    window_start,
    window_end
FROM TUMBLE(
    server_metrics,
    ts,
    INTERVAL '1' MINUTE
)
WHERE metric_name = 'cpu_usage'
GROUP BY host_id, service_name, window_start, window_end
HAVING AVG(metric_value) > 85;

The TUMBLE function creates fixed, non-overlapping 1-minute windows. RisingWave evaluates this query incrementally: as each new cpu_usage metric arrives, it updates the running average for the current window. The moment the average crosses 85%, a new row appears in high_cpu_alerts.

Let us verify with sample data (tested on RisingWave 2.8.0):

INSERT INTO server_metrics VALUES
    ('m001', 'host-1', 'api-gateway', 'cpu_usage', 92.5, '2026-04-01 10:00:05+00'),
    ('m002', 'host-1', 'api-gateway', 'cpu_usage', 88.3, '2026-04-01 10:00:15+00'),
    ('m003', 'host-1', 'api-gateway', 'cpu_usage', 95.1, '2026-04-01 10:00:25+00'),
    ('m004', 'host-1', 'api-gateway', 'cpu_usage', 91.0, '2026-04-01 10:00:35+00'),
    ('m005', 'host-2', 'payments',    'cpu_usage', 45.2, '2026-04-01 10:00:10+00'),
    ('m006', 'host-2', 'payments',    'cpu_usage', 42.8, '2026-04-01 10:00:20+00'),
    ('m007', 'host-3', 'search',      'cpu_usage', 97.2, '2026-04-01 10:00:08+00'),
    ('m008', 'host-3', 'search',      'cpu_usage', 94.6, '2026-04-01 10:00:18+00'),
    ('m009', 'host-3', 'search',      'cpu_usage', 89.1, '2026-04-01 10:00:28+00');
SELECT * FROM high_cpu_alerts ORDER BY avg_cpu DESC;
 host_id | service_name |      avg_cpu      | peak_cpu | sample_count |       window_start        |        window_end
---------+--------------+-------------------+----------+--------------+---------------------------+---------------------------
 host-3  | search       | 93.63333333333333 |     97.2 |            3 | 2026-04-01 10:00:00+00:00 | 2026-04-01 10:01:00+00:00
 host-1  | api-gateway  |            91.725 |     95.1 |            4 | 2026-04-01 10:00:00+00:00 | 2026-04-01 10:01:00+00:00

Notice that host-2 (payments) does not appear because its average CPU was 44%, well below the 85% threshold. The alert system only surfaces the hosts that need attention.

Alert 2: High Error Rate

This alert monitors HTTP 5xx error rates per service and endpoint over 5-minute windows:

CREATE MATERIALIZED VIEW error_rate_alerts AS
SELECT
    service_name,
    endpoint,
    COUNT(*) FILTER (WHERE status_code >= 500) AS error_count,
    COUNT(*) AS total_requests,
    COUNT(*) FILTER (WHERE status_code >= 500)::DOUBLE PRECISION
        / COUNT(*)::DOUBLE PRECISION * 100 AS error_rate_pct,
    window_start,
    window_end
FROM TUMBLE(
    application_events,
    ts,
    INTERVAL '5' MINUTE
)
GROUP BY service_name, endpoint, window_start, window_end
HAVING COUNT(*) FILTER (WHERE status_code >= 500)::DOUBLE PRECISION
    / COUNT(*)::DOUBLE PRECISION * 100 > 5.0;

The FILTER (WHERE status_code >= 500) clause is a standard SQL conditional aggregate that counts only the error responses. The HAVING clause fires the alert when the error rate exceeds 5%.

Insert test data to verify:

INSERT INTO application_events VALUES
    ('e001', 'checkout', 'http_request', 200, 120.5, 'u100', '/api/checkout', '2026-04-01 10:00:01+00'),
    ('e002', 'checkout', 'http_request', 200, 135.2, 'u101', '/api/checkout', '2026-04-01 10:00:05+00'),
    ('e003', 'checkout', 'http_request', 500, 5200.0, 'u102', '/api/checkout', '2026-04-01 10:00:10+00'),
    ('e004', 'checkout', 'http_request', 500, 4800.3, 'u103', '/api/checkout', '2026-04-01 10:00:15+00'),
    ('e005', 'checkout', 'http_request', 502, 6100.0, 'u104', '/api/checkout', '2026-04-01 10:00:20+00'),
    ('e006', 'checkout', 'http_request', 200, 145.0, 'u105', '/api/checkout', '2026-04-01 10:00:25+00'),
    ('e007', 'checkout', 'http_request', 500, 4900.5, 'u106', '/api/checkout', '2026-04-01 10:00:30+00'),
    ('e008', 'checkout', 'http_request', 200, 110.0, 'u107', '/api/checkout', '2026-04-01 10:00:35+00'),
    ('e009', 'checkout', 'http_request', 200, 128.7, 'u108', '/api/checkout', '2026-04-01 10:00:40+00'),
    ('e010', 'checkout', 'http_request', 200, 115.3, 'u109', '/api/checkout', '2026-04-01 10:00:45+00'),
    ('e011', 'search',   'http_request', 200, 45.2,  'u110', '/api/search',   '2026-04-01 10:00:02+00'),
    ('e012', 'search',   'http_request', 200, 52.1,  'u111', '/api/search',   '2026-04-01 10:00:12+00'),
    ('e013', 'search',   'http_request', 200, 48.6,  'u112', '/api/search',   '2026-04-01 10:00:22+00');
SELECT service_name, endpoint, error_count, total_requests,
       ROUND(error_rate_pct::numeric, 1) AS error_rate_pct,
       window_start, window_end
FROM error_rate_alerts;
 service_name |   endpoint    | error_count | total_requests | error_rate_pct |       window_start        |        window_end
--------------+---------------+-------------+----------------+----------------+---------------------------+---------------------------
 checkout     | /api/checkout |           4 |             10 |           40.0 | 2026-04-01 10:00:00+00:00 | 2026-04-01 10:05:00+00:00

The checkout service fired an alert: 4 out of 10 requests returned 5xx errors, a 40% error rate. The search service did not fire because all its requests returned 200.

Building Anomaly-Based Alerts with Statistical Deviation

Threshold-based alerts work well when you know what "bad" looks like. But what about situations where the boundary is not fixed? A checkout endpoint that normally responds in 120ms suddenly averaging 800ms is a problem, even though 800ms might be perfectly normal for a report-generation endpoint.

Anomaly-based alerts compare current behavior against a learned baseline. In RisingWave, you can build this with two materialized views: one that computes the baseline statistics, and another that compares real-time values against that baseline.

Step 1: Compute the Baseline

This materialized view calculates average and standard deviation of response times over 15-minute windows, establishing what "normal" looks like for each service and endpoint:

CREATE MATERIALIZED VIEW response_time_baseline AS
SELECT
    service_name,
    endpoint,
    AVG(response_time_ms) AS avg_response_time,
    STDDEV_POP(response_time_ms) AS stddev_response_time,
    COUNT(*) AS sample_count,
    window_start,
    window_end
FROM TUMBLE(
    application_events,
    ts,
    INTERVAL '15' MINUTE
)
GROUP BY service_name, endpoint, window_start, window_end;

Step 2: Detect Deviations

This materialized view compares 1-minute averages against the 15-minute baseline. When the current average exceeds the baseline mean by more than 3 standard deviations, it surfaces as an anomaly:

CREATE MATERIALIZED VIEW latency_spike_alerts AS
SELECT
    recent.service_name,
    recent.endpoint,
    recent.avg_response_time AS current_avg_ms,
    baseline.avg_response_time AS baseline_avg_ms,
    baseline.stddev_response_time AS baseline_stddev,
    recent.window_start,
    recent.window_end
FROM (
    SELECT
        service_name,
        endpoint,
        AVG(response_time_ms) AS avg_response_time,
        window_start,
        window_end
    FROM TUMBLE(
        application_events,
        ts,
        INTERVAL '1' MINUTE
    )
    GROUP BY service_name, endpoint, window_start, window_end
) recent
JOIN response_time_baseline baseline
ON recent.service_name = baseline.service_name
   AND recent.endpoint = baseline.endpoint
   AND recent.window_start >= baseline.window_start
   AND recent.window_start < baseline.window_end
WHERE recent.avg_response_time > baseline.avg_response_time
    + 3 * baseline.stddev_response_time
  AND baseline.stddev_response_time > 0;

The 3-sigma threshold follows the standard statistical rule: in a normal distribution, 99.7% of values fall within 3 standard deviations of the mean. Anything outside that range is statistically unusual. The stddev_response_time > 0 guard prevents division-by-zero issues when all samples in the baseline window have identical response times.

This pattern is powerful because it adapts automatically. A search endpoint with a 50ms baseline and 3ms standard deviation would alert at around 59ms. A batch processing endpoint with a 2000ms baseline and 500ms standard deviation would not alert until it exceeds 3500ms. No manual threshold tuning required.

Creating a Unified Alert Stream

With multiple alert materialized views in place, you need a single stream that aggregates all alerts for downstream consumption. A UNION ALL materialized view brings everything together with a consistent schema:

CREATE MATERIALIZED VIEW unified_alerts AS
SELECT
    'high_cpu' AS alert_type,
    'WARNING' AS severity,
    host_id AS source,
    service_name,
    'CPU usage at ' || ROUND(avg_cpu::numeric, 1)
        || '% (threshold: 85%)' AS alert_message,
    window_start AS detected_at
FROM high_cpu_alerts

UNION ALL

SELECT
    'high_error_rate' AS alert_type,
    'CRITICAL' AS severity,
    endpoint AS source,
    service_name,
    'Error rate at ' || ROUND(error_rate_pct::numeric, 1)
        || '% (' || error_count || '/' || total_requests
        || ' requests failed)' AS alert_message,
    window_start AS detected_at
FROM error_rate_alerts;

Query the unified view:

SELECT alert_type, severity, source, service_name, alert_message, detected_at
FROM unified_alerts
ORDER BY severity, detected_at;
   alert_type    | severity |    source     | service_name |              alert_message               |        detected_at
-----------------+----------+---------------+--------------+------------------------------------------+---------------------------
 high_error_rate | CRITICAL | /api/checkout | checkout     | Error rate at 40% (4/10 requests failed) | 2026-04-01 10:00:00+00:00
 high_cpu        | WARNING  | host-3        | search       | CPU usage at 93.6% (threshold: 85%)      | 2026-04-01 10:00:00+00:00
 high_cpu        | WARNING  | host-1        | api-gateway  | CPU usage at 91.7% (threshold: 85%)      | 2026-04-01 10:00:00+00:00

CRITICAL alerts sort first, giving your on-call engineer immediate visibility into the most urgent problems.

Sinking Alerts to Slack and PagerDuty

With unified_alerts producing a clean, structured alert stream, the next step is routing those alerts to your notification channels. RisingWave's Kafka sink connector pushes alert rows to a Kafka topic, and a lightweight downstream consumer delivers them to Slack, PagerDuty, or any HTTP webhook.

Step 1: Create a Kafka Sink

CREATE SINK alerts_to_kafka FROM unified_alerts
WITH (
    connector = 'kafka',
    properties.bootstrap.server = 'kafka:9092',
    topic = 'alerts'
)
FORMAT PLAIN ENCODE JSON;

Every time a new alert row appears in unified_alerts, RisingWave pushes a JSON message to the alerts Kafka topic. The message looks like this:

{
  "alert_type": "high_error_rate",
  "severity": "CRITICAL",
  "source": "/api/checkout",
  "service_name": "checkout",
  "alert_message": "Error rate at 40% (4/10 requests failed)",
  "detected_at": "2026-04-01T10:00:00Z"
}

Step 2: Route to Slack with a Kafka Consumer

A small Python consumer reads from the alerts topic and posts to a Slack webhook. This is the only custom code in the entire pipeline:

import json
import requests
from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'alerts',
    bootstrap_servers='kafka:9092',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

SEVERITY_EMOJI = {
    "CRITICAL": ":red_circle:",
    "WARNING": ":warning:",
    "INFO": ":information_source:"
}

for message in consumer:
    alert = message.value
    emoji = SEVERITY_EMOJI.get(alert["severity"], ":bell:")
    slack_payload = {
        "text": f"{emoji} *{alert['alert_type'].upper()}* - {alert['severity']}\n"
                f"Service: `{alert['service_name']}` | Source: `{alert['source']}`\n"
                f"{alert['alert_message']}\n"
                f"Detected: {alert['detected_at']}"
    }
    requests.post(SLACK_WEBHOOK_URL, json=slack_payload)

Step 3: Route to PagerDuty for On-Call Escalation

For critical alerts that need on-call escalation, route to PagerDuty's Events API v2:

import json
import requests
from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'alerts',
    bootstrap_servers='kafka:9092',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

PAGERDUTY_ROUTING_KEY = "YOUR_PAGERDUTY_INTEGRATION_KEY"

SEVERITY_MAP = {
    "CRITICAL": "critical",
    "WARNING": "warning",
    "INFO": "info"
}

for message in consumer:
    alert = message.value
    if alert["severity"] != "CRITICAL":
        continue  # Only page on-call for critical alerts

    pd_payload = {
        "routing_key": PAGERDUTY_ROUTING_KEY,
        "event_action": "trigger",
        "payload": {
            "summary": f"[{alert['service_name']}] {alert['alert_message']}",
            "severity": SEVERITY_MAP.get(alert["severity"], "warning"),
            "source": alert["source"],
            "component": alert["service_name"],
            "group": alert["alert_type"],
            "custom_details": alert
        }
    }
    requests.post(
        "https://events.pagerduty.com/v2/enqueue",
        json=pd_payload
    )

This separation of concerns is intentional. RisingWave handles the hard part: continuous evaluation of alert conditions over streaming data with exactly the right windowing and aggregation semantics. The Kafka consumers are stateless and trivial to deploy, scale, and replace.

Architecture Overview

graph LR
    A[Kafka: server_metrics] --> B[RisingWave]
    C[Kafka: application_events] --> B
    B --> D[high_cpu_alerts MV]
    B --> E[error_rate_alerts MV]
    B --> F[latency_spike_alerts MV]
    D --> G[unified_alerts MV]
    E --> G
    F --> G
    G --> H[Kafka Sink: alerts topic]
    H --> I[Slack Consumer]
    H --> J[PagerDuty Consumer]

What Is a Real-Time Alerting System on Streaming Data?

A real-time alerting system on streaming data is a pipeline that continuously evaluates conditions over incoming event streams and triggers notifications the moment those conditions are met. Unlike batch-based alerting, which runs queries on a schedule (every minute, every 5 minutes), streaming alerting processes each event as it arrives, reducing detection latency from minutes to milliseconds. RisingWave implements this pattern using materialized views: SQL queries that are maintained incrementally and always reflect the latest state of the data.

How Does Anomaly Detection Work Without Predefined Thresholds?

Anomaly detection without fixed thresholds uses statistical baselines computed from recent data. The system calculates the mean and standard deviation of a metric over a sliding or tumbling window, then flags new data points that deviate significantly (typically 3 or more standard deviations) from that baseline. In RisingWave, you implement this with two materialized views: one that maintains the baseline statistics and another that joins real-time values against the baseline and filters for outliers. This approach adapts automatically as normal operating conditions change.

When Should You Use Threshold Alerts vs. Anomaly Alerts?

Use threshold-based alerts when you have well-understood operational boundaries, such as "CPU must stay below 85%" or "error rate must stay below 5%." These are simple, predictable, and easy to debug. Use anomaly-based alerts for metrics where "normal" varies over time or across different contexts, such as response latency that differs by endpoint, time of day, or traffic volume. In practice, most production systems use both: thresholds for known failure modes and anomalies for catching unknown unknowns.

Can RisingWave Replace Prometheus for Alerting?

RisingWave can replace the alerting rules layer of a Prometheus-based stack, particularly when you need sub-second alert latency, complex multi-signal correlation, or SQL-based rule management. Prometheus remains a strong choice for scraping and storing infrastructure metrics. A common hybrid architecture uses Prometheus for metrics collection, publishes those metrics to Kafka via remote write, and lets RisingWave handle the alerting logic via streaming SQL. This gives you the best of both worlds: Prometheus's battle-tested collection and RisingWave's expressive, low-latency alerting.

Conclusion

Building a real-time alerting system on streaming data does not require a sprawling microservices architecture. With RisingWave, the entire alert logic lives in SQL:

  • Threshold alerts use materialized views with HAVING clauses to catch known failure modes like high CPU or elevated error rates
  • Anomaly alerts use paired materialized views, one for baseline statistics and one for deviation detection, to catch unexpected behavior without manual threshold tuning
  • A unified alert stream combines all alert types into a single, consistently structured materialized view
  • Kafka sinks push alerts to a topic where lightweight consumers route them to Slack, PagerDuty, or any notification endpoint
  • Incremental computation means alerts fire in milliseconds, not on a polling schedule

The entire pipeline from metric ingestion to Slack notification can be built and modified with SQL. No custom stream processing code, no DAG definitions, no separate rules engine.


Ready to build your own alerting system? Try RisingWave Cloud free - no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.