How to Build Real-Time SLO Monitoring with Streaming SQL

Every production outage starts the same way: someone checks a dashboard too late, discovers the error rate crossed a threshold 20 minutes ago, and scrambles to respond. Traditional SLO monitoring relies on periodic batch queries against log stores or time-series databases, creating blind spots between query intervals. When your error budget burns through in minutes, a 5-minute polling interval is the difference between a near-miss and a customer-facing incident.

Streaming SQL changes this equation. Instead of polling metrics on a schedule, you define your Service Level Indicators (SLIs) as continuous queries that update the moment new data arrives. Materialized views maintain running calculations of error budgets, burn rates, and SLO compliance percentages, all in real time. When a threshold is breached, the system knows within seconds, not minutes.

In this guide, you will build a complete SLO monitoring pipeline using RisingWave, a streaming database that lets you write standard SQL to process event streams continuously. By the end, you will have real-time SLI tracking, error budget calculation, burn rate alerting, and an SLO compliance dashboard, all defined in SQL.

What Are SLOs, SLIs, and Error Budgets?

Before writing any SQL, let's establish clear definitions for the core SRE concepts this pipeline will track.

A Service Level Indicator (SLI) is a quantitative measure of a specific aspect of service quality. Common SLIs include request latency (percentage of requests served under 200ms), availability (percentage of successful responses), and throughput (requests per second). SLIs are the raw measurements your pipeline ingests.

A Service Level Objective (SLO) is a target value or range for an SLI. For example, "99.9% of requests should complete in under 200ms" is an SLO. SLOs define what "good enough" looks like for your service.

An error budget is the inverse of your SLO target. If your SLO is 99.9% availability, your error budget is 0.1%, which translates to roughly 43 minutes of allowed downtime per 30-day window. Error budgets give engineering teams a concrete, measurable balance between shipping features and maintaining reliability.

Burn rate measures how fast you are consuming your error budget relative to the SLO window. A burn rate of 1.0 means you are consuming the budget at exactly the expected rate. A burn rate of 10.0 means you are burning through your error budget 10 times faster than sustainable, and you will exhaust it in 3 days instead of 30.

Why Batch Monitoring Falls Short

Most SLO monitoring systems work by running queries against Prometheus, Datadog, or a data warehouse on a fixed schedule. This approach has three problems:

Detection delay: If you poll every 5 minutes and an incident starts at minute 1, you will not know until minute 5. For services with tight error budgets, this delay can consume a significant portion of the budget before anyone responds.
Aggregation loss: Batch queries typically compute averages over the polling interval, which can hide short spikes. A 30-second latency spike inside a 5-minute window gets averaged away.
Scaling cost: Running complex window queries against large time-series stores every few minutes puts significant load on your monitoring infrastructure, especially when you track dozens of SLOs across hundreds of services.

Streaming SQL solves all three: the pipeline processes each event as it arrives, maintains incremental aggregations that preserve spike visibility, and offloads the continuous computation to a system designed for exactly this workload.

How Does the Streaming SLO Pipeline Work?

The architecture has four layers: ingestion, SLI computation, error budget tracking, and alerting. Here is how the data flows through each layer.

Layer 1 - Ingestion: Application services emit request-level events (timestamp, service name, endpoint, response status, latency) to Apache Kafka. Each event represents a single request/response pair.

Layer 2 - SLI Computation: RisingWave ingests the Kafka topic and computes SLIs using materialized views with tumbling windows. One view calculates availability (ratio of successful to total requests), another calculates latency compliance (ratio of fast requests to total requests).

Layer 3 - Error Budget Tracking: A second layer of materialized views joins the SLI results with SLO target definitions stored in a reference table. These views calculate remaining error budget, current burn rate, and projected budget exhaustion time.

Layer 4 - Alerting: RisingWave sinks alert events to a downstream Kafka topic or PostgreSQL table, which an alerting service (PagerDuty, OpsGenie, or a custom webhook consumer) reads from. Multi-window burn rate alerts use the Google SRE methodology: fast burn (2% budget in 1 hour) and slow burn (5% budget in 6 hours).

This layered approach means each concern is isolated in its own SQL definition, making the pipeline easy to understand, test, and modify.

How Do You Set Up the Data Ingestion Layer?

The foundation of the pipeline is a Kafka topic carrying request-level telemetry from your services. Each message contains the essential fields for SLI calculation.

Define the Kafka Source

First, create a source in RisingWave that connects to your Kafka topic:

-- Connect to the request telemetry Kafka topic
CREATE SOURCE request_events (
    event_id VARCHAR,
    service_name VARCHAR,
    endpoint VARCHAR,
    http_status INT,
    latency_ms INT,
    event_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'service.request.events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

This source definition tells RisingWave to continuously consume JSON messages from the service.request.events topic. The scan.startup.mode = 'latest' setting means it starts processing from the most recent offset, which is appropriate for a monitoring pipeline that cares about current state rather than historical replay.

Define the SLO Targets Table

SLO targets change infrequently, so store them in a regular table that your SRE team can update via SQL:

-- SLO target definitions, updated by the SRE team
CREATE TABLE slo_targets (
    service_name VARCHAR,
    slo_name VARCHAR,
    sli_type VARCHAR,           -- 'availability' or 'latency'
    target_percentage DOUBLE,   -- e.g., 99.9
    latency_threshold_ms INT,   -- only for latency SLOs, e.g., 200
    window_days INT,            -- SLO window, e.g., 30
    PRIMARY KEY (service_name, slo_name)
);

-- Insert SLO definitions for your services
INSERT INTO slo_targets VALUES
('payment-service', 'availability', 'availability', 99.9, NULL, 30),
('payment-service', 'latency-p99', 'latency', 99.0, 200, 30),
('auth-service', 'availability', 'availability', 99.95, NULL, 30),
('auth-service', 'latency-p99', 'latency', 95.0, 500, 30),
('order-service', 'availability', 'availability', 99.9, NULL, 30),
('order-service', 'latency-p99', 'latency', 99.0, 300, 30);

Storing SLO targets in a table rather than hardcoding them into the pipeline lets you add new services or adjust targets without redeploying any streaming jobs. RisingWave's materialized views automatically pick up changes when the underlying data changes.

How Do You Compute SLIs with Materialized Views?

With the data flowing in and targets defined, the next step is to compute SLIs continuously. Each materialized view performs an incremental aggregation that updates as new events arrive.

Availability SLI

The availability SLI measures the percentage of requests that returned a successful response (HTTP status < 500):

-- Compute availability SLI per service, per 1-minute window
CREATE MATERIALIZED VIEW sli_availability AS
SELECT
    service_name,
    window_start,
    window_end,
    COUNT(*) AS total_requests,
    COUNT(*) FILTER (WHERE http_status < 500) AS good_requests,
    (COUNT(*) FILTER (WHERE http_status < 500))::DOUBLE / COUNT(*)::DOUBLE * 100.0
        AS availability_pct
FROM TUMBLE(
    request_events,
    event_time,
    INTERVAL '1 minute'
)
GROUP BY service_name, window_start, window_end;

This materialized view uses a 1-minute tumbling window to group events. Every minute, RisingWave emits a new row per service with the availability percentage. The FILTER clause cleanly separates good requests (any status below 500) from server errors.

Latency SLI

The latency SLI measures the percentage of requests that completed within the acceptable threshold. Since different services have different latency targets, this view joins against the SLO targets table:

-- Compute latency SLI per service, per 1-minute window
CREATE MATERIALIZED VIEW sli_latency AS
SELECT
    re.service_name,
    re.window_start,
    re.window_end,
    st.latency_threshold_ms,
    COUNT(*) AS total_requests,
    COUNT(*) FILTER (WHERE re.latency_ms <= st.latency_threshold_ms)
        AS fast_requests,
    (COUNT(*) FILTER (WHERE re.latency_ms <= st.latency_threshold_ms))::DOUBLE
        / COUNT(*)::DOUBLE * 100.0
        AS latency_compliance_pct
FROM TUMBLE(
    request_events,
    event_time,
    INTERVAL '1 minute'
) AS re
JOIN slo_targets st
    ON re.service_name = st.service_name
    AND st.sli_type = 'latency'
GROUP BY
    re.service_name,
    re.window_start,
    re.window_end,
    st.latency_threshold_ms;

Because this view joins against the slo_targets table, adjusting a service's latency threshold is a single UPDATE statement. The materialized view recomputes automatically with the new threshold value.

Rolling 30-Day SLI

Individual 1-minute windows are useful for detecting spikes, but SLOs are typically measured over longer periods (7 or 30 days). Create a rolling window view that aggregates the minute-level SLIs:

-- Rolling 30-day availability across all 1-minute windows
CREATE MATERIALIZED VIEW sli_availability_30d AS
SELECT
    service_name,
    SUM(good_requests) AS total_good,
    SUM(total_requests) AS total_all,
    (SUM(good_requests))::DOUBLE / (SUM(total_requests))::DOUBLE * 100.0
        AS rolling_availability_pct
FROM sli_availability
WHERE window_start >= NOW() - INTERVAL '30 days'
GROUP BY service_name;

This view is built on top of the sli_availability view, creating a layered computation. RisingWave handles this efficiently through incremental view maintenance, where only the changed portions of the result are recomputed rather than the entire 30-day window.

How Do You Track Error Budgets and Burn Rates?

Error budget tracking turns raw SLI data into actionable reliability signals. The key question shifts from "what is our availability right now?" to "how much room do we have left before we breach our SLO?"

Error Budget Remaining

This materialized view calculates the remaining error budget for each service and SLO:

-- Calculate remaining error budget per service
CREATE MATERIALIZED VIEW error_budget_status AS
SELECT
    st.service_name,
    st.slo_name,
    st.target_percentage,
    a.rolling_availability_pct AS current_sli,
    -- Error budget = allowed bad request percentage
    (100.0 - st.target_percentage) AS total_error_budget_pct,
    -- Consumed budget = how much of the budget has been used
    CASE
        WHEN a.rolling_availability_pct >= st.target_percentage
        THEN (100.0 - a.rolling_availability_pct) / (100.0 - st.target_percentage) * 100.0
        ELSE 100.0 + ((st.target_percentage - a.rolling_availability_pct)
            / (100.0 - st.target_percentage) * 100.0)
    END AS budget_consumed_pct,
    -- Remaining budget percentage
    CASE
        WHEN a.rolling_availability_pct >= st.target_percentage
        THEN 100.0 - ((100.0 - a.rolling_availability_pct)
            / (100.0 - st.target_percentage) * 100.0)
        ELSE 0.0
    END AS budget_remaining_pct
FROM slo_targets st
JOIN sli_availability_30d a
    ON st.service_name = a.service_name
WHERE st.sli_type = 'availability';

When budget_remaining_pct drops below zero, the service has exhausted its error budget and is in SLO violation. When it is above zero, the team has room to ship features and take calculated risks.

Burn Rate Calculation

Burn rate is the speed at which the error budget is being consumed. The Google SRE Workbook recommends using multi-window burn rate alerts to balance sensitivity with false positive rates. Here is a view that computes burn rates over short and long windows:

-- Compute burn rate over the last 1 hour and last 6 hours
CREATE MATERIALIZED VIEW burn_rate AS
SELECT
    sa.service_name,
    st.slo_name,
    st.target_percentage,
    -- 1-hour error rate
    (SELECT
        (1.0 - SUM(good_requests)::DOUBLE / NULLIF(SUM(total_requests), 0)::DOUBLE) * 100.0
     FROM sli_availability
     WHERE service_name = sa.service_name
       AND window_start >= NOW() - INTERVAL '1 hour'
    ) AS error_rate_1h,
    -- 6-hour error rate
    (SELECT
        (1.0 - SUM(good_requests)::DOUBLE / NULLIF(SUM(total_requests), 0)::DOUBLE) * 100.0
     FROM sli_availability
     WHERE service_name = sa.service_name
       AND window_start >= NOW() - INTERVAL '6 hours'
    ) AS error_rate_6h,
    -- Burn rate = actual error rate / allowed error rate
    -- For 1-hour window
    (SELECT
        (1.0 - SUM(good_requests)::DOUBLE / NULLIF(SUM(total_requests), 0)::DOUBLE)
        / ((100.0 - st.target_percentage) / 100.0)
     FROM sli_availability
     WHERE service_name = sa.service_name
       AND window_start >= NOW() - INTERVAL '1 hour'
    ) AS burn_rate_1h,
    -- For 6-hour window
    (SELECT
        (1.0 - SUM(good_requests)::DOUBLE / NULLIF(SUM(total_requests), 0)::DOUBLE)
        / ((100.0 - st.target_percentage) / 100.0)
     FROM sli_availability
     WHERE service_name = sa.service_name
       AND window_start >= NOW() - INTERVAL '6 hours'
    ) AS burn_rate_6h
FROM sli_availability_30d sa
JOIN slo_targets st
    ON sa.service_name = st.service_name
    AND st.sli_type = 'availability';

A burn_rate_1h of 14.4 means the service is burning through its 30-day error budget 14.4 times faster than allowed. At that rate, the entire budget would be exhausted in about 2 days. A burn_rate_6h of 6.0 means 5% of the budget will be consumed in 6 hours.

How Do You Set Up Real-Time Alerts?

With burn rates computed continuously, the final step is to generate alert events when thresholds are breached. Following the multi-window approach from the Google SRE Workbook, you combine a fast-burn short window with a slower-burn long window to reduce false positives.

Create the Alert Sink

First, create a materialized view that filters for alertable conditions, then sink it to a Kafka topic for downstream consumption:

-- Generate alert events when burn rate thresholds are breached
CREATE MATERIALIZED VIEW slo_alerts AS
SELECT
    service_name,
    slo_name,
    target_percentage,
    burn_rate_1h,
    burn_rate_6h,
    error_rate_1h,
    error_rate_6h,
    CASE
        WHEN burn_rate_1h > 14.4 AND burn_rate_6h > 6.0 THEN 'CRITICAL'
        WHEN burn_rate_1h > 6.0 AND burn_rate_6h > 3.0 THEN 'WARNING'
        WHEN burn_rate_1h > 3.0 AND burn_rate_6h > 1.0 THEN 'INFO'
        ELSE 'OK'
    END AS alert_severity
FROM burn_rate
WHERE burn_rate_1h > 3.0 AND burn_rate_6h > 1.0;

The severity levels follow Google's multi-window burn rate methodology:

CRITICAL (burn rate > 14.4x short, > 6x long): Consumes 2% of the 30-day budget in 1 hour. Page the on-call engineer immediately.
WARNING (burn rate > 6x short, > 3x long): Consumes 5% of the budget in 6 hours. Create a ticket for investigation.
INFO (burn rate > 3x short, > 1x long): Elevated burn rate but not urgent. Log for review.

Sink Alerts to Kafka

To deliver alerts to your incident management system, create a sink that writes alert events to a Kafka topic:

-- Deliver alert events to a Kafka topic
CREATE SINK slo_alert_sink FROM slo_alerts
WITH (
    connector = 'kafka',
    topic = 'slo.alerts',
    properties.bootstrap.server = 'kafka:9092',
    type = 'upsert',
    primary_key = 'service_name,slo_name'
) FORMAT PLAIN ENCODE JSON;

Using the upsert type ensures that each service/SLO combination maintains a single current alert state. When the burn rate drops below the threshold, the alert row disappears from the materialized view, and an update is sent downstream to clear the incident.

A lightweight consumer service reads from the slo.alerts topic and routes to PagerDuty, Slack, or OpsGenie based on the alert_severity field.

How Does This Compare to Traditional SLO Tools?

The streaming SQL approach differs from traditional SLO monitoring tools in several important ways.

Aspect	Traditional (Prometheus/Datadog)	Streaming SQL (RisingWave)
Update frequency	15s-5min polling intervals	Continuous, sub-second
Query model	Pull-based PromQL/DQL queries	Push-based materialized views
Spike detection	Averaged across scrape intervals	Every event is evaluated
Custom SLIs	Requires metric relabeling	Standard SQL joins and filters
Multi-service correlation	Complex federation queries	Simple SQL joins across sources
Historical analysis	Separate long-term storage needed	Built-in with time window queries
Operational cost	Scales with number of SLOs x scrape rate	Scales with event throughput

Traditional tools like Prometheus are excellent at metrics collection and short-term storage. The streaming SQL approach complements them by handling the continuous computation layer: joining multiple data sources, computing complex SLI formulas, and maintaining rolling window calculations that would be expensive as repeated PromQL queries.

For teams already using Kafka for application telemetry, adding RisingWave as the SLO computation engine requires no changes to the existing instrumentation. You keep your existing metrics pipeline and add a parallel path for real-time SLO tracking.

What About the Dashboard Layer?

Because RisingWave exposes a PostgreSQL-compatible interface, any BI tool that connects to PostgreSQL can query the materialized views directly. Grafana, Metabase, and Superset all work out of the box.

Query the Dashboard Views

Connect Grafana to RisingWave using the PostgreSQL data source and run queries like:

-- Dashboard query: current error budget status for all services
SELECT
    service_name,
    slo_name,
    target_percentage,
    ROUND(current_sli::NUMERIC, 3) AS current_sli,
    ROUND(budget_remaining_pct::NUMERIC, 1) AS budget_remaining_pct,
    CASE
        WHEN budget_remaining_pct > 50 THEN 'healthy'
        WHEN budget_remaining_pct > 20 THEN 'caution'
        WHEN budget_remaining_pct > 0 THEN 'danger'
        ELSE 'violated'
    END AS status
FROM error_budget_status
ORDER BY budget_remaining_pct ASC;

Expected output:

 service_name     | slo_name     | target_percentage | current_sli | budget_remaining_pct | status
------------------+--------------+-------------------+-------------+----------------------+---------
 payment-service  | availability | 99.9              | 99.872      | 72.0                 | healthy
 auth-service     | availability | 99.95             | 99.931      | 62.0                 | healthy
 order-service    | availability | 99.9              | 99.843      | 43.0                 | caution

This query hits the pre-computed materialized view, so it returns instantly regardless of how much raw event data has been processed. There is no expensive scan over billions of rows each time the dashboard refreshes.

Real-Time Burn Rate Panel

For a Grafana time-series panel showing burn rate trends:

-- Burn rate over time for a specific service
SELECT
    window_start AS time,
    (1.0 - good_requests::DOUBLE / NULLIF(total_requests, 0)::DOUBLE)
        / ((100.0 - 99.9) / 100.0) AS burn_rate
FROM sli_availability
WHERE service_name = 'payment-service'
  AND window_start >= NOW() - INTERVAL '24 hours'
ORDER BY window_start;

This query returns per-minute burn rate data points for the last 24 hours, giving the on-call engineer a clear picture of when burn rate spikes occurred and whether they are trending up or down.

FAQ

What is the advantage of streaming SQL over Prometheus for SLO monitoring?

Streaming SQL processes every event as it arrives, eliminating the polling gaps inherent in Prometheus's scrape-interval model. This means latency spikes or error bursts that last only a few seconds are captured and reflected in SLI calculations immediately, rather than being averaged away across a 15-second or longer scrape window. Additionally, streaming SQL makes it straightforward to join telemetry data from multiple services and data sources using standard SQL, which is significantly simpler than Prometheus federation or recording rules.

How does RisingWave handle high-throughput telemetry data for SLO monitoring?

RisingWave is designed for high-throughput event processing and can handle millions of events per second. Its incremental view maintenance engine ensures that materialized views are updated efficiently, processing only the delta of new events rather than recomputing entire aggregations. For SLO monitoring workloads, this means the system scales with your request volume without requiring proportionally more compute resources.

Can I use this approach alongside my existing Prometheus and Grafana setup?

Yes. The streaming SQL SLO pipeline is complementary, not a replacement. You keep Prometheus for infrastructure metrics collection and standard alerting. The streaming pipeline adds real-time SLI computation, error budget tracking, and burn rate alerting on top of your application-level telemetry flowing through Kafka. Grafana connects to RisingWave as an additional PostgreSQL data source, so your SLO dashboards sit alongside your existing infrastructure dashboards.

What happens if the streaming pipeline goes down?

RisingWave provides exactly-once processing semantics with persistent checkpointing. If the system restarts, it resumes from the last checkpoint and replays any missed Kafka events, ensuring no data is lost and SLI calculations remain accurate. The SLO targets table and materialized view definitions are stored durably, so the pipeline recovers automatically without manual intervention.

Conclusion

Here are the key takeaways from building a real-time SLO monitoring pipeline with streaming SQL:

SLIs as materialized views eliminate the polling delay of traditional monitoring, catching error budget burns within seconds instead of minutes.
Layered SQL definitions (source, SLI computation, error budget, alerting) keep each concern isolated and independently testable.
Multi-window burn rate alerts following the Google SRE methodology reduce false positives while maintaining fast detection of real incidents.
PostgreSQL compatibility means any existing BI tool or dashboard connects to your SLO data without additional integration work.
SQL-first configuration allows SRE teams to add new services, adjust SLO targets, and modify alert thresholds with simple INSERT and UPDATE statements rather than code deployments.

Ready to try this yourself? Try RisingWave Cloud free - no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.