How to Monitor Debezium Pipeline Health with Streaming SQL

How to Monitor Debezium Pipeline Health with Streaming SQL

A Debezium pipeline that is silently lagging is worse than one that has stopped—you get stale data without any visible alarm. Effective monitoring combines JMX metrics from Kafka Connect, consumer group lag from Kafka, and streaming SQL queries in RisingWave to give you a complete, real-time picture of pipeline health.

Why Monitoring Debezium Pipelines Is Non-Trivial

Debezium sits at the intersection of three systems: the source database, Kafka, and the downstream consumer. A problem can originate anywhere:

  • The database's CDC log fills up because Debezium hasn't consumed it fast enough.
  • The Kafka topic grows unbounded because downstream consumers are slow.
  • A schema change breaks deserialization and events start routing to the Dead Letter Queue (DLQ).
  • Network timeouts cause the connector to restart repeatedly, creating gaps in the event stream.

None of these failures are immediately obvious from a "connector status: RUNNING" indicator. You need metric-level visibility.

Key Metrics to Track

Connector-Level (JMX)

Debezium exposes metrics via JMX under the MBean prefix debezium.{connector_type}:

MetricMBean AttributeMeaning
Replication lagSecondsBehindSourceSeconds between source DB event and Debezium processing it
Snapshot progressRemainingTableCountTables left to snapshot
Total eventsTotalNumberOfEventsSeenCumulative events processed
Error countNumberOfErroneousEventsEvents that failed processing
ConnectedConnectedWhether the connector is connected to the source
Queue remaining capacityQueueRemainingCapacityConnector internal queue headroom

Kafka Consumer Lag

Consumer group lag measures how many messages your downstream consumer (e.g., RisingWave) is behind the Kafka topic head:

kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --describe \
  --group risingwave-consumer-group

Watch for LAG growing on any partition—this means RisingWave is falling behind the Debezium-produced stream.

Dead Letter Queue (DLQ)

Configure a DLQ in your connector to capture events that fail deserialization:

{
  "errors.tolerance": "all",
  "errors.deadletterqueue.topic.name": "dlq.orders-cdc",
  "errors.deadletterqueue.context.headers.enable": "true",
  "errors.log.enable": "true",
  "errors.log.include.messages": "true"
}

Monitor DLQ topic lag—any non-zero message count is a signal that something is broken in the event structure.

Step-by-Step Tutorial

Step 1: Expose JMX Metrics from Kafka Connect

Add JMX options to your Kafka Connect worker configuration or Docker environment:

KAFKA_JMX_PORT=9999
KAFKA_JMX_OPTS="-Dcom.sun.jmx.remote.authenticate=false \
  -Dcom.sun.jmx.remote.ssl=false \
  -Djava.rmi.server.hostname=kafka-connect"

Then scrape with Prometheus JMX Exporter by mounting the agent JAR and a config file:

# jmx_exporter_config.yaml
rules:
  - pattern: 'debezium\.(.*)<type=connector-metrics, context=(.*), server=(.*), task=(.*), partition=(.*)><>(.+)'
    name: debezium_$1_$6
    labels:
      context: "$2"
      server: "$3"

Step 2: Ingest Monitoring Metrics into RisingWave

Create a Kafka topic for connector heartbeats, then ingest it:

-- Heartbeat topic source (Debezium heartbeat.interval.ms feature)
CREATE SOURCE debezium_heartbeats (
    server_name  VARCHAR,
    ts_ms        BIGINT
) WITH (
    connector = 'kafka',
    topic = 'debezium-heartbeats',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

-- Connector lag view: difference between heartbeat time and current time
CREATE MATERIALIZED VIEW connector_lag_seconds AS
SELECT
    server_name,
    MAX(ts_ms) / 1000.0             AS last_event_epoch_s,
    EXTRACT(EPOCH FROM NOW())        AS current_epoch_s,
    EXTRACT(EPOCH FROM NOW())
      - MAX(ts_ms) / 1000.0          AS lag_seconds
FROM debezium_heartbeats
GROUP BY server_name;

Step 3: Detect Stalled Pipelines with Streaming SQL

-- Alert when any pipeline hasn't produced a heartbeat in 60 seconds
CREATE MATERIALIZED VIEW stalled_pipelines AS
SELECT
    server_name,
    lag_seconds,
    'STALLED' AS alert_level
FROM connector_lag_seconds
WHERE lag_seconds > 60;

-- Track event throughput per minute from the CDC source
CREATE MATERIALIZED VIEW cdc_throughput AS
SELECT
    DATE_TRUNC('minute', TO_TIMESTAMP(ts_ms / 1000.0)) AS window_start,
    server_name,
    COUNT(*) AS events_per_minute
FROM debezium_heartbeats
GROUP BY 1, 2;

Step 4: Build a Health Dashboard

Wire connector_lag_seconds and stalled_pipelines to your dashboard tool (Grafana via the RisingWave PostgreSQL data source, Metabase, or Superset). Set alert thresholds:

  • lag_seconds > 30 → Warning
  • lag_seconds > 120 → Critical (page on-call)
  • DLQ topic message count > 0 → Investigate immediately
  • QueueRemainingCapacity < 100 → Connector backpressure risk

Comparison Table

Monitoring ApproachGranularityLag DetectionSchema Error DetectionSetup Effort
Kafka consumer group lagPartition-levelYesNoLow
Debezium JMX metricsConnector/task levelYesYes (error count)Medium
DLQ monitoringEvent levelNoYesMedium
Debezium heartbeats + RisingWave SQLEnd-to-endYesNoMedium
Full OpenTelemetry pipelineAll levelsYesYesHigh

FAQ

Q: What is seconds_behind_source and when is it inaccurate? SecondsBehindSource measures the age of the last event Debezium processed relative to wall clock time. It can appear artificially high on quiet tables because no events flow, so the "last event" timestamp drifts. Use heartbeat.interval.ms (e.g., 10000) to emit periodic heartbeat events that keep this metric fresh even when the source table is idle.

Q: How do I detect when the Debezium connector has switched from snapshot to streaming mode? Watch the SnapshotCompleted JMX attribute (boolean) and the streaming value in the connector status API: GET /connectors/{name}/status. You can also instrument this via the Debezium notification feature in newer versions, which emits lifecycle events to a configurable Kafka topic.

Q: Should I monitor the DLQ even if errors.tolerance is set to none? Yes—even with errors.tolerance: none (which stops the connector on first error), a brief window exists where events might be written before the connector halts. More importantly, setting up DLQ monitoring before you need it means you have the tooling ready when a schema change breaks deserialization in production.

Key Takeaways

  • SecondsBehindSource is the primary lag signal, but it needs heartbeat events to stay accurate on quiet tables.
  • Dead Letter Queue monitoring catches silent data corruption that lag metrics won't reveal.
  • RisingWave's streaming SQL can turn raw heartbeat and metric events into continuously updated alert views without batch jobs.
  • Consumer group lag from Kafka reflects downstream (RisingWave) performance, while JMX metrics reflect upstream (source DB) performance—monitor both.
  • Automate alerting thresholds: a pipeline that's been stalled for more than two minutes in production deserves an immediate page.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.