A Debezium pipeline that is silently lagging is worse than one that has stopped—you get stale data without any visible alarm. Effective monitoring combines JMX metrics from Kafka Connect, consumer group lag from Kafka, and streaming SQL queries in RisingWave to give you a complete, real-time picture of pipeline health.
Why Monitoring Debezium Pipelines Is Non-Trivial
Debezium sits at the intersection of three systems: the source database, Kafka, and the downstream consumer. A problem can originate anywhere:
- The database's CDC log fills up because Debezium hasn't consumed it fast enough.
- The Kafka topic grows unbounded because downstream consumers are slow.
- A schema change breaks deserialization and events start routing to the Dead Letter Queue (DLQ).
- Network timeouts cause the connector to restart repeatedly, creating gaps in the event stream.
None of these failures are immediately obvious from a "connector status: RUNNING" indicator. You need metric-level visibility.
Key Metrics to Track
Connector-Level (JMX)
Debezium exposes metrics via JMX under the MBean prefix debezium.{connector_type}:
| Metric | MBean Attribute | Meaning |
| Replication lag | SecondsBehindSource | Seconds between source DB event and Debezium processing it |
| Snapshot progress | RemainingTableCount | Tables left to snapshot |
| Total events | TotalNumberOfEventsSeen | Cumulative events processed |
| Error count | NumberOfErroneousEvents | Events that failed processing |
| Connected | Connected | Whether the connector is connected to the source |
| Queue remaining capacity | QueueRemainingCapacity | Connector internal queue headroom |
Kafka Consumer Lag
Consumer group lag measures how many messages your downstream consumer (e.g., RisingWave) is behind the Kafka topic head:
kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--describe \
--group risingwave-consumer-group
Watch for LAG growing on any partition—this means RisingWave is falling behind the Debezium-produced stream.
Dead Letter Queue (DLQ)
Configure a DLQ in your connector to capture events that fail deserialization:
{
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "dlq.orders-cdc",
"errors.deadletterqueue.context.headers.enable": "true",
"errors.log.enable": "true",
"errors.log.include.messages": "true"
}
Monitor DLQ topic lag—any non-zero message count is a signal that something is broken in the event structure.
Step-by-Step Tutorial
Step 1: Expose JMX Metrics from Kafka Connect
Add JMX options to your Kafka Connect worker configuration or Docker environment:
KAFKA_JMX_PORT=9999
KAFKA_JMX_OPTS="-Dcom.sun.jmx.remote.authenticate=false \
-Dcom.sun.jmx.remote.ssl=false \
-Djava.rmi.server.hostname=kafka-connect"
Then scrape with Prometheus JMX Exporter by mounting the agent JAR and a config file:
# jmx_exporter_config.yaml
rules:
- pattern: 'debezium\.(.*)<type=connector-metrics, context=(.*), server=(.*), task=(.*), partition=(.*)><>(.+)'
name: debezium_$1_$6
labels:
context: "$2"
server: "$3"
Step 2: Ingest Monitoring Metrics into RisingWave
Create a Kafka topic for connector heartbeats, then ingest it:
-- Heartbeat topic source (Debezium heartbeat.interval.ms feature)
CREATE SOURCE debezium_heartbeats (
server_name VARCHAR,
ts_ms BIGINT
) WITH (
connector = 'kafka',
topic = 'debezium-heartbeats',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;
-- Connector lag view: difference between heartbeat time and current time
CREATE MATERIALIZED VIEW connector_lag_seconds AS
SELECT
server_name,
MAX(ts_ms) / 1000.0 AS last_event_epoch_s,
EXTRACT(EPOCH FROM NOW()) AS current_epoch_s,
EXTRACT(EPOCH FROM NOW())
- MAX(ts_ms) / 1000.0 AS lag_seconds
FROM debezium_heartbeats
GROUP BY server_name;
Step 3: Detect Stalled Pipelines with Streaming SQL
-- Alert when any pipeline hasn't produced a heartbeat in 60 seconds
CREATE MATERIALIZED VIEW stalled_pipelines AS
SELECT
server_name,
lag_seconds,
'STALLED' AS alert_level
FROM connector_lag_seconds
WHERE lag_seconds > 60;
-- Track event throughput per minute from the CDC source
CREATE MATERIALIZED VIEW cdc_throughput AS
SELECT
DATE_TRUNC('minute', TO_TIMESTAMP(ts_ms / 1000.0)) AS window_start,
server_name,
COUNT(*) AS events_per_minute
FROM debezium_heartbeats
GROUP BY 1, 2;
Step 4: Build a Health Dashboard
Wire connector_lag_seconds and stalled_pipelines to your dashboard tool (Grafana via the RisingWave PostgreSQL data source, Metabase, or Superset). Set alert thresholds:
lag_seconds > 30→ Warninglag_seconds > 120→ Critical (page on-call)- DLQ topic message count > 0 → Investigate immediately
QueueRemainingCapacity < 100→ Connector backpressure risk
Comparison Table
| Monitoring Approach | Granularity | Lag Detection | Schema Error Detection | Setup Effort |
| Kafka consumer group lag | Partition-level | Yes | No | Low |
| Debezium JMX metrics | Connector/task level | Yes | Yes (error count) | Medium |
| DLQ monitoring | Event level | No | Yes | Medium |
| Debezium heartbeats + RisingWave SQL | End-to-end | Yes | No | Medium |
| Full OpenTelemetry pipeline | All levels | Yes | Yes | High |
FAQ
Q: What is seconds_behind_source and when is it inaccurate?
SecondsBehindSource measures the age of the last event Debezium processed relative to wall clock time. It can appear artificially high on quiet tables because no events flow, so the "last event" timestamp drifts. Use heartbeat.interval.ms (e.g., 10000) to emit periodic heartbeat events that keep this metric fresh even when the source table is idle.
Q: How do I detect when the Debezium connector has switched from snapshot to streaming mode?
Watch the SnapshotCompleted JMX attribute (boolean) and the streaming value in the connector status API: GET /connectors/{name}/status. You can also instrument this via the Debezium notification feature in newer versions, which emits lifecycle events to a configurable Kafka topic.
Q: Should I monitor the DLQ even if errors.tolerance is set to none?
Yes—even with errors.tolerance: none (which stops the connector on first error), a brief window exists where events might be written before the connector halts. More importantly, setting up DLQ monitoring before you need it means you have the tooling ready when a schema change breaks deserialization in production.
Key Takeaways
SecondsBehindSourceis the primary lag signal, but it needs heartbeat events to stay accurate on quiet tables.- Dead Letter Queue monitoring catches silent data corruption that lag metrics won't reveal.
- RisingWave's streaming SQL can turn raw heartbeat and metric events into continuously updated alert views without batch jobs.
- Consumer group lag from Kafka reflects downstream (RisingWave) performance, while JMX metrics reflect upstream (source DB) performance—monitor both.
- Automate alerting thresholds: a pipeline that's been stalled for more than two minutes in production deserves an immediate page.

