Monitoring Debezium CDC Pipelines: Metrics, Alerts, and Observability

A Debezium connector can show RUNNING in the API while silently falling behind by hours. Connector status is a liveness check, not a health check. Real observability requires tracking lag metrics, replication slot growth, error rates, and snapshot progress — and knowing which alert thresholds actually matter in production.

The Three Layers of CDC Observability

Layer 1: Connector health. Is the connector process alive and tasks running? This is the minimum — necessary but not sufficient.

Layer 2: Throughput and lag. Is the connector keeping up with the source database's write rate? A running connector with growing lag is a silent failure.

Layer 3: Source database health. Is the replication slot growing? Is the binlog about to rotate? These database-side conditions will eventually kill the connector even if it's currently healthy.

Monitoring only Layer 1 is how teams get paged at 2am with "connector has been running fine for weeks, why is it suddenly 6 hours behind?"

Debezium JMX Metrics

Debezium exposes metrics via JMX. Enable them in the Kafka Connect worker JVM settings:

# In the Connect startup script or environment
KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=9010 \
  -Dcom.sun.management.jmxremote.local.only=false \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcom.sun.management.jmxremote.ssl=false"

Key JMX MBeans for a PostgreSQL connector:

debezium.postgres:type=connector-metrics,context=streaming,server=shop
  MilliSecondsBehindSource      # Lag from database WAL in milliseconds
  NumberOfCommittedTransactions # Committed transactions processed
  TotalNumberOfEventsSeen       # Total events read from WAL
  NumberOfEventsFiltered        # Events dropped by include/exclude lists
  LastEvent                     # Timestamp of last processed event

debezium.postgres:type=connector-metrics,context=snapshot,server=shop
  TotalTableCount               # Tables to snapshot
  RemainingTableCount           # Tables not yet snapshotted
  RowsScanned                   # Map of table → rows scanned
  SnapshotDurationInSeconds     # Total snapshot time so far

Prometheus Integration

Use the JMX Exporter to scrape Debezium JMX metrics into Prometheus:

# jmx_exporter_config.yml
rules:
  - pattern: 'debezium\.(\w+)<type=connector-metrics, context=(\w+), server=(\w+)><>(\w+)'
    name: debezium_$1_$4
    labels:
      context: "$2"
      server: "$3"
    type: GAUGE

# Run JMX Exporter as a Java agent alongside Kafka Connect
KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=8080:/opt/jmx_exporter/config.yml"

Then in Prometheus:

# prometheus.yml scrape config
scrape_configs:
  - job_name: 'debezium'
    static_configs:
      - targets: ['debezium-host:8080']
    metrics_path: '/metrics'

Key Metrics and Alert Thresholds

Metric	Alert Threshold	Severity	Notes
`debezium_postgres_MilliSecondsBehindSource`	> 60,000 (1 min)	Warning	> 300,000 (5 min): Critical
`debezium_postgres_NumberOfEventsFiltered / TotalNumberOfEventsSeen`	> 50%	Warning	May indicate misconfigured include lists
Kafka consumer group lag (`kafka_consumergroup_lag`)	> 100,000	Warning	Downstream consumers falling behind
PostgreSQL `pg_replication_slots.pg_wal_lsn_diff`	> 10GB	Warning	WAL accumulation risk
MySQL `Seconds_Behind_Master` equivalent	> 60 seconds	Warning	Binlog replication lag

Grafana Dashboard Query Examples

# Seconds behind source (PostgreSQL)
debezium_postgres_MilliSecondsBehindSource{server="shop"} / 1000

# Event throughput (events per second)
rate(debezium_postgres_TotalNumberOfEventsSeen{server="shop"}[5m])

# Snapshot progress (% complete)
100 * (1 - (debezium_postgres_RemainingTableCount / debezium_postgres_TotalTableCount))

Silent Failure Modes

1. Connector Paused Without Error

The connector status API reports RUNNING, but MilliSecondsBehindSource stops updating. This happens when:

The source table has no new writes (legitimate) — distinguish by checking if the timestamp stopped updating vs. lag increasing.
The internal task thread is deadlocked.

Alert on MilliSecondsBehindSource not changing for more than 5 minutes and the source database is actively receiving writes.

2. PostgreSQL Replication Slot Bloat

If the connector is slow or paused, the PostgreSQL replication slot prevents WAL segments from being recycled. Disk fills up. Database performance degrades. This is the most severe production failure mode for PostgreSQL CDC.

-- Monitor from PostgreSQL
SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS wal_retained,
  now() - pg_last_xact_replay_timestamp() AS replication_delay
FROM pg_replication_slots;

Alert when wal_retained exceeds 10GB or your disk capacity threshold. If you cannot resolve the connector lag quickly, drop the slot as an emergency measure — accept that you will need to resnapshot.

-- Emergency: drop a stuck replication slot
SELECT pg_drop_replication_slot('debezium');

3. MySQL Binlog Rotation

MySQL rotates binlog files based on max_binlog_size and expire_logs_days. If the connector is offline and the binlog file referenced in its stored offset is rotated away, the connector cannot resume.

-- Check current binlog files
SHOW BINARY LOGS;

-- Check expire setting
SHOW VARIABLES LIKE 'expire_logs_days';
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';

Alert: if a connector has been offline for > 50% of your expire_logs_days setting, trigger a PagerDuty alert to restart it before the binlog rotates.

4. Connector Silently Dropping Events

If NumberOfEventsFiltered grows unexpectedly, events are being dropped by include/exclude configuration. This is not an error — Debezium does not alert on filtered events — but it can cause data gaps in downstream systems.

# Check current connector include list
curl -s http://localhost:8083/connectors/orders-connector/config | jq '.["table.include.list"]'

Kafka Consumer Group Lag

Debezium connector lag (WAL position) is distinct from Kafka consumer group lag (how far consumers are behind the Kafka topic). Both must be monitored.

# Check Kafka consumer group lag
kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group orders-analytics-consumer \
  --describe

This shows CURRENT-OFFSET, LOG-END-OFFSET, and LAG per partition. Alert when aggregate lag exceeds your SLA tolerance.

In Prometheus via the Kafka exporter:

# Total consumer group lag
sum(kafka_consumergroup_lag{group="orders-analytics-consumer"}) by (topic)

CDC Health Checklist

Run through this checklist when investigating a suspected CDC incident:

Connector layer:

[ ] GET /connectors/{name}/status — all tasks in RUNNING state
[ ] MilliSecondsBehindSource < 60,000 ms
[ ] TotalNumberOfEventsSeen increasing over time
[ ] No recent task restarts in connector logs

Kafka layer:

[ ] connect-offsets topic has replication factor ≥ 3
[ ] Schema history topic exists and is accessible
[ ] Consumer group lag within acceptable threshold

PostgreSQL source:

[ ] Replication slot exists: SELECT * FROM pg_replication_slots WHERE slot_name = 'debezium'
[ ] Slot is active: active = true
[ ] WAL retained < 10GB
[ ] No long-running transactions blocking WAL cleanup: SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction'

MySQL source:

[ ] Binlog file referenced in offset still exists in SHOW BINARY LOGS
[ ] expire_logs_days or binlog_expire_logs_seconds provides adequate retention window
[ ] SHOW SLAVE STATUS shows no replication errors (if using replica as CDC source)

How RisingWave Monitoring Compares

RisingWave exposes CDC and streaming metrics through its system catalog and a Prometheus endpoint, without requiring JMX setup.

-- Source ingestion status
SELECT source_name, source_type, connection_params
FROM rw_sources;

-- Materialized view freshness (approximate)
SELECT name, definition
FROM rw_materialized_views
WHERE name = 'orders_summary';

-- Streaming job metrics
SELECT fragment_id, node, throughput_bytes, throughput_rows
FROM rw_streaming_jobs;

For Prometheus, RisingWave exposes a /metrics endpoint on the meta node (default port 1250) and each compute node. Key metrics:

# CDC source read throughput
risingwave_source_output_rows_counts{source_name="orders"}

# Barrier (checkpoint) latency
risingwave_meta_barrier_duration_seconds

# Materialized view write throughput
risingwave_stream_actor_output_row_count{fragment_type="mview"}

The main operational advantage: there is no replication slot to monitor or binlog rotation to track. RisingWave manages the WAL position internally, and the equivalent of "slot lag" is surfaced as checkpoint lag in its own metrics — a single metric rather than a database-side and connector-side pair.

FAQ

How do I detect if a connector has quietly stopped processing new events? Alert on MilliSecondsBehindSource exceeding your threshold AND cross-check that the source database is receiving writes. If lag is 0 but no events are flowing, the source table is quiet (fine) or the connector is stuck (not fine). Track TotalNumberOfEventsSeen as a monotonically increasing counter — if it stops increasing for 10 minutes during a write-active period, the connector is stalled.

Is there a way to get end-to-end latency measurement, from database write to Kafka delivery? Debezium includes a ts_ms field in every event payload — the millisecond timestamp when the event was processed by the connector. The database write timestamp is in source.ts_ms. Subtracting these gives connector processing latency. For end-to-end including Kafka consumer processing, compare source.ts_ms to your consumer's processing timestamp.

What is the best alerting threshold for MilliSecondsBehindSource? It depends on your SLA. For real-time dashboards requiring sub-second freshness: alert at 5,000 ms (5 sec). For operational reporting with minute-level freshness: alert at 120,000 ms (2 min). For daily batch-equivalent workloads: alert at 3,600,000 ms (1 hour). Never use a single threshold for all pipelines — different tables have different freshness requirements.

Should I monitor the replication slot even when the connector is running normally? Yes. The replication slot can grow during normal operation if the connector is slower than the write rate. Slot growth is independent of connector RUNNING state. A connector that is running but consistently lagged will accumulate gigabytes of retained WAL. Monitor pg_wal_lsn_diff as a continuous metric, not just during incidents.

How do I integrate Debezium metrics into an existing Datadog or New Relic setup? Both Datadog and New Relic support JMX metric collection via their agents. Configure the agent to collect from the same JMX port (9010 in the example above) using the Debezium MBean patterns. Alternatively, deploy the JMX Prometheus Exporter and use the Prometheus-compatible integration in Datadog/New Relic, which gives you the full PromQL-based alerting as described above.