What Is Debezium? A Complete Guide to Change Data Capture

Debezium is an open-source distributed platform for change data capture (CDC). It monitors your databases and streams every row-level change — inserts, updates, and deletes — into Kafka topics as structured events, enabling downstream systems to react in real time without polling.

What Is Change Data Capture?

Change Data Capture is the practice of tracking changes made to data in a database and delivering those changes to other systems. Instead of periodically querying a database for new or modified rows, CDC captures changes at the source — typically by reading the database's internal transaction log — and propagates them to consumers immediately.

Traditional ETL pipelines copy data in batches, introducing latency of minutes, hours, or even days. CDC eliminates this lag by treating every database write as an event. This makes CDC foundational for:

Real-time analytics: Keeping dashboards and reports up to date as transactions happen
Data synchronization: Keeping multiple databases or services in sync
Event-driven microservices: Triggering business logic in response to database changes
Audit trails: Recording every change with a timestamp and before/after snapshot
Cache invalidation: Flushing caches precisely when underlying data changes

How Debezium Works

Debezium runs as a set of Kafka Connect source connectors. Each connector is tailored to a specific database engine and reads from that engine's native change log:

PostgreSQL: Debezium reads from the logical replication slot, using the pgoutput or decoderbufs output plugin
MySQL: Debezium reads from the binlog, which records all committed transactions
MongoDB: Debezium tails the oplog (operations log) on the primary replica set member
Oracle: Debezium reads from the redo log via LogMiner
SQL Server: Debezium reads from the SQL Server CDC tables backed by the transaction log

When a change occurs, Debezium emits a structured event to a Kafka topic. Each event contains:

before: The row state before the change (null for inserts)
after: The row state after the change (null for deletes)
op: The operation type — c (create/insert), u (update), d (delete), r (read/snapshot)
source: Metadata including the connector name, database name, table name, and ts_ms (event timestamp)

This envelope format makes it straightforward for consumers to apply changes, build audit logs, or reconstruct current state.

How Debezium Integrates with Kafka

Debezium is built on top of Kafka Connect, the integration framework included with Apache Kafka. You deploy Debezium connectors inside a Kafka Connect worker (a JVM process). Connectors are configured via JSON or REST API and run continuously, tailing the database log and writing events to Kafka.

A typical Debezium deployment involves:

Kafka Connect worker: One or more workers forming a distributed cluster
Debezium connector: Configured per database, reads the transaction log
Kafka topics: One topic per monitored table by default (e.g., dbserver1.public.orders)
Consumers: Applications, stream processors, or sink connectors that read from those topics

Kafka acts as the durable buffer between the database and all downstream consumers. A change captured once by Debezium can be consumed by analytics pipelines, search indexes, notification services, and data warehouses simultaneously — the fan-out pattern.

Step-by-Step Tutorial

Step 1: Deploy Debezium with Kafka Connect

Start a Kafka Connect worker with the Debezium PostgreSQL connector plugin on the classpath, then register a connector:

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "debezium",
    "database.dbname": "shop",
    "database.server.name": "dbserver1",
    "slot.name": "debezium_orders",
    "plugin.name": "pgoutput",
    "table.include.list": "public.orders",
    "topic.prefix": "dbserver1"
  }
}

Post this configuration to the Kafka Connect REST API:

curl -X POST http://kafka-connect:8083/connectors \
  -H "Content-Type: application/json" \
  -d @connector-config.json

Step 2: Connect to RisingWave

RisingWave is a PostgreSQL-compatible streaming database that can consume Debezium events directly from Kafka. Create a source that reads the Debezium-formatted topic:

CREATE SOURCE orders_cdc
WITH (
    connector = 'kafka',
    topic = 'dbserver1.public.orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT DEBEZIUM ENCODE JSON;

RisingWave understands the Debezium envelope format, automatically applying the before/after/op fields to maintain a consistent materialized state.

Step 3: Build a Materialized View

Once the source is defined, you can build real-time aggregations or transformations as materialized views:

CREATE MATERIALIZED VIEW order_summary AS
SELECT
    DATE_TRUNC('hour', TO_TIMESTAMP(ts_ms / 1000)) AS hour,
    status,
    COUNT(*) AS order_count,
    SUM(amount) AS total_amount
FROM orders_cdc
GROUP BY 1, 2;

This view updates incrementally as new CDC events arrive — no batch jobs required.

Step 4: Sink to Downstream Systems

Stream the results to a downstream data store or analytics platform:

CREATE SINK order_summary_sink
FROM order_summary
WITH (
    connector = 'kafka',
    topic = 'order-summary-output',
    properties.bootstrap.server = 'kafka:9092'
) FORMAT DEBEZIUM ENCODE JSON;

Comparison Table

Feature	Debezium + Kafka	Native DB Polling
Latency	Milliseconds	Minutes to hours
Source impact	Log-based (minimal)	Query load on DB
Captures deletes	Yes	Difficult
Schema evolution support	Yes (with Schema Registry)	Manual
Fan-out (multiple consumers)	Built-in via Kafka	Must re-query per consumer
Operational complexity	Medium-high	Low

FAQ

Does Debezium work with managed databases like Amazon RDS? Yes. Debezium supports Amazon RDS for PostgreSQL and MySQL, as well as Amazon Aurora. You need to enable logical replication (rds.logical_replication=1 for PostgreSQL) and create a replication user with the appropriate privileges.

What happens if Debezium goes down? Debezium stores its last-read offset (log position) in Kafka. When it restarts, it resumes from where it left off. For PostgreSQL, the replication slot ensures the database retains WAL segments until Debezium confirms it has processed them.

How is RisingWave's native CDC different from Debezium? RisingWave also offers native CDC connectors (connector = 'postgres-cdc' and connector = 'mysql-cdc') that connect directly to the database without requiring Kafka or Debezium. Native CDC is simpler to operate for single-consumer scenarios. Debezium with Kafka is better when you need to fan out changes to many consumers simultaneously.

Key Takeaways

Debezium captures database changes from transaction logs with millisecond latency
It runs as Kafka Connect source connectors, emitting structured events with before, after, and op fields
RisingWave consumes Debezium events via FORMAT DEBEZIUM ENCODE JSON on Kafka sources
Materialized views in RisingWave update incrementally as CDC events arrive
Choose Debezium + Kafka for fan-out architectures; choose RisingWave's native CDC for simpler, direct pipelines

Get started | Slack