Debezium is an open-source distributed platform for change data capture (CDC). It monitors your databases and streams every row-level change — inserts, updates, and deletes — into Kafka topics as structured events, enabling downstream systems to react in real time without polling.
What Is Change Data Capture?
Change Data Capture is the practice of tracking changes made to data in a database and delivering those changes to other systems. Instead of periodically querying a database for new or modified rows, CDC captures changes at the source — typically by reading the database's internal transaction log — and propagates them to consumers immediately.
Traditional ETL pipelines copy data in batches, introducing latency of minutes, hours, or even days. CDC eliminates this lag by treating every database write as an event. This makes CDC foundational for:
- Real-time analytics: Keeping dashboards and reports up to date as transactions happen
- Data synchronization: Keeping multiple databases or services in sync
- Event-driven microservices: Triggering business logic in response to database changes
- Audit trails: Recording every change with a timestamp and before/after snapshot
- Cache invalidation: Flushing caches precisely when underlying data changes
How Debezium Works
Debezium runs as a set of Kafka Connect source connectors. Each connector is tailored to a specific database engine and reads from that engine's native change log:
- PostgreSQL: Debezium reads from the logical replication slot, using the
pgoutputordecoderbufsoutput plugin - MySQL: Debezium reads from the binlog, which records all committed transactions
- MongoDB: Debezium tails the oplog (operations log) on the primary replica set member
- Oracle: Debezium reads from the redo log via LogMiner
- SQL Server: Debezium reads from the SQL Server CDC tables backed by the transaction log
When a change occurs, Debezium emits a structured event to a Kafka topic. Each event contains:
before: The row state before the change (null for inserts)after: The row state after the change (null for deletes)op: The operation type —c(create/insert),u(update),d(delete),r(read/snapshot)source: Metadata including the connector name, database name, table name, andts_ms(event timestamp)
This envelope format makes it straightforward for consumers to apply changes, build audit logs, or reconstruct current state.
How Debezium Integrates with Kafka
Debezium is built on top of Kafka Connect, the integration framework included with Apache Kafka. You deploy Debezium connectors inside a Kafka Connect worker (a JVM process). Connectors are configured via JSON or REST API and run continuously, tailing the database log and writing events to Kafka.
A typical Debezium deployment involves:
- Kafka Connect worker: One or more workers forming a distributed cluster
- Debezium connector: Configured per database, reads the transaction log
- Kafka topics: One topic per monitored table by default (e.g.,
dbserver1.public.orders) - Consumers: Applications, stream processors, or sink connectors that read from those topics
Kafka acts as the durable buffer between the database and all downstream consumers. A change captured once by Debezium can be consumed by analytics pipelines, search indexes, notification services, and data warehouses simultaneously — the fan-out pattern.
Step-by-Step Tutorial
Step 1: Deploy Debezium with Kafka Connect
Start a Kafka Connect worker with the Debezium PostgreSQL connector plugin on the classpath, then register a connector:
{
"name": "orders-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "debezium",
"database.password": "debezium",
"database.dbname": "shop",
"database.server.name": "dbserver1",
"slot.name": "debezium_orders",
"plugin.name": "pgoutput",
"table.include.list": "public.orders",
"topic.prefix": "dbserver1"
}
}
Post this configuration to the Kafka Connect REST API:
curl -X POST http://kafka-connect:8083/connectors \
-H "Content-Type: application/json" \
-d @connector-config.json
Step 2: Connect to RisingWave
RisingWave is a PostgreSQL-compatible streaming database that can consume Debezium events directly from Kafka. Create a source that reads the Debezium-formatted topic:
CREATE SOURCE orders_cdc
WITH (
connector = 'kafka',
topic = 'dbserver1.public.orders',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'earliest'
) FORMAT DEBEZIUM ENCODE JSON;
RisingWave understands the Debezium envelope format, automatically applying the before/after/op fields to maintain a consistent materialized state.
Step 3: Build a Materialized View
Once the source is defined, you can build real-time aggregations or transformations as materialized views:
CREATE MATERIALIZED VIEW order_summary AS
SELECT
DATE_TRUNC('hour', TO_TIMESTAMP(ts_ms / 1000)) AS hour,
status,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM orders_cdc
GROUP BY 1, 2;
This view updates incrementally as new CDC events arrive — no batch jobs required.
Step 4: Sink to Downstream Systems
Stream the results to a downstream data store or analytics platform:
CREATE SINK order_summary_sink
FROM order_summary
WITH (
connector = 'kafka',
topic = 'order-summary-output',
properties.bootstrap.server = 'kafka:9092'
) FORMAT DEBEZIUM ENCODE JSON;
Comparison Table
| Feature | Debezium + Kafka | Native DB Polling |
| Latency | Milliseconds | Minutes to hours |
| Source impact | Log-based (minimal) | Query load on DB |
| Captures deletes | Yes | Difficult |
| Schema evolution support | Yes (with Schema Registry) | Manual |
| Fan-out (multiple consumers) | Built-in via Kafka | Must re-query per consumer |
| Operational complexity | Medium-high | Low |
FAQ
Does Debezium work with managed databases like Amazon RDS?
Yes. Debezium supports Amazon RDS for PostgreSQL and MySQL, as well as Amazon Aurora. You need to enable logical replication (rds.logical_replication=1 for PostgreSQL) and create a replication user with the appropriate privileges.
What happens if Debezium goes down? Debezium stores its last-read offset (log position) in Kafka. When it restarts, it resumes from where it left off. For PostgreSQL, the replication slot ensures the database retains WAL segments until Debezium confirms it has processed them.
How is RisingWave's native CDC different from Debezium?
RisingWave also offers native CDC connectors (connector = 'postgres-cdc' and connector = 'mysql-cdc') that connect directly to the database without requiring Kafka or Debezium. Native CDC is simpler to operate for single-consumer scenarios. Debezium with Kafka is better when you need to fan out changes to many consumers simultaneously.
Key Takeaways
- Debezium captures database changes from transaction logs with millisecond latency
- It runs as Kafka Connect source connectors, emitting structured events with
before,after, andopfields - RisingWave consumes Debezium events via
FORMAT DEBEZIUM ENCODE JSONon Kafka sources - Materialized views in RisingWave update incrementally as CDC events arrive
- Choose Debezium + Kafka for fan-out architectures; choose RisingWave's native CDC for simpler, direct pipelines

