What Is Change Data Capture (CDC)? Tools and Tutorial

Every time a customer places an order, updates their shipping address, or cancels a subscription, your source database changes. The challenge is getting those changes to every downstream system that needs them: your analytics warehouse, your search index, your recommendation engine, your monitoring dashboard. Batch ETL jobs that run every hour (or every night) leave you operating on stale data. For many applications, that delay is unacceptable.

Change Data Capture (CDC) solves this problem by tracking row-level changes in a database and streaming them to downstream consumers as they happen. Instead of periodically dumping entire tables, CDC captures only what changed: inserts, updates, and deletes. The result is lower latency, reduced load on the source database, and a continuous flow of fresh data to every system that needs it.

In this article, you will learn what CDC is and how it works, compare the most popular CDC tools (Debezium, Airbyte, and Fivetran), and follow a step-by-step tutorial using RisingWave's built-in CDC connectors to stream changes from PostgreSQL and MySQL into materialized views with plain SQL.

How Change Data Capture Works

Change Data Capture is a method for detecting and capturing row-level changes (inserts, updates, and deletes) in a database and delivering those changes to downstream systems in near real time. Rather than querying entire tables on a schedule, CDC reads the database's own change log, making it both efficient and comprehensive.

There are three primary approaches to implementing CDC, each with different trade-offs.

Log-Based CDC

Log-based CDC reads the database's internal transaction log: the Write-Ahead Log (WAL) in PostgreSQL or the binary log (binlog) in MySQL. Every committed transaction is already recorded in these logs, so reading them adds virtually zero overhead to the source database.

This is the gold standard for CDC. It captures every change (including deletes), maintains the correct ordering of events, and operates with sub-second latency. Debezium, Fivetran, and RisingWave all use log-based CDC as their primary mechanism.

Trigger-Based CDC

Trigger-based CDC attaches database triggers to monitored tables. When a row changes, the trigger writes the old and new values to a shadow table, which a separate process then reads and forwards.

While straightforward to implement, triggers add write overhead to every transaction on the source database. They also require schema changes (the shadow tables and triggers themselves), which can complicate database management.

Timestamp-Based CDC (Polling)

Timestamp-based CDC polls tables for rows where an updated_at column is more recent than the last poll time. This approach is simple and works without any special database configuration.

However, it has significant limitations: it cannot detect deletes (deleted rows have no timestamp to check), it misses changes if the timestamp column is not consistently updated, and frequent polling adds load to the source database. It also introduces latency equal to the polling interval.

Approach	Latency	Delete Detection	Source DB Overhead	Complexity
Log-based	Sub-second	Yes	Minimal	Medium
Trigger-based	Sub-second	Yes	High (per-write)	High
Timestamp-based	Seconds to minutes	No	Medium (polling)	Low

For production workloads, log-based CDC is almost always the right choice. The rest of this article focuses on log-based tools and techniques.

Comparing Popular CDC Tools

The CDC tooling landscape ranges from open-source frameworks you self-host to fully managed SaaS platforms. Here is how three of the most widely used options compare.

Debezium

Debezium is an open-source, distributed CDC platform built on top of Apache Kafka Connect. It reads transaction logs from PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, and other databases, then publishes change events as JSON messages to Kafka topics.

Strengths:

Broad database support (10+ connectors)
Fine-grained control over serialization, filtering, and routing
Large community and extensive documentation
Completely free and open source

Trade-offs:

Requires running and operating Kafka, Zookeeper (or KRaft), and Kafka Connect
Configuration-heavy: connector JSON files, schema registry setup, topic management
Operational burden falls entirely on your team
No built-in processing: you need a separate system (Flink, Spark, or a streaming database) to query or transform the CDC events

Debezium is the right choice when your organization already runs Kafka and has the engineering capacity to manage the infrastructure.

Airbyte

Airbyte is an open-source data integration platform that supports CDC as one of its replication modes. Under the hood, Airbyte's CDC connectors for PostgreSQL, MySQL, and SQL Server use embedded Debezium.

Strengths:

350+ connectors (CDC and non-CDC)
User-friendly web UI for configuration
Available as open-source self-hosted or managed cloud
Connector Development Kit (CDK) for building custom connectors

Trade-offs:

CDC replication is more batch-oriented than truly streaming; sync intervals are typically minutes, not seconds
Not designed for real-time analytics or stream processing
The Debezium dependency adds a layer of abstraction that can complicate debugging
Resource-intensive: each sync spins up containers

Airbyte works well when your primary goal is loading data into a warehouse (Snowflake, BigQuery, Redshift) and you can tolerate minutes of latency.

Fivetran

Fivetran is a fully managed data integration platform. After acquiring HVR in 2021, Fivetran integrated enterprise-grade log-based CDC directly into its platform, supporting PostgreSQL, MySQL, SQL Server, Oracle, and more.

Strengths:

Zero infrastructure to manage: no Kafka, no Connect, no servers
700+ pre-built connectors
Automatic schema migration handling
SOC 2 Type II, HIPAA, and other compliance certifications

Trade-offs:

Closed source with usage-based pricing that can scale quickly
Limited control over event routing and transformation
Minimum sync frequency is typically 1-5 minutes on standard plans
Vendor lock-in: migrating away requires rebuilding pipelines

Fivetran is ideal for teams that want CDC without any operational overhead and are willing to pay for the convenience.

Comparison Table

Feature	Debezium	Airbyte	Fivetran	RisingWave
License	Open source	Open source / Cloud	Proprietary	Open source / Cloud
CDC Method	Log-based	Log-based (Debezium)	Log-based (HVR)	Log-based (built-in)
Requires Kafka	Yes	No	No	No
Latency	Sub-second	Minutes	1-5 minutes	Sub-second
Built-in Processing	No	No	No	Yes (SQL)
Managed Option	No (community)	Airbyte Cloud	Yes	RisingWave Cloud
Setup Complexity	High	Medium	Low	Low

RisingWave: Built-in CDC Without Middleware

RisingWave takes a fundamentally different approach. Instead of requiring a separate CDC tool plus a separate processing engine, RisingWave includes native CDC connectors for PostgreSQL and MySQL directly inside the database. You configure the connection, define your tables, and build streaming queries, all with standard SQL. No Kafka. No Debezium. No Kafka Connect.

This matters because CDC is only the first step. After capturing changes, you almost always need to join, aggregate, filter, or transform them. With traditional CDC pipelines, that means shipping data through Kafka to a stream processor (Flink, Spark Streaming) or loading it into a warehouse for batch queries. With RisingWave, you write a CREATE SOURCE statement to connect to your upstream database, create tables that map to upstream tables, and then build materialized views that incrementally maintain query results as new CDC events arrive.

The architecture collapses from three or four components down to one:

graph LR
    A[PostgreSQL / MySQL] -->|CDC| B[RisingWave]
    B --> C[Materialized Views]
    C --> D[Dashboards / APIs / Downstream]

Compare this with the traditional Debezium pipeline:

graph LR
    A[PostgreSQL / MySQL] -->|WAL / Binlog| B[Debezium]
    B --> C[Kafka]
    C --> D[Stream Processor]
    D --> E[Downstream]

Fewer moving parts means fewer things that break at 2 AM.

Tutorial: Stream CDC Events from PostgreSQL into RisingWave

This hands-on tutorial walks you through setting up a CDC pipeline from a PostgreSQL source database into RisingWave, creating materialized views that update automatically as the source data changes. All SQL has been verified against RisingWave 2.8.0.

Prerequisites

A running PostgreSQL instance (the source database) with logical replication enabled
A running RisingWave instance (quickstart guide)
psql client for connecting to both databases

Step 1: Configure PostgreSQL for Logical Replication

Edit your PostgreSQL postgresql.conf to enable logical replication:

# postgresql.conf
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4

Restart PostgreSQL after making these changes. Then create a user with replication privileges:

-- Run on your PostgreSQL source database
CREATE USER replication_user WITH REPLICATION PASSWORD 'securepass';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user;

Step 2: Create the CDC Source in RisingWave

Connect to your RisingWave instance and create a shared CDC source that points to your PostgreSQL database:

-- Run on RisingWave
CREATE SOURCE pg_ecommerce WITH (
    connector = 'postgres-cdc',
    hostname = '127.0.0.1',
    port = '5432',
    username = 'replication_user',
    password = 'securepass',
    database.name = 'ecommerce',
    schema.name = 'public',
    slot.name = 'risingwave_slot'
);

RisingWave manages the replication slot automatically. The slot.name parameter lets you specify a custom name for easier identification.

Step 3: Create Tables from the CDC Source

For each upstream PostgreSQL table you want to replicate, create a corresponding table in RisingWave using FROM source TABLE:

-- Map the upstream 'orders' table
CREATE TABLE cdc_orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product_name VARCHAR,
    quantity INT,
    total_amount DECIMAL,
    order_status VARCHAR,
    created_at TIMESTAMPTZ
) FROM pg_ecommerce TABLE 'public.orders';

-- Map the upstream 'customers' table
CREATE TABLE cdc_customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR,
    email VARCHAR,
    region VARCHAR,
    created_at TIMESTAMPTZ
) FROM pg_ecommerce TABLE 'public.customers';

RisingWave immediately begins snapshotting the existing data and then switches to streaming mode, capturing every subsequent insert, update, and delete from the source.

Step 4: Build Materialized Views on CDC Data

Now create materialized views that join and aggregate the CDC data. These views update incrementally as new changes arrive from PostgreSQL:

CREATE MATERIALIZED VIEW mv_customer_order_summary AS
SELECT
    c.customer_id,
    c.customer_name,
    c.region,
    COUNT(o.order_id) AS total_orders,
    SUM(o.total_amount) AS total_spent,
    MAX(o.created_at) AS last_order_at
FROM cdc_customers c
JOIN cdc_orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.region;

This view joins customers with their orders, computing running totals that stay current without any scheduled jobs or manual refreshes.

Create a second view for monitoring order fulfillment:

CREATE MATERIALIZED VIEW mv_order_status_monitor AS
SELECT
    order_status,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_revenue,
    AVG(total_amount) AS avg_order_value
FROM cdc_orders
GROUP BY order_status;

Step 5: Query the Results

Query the materialized views just like regular tables:

SELECT * FROM mv_customer_order_summary ORDER BY total_spent DESC;

 customer_id | customer_name |  region  | total_orders | total_spent |       last_order_at
-------------+---------------+----------+--------------+-------------+---------------------------
           1 | Alice Chen    | US-West  |            3 |      239.94 | 2026-03-31 10:00:00+00:00
           3 | Carol Kim     | EU-West  |            2 |      219.98 | 2026-04-01 09:00:00+00:00
           2 | Bob Martinez  | US-East  |            2 |      139.98 | 2026-03-30 08:45:00+00:00
           4 | David Patel   | AP-South |            1 |       69.99 | 2026-03-30 16:10:00+00:00

SELECT * FROM mv_order_status_monitor ORDER BY order_count DESC;

 order_status | order_count | total_revenue | avg_order_value
--------------+-------------+---------------+-----------------
 shipped      |           3 |        309.96 |          103.32
 processing   |           3 |        199.97 |           66.66
 delivered    |           2 |        159.96 |           79.98

When a new order is inserted into the source PostgreSQL database, RisingWave captures the change via CDC and both materialized views update automatically within seconds, with no manual intervention required.

Setting Up MySQL CDC

The process for MySQL is nearly identical. Enable the binlog in your MySQL configuration:

# my.cnf
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL

Then create the source in RisingWave:

CREATE SOURCE mysql_ecommerce WITH (
    connector = 'mysql-cdc',
    hostname = '127.0.0.1',
    port = '3306',
    username = 'root',
    password = '123456',
    database.name = 'ecommerce',
    server.id = 5888
);

The server.id must be unique across all database clients connected to the MySQL cluster. After creating the source, the CREATE TABLE ... FROM source TABLE syntax works the same way as with PostgreSQL.

What Is the Difference Between CDC and ETL?

ETL (Extract, Transform, Load) traditionally operates in batches. A scheduler triggers a job that queries source tables, transforms the data, and loads it into a destination. This works, but it means your destination data is always at least as stale as the interval between batch runs.

CDC is fundamentally event-driven. It captures each individual change as it happens and streams it to consumers. You can think of CDC as the "Extract" step of ETL, but operating continuously rather than on a schedule. When paired with a streaming database like RisingWave, CDC replaces the entire ETL pipeline: changes are captured (Extract), processed by materialized views (Transform), and the results are available for querying or sinking to downstream systems (Load), all in real time.

When Should You Use CDC Instead of Batch Replication?

Use CDC when your use case requires data freshness measured in seconds rather than hours. Common scenarios include real-time analytics dashboards, fraud detection systems, operational monitoring, cache invalidation, and synchronizing data across microservices. If your analytics queries run on yesterday's data and that is acceptable, batch replication is simpler and cheaper. But if a five-minute delay in detecting a fraudulent transaction or a stock-out condition costs real money, CDC is the right approach.

How Does RisingWave Handle CDC Differently from Debezium?

Debezium is a CDC connector: it reads database logs and publishes change events to Kafka. It does one thing well, but it requires Kafka as a transport layer and a separate system to process the events. RisingWave is a streaming database that includes CDC connectors natively. It reads database logs directly, processes the change events with SQL (joins, aggregations, windowing), and stores the results in materialized views. The key difference is that RisingWave eliminates the need for Kafka and a separate stream processor, reducing the pipeline from three or four components to one. For teams that need Kafka for other purposes, RisingWave can also consume CDC events via Kafka topics.

Can CDC Handle Schema Changes in the Source Database?

Schema evolution is one of the trickiest aspects of CDC. When a column is added, renamed, or dropped in the source database, the CDC pipeline must handle the change gracefully. Debezium supports schema change detection and can propagate schema changes through Kafka, but consumers must handle the new schema. Fivetran automatically handles many schema changes (new columns, type widening) without user intervention. RisingWave currently requires you to update the table definition when the source schema changes, which gives you explicit control over when and how schema changes propagate. Check the RisingWave documentation for the latest schema evolution support.

Conclusion

Change Data Capture transforms how data moves between systems, replacing periodic batch dumps with continuous, low-latency event streams. Here are the key takeaways:

CDC captures row-level changes (inserts, updates, deletes) from database transaction logs, delivering them to downstream systems in sub-second latency.
Log-based CDC is the production standard, offering minimal source database overhead and complete change detection.
Debezium is powerful but requires Kafka infrastructure. Airbyte adds a friendly UI but operates in batch intervals. Fivetran is fully managed but proprietary and costly.
RisingWave eliminates CDC middleware by embedding PostgreSQL and MySQL connectors directly in the database, letting you capture, process, and query change events with SQL alone.
Materialized views on CDC data give you always-fresh query results without scheduled jobs, manual refreshes, or external orchestrators.

Ready to try CDC with zero middleware? Try RisingWave Cloud free - no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.