CDC Stream Processing: The Complete Guide (2026)

Change Data Capture (CDC) is a method for tracking row-level changes — inserts, updates, and deletes — in a database and streaming those changes to downstream systems in real time. In 2026, CDC is the backbone of real-time data pipelines, powering use cases from database replication to real-time analytics to AI agent context. The most common CDC tools are Debezium, Flink CDC, and RisingWave (which supports native CDC without middleware).

This guide covers how CDC works, the tools available, architecture patterns, and how to build CDC pipelines with SQL.

What Is Change Data Capture?

Change Data Capture captures every data modification event from a database's transaction log (WAL in PostgreSQL, binlog in MySQL) and produces a stream of change events. Each event contains:

Operation type: INSERT, UPDATE, or DELETE
Before state: The row values before the change (for updates and deletes)
After state: The row values after the change (for inserts and updates)
Metadata: Timestamp, transaction ID, source table

Unlike batch ETL (which periodically queries entire tables), CDC captures changes as they happen — providing sub-second latency between a write in the source database and its availability in downstream systems.

Log-Based CDC vs. Query-Based CDC

Approach	How It Works	Latency	Impact on Source
Log-based	Reads the database transaction log (WAL/binlog)	Sub-second	Minimal (reads log, not tables)
Query-based	Periodically queries tables for changes (timestamps, checksums)	Minutes to hours	High (full table scans)
Trigger-based	Database triggers capture changes to shadow tables	Sub-second	High (triggers on every write)

Log-based CDC is the standard in 2026 because it provides real-time capture with minimal impact on the source database.

CDC Tools Compared

Tool	Type	Deployment	Source DBs	Destination	Learning Curve
Debezium	CDC connector	Kafka Connect (requires Kafka)	PostgreSQL, MySQL, MongoDB, SQL Server, Oracle	Kafka topics	Medium
Flink CDC	CDC + processing	Flink cluster	PostgreSQL, MySQL, MongoDB, Oracle	Flink sinks	High
RisingWave	CDC + processing + serving	Standalone	PostgreSQL, MySQL	Materialized views (queryable)	Low (SQL)
Striim	CDC platform	Managed / self-hosted	100+ sources	Multiple	Medium
Debezium Server	Standalone CDC	No Kafka required	Same as Debezium	HTTP, Pub/Sub, Kinesis	Medium

Architecture Patterns

Pattern 1: Traditional CDC Pipeline (Debezium + Kafka + Flink)

Source DB → Debezium → Kafka → Flink → Target DB/Warehouse

Pros: Battle-tested, flexible, supports many sources and sinks Cons: Three separate systems to deploy, manage, and monitor. High operational overhead.

Pattern 2: Flink CDC (Simplified)

Source DB → Flink CDC → Flink Processing → Target

Pros: Eliminates Kafka for CDC ingestion Cons: Still requires Flink cluster, Java expertise, state management

Pattern 3: RisingWave Native CDC (Simplest)

Source DB → RisingWave → Query results directly / Sink to target

Pros: Single system, SQL-only, no middleware, built-in serving Cons: Fewer source database types than Debezium

Building a CDC Pipeline with RisingWave

RisingWave supports native CDC from PostgreSQL and MySQL — no Debezium, no Kafka, no middleware required.

Step 1: Configure the Source Database

For PostgreSQL, enable logical replication:

-- In postgresql.conf
-- wal_level = logical

-- Create a publication for the tables you want to capture
CREATE PUBLICATION my_publication FOR TABLE orders, customers;

Step 2: Create a CDC Source in RisingWave

CREATE SOURCE orders_cdc WITH (
  connector = 'postgres-cdc',
  hostname = 'postgres-host',
  port = '5432',
  username = 'replication_user',
  password = 'password',
  database.name = 'mydb',
  slot.name = 'risingwave_slot',
  publication.name = 'my_publication'
);

Step 3: Create CDC Tables

CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  amount DECIMAL,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
) FROM orders_cdc TABLE 'public.orders';

Step 4: Build Real-Time Views Over CDC Data

-- Real-time order analytics that updates with every change
CREATE MATERIALIZED VIEW order_stats AS
SELECT
  status,
  COUNT(*) as order_count,
  SUM(amount) as total_amount,
  AVG(amount) as avg_amount,
  MAX(updated_at) as last_update
FROM orders
GROUP BY status;

-- Real-time customer lifetime value
CREATE MATERIALIZED VIEW customer_ltv AS
SELECT
  c.customer_id,
  c.name,
  COUNT(o.order_id) as total_orders,
  SUM(o.amount) as lifetime_value,
  MAX(o.created_at) as last_order
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name;

Step 5: Query or Sink the Results

-- Query directly
SELECT * FROM customer_ltv WHERE lifetime_value > 10000;

-- Or sink to another system
CREATE SINK ltv_to_iceberg AS
SELECT * FROM customer_ltv
WITH (
  connector = 'iceberg',
  type = 'upsert',
  primary_key = 'customer_id',
  ...
);

CDC Use Cases

Database Replication and Migration

Replicate data from operational PostgreSQL/MySQL databases to analytical systems in real time. CDC ensures the target is always in sync without impacting source performance.

Real-Time Analytics Dashboards

Build dashboards that reflect changes within seconds. Connect Grafana or Metabase to materialized views built over CDC data for always-current metrics.

Event-Driven Microservices

Capture database changes as events that trigger downstream microservice actions — order fulfillment, notification sending, inventory updates — without coupling services to the source database.

Data Lake Ingestion

Stream CDC data into Apache Iceberg, Delta Lake, or other lakehouse formats for real-time data lake ingestion without batch ETL jobs.

AI Agent Context

Keep AI agent context fresh by ingesting CDC from operational databases into streaming materialized views that agents can query in real time.

Debezium vs RisingWave for CDC

Aspect	Debezium + Kafka + Flink	RisingWave Native CDC
Components	3 systems (Debezium, Kafka, Flink)	1 system
Languages	Java (Kafka Connect, Flink)	SQL only
Processing	Flink Java/SQL	SQL materialized views
Serving	Separate database required	Built-in (PostgreSQL protocol)
Source DBs	10+ databases	PostgreSQL, MySQL
Setup time	Days	Minutes
Operational overhead	High (3 systems to monitor)	Low (1 system)

Choose Debezium when you need CDC from databases RisingWave doesn't support (MongoDB, SQL Server, Oracle) or when you need Kafka as a central event bus.

Choose RisingWave when your sources are PostgreSQL or MySQL, you want SQL-only development, and you want processing + serving in a single system.

Frequently Asked Questions

What is Change Data Capture (CDC)?

Change Data Capture is a method for tracking row-level changes (inserts, updates, deletes) in a database by reading the database's transaction log. CDC produces a real-time stream of change events that downstream systems can consume, enabling real-time data replication, analytics, and event-driven architectures without impacting source database performance.

Do I need Kafka for CDC?

No. While traditional CDC pipelines use Debezium with Kafka, modern tools like RisingWave and Flink CDC can ingest CDC streams directly from source databases without Kafka as an intermediary. This significantly simplifies the architecture — from three systems (Debezium + Kafka + processor) down to one.

Which databases support CDC?

Most major databases support log-based CDC: PostgreSQL (logical replication), MySQL (binlog), MongoDB (change streams), SQL Server (CT/CDC), Oracle (LogMiner/XStream), and others. RisingWave currently supports native CDC from PostgreSQL and MySQL. Debezium supports the broadest range of source databases.

How does CDC compare to batch ETL?

CDC captures changes in real time (sub-second latency) by reading the database transaction log, with minimal impact on source performance. Batch ETL periodically queries entire tables (minutes to hours latency) with significant impact on source database load. CDC is more efficient for keeping downstream systems in sync because it processes only changes, not full table scans.

Can I use CDC for real-time analytics?

Yes. CDC streams can feed into streaming databases like RisingWave, where you define materialized views that continuously compute analytics over the changing data. This gives you real-time dashboards and metrics that update within seconds of any change in the source database, without building batch ETL pipelines.