CDC Stream Processing: The Complete Guide (2026)
Change Data Capture (CDC) is a method for tracking row-level changes — inserts, updates, and deletes — in a database and streaming those changes to downstream systems in real time. In 2026, CDC is the backbone of real-time data pipelines, powering use cases from database replication to real-time analytics to AI agent context. The most common CDC tools are Debezium, Flink CDC, and RisingWave (which supports native CDC without middleware).
This guide covers how CDC works, the tools available, architecture patterns, and how to build CDC pipelines with SQL.
What Is Change Data Capture?
Change Data Capture captures every data modification event from a database's transaction log (WAL in PostgreSQL, binlog in MySQL) and produces a stream of change events. Each event contains:
- Operation type: INSERT, UPDATE, or DELETE
- Before state: The row values before the change (for updates and deletes)
- After state: The row values after the change (for inserts and updates)
- Metadata: Timestamp, transaction ID, source table
Unlike batch ETL (which periodically queries entire tables), CDC captures changes as they happen — providing sub-second latency between a write in the source database and its availability in downstream systems.
Log-Based CDC vs. Query-Based CDC
| Approach | How It Works | Latency | Impact on Source |
| Log-based | Reads the database transaction log (WAL/binlog) | Sub-second | Minimal (reads log, not tables) |
| Query-based | Periodically queries tables for changes (timestamps, checksums) | Minutes to hours | High (full table scans) |
| Trigger-based | Database triggers capture changes to shadow tables | Sub-second | High (triggers on every write) |
Log-based CDC is the standard in 2026 because it provides real-time capture with minimal impact on the source database.
CDC Tools Compared
| Tool | Type | Deployment | Source DBs | Destination | Learning Curve |
| Debezium | CDC connector | Kafka Connect (requires Kafka) | PostgreSQL, MySQL, MongoDB, SQL Server, Oracle | Kafka topics | Medium |
| Flink CDC | CDC + processing | Flink cluster | PostgreSQL, MySQL, MongoDB, Oracle | Flink sinks | High |
| RisingWave | CDC + processing + serving | Standalone | PostgreSQL, MySQL | Materialized views (queryable) | Low (SQL) |
| Striim | CDC platform | Managed / self-hosted | 100+ sources | Multiple | Medium |
| Debezium Server | Standalone CDC | No Kafka required | Same as Debezium | HTTP, Pub/Sub, Kinesis | Medium |
Architecture Patterns
Pattern 1: Traditional CDC Pipeline (Debezium + Kafka + Flink)
Source DB → Debezium → Kafka → Flink → Target DB/Warehouse
Pros: Battle-tested, flexible, supports many sources and sinks Cons: Three separate systems to deploy, manage, and monitor. High operational overhead.
Pattern 2: Flink CDC (Simplified)
Source DB → Flink CDC → Flink Processing → Target
Pros: Eliminates Kafka for CDC ingestion Cons: Still requires Flink cluster, Java expertise, state management
Pattern 3: RisingWave Native CDC (Simplest)
Source DB → RisingWave → Query results directly / Sink to target
Pros: Single system, SQL-only, no middleware, built-in serving Cons: Fewer source database types than Debezium
Building a CDC Pipeline with RisingWave
RisingWave supports native CDC from PostgreSQL and MySQL — no Debezium, no Kafka, no middleware required.
Step 1: Configure the Source Database
For PostgreSQL, enable logical replication:
-- In postgresql.conf
-- wal_level = logical
-- Create a publication for the tables you want to capture
CREATE PUBLICATION my_publication FOR TABLE orders, customers;
Step 2: Create a CDC Source in RisingWave
CREATE SOURCE orders_cdc WITH (
connector = 'postgres-cdc',
hostname = 'postgres-host',
port = '5432',
username = 'replication_user',
password = 'password',
database.name = 'mydb',
slot.name = 'risingwave_slot',
publication.name = 'my_publication'
);
Step 3: Create CDC Tables
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
amount DECIMAL,
status VARCHAR,
created_at TIMESTAMP,
updated_at TIMESTAMP
) FROM orders_cdc TABLE 'public.orders';
Step 4: Build Real-Time Views Over CDC Data
-- Real-time order analytics that updates with every change
CREATE MATERIALIZED VIEW order_stats AS
SELECT
status,
COUNT(*) as order_count,
SUM(amount) as total_amount,
AVG(amount) as avg_amount,
MAX(updated_at) as last_update
FROM orders
GROUP BY status;
-- Real-time customer lifetime value
CREATE MATERIALIZED VIEW customer_ltv AS
SELECT
c.customer_id,
c.name,
COUNT(o.order_id) as total_orders,
SUM(o.amount) as lifetime_value,
MAX(o.created_at) as last_order
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name;
Step 5: Query or Sink the Results
-- Query directly
SELECT * FROM customer_ltv WHERE lifetime_value > 10000;
-- Or sink to another system
CREATE SINK ltv_to_iceberg AS
SELECT * FROM customer_ltv
WITH (
connector = 'iceberg',
type = 'upsert',
primary_key = 'customer_id',
...
);
CDC Use Cases
Database Replication and Migration
Replicate data from operational PostgreSQL/MySQL databases to analytical systems in real time. CDC ensures the target is always in sync without impacting source performance.
Real-Time Analytics Dashboards
Build dashboards that reflect changes within seconds. Connect Grafana or Metabase to materialized views built over CDC data for always-current metrics.
Event-Driven Microservices
Capture database changes as events that trigger downstream microservice actions — order fulfillment, notification sending, inventory updates — without coupling services to the source database.
Data Lake Ingestion
Stream CDC data into Apache Iceberg, Delta Lake, or other lakehouse formats for real-time data lake ingestion without batch ETL jobs.
AI Agent Context
Keep AI agent context fresh by ingesting CDC from operational databases into streaming materialized views that agents can query in real time.
Debezium vs RisingWave for CDC
| Aspect | Debezium + Kafka + Flink | RisingWave Native CDC |
| Components | 3 systems (Debezium, Kafka, Flink) | 1 system |
| Languages | Java (Kafka Connect, Flink) | SQL only |
| Processing | Flink Java/SQL | SQL materialized views |
| Serving | Separate database required | Built-in (PostgreSQL protocol) |
| Source DBs | 10+ databases | PostgreSQL, MySQL |
| Setup time | Days | Minutes |
| Operational overhead | High (3 systems to monitor) | Low (1 system) |
Choose Debezium when you need CDC from databases RisingWave doesn't support (MongoDB, SQL Server, Oracle) or when you need Kafka as a central event bus.
Choose RisingWave when your sources are PostgreSQL or MySQL, you want SQL-only development, and you want processing + serving in a single system.
Frequently Asked Questions
What is Change Data Capture (CDC)?
Change Data Capture is a method for tracking row-level changes (inserts, updates, deletes) in a database by reading the database's transaction log. CDC produces a real-time stream of change events that downstream systems can consume, enabling real-time data replication, analytics, and event-driven architectures without impacting source database performance.
Do I need Kafka for CDC?
No. While traditional CDC pipelines use Debezium with Kafka, modern tools like RisingWave and Flink CDC can ingest CDC streams directly from source databases without Kafka as an intermediary. This significantly simplifies the architecture — from three systems (Debezium + Kafka + processor) down to one.
Which databases support CDC?
Most major databases support log-based CDC: PostgreSQL (logical replication), MySQL (binlog), MongoDB (change streams), SQL Server (CT/CDC), Oracle (LogMiner/XStream), and others. RisingWave currently supports native CDC from PostgreSQL and MySQL. Debezium supports the broadest range of source databases.
How does CDC compare to batch ETL?
CDC captures changes in real time (sub-second latency) by reading the database transaction log, with minimal impact on source performance. Batch ETL periodically queries entire tables (minutes to hours latency) with significant impact on source database load. CDC is more efficient for keeping downstream systems in sync because it processes only changes, not full table scans.
Can I use CDC for real-time analytics?
Yes. CDC streams can feed into streaming databases like RisingWave, where you define materialized views that continuously compute analytics over the changing data. This gives you real-time dashboards and metrics that update within seconds of any change in the source database, without building batch ETL pipelines.

