CDC Stream Processing: The Complete Guide (2026)

CDC Stream Processing: The Complete Guide (2026)

CDC Stream Processing: The Complete Guide (2026)

Change Data Capture (CDC) is a method for tracking row-level changes — inserts, updates, and deletes — in a database and streaming those changes to downstream systems in real time. In 2026, CDC is the backbone of real-time data pipelines, powering use cases from database replication to real-time analytics to AI agent context. The most common CDC tools are Debezium, Flink CDC, and RisingWave (which supports native CDC without middleware).

This guide covers how CDC works, the tools available, architecture patterns, and how to build CDC pipelines with SQL.

What Is Change Data Capture?

Change Data Capture captures every data modification event from a database's transaction log (WAL in PostgreSQL, binlog in MySQL) and produces a stream of change events. Each event contains:

  • Operation type: INSERT, UPDATE, or DELETE
  • Before state: The row values before the change (for updates and deletes)
  • After state: The row values after the change (for inserts and updates)
  • Metadata: Timestamp, transaction ID, source table

Unlike batch ETL (which periodically queries entire tables), CDC captures changes as they happen — providing sub-second latency between a write in the source database and its availability in downstream systems.

Log-Based CDC vs. Query-Based CDC

ApproachHow It WorksLatencyImpact on Source
Log-basedReads the database transaction log (WAL/binlog)Sub-secondMinimal (reads log, not tables)
Query-basedPeriodically queries tables for changes (timestamps, checksums)Minutes to hoursHigh (full table scans)
Trigger-basedDatabase triggers capture changes to shadow tablesSub-secondHigh (triggers on every write)

Log-based CDC is the standard in 2026 because it provides real-time capture with minimal impact on the source database.

CDC Tools Compared

ToolTypeDeploymentSource DBsDestinationLearning Curve
DebeziumCDC connectorKafka Connect (requires Kafka)PostgreSQL, MySQL, MongoDB, SQL Server, OracleKafka topicsMedium
Flink CDCCDC + processingFlink clusterPostgreSQL, MySQL, MongoDB, OracleFlink sinksHigh
RisingWaveCDC + processing + servingStandalonePostgreSQL, MySQLMaterialized views (queryable)Low (SQL)
StriimCDC platformManaged / self-hosted100+ sourcesMultipleMedium
Debezium ServerStandalone CDCNo Kafka requiredSame as DebeziumHTTP, Pub/Sub, KinesisMedium

Architecture Patterns

Source DB → Debezium → Kafka → Flink → Target DB/Warehouse

Pros: Battle-tested, flexible, supports many sources and sinks Cons: Three separate systems to deploy, manage, and monitor. High operational overhead.

Source DB → Flink CDC → Flink Processing → Target

Pros: Eliminates Kafka for CDC ingestion Cons: Still requires Flink cluster, Java expertise, state management

Pattern 3: RisingWave Native CDC (Simplest)

Source DB → RisingWave → Query results directly / Sink to target

Pros: Single system, SQL-only, no middleware, built-in serving Cons: Fewer source database types than Debezium

Building a CDC Pipeline with RisingWave

RisingWave supports native CDC from PostgreSQL and MySQL — no Debezium, no Kafka, no middleware required.

Step 1: Configure the Source Database

For PostgreSQL, enable logical replication:

-- In postgresql.conf
-- wal_level = logical

-- Create a publication for the tables you want to capture
CREATE PUBLICATION my_publication FOR TABLE orders, customers;

Step 2: Create a CDC Source in RisingWave

CREATE SOURCE orders_cdc WITH (
  connector = 'postgres-cdc',
  hostname = 'postgres-host',
  port = '5432',
  username = 'replication_user',
  password = 'password',
  database.name = 'mydb',
  slot.name = 'risingwave_slot',
  publication.name = 'my_publication'
);

Step 3: Create CDC Tables

CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  amount DECIMAL,
  status VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
) FROM orders_cdc TABLE 'public.orders';

Step 4: Build Real-Time Views Over CDC Data

-- Real-time order analytics that updates with every change
CREATE MATERIALIZED VIEW order_stats AS
SELECT
  status,
  COUNT(*) as order_count,
  SUM(amount) as total_amount,
  AVG(amount) as avg_amount,
  MAX(updated_at) as last_update
FROM orders
GROUP BY status;

-- Real-time customer lifetime value
CREATE MATERIALIZED VIEW customer_ltv AS
SELECT
  c.customer_id,
  c.name,
  COUNT(o.order_id) as total_orders,
  SUM(o.amount) as lifetime_value,
  MAX(o.created_at) as last_order
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name;

Step 5: Query or Sink the Results

-- Query directly
SELECT * FROM customer_ltv WHERE lifetime_value > 10000;

-- Or sink to another system
CREATE SINK ltv_to_iceberg AS
SELECT * FROM customer_ltv
WITH (
  connector = 'iceberg',
  type = 'upsert',
  primary_key = 'customer_id',
  ...
);

CDC Use Cases

Database Replication and Migration

Replicate data from operational PostgreSQL/MySQL databases to analytical systems in real time. CDC ensures the target is always in sync without impacting source performance.

Real-Time Analytics Dashboards

Build dashboards that reflect changes within seconds. Connect Grafana or Metabase to materialized views built over CDC data for always-current metrics.

Event-Driven Microservices

Capture database changes as events that trigger downstream microservice actions — order fulfillment, notification sending, inventory updates — without coupling services to the source database.

Data Lake Ingestion

Stream CDC data into Apache Iceberg, Delta Lake, or other lakehouse formats for real-time data lake ingestion without batch ETL jobs.

AI Agent Context

Keep AI agent context fresh by ingesting CDC from operational databases into streaming materialized views that agents can query in real time.

Debezium vs RisingWave for CDC

AspectDebezium + Kafka + FlinkRisingWave Native CDC
Components3 systems (Debezium, Kafka, Flink)1 system
LanguagesJava (Kafka Connect, Flink)SQL only
ProcessingFlink Java/SQLSQL materialized views
ServingSeparate database requiredBuilt-in (PostgreSQL protocol)
Source DBs10+ databasesPostgreSQL, MySQL
Setup timeDaysMinutes
Operational overheadHigh (3 systems to monitor)Low (1 system)

Choose Debezium when you need CDC from databases RisingWave doesn't support (MongoDB, SQL Server, Oracle) or when you need Kafka as a central event bus.

Choose RisingWave when your sources are PostgreSQL or MySQL, you want SQL-only development, and you want processing + serving in a single system.

Frequently Asked Questions

What is Change Data Capture (CDC)?

Change Data Capture is a method for tracking row-level changes (inserts, updates, deletes) in a database by reading the database's transaction log. CDC produces a real-time stream of change events that downstream systems can consume, enabling real-time data replication, analytics, and event-driven architectures without impacting source database performance.

Do I need Kafka for CDC?

No. While traditional CDC pipelines use Debezium with Kafka, modern tools like RisingWave and Flink CDC can ingest CDC streams directly from source databases without Kafka as an intermediary. This significantly simplifies the architecture — from three systems (Debezium + Kafka + processor) down to one.

Which databases support CDC?

Most major databases support log-based CDC: PostgreSQL (logical replication), MySQL (binlog), MongoDB (change streams), SQL Server (CT/CDC), Oracle (LogMiner/XStream), and others. RisingWave currently supports native CDC from PostgreSQL and MySQL. Debezium supports the broadest range of source databases.

How does CDC compare to batch ETL?

CDC captures changes in real time (sub-second latency) by reading the database transaction log, with minimal impact on source performance. Batch ETL periodically queries entire tables (minutes to hours latency) with significant impact on source database load. CDC is more efficient for keeping downstream systems in sync because it processes only changes, not full table scans.

Can I use CDC for real-time analytics?

Yes. CDC streams can feed into streaming databases like RisingWave, where you define materialized views that continuously compute analytics over the changing data. This gives you real-time dashboards and metrics that update within seconds of any change in the source database, without building batch ETL pipelines.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.