CDC Architecture Patterns: From Debezium to Streaming Databases

CDC Architecture Patterns: From Debezium to Streaming Databases

CDC Architecture Patterns: From Debezium to Streaming Databases

Change Data Capture has evolved from a niche database trick into a foundational pattern for real-time data systems. Four patterns dominate production deployments in 2026: log-based CDC with a message broker, embedded CDC in a streaming database, managed CDC services, and hybrid architectures that combine multiple patterns. Each solves a different problem. Choosing the wrong pattern costs months of rework.

What All Four Patterns Share

Every CDC architecture solves the same core problem: reading database changes from a transaction log without polling, then making those changes available to downstream systems.

The differences are in what happens next — how changes are routed, transformed, stored, and queried — and what infrastructure you must operate to get there.


Pattern 1: Log-Based CDC with Debezium (The Traditional Pattern)

Architecture Description

Source Database (PostgreSQL / MySQL / SQL Server)
    ↓  [replication slot / binlog]
Debezium (Kafka Connect plugin)
    ↓  [Kafka topics, one per table]
Kafka Broker Cluster
    ├─→ Consumer A: Elasticsearch indexer
    ├─→ Consumer B: Data lake loader (S3/Iceberg)
    ├─→ Consumer C: Stream processor (Flink/Spark)
    └─→ Consumer D: Notification service

Debezium reads the database replication log and publishes change events to Kafka topics. Each downstream system is an independent Kafka consumer.

How It Works

Debezium runs as a Kafka Connect plugin. The connector configuration specifies the source database, credentials, and which tables to capture. Debezium creates a replication slot (PostgreSQL) or reads the binlog (MySQL) and produces structured JSON or Avro events to Kafka.

{
  "op": "u",
  "before": { "id": 123, "status": "pending" },
  "after": { "id": 123, "status": "shipped" },
  "source": {
    "db": "ecommerce",
    "table": "orders",
    "ts_ms": 1711929600000
  }
}

The op field indicates the operation: c (create), u (update), d (delete), r (snapshot read).

Use Cases

  • Multi-system fan-out: same change event delivered to search, analytics, and the data lake independently
  • Event replay: Kafka's log retention allows consumers to reprocess events from any point in time
  • Decoupled teams: different teams own different consumers with no coordination required

Trade-offs

Value
Operational componentsKafka brokers + Kafka Connect workers + Debezium config + schema registry
Latency100ms–5s (Kafka poll interval dependent)
Fan-outUnlimited consumers
Query capabilityNone (Debezium only produces events)
Schema evolutionManaged via schema registry (Avro) or manual (JSON)
Maintenance burdenHigh — each component has its own failure modes

When to use: Multiple independent consumers need the same CDC events. Teams are already operating Kafka. Event replay capability is required.

When not to use: Analytics is the only consumer. You do not operate Kafka today. Sub-second latency is required.


Pattern 2: Embedded CDC in a Streaming Database (The Simplified Pattern)

Architecture Description

Source Database (PostgreSQL / MySQL)
    ↓  [replication slot / binlog]
RisingWave (embedded Debezium engine + SQL processor + query serving)
    ↓  [PostgreSQL wire protocol]
BI Tools / Dashboards / Applications

A streaming database like RisingWave embeds the CDC capture layer, the stream processing layer, and the query serving layer in a single system.

How It Works

RisingWave uses the Debezium Embedded Engine — the same Java library that powers Debezium Standalone — to read the database replication log. This is not a reimplementation; it is the same battle-tested library, running in-process within RisingWave instead of as a Kafka Connect plugin.

Change events flow directly into RisingWave's incremental computation engine. Materialized views defined in SQL update automatically as new events arrive.

-- Declare the CDC source
CREATE SOURCE pg_source WITH (
    connector = 'postgres-cdc',
    hostname = 'postgres.internal',
    port = '5432',
    username = 'replicator',
    password = 'secret',
    database.name = 'ecommerce',
    slot.name = 'rw_slot'
);

-- Declare a table backed by the CDC stream
CREATE TABLE orders (
    id BIGINT PRIMARY KEY,
    customer_id BIGINT,
    status VARCHAR,
    total NUMERIC,
    created_at TIMESTAMPTZ
) FROM pg_source TABLE 'public.orders';

-- Materialized view: always current, no manual refresh
CREATE MATERIALIZED VIEW hourly_revenue AS
SELECT
    DATE_TRUNC('hour', created_at) AS hour,
    SUM(total) AS revenue,
    COUNT(*) AS orders
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('hour', created_at);

Use Cases

  • Real-time analytics: dashboards and reports that must reflect the current state of the database within seconds
  • SQL-based stream processing: teams that want to write transformations in SQL, not Java or Python
  • Simplified operations: small teams that cannot afford to operate Kafka, Kafka Connect, and Flink separately

Trade-offs

Value
Operational componentsRisingWave only
LatencyMilliseconds to seconds
Fan-outRisingWave is one logical consumer
Query capabilityFull SQL, PostgreSQL-compatible
Schema evolutionAdditive changes automatic; destructive changes require source refresh
Maintenance burdenLow — single system

When to use: Analytics or real-time query serving is the primary CDC use case. No fan-out to multiple independent consumers is needed.

When not to use: Multiple downstream systems need independent access to CDC events. Kafka is already in your stack and teams depend on it.


Pattern 3: Managed CDC (The Outsourced Pattern)

Architecture Description

Source Database
    ↓  [managed connector]
Managed CDC Service (Fivetran / Airbyte Cloud / AWS DMS)
    ↓  [scheduled sync]
Data Warehouse (Snowflake / BigQuery / Redshift)
    ↓
BI Tools

Managed CDC services abstract away all infrastructure. You configure credentials, choose a destination, and the service handles log reading, schema evolution, and delivery. Fivetran uses proprietary connectors; Airbyte uses the Debezium Embedded Engine under the hood. Neither requires you to manage replication slots or consumer lag.

Use Cases

  • Warehouse replication on a schedule (5–60 minute intervals)
  • Teams without streaming infrastructure expertise
  • Compliance copies of production data into a governed warehouse

Trade-offs

Value
Operational componentsNone (managed)
LatencyMinutes to hours
Fan-outSingle warehouse destination per connector
Query capabilityWarehouse SQL after sync
Maintenance burdenNear-zero; vendor dependency

When to use: Destination is a data warehouse. Batch latency is acceptable. No streaming infrastructure expertise on the team.

When not to use: Sub-second latency is required. You need real-time serving, not periodic warehouse loads.


Pattern 4: Hybrid CDC (Debezium + Streaming Database)

Architecture Description

Source Database
    ↓  [replication slot]
Debezium (Kafka Connect)
    ↓
Kafka
    ├─→ RisingWave  ← SQL analytics + real-time serving
    ├─→ Elasticsearch ← full-text search
    ├─→ S3 / Iceberg ← data lake
    └─→ Notification Service ← alerts

The hybrid pattern uses Debezium and Kafka for fan-out, while RisingWave subscribes as one consumer to provide SQL analytics. You get the fan-out of Pattern 1 and the SQL query capability of Pattern 2.

How It Works

Debezium publishes to Kafka. RisingWave reads from Kafka as a consumer with FORMAT DEBEZIUM ENCODE JSON or FORMAT DEBEZIUM ENCODE AVRO.

-- RisingWave reading from Kafka (Debezium format)
CREATE SOURCE orders_from_kafka (
    id BIGINT,
    customer_id BIGINT,
    status VARCHAR,
    total NUMERIC,
    created_at TIMESTAMPTZ,
    PRIMARY KEY (id)
)
WITH (
    connector = 'kafka',
    topic = 'ecommerce.public.orders',
    properties.bootstrap.server = 'kafka.internal:9092',
    scan.startup.mode = 'earliest'
)
FORMAT DEBEZIUM ENCODE JSON;

-- Materialized views work the same as with direct CDC
CREATE MATERIALIZED VIEW order_funnel AS
SELECT
    status,
    COUNT(*) AS count,
    SUM(total) AS value
FROM orders_from_kafka
GROUP BY status;

RisingWave does not hold a replication slot. Debezium holds the slot. RisingWave is just another Kafka consumer.

Use Cases

  • Large organizations with diverse consumers of the same CDC stream
  • Architectures where some consumers are non-SQL (Elasticsearch, S3 sink) and one consumer needs SQL analytics
  • Teams migrating to a streaming database incrementally without disrupting existing consumers

Trade-offs

Value
Operational componentsKafka + Kafka Connect + Debezium + RisingWave
LatencyKafka adds 100ms–2s over direct CDC
Fan-outUnlimited via Kafka
Query capabilityFull SQL via RisingWave
Schema evolutionManaged via schema registry; RisingWave reads schema automatically
Maintenance burdenHigh — multiple systems to operate

When to use: Multiple independent consumers exist AND one of them needs SQL analytics. Teams already operate Kafka.

When not to use: Operational complexity is a primary concern. Analytics is the only consumer (use Pattern 2 instead).


Pattern Selection Guide

Do you need multiple independent consumers of CDC events?
    │
    ├─ YES → Do you need SQL analytics as one of those consumers?
    │            ├─ YES → Pattern 4 (Debezium + Kafka + RisingWave)
    │            └─ NO  → Pattern 1 (Debezium + Kafka only)
    │
    └─ NO  → Is the destination a data warehouse with batch latency acceptable?
                 ├─ YES → Pattern 3 (Managed CDC: Fivetran or Airbyte)
                 └─ NO  → Pattern 2 (Embedded CDC: RisingWave direct)

The Evolution of CDC

The trend in 2026 is toward Pattern 2 for analytics use cases: teams are unwilling to operate Kafka, Kafka Connect, and a stream processor when all they want is a SQL query that reflects the current state of a production database.

Pattern 1 is not disappearing. Fan-out to multiple consumers is a real requirement that requires a broker. But the scope of that requirement is smaller than many teams assume. Most pipelines built on Debezium + Kafka serve a single analytics consumer.


FAQ

Q: Is the Debezium Embedded Engine in RisingWave the same code as Debezium Standalone? Yes. The Debezium Embedded Engine is a Java library that RisingWave (and Airbyte) embed directly. It is the same log-reading code that powers Debezium Standalone. The difference is that in standalone mode it runs inside a Kafka Connect worker; in embedded mode it runs inside the host application. The reliability of log capture is equivalent.

Q: Which pattern is the most production-proven? Pattern 1 (Debezium + Kafka) has the longest production history and the largest community. Pattern 2 (embedded CDC in streaming databases) has matured significantly since 2023 and is now production-proven at scale. Pattern 3 (managed CDC) is fully mature for warehouse use cases. Pattern 4 (hybrid) is simply a combination of patterns 1 and 2.

Q: Can I start with Pattern 1 and migrate to Pattern 2 later? Yes. If your Kafka topics have only one analytics consumer, Pattern 2 is a viable simplification. The migration involves translating Flink or Spark jobs to RisingWave materialized views and running both systems in parallel for validation before decommissioning Kafka.

Q: What about Kafka Streams and ksqlDB? Both are Pattern 1 variants — they still require Kafka as the underlying broker. RisingWave replaces the broker and the processor together for the single-consumer analytics case; ksqlDB and Kafka Streams do not.

Q: How do streaming databases handle late-arriving events? RisingWave uses watermark-based event time processing. Out-of-order events that arrive before the watermark advances are processed correctly. True late arrivals may not update time-windowed aggregations depending on the window definition — comparable behavior to Flink's late data handling.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.