CDC Architecture Patterns: From Debezium to Streaming Databases
Change Data Capture has evolved from a niche database trick into a foundational pattern for real-time data systems. Four patterns dominate production deployments in 2026: log-based CDC with a message broker, embedded CDC in a streaming database, managed CDC services, and hybrid architectures that combine multiple patterns. Each solves a different problem. Choosing the wrong pattern costs months of rework.
What All Four Patterns Share
Every CDC architecture solves the same core problem: reading database changes from a transaction log without polling, then making those changes available to downstream systems.
The differences are in what happens next — how changes are routed, transformed, stored, and queried — and what infrastructure you must operate to get there.
Pattern 1: Log-Based CDC with Debezium (The Traditional Pattern)
Architecture Description
Source Database (PostgreSQL / MySQL / SQL Server)
↓ [replication slot / binlog]
Debezium (Kafka Connect plugin)
↓ [Kafka topics, one per table]
Kafka Broker Cluster
├─→ Consumer A: Elasticsearch indexer
├─→ Consumer B: Data lake loader (S3/Iceberg)
├─→ Consumer C: Stream processor (Flink/Spark)
└─→ Consumer D: Notification service
Debezium reads the database replication log and publishes change events to Kafka topics. Each downstream system is an independent Kafka consumer.
How It Works
Debezium runs as a Kafka Connect plugin. The connector configuration specifies the source database, credentials, and which tables to capture. Debezium creates a replication slot (PostgreSQL) or reads the binlog (MySQL) and produces structured JSON or Avro events to Kafka.
{
"op": "u",
"before": { "id": 123, "status": "pending" },
"after": { "id": 123, "status": "shipped" },
"source": {
"db": "ecommerce",
"table": "orders",
"ts_ms": 1711929600000
}
}
The op field indicates the operation: c (create), u (update), d (delete), r (snapshot read).
Use Cases
- Multi-system fan-out: same change event delivered to search, analytics, and the data lake independently
- Event replay: Kafka's log retention allows consumers to reprocess events from any point in time
- Decoupled teams: different teams own different consumers with no coordination required
Trade-offs
| Value | |
| Operational components | Kafka brokers + Kafka Connect workers + Debezium config + schema registry |
| Latency | 100ms–5s (Kafka poll interval dependent) |
| Fan-out | Unlimited consumers |
| Query capability | None (Debezium only produces events) |
| Schema evolution | Managed via schema registry (Avro) or manual (JSON) |
| Maintenance burden | High — each component has its own failure modes |
When to use: Multiple independent consumers need the same CDC events. Teams are already operating Kafka. Event replay capability is required.
When not to use: Analytics is the only consumer. You do not operate Kafka today. Sub-second latency is required.
Pattern 2: Embedded CDC in a Streaming Database (The Simplified Pattern)
Architecture Description
Source Database (PostgreSQL / MySQL)
↓ [replication slot / binlog]
RisingWave (embedded Debezium engine + SQL processor + query serving)
↓ [PostgreSQL wire protocol]
BI Tools / Dashboards / Applications
A streaming database like RisingWave embeds the CDC capture layer, the stream processing layer, and the query serving layer in a single system.
How It Works
RisingWave uses the Debezium Embedded Engine — the same Java library that powers Debezium Standalone — to read the database replication log. This is not a reimplementation; it is the same battle-tested library, running in-process within RisingWave instead of as a Kafka Connect plugin.
Change events flow directly into RisingWave's incremental computation engine. Materialized views defined in SQL update automatically as new events arrive.
-- Declare the CDC source
CREATE SOURCE pg_source WITH (
connector = 'postgres-cdc',
hostname = 'postgres.internal',
port = '5432',
username = 'replicator',
password = 'secret',
database.name = 'ecommerce',
slot.name = 'rw_slot'
);
-- Declare a table backed by the CDC stream
CREATE TABLE orders (
id BIGINT PRIMARY KEY,
customer_id BIGINT,
status VARCHAR,
total NUMERIC,
created_at TIMESTAMPTZ
) FROM pg_source TABLE 'public.orders';
-- Materialized view: always current, no manual refresh
CREATE MATERIALIZED VIEW hourly_revenue AS
SELECT
DATE_TRUNC('hour', created_at) AS hour,
SUM(total) AS revenue,
COUNT(*) AS orders
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('hour', created_at);
Use Cases
- Real-time analytics: dashboards and reports that must reflect the current state of the database within seconds
- SQL-based stream processing: teams that want to write transformations in SQL, not Java or Python
- Simplified operations: small teams that cannot afford to operate Kafka, Kafka Connect, and Flink separately
Trade-offs
| Value | |
| Operational components | RisingWave only |
| Latency | Milliseconds to seconds |
| Fan-out | RisingWave is one logical consumer |
| Query capability | Full SQL, PostgreSQL-compatible |
| Schema evolution | Additive changes automatic; destructive changes require source refresh |
| Maintenance burden | Low — single system |
When to use: Analytics or real-time query serving is the primary CDC use case. No fan-out to multiple independent consumers is needed.
When not to use: Multiple downstream systems need independent access to CDC events. Kafka is already in your stack and teams depend on it.
Pattern 3: Managed CDC (The Outsourced Pattern)
Architecture Description
Source Database
↓ [managed connector]
Managed CDC Service (Fivetran / Airbyte Cloud / AWS DMS)
↓ [scheduled sync]
Data Warehouse (Snowflake / BigQuery / Redshift)
↓
BI Tools
Managed CDC services abstract away all infrastructure. You configure credentials, choose a destination, and the service handles log reading, schema evolution, and delivery. Fivetran uses proprietary connectors; Airbyte uses the Debezium Embedded Engine under the hood. Neither requires you to manage replication slots or consumer lag.
Use Cases
- Warehouse replication on a schedule (5–60 minute intervals)
- Teams without streaming infrastructure expertise
- Compliance copies of production data into a governed warehouse
Trade-offs
| Value | |
| Operational components | None (managed) |
| Latency | Minutes to hours |
| Fan-out | Single warehouse destination per connector |
| Query capability | Warehouse SQL after sync |
| Maintenance burden | Near-zero; vendor dependency |
When to use: Destination is a data warehouse. Batch latency is acceptable. No streaming infrastructure expertise on the team.
When not to use: Sub-second latency is required. You need real-time serving, not periodic warehouse loads.
Pattern 4: Hybrid CDC (Debezium + Streaming Database)
Architecture Description
Source Database
↓ [replication slot]
Debezium (Kafka Connect)
↓
Kafka
├─→ RisingWave ← SQL analytics + real-time serving
├─→ Elasticsearch ← full-text search
├─→ S3 / Iceberg ← data lake
└─→ Notification Service ← alerts
The hybrid pattern uses Debezium and Kafka for fan-out, while RisingWave subscribes as one consumer to provide SQL analytics. You get the fan-out of Pattern 1 and the SQL query capability of Pattern 2.
How It Works
Debezium publishes to Kafka. RisingWave reads from Kafka as a consumer with FORMAT DEBEZIUM ENCODE JSON or FORMAT DEBEZIUM ENCODE AVRO.
-- RisingWave reading from Kafka (Debezium format)
CREATE SOURCE orders_from_kafka (
id BIGINT,
customer_id BIGINT,
status VARCHAR,
total NUMERIC,
created_at TIMESTAMPTZ,
PRIMARY KEY (id)
)
WITH (
connector = 'kafka',
topic = 'ecommerce.public.orders',
properties.bootstrap.server = 'kafka.internal:9092',
scan.startup.mode = 'earliest'
)
FORMAT DEBEZIUM ENCODE JSON;
-- Materialized views work the same as with direct CDC
CREATE MATERIALIZED VIEW order_funnel AS
SELECT
status,
COUNT(*) AS count,
SUM(total) AS value
FROM orders_from_kafka
GROUP BY status;
RisingWave does not hold a replication slot. Debezium holds the slot. RisingWave is just another Kafka consumer.
Use Cases
- Large organizations with diverse consumers of the same CDC stream
- Architectures where some consumers are non-SQL (Elasticsearch, S3 sink) and one consumer needs SQL analytics
- Teams migrating to a streaming database incrementally without disrupting existing consumers
Trade-offs
| Value | |
| Operational components | Kafka + Kafka Connect + Debezium + RisingWave |
| Latency | Kafka adds 100ms–2s over direct CDC |
| Fan-out | Unlimited via Kafka |
| Query capability | Full SQL via RisingWave |
| Schema evolution | Managed via schema registry; RisingWave reads schema automatically |
| Maintenance burden | High — multiple systems to operate |
When to use: Multiple independent consumers exist AND one of them needs SQL analytics. Teams already operate Kafka.
When not to use: Operational complexity is a primary concern. Analytics is the only consumer (use Pattern 2 instead).
Pattern Selection Guide
Do you need multiple independent consumers of CDC events?
│
├─ YES → Do you need SQL analytics as one of those consumers?
│ ├─ YES → Pattern 4 (Debezium + Kafka + RisingWave)
│ └─ NO → Pattern 1 (Debezium + Kafka only)
│
└─ NO → Is the destination a data warehouse with batch latency acceptable?
├─ YES → Pattern 3 (Managed CDC: Fivetran or Airbyte)
└─ NO → Pattern 2 (Embedded CDC: RisingWave direct)
The Evolution of CDC
The trend in 2026 is toward Pattern 2 for analytics use cases: teams are unwilling to operate Kafka, Kafka Connect, and a stream processor when all they want is a SQL query that reflects the current state of a production database.
Pattern 1 is not disappearing. Fan-out to multiple consumers is a real requirement that requires a broker. But the scope of that requirement is smaller than many teams assume. Most pipelines built on Debezium + Kafka serve a single analytics consumer.
FAQ
Q: Is the Debezium Embedded Engine in RisingWave the same code as Debezium Standalone? Yes. The Debezium Embedded Engine is a Java library that RisingWave (and Airbyte) embed directly. It is the same log-reading code that powers Debezium Standalone. The difference is that in standalone mode it runs inside a Kafka Connect worker; in embedded mode it runs inside the host application. The reliability of log capture is equivalent.
Q: Which pattern is the most production-proven? Pattern 1 (Debezium + Kafka) has the longest production history and the largest community. Pattern 2 (embedded CDC in streaming databases) has matured significantly since 2023 and is now production-proven at scale. Pattern 3 (managed CDC) is fully mature for warehouse use cases. Pattern 4 (hybrid) is simply a combination of patterns 1 and 2.
Q: Can I start with Pattern 1 and migrate to Pattern 2 later? Yes. If your Kafka topics have only one analytics consumer, Pattern 2 is a viable simplification. The migration involves translating Flink or Spark jobs to RisingWave materialized views and running both systems in parallel for validation before decommissioning Kafka.
Q: What about Kafka Streams and ksqlDB? Both are Pattern 1 variants — they still require Kafka as the underlying broker. RisingWave replaces the broker and the processor together for the single-consumer analytics case; ksqlDB and Kafka Streams do not.
Q: How do streaming databases handle late-arriving events? RisingWave uses watermark-based event time processing. Out-of-order events that arrive before the watermark advances are processed correctly. True late arrivals may not update time-windowed aggregations depending on the window definition — comparable behavior to Flink's late data handling.

