How RisingWave CDC Works Under the Hood: Debezium Embedded Engine

How RisingWave CDC Works Under the Hood: Debezium Embedded Engine

How RisingWave CDC Works Under the Hood: Debezium Embedded Engine

RisingWave's CDC connectors are powered by the Debezium Embedded Engine — the same battle-tested change data capture engine used in production at thousands of companies, embedded directly into RisingWave without requiring Kafka or Kafka Connect. When you run a CREATE SOURCE statement pointing at PostgreSQL or MySQL, Debezium's engine is doing the log-reading work internally.


What Is the Debezium Embedded Engine?

Debezium is best known as a Kafka Connect plugin: you deploy it alongside Kafka, configure a connector, and it streams database change events into Kafka topics. That's Debezium Standalone (or Debezium Server).

But Debezium also ships as a Java library called the Debezium Embedded Engine. It exposes the same log-reading logic — PostgreSQL replication slots, MySQL binlog parsing, Oracle LogMiner — as an embeddable component that any JVM or native application can link against.

The Embedded Engine is not a lightweight wrapper. It is the same core that powers the Kafka Connect version. The difference is purely architectural: instead of writing events to Kafka topics, the Embedded Engine delivers events to a callback in the host process.

Airbyte uses the Debezium Embedded Engine for its database sources. RisingWave does the same.


How RisingWave Embeds It

RisingWave is written in Rust. Its CDC connectors call into the Debezium Embedded Engine (Java) via a JNI bridge. This means:

  1. The Debezium Embedded Engine runs in a dedicated JVM process managed by RisingWave.
  2. Change events flow from Debezium's callback into RisingWave's streaming execution engine.
  3. RisingWave checkpoints its consumption position, backed by object storage (S3, GCS, or compatible).

From the user's perspective, this is invisible. You write SQL. The Debezium machinery runs behind the scenes.


What Happens When You Run CREATE SOURCE

Here is a minimal example connecting RisingWave to PostgreSQL via CDC:

CREATE SOURCE orders_cdc WITH (
    connector = 'postgres-cdc',
    hostname  = 'db.internal',
    port      = '5432',
    username  = 'replication_user',
    password  = 'secret',
    database.name = 'shop',
    slot.name = 'risingwave_slot',
    publication.name = 'risingwave_pub'
);

When this statement executes, RisingWave performs the following steps internally:

Step 1: Snapshot. The Debezium Embedded Engine reads a consistent snapshot of the targeted tables using a repeatable-read transaction. Every existing row is ingested as an insert event.

Step 2: Replication slot attachment. For PostgreSQL, RisingWave uses a logical replication slot (created with pgoutput or wal2json). The slot begins buffering WAL entries from the snapshot LSN onward, so no changes are missed during the snapshot phase.

Step 3: Streaming. After the snapshot completes, Debezium switches to streaming mode. It reads from the replication slot continuously, decodes WAL entries into structured change events (INSERT, UPDATE, DELETE), and forwards them into RisingWave's streaming engine.

Step 4: Checkpointing. RisingWave periodically checkpoints the current LSN to object storage. On restart, it resumes from the last checkpoint. No changes are lost, and no duplicate processing occurs under normal conditions.

You can then build materialized views directly on top of the CDC source:

CREATE MATERIALIZED VIEW order_summary AS
SELECT
    customer_id,
    COUNT(*)        AS order_count,
    SUM(total_amt)  AS lifetime_value
FROM orders_cdc
GROUP BY customer_id;

This view is kept incrementally up to date as changes arrive from the database.


Why the Debezium Embedded Engine, Specifically?

RisingWave could have implemented its own binlog/WAL reader from scratch. The reason it didn't comes down to the scope of the problem.

Database log formats are complex, version-sensitive, and full of edge cases. PostgreSQL's WAL format changes between major versions. MySQL's binlog format has quirks depending on the binlog_row_image setting. Oracle's LogMiner has known performance traps. Debezium has accumulated years of fixes for all of these.

By embedding Debezium's engine, RisingWave inherits:

  • Connector maturity. Support for PostgreSQL, MySQL, MongoDB, Oracle, SQL Server, and more, each with production-hardened logic.
  • Schema evolution handling. Debezium tracks DDL changes and adapts the event schema accordingly.
  • Exactly-once semantics. Combined with RisingWave's checkpoint mechanism, the system provides end-to-end consistency guarantees.
  • Active maintenance. The Debezium project is actively maintained by Red Hat and a large open-source community. RisingWave benefits from upstream improvements automatically.

Reliability Characteristics

Because Debezium Embedded Engine is the same code as Debezium Standalone, the reliability profile is identical at the connector layer:

PropertyBehavior
Snapshot consistencyRepeatable-read snapshot, no dirty reads
WAL lag toleranceReplication slot buffers changes during slow consumers
Schema change handlingDDL events propagated and applied
Restart recoveryResumes from last committed offset/LSN
BackpressureEmbedded engine pauses when downstream is slow

One practical difference from Debezium Standalone: there is no Kafka topic acting as a durable buffer between the connector and the consumer. In RisingWave's architecture, the checkpoint to object storage serves this role. The tradeoff is a simpler operational footprint at the cost of Kafka's fan-out capabilities.


PostgreSQL Prerequisites

Before creating a CDC source in RisingWave, the source database needs a small amount of configuration:

-- On the source PostgreSQL instance
ALTER SYSTEM SET wal_level = logical;

-- Create a publication (RisingWave will read from this)
CREATE PUBLICATION risingwave_pub FOR TABLE orders, customers, products;

-- Grant replication privilege to the connector user
ALTER ROLE replication_user REPLICATION LOGIN;

The replication slot itself is created automatically by RisingWave when the CREATE SOURCE statement runs.


What This Means If You Already Use Debezium

If your organization already runs Debezium with Kafka, RisingWave does not replace that pipeline. The two can coexist. A common pattern is:

  • Debezium Standalone writes change events to Kafka topics (for multiple downstream consumers, event archiving, audit trails).
  • RisingWave subscribes to those Kafka topics directly using its Kafka source connector.
  • RisingWave handles the analytics and materialized view layer on top of Kafka.

In this architecture, you are not using RisingWave's built-in CDC connector at all — you are using its Kafka source. The Debezium Embedded Engine only comes into play when you connect RisingWave directly to the source database, bypassing Kafka.


FAQ

Does RisingWave's CDC connector work with managed PostgreSQL (RDS, Aurora, Cloud SQL)? Yes. Logical replication is available on Amazon RDS for PostgreSQL, Aurora PostgreSQL, Google Cloud SQL, and Azure Database for PostgreSQL. Each has slightly different steps to enable it, but the RisingWave CREATE SOURCE syntax is the same.

Does the Debezium Embedded Engine run in the same process as RisingWave? RisingWave is written in Rust and calls into Debezium's Java library via JNI. The JVM runs as a subprocess managed by RisingWave's CDC worker nodes. Users do not interact with it directly.

What happens if the replication slot falls too far behind? PostgreSQL will hold WAL segments until the slot's LSN advances. If RisingWave is down for an extended period, WAL can accumulate and fill disk. Best practice is to monitor pg_replication_slots and set max_slot_wal_keep_size to prevent unbounded growth.

Can I use RisingWave CDC alongside Debezium Standalone on the same table? You can, but each requires its own replication slot. Multiple slots consume WAL independently, which increases WAL retention pressure on the source. Monitor carefully in high-write environments.

Is the Debezium Embedded Engine open source? Yes. Debezium is Apache 2.0 licensed. RisingWave is also Apache 2.0 licensed.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.