PostgreSQL CDC Without Kafka: RisingWave vs Debezium + Kafka Connect

PostgreSQL CDC Without Kafka: RisingWave vs Debezium + Kafka Connect

PostgreSQL CDC Without Kafka: RisingWave vs Debezium + Kafka Connect

PostgreSQL CDC without Kafka is possible and increasingly practical. RisingWave connects directly to PostgreSQL's logical replication stream using the Debezium Embedded Engine, eliminating the need for Kafka brokers, Kafka Connect workers, and connector configuration. If you only need one downstream consumer of your change data, removing Kafka simplifies the stack significantly.


The Standard Debezium + Kafka Stack

Most PostgreSQL CDC tutorials describe this architecture:

PostgreSQL (WAL/logical replication)
    ↓
Debezium PostgreSQL Connector (Kafka Connect worker)
    ↓
Kafka broker (topic: server.schema.table)
    ↓
Consumer application / stream processor / data warehouse

This stack is mature and battle-tested. It handles high throughput, supports multiple independent consumers, and retains events for replay. It also requires operating:

  • One or more Kafka brokers (with ZooKeeper or KRaft)
  • Kafka Connect workers
  • Connector configuration and management
  • Topic retention and compaction policies
  • Network connectivity between all components

For many teams, this is appropriate. For teams that need CDC primarily for analytics, materialized views, or real-time queries — with a single downstream consumer — it is significant overhead.


The Full Kafka Setup for PostgreSQL CDC

Here is what the Debezium + Kafka path looks like end to end.

Step 1: Configure PostgreSQL

-- postgresql.conf
-- wal_level = logical   (requires restart)

-- In psql
CREATE PUBLICATION debezium_pub FOR TABLE orders, customers, products;
ALTER ROLE debezium_user REPLICATION LOGIN;

Step 2: Deploy Kafka and Kafka Connect

# docker-compose.yml (simplified)
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    depends_on: [zookeeper]
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092

  kafka-connect:
    image: debezium/connect:2.5
    depends_on: [kafka]
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: 1
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_status

Step 3: Register the Debezium Connector

curl -X POST http://kafka-connect:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "pg-connector",
    "config": {
      "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
      "database.hostname": "postgres",
      "database.port": "5432",
      "database.user": "debezium_user",
      "database.password": "secret",
      "database.dbname": "shop",
      "database.server.name": "shop_server",
      "table.include.list": "public.orders,public.customers,public.products",
      "plugin.name": "pgoutput",
      "slot.name": "debezium_slot",
      "publication.name": "debezium_pub",
      "topic.prefix": "shop_server"
    }
  }'

Step 4: Build a Consumer

With events flowing into Kafka, you now need a consumer. For analytics, this might be a Spark Structured Streaming job, a Flink application, ksqlDB, or a custom application that reads from the topic and writes to a data warehouse.

That consumer is a separate piece of infrastructure to build, deploy, and maintain.


The RisingWave Path (No Kafka)

RisingWave connects to PostgreSQL directly. It uses the Debezium Embedded Engine internally, so it does the same log reading that Debezium would do — without the Kafka layer.

Step 1: Configure PostgreSQL (Same as Before)

ALTER SYSTEM SET wal_level = logical;
-- (requires PostgreSQL restart)

CREATE PUBLICATION risingwave_pub FOR TABLE orders, customers, products;
ALTER ROLE risingwave_user REPLICATION LOGIN;

Step 2: Create a Source in RisingWave

CREATE SOURCE pg_shop WITH (
    connector           = 'postgres-cdc',
    hostname            = 'postgres',
    port                = '5432',
    username            = 'risingwave_user',
    password            = 'secret',
    database.name       = 'shop',
    slot.name           = 'risingwave_slot',
    publication.name    = 'risingwave_pub'
);

RisingWave creates the replication slot automatically and begins the snapshot process.

Step 3: Create Tables from the Source

CREATE TABLE orders (
    id          BIGINT PRIMARY KEY,
    customer_id BIGINT,
    total_amt   DECIMAL(10, 2),
    status      VARCHAR,
    created_at  TIMESTAMPTZ
) FROM pg_shop TABLE 'public.orders';

CREATE TABLE customers (
    id     BIGINT PRIMARY KEY,
    email  VARCHAR,
    region VARCHAR,
    status VARCHAR
) FROM pg_shop TABLE 'public.customers';

Step 4: Write SQL Queries

-- Continuously maintained aggregation
CREATE MATERIALIZED VIEW revenue_by_region AS
SELECT
    c.region,
    SUM(o.total_amt)             AS total_revenue,
    COUNT(*)                     AS order_count,
    COUNT(DISTINCT o.customer_id) AS unique_customers
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'completed'
GROUP BY c.region;

-- Query it like any table
SELECT * FROM revenue_by_region ORDER BY total_revenue DESC;

This view reflects the current state of the PostgreSQL tables. As rows are inserted, updated, or deleted in PostgreSQL, revenue_by_region updates automatically — typically within milliseconds.


Architecture Comparison

Debezium + KafkaRisingWave (built-in CDC)
Components to operatePostgreSQL, Kafka, ZooKeeper/KRaft, Kafka Connect, consumer appPostgreSQL, RisingWave
Config files / APIsdocker-compose, connector JSON, consumer codeSQL
Fan-out to N consumersYesNo
SQL analytics built inNo (need separate processor)Yes
Materialized viewsNo (need separate processor)Yes
Event replayYes (Kafka retention)No
End-to-end latency50–500 ms (typical)Under 100 ms (typical)
Operational expertise neededKafka administrationSQL

When Removing Kafka Makes Sense

Removing Kafka is the right call when all of the following are true:

  1. There is one downstream consumer of the CDC stream. Kafka's fan-out value is zero if only one system reads the topic.
  2. The destination is a SQL system. RisingWave's entire interface is SQL. If you want to query CDC data, write materialized views, or join multiple tables, SQL is more expressive than managing Kafka consumer code.
  3. Your team does not already operate Kafka. If Kafka is not already in the stack, standing it up for a single CDC pipeline is significant infrastructure investment.
  4. Replay requirements are limited. If you don't need to re-consume historical events for new consumers, Kafka's retention model adds no value.

Keeping Kafka makes sense when:

  • Multiple downstream systems read from the same change stream.
  • Some downstream consumers are not SQL-based (search engines, cache invalidation, notification services).
  • Long-term event retention and replay are required.
  • Kafka is already in the infrastructure and marginal cost is low.

Managed PostgreSQL Considerations

The CREATE SOURCE syntax above works identically for managed PostgreSQL offerings, with minor prerequisite differences:

Amazon RDS for PostgreSQL:

-- Run on RDS instance (requires rds_superuser)
SELECT rds_enable_logical_replication_slot('risingwave_slot', 'pgoutput');
-- Or set parameter group: rds.logical_replication = 1

Amazon Aurora PostgreSQL:

-- Aurora requires a custom parameter group
-- aurora_logical_replication = 1 (cluster parameter)
-- wal_level is set automatically when logical replication is enabled

Google Cloud SQL for PostgreSQL:

-- Enable: cloudsql.logical_decoding = on (instance flag)
-- Create publication as normal

Once the source database is configured, the RisingWave CREATE SOURCE statement is identical regardless of provider.


FAQ

Do I need to install anything extra to use RisingWave's PostgreSQL CDC connector? No. The CDC connector is built into RisingWave. No separate installation, plugin, or worker process is required beyond RisingWave itself.

How does RisingWave handle the initial snapshot for large tables? RisingWave (via the Debezium Embedded Engine) performs a consistent snapshot using a repeatable-read transaction. For very large tables, this can take minutes. During the snapshot, changes are buffered in the replication slot so nothing is missed.

What is the WAL retention risk with RisingWave's replication slot? PostgreSQL holds WAL segments until the replication slot's LSN advances. If RisingWave stops consuming (e.g., is offline), WAL accumulates. Monitor pg_replication_slots.confirmed_flush_lsn and set max_slot_wal_keep_size in postgresql.conf to cap WAL retention.

Can I run Debezium Standalone and RisingWave CDC on the same PostgreSQL instance simultaneously? Yes. Each requires its own replication slot. Both will receive the full change stream independently. Monitor total WAL retention pressure, as each slot extends WAL independently.

Does RisingWave support TRUNCATE events from PostgreSQL CDC? TRUNCATE is handled as a special case. PostgreSQL's logical replication does emit truncate events (in PostgreSQL 11+). RisingWave processes them by truncating the corresponding internal table state.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.