Debezium + RisingWave: When to Use Both Together

When you need Kafka fan-out to multiple systems AND real-time SQL analytics, use Debezium and RisingWave together. Debezium handles the fan-out. RisingWave subscribes to Kafka as one of the consumers and provides the SQL analytics layer. This is not a either/or choice — the two tools are complementary.

Why Both?

Some architectures genuinely need multiple consumers of the same CDC stream. A single PostgreSQL change event might need to update Elasticsearch for search, feed a data lake for historical analysis, trigger a notification service, and power real-time dashboards.

Kafka is the right broker for this fan-out. Debezium is the right producer. And RisingWave is the right analytics consumer — it reads from Kafka and serves incremental SQL results without you operating a separate stream processor.

The Architecture

PostgreSQL
    ↓
Debezium (Kafka Connect)
    ↓
Kafka
    ├─→ RisingWave (real-time SQL analytics)
    ├─→ Elasticsearch (full-text search)
    ├─→ S3 / Data Lake (Iceberg, Delta Lake)
    └─→ Notification Service (alerts, webhooks)

Each consumer reads from the same Kafka topics at its own pace. Kafka's consumer group mechanism ensures each downstream system gets every event independently.

RisingWave is one of those consumers. It reads change events, maintains materialized views, and serves query results to BI tools and applications over the PostgreSQL wire protocol.

Setting Up Debezium (Kafka Connect Side)

Deploy the Debezium PostgreSQL connector with standard Kafka Connect config:

{
  "name": "pg-ecommerce-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "ecommerce",
    "database.server.name": "ecommerce",
    "table.include.list": "public.orders,public.customers,public.products",
    "plugin.name": "pgoutput",
    "topic.prefix": "ecommerce",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter"
  }
}

This creates Kafka topics like ecommerce.public.orders, ecommerce.public.customers, and ecommerce.public.products.

Connecting RisingWave to Kafka

RisingWave reads from these Kafka topics as a source. It understands Debezium's JSON format natively.

-- Create a Kafka source for the orders topic
CREATE SOURCE orders_kafka_source (
    id BIGINT,
    customer_id BIGINT,
    status VARCHAR,
    total NUMERIC,
    created_at TIMESTAMPTZ,
    PRIMARY KEY (id)
)
WITH (
    connector = 'kafka',
    topic = 'ecommerce.public.orders',
    properties.bootstrap.server = 'kafka.internal:9092',
    scan.startup.mode = 'earliest'
)
FORMAT DEBEZIUM ENCODE JSON;

The FORMAT DEBEZIUM ENCODE JSON clause tells RisingWave to interpret the Debezium envelope (before/after/op fields) automatically.

For Avro with schema registry:

CREATE SOURCE orders_kafka_source (...)
WITH (
    connector = 'kafka',
    topic = 'ecommerce.public.orders',
    properties.bootstrap.server = 'kafka.internal:9092',
    schema.registry = 'http://schema-registry.internal:8081'
)
FORMAT DEBEZIUM ENCODE AVRO;

Building Analytics with Materialized Views

Once the Kafka source is defined, create materialized views for your analytics workloads.

Real-time revenue dashboard:

CREATE MATERIALIZED VIEW revenue_by_day AS
SELECT
    DATE_TRUNC('day', created_at) AS day,
    COUNT(*) AS order_count,
    SUM(total) AS revenue,
    AVG(total) AS avg_order_value
FROM orders_kafka_source
WHERE status = 'completed'
GROUP BY DATE_TRUNC('day', created_at);

Customer activity over the last 24 hours:

CREATE MATERIALIZED VIEW active_customers_24h AS
SELECT
    customer_id,
    COUNT(*) AS orders_placed,
    SUM(total) AS spend
FROM orders_kafka_source
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY customer_id;

Multi-topic join (customers + orders):

CREATE SOURCE customers_kafka_source (
    id BIGINT,
    name VARCHAR,
    email VARCHAR,
    tier VARCHAR,
    PRIMARY KEY (id)
)
WITH (
    connector = 'kafka',
    topic = 'ecommerce.public.customers',
    properties.bootstrap.server = 'kafka.internal:9092',
    scan.startup.mode = 'earliest'
)
FORMAT DEBEZIUM ENCODE JSON;

CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
    c.id AS customer_id,
    c.name,
    c.tier,
    COUNT(o.id) AS lifetime_orders,
    SUM(o.total) AS lifetime_value
FROM customers_kafka_source c
LEFT JOIN orders_kafka_source o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.tier;

Your BI tool connects to RisingWave on port 4566 (PostgreSQL protocol) and queries these views with standard SQL. The results are always current — no refresh needed, no cache invalidation.

What Each System Owns

System	Responsibility
Debezium	Log capture, Kafka topic production, schema evolution via registry
Kafka	Event durability, fan-out to all consumers, replay capability
RisingWave	Incremental SQL processing, materialized view maintenance, query serving
Elasticsearch	Full-text search over current entity state
Data Lake	Historical storage, batch analytics, ML feature pipelines

No single component is overloaded. Each does what it does well.

When to Use This Architecture vs. Direct CDC

This architecture adds Kafka to the mix. That is real operational complexity. Use it only when you have multiple consumers that genuinely need independent access to CDC events.

Use Debezium + Kafka + RisingWave when:

You have 3+ downstream consumers of CDC events
Some consumers need long-term event replay (e.g., reprocessing for a data lake)
Some consumers are non-SQL systems (Elasticsearch, notification services)
You need to decouple source database connection limits from consumer count

Use RisingWave direct CDC (no Kafka) when:

Analytics is the only CDC consumer
You want the simplest possible architecture
Operational overhead of Kafka is not justified

Offset Management

When RisingWave reads from Kafka, it manages its own consumer group offset. This means:

If RisingWave restarts, it resumes from where it left off.
RisingWave's progress does not affect other Kafka consumers.
You can replay from an earlier offset by resetting the consumer group independently.

Check RisingWave's Kafka consumer lag:

SELECT source_name, partition, consumer_offset, latest_offset,
       (latest_offset - consumer_offset) AS lag
FROM rw_kafka_source_stats;

FAQ

Q: Does RisingWave consume directly from the PostgreSQL replication slot when used with Debezium? No. In this architecture, Debezium holds the replication slot and publishes to Kafka. RisingWave reads from Kafka as a consumer. This is important: do not also connect RisingWave's native CDC connector to the same PostgreSQL instance while Debezium is active, or you will create two replication slots.

Q: What Kafka message format should I use? JSON is simpler to set up. Avro with schema registry is more robust for production because it enforces schema evolution rules and reduces message size. RisingWave supports both.

Q: Can RisingWave fall behind on Kafka topic consumption? Yes, under heavy load, consumer lag can accumulate. Monitor the lag metric above and size RisingWave resources accordingly. RisingWave processes Kafka messages incrementally and will catch up after a backpressure event.

Q: What happens if Debezium restarts and replays events? Debezium may produce duplicate events after a restart before finding its committed offset. RisingWave handles Debezium's op field (create/update/delete) correctly and applies idempotent upserts for c and u operations, so duplicate events are handled safely.

Q: How does this compare to Confluent Platform for analytics? Confluent Platform with ksqlDB provides similar SQL-over-Kafka analytics. RisingWave's advantages are PostgreSQL compatibility (broader BI tool support), stronger JOIN semantics across multiple topics, and the ability to serve low-latency point queries over materialized views — not just aggregations.