Real-Time Search Indexing with CDC: Debezium to Elasticsearch vs RisingWave

Keeping a search index synchronized with your operational database is one of the most common CDC use cases. The traditional route is Debezium → Kafka → Kafka Connect Elasticsearch Sink. RisingWave offers a shorter path: CDC source → materialized view → Elasticsearch sink — with SQL transformations in the middle.

Why Search Indexes Need CDC

Full re-index jobs that run nightly are no longer acceptable for most search experiences. Users expect results to reflect changes within seconds — a product going out of stock, a document being edited, a customer record being updated.

Polling the source database for changes is expensive and slow. CDC captures every committed change at the transaction log level, giving you sub-second propagation without added load on the primary.

The Classic Pattern: Debezium → Kafka → Elasticsearch

This is the most widely deployed architecture for real-time search indexing.

PostgreSQL (WAL)
     │
     ▼
Debezium (Kafka Connect Source)
     │
     ▼
Kafka Topic: postgres.public.products
     │
     ▼
Kafka Connect Elasticsearch Sink
     │
     ▼
Elasticsearch Index: products

Debezium Source Connector Configuration

{
  "name": "postgres-products-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "shop",
    "table.include.list": "public.products",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_products",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}

Kafka Connect Elasticsearch Sink Configuration

{
  "name": "elasticsearch-products-sink",
  "config": {
    "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "tasks.max": "1",
    "topics": "postgres.public.products",
    "connection.url": "http://elasticsearch:9200",
    "type.name": "_doc",
    "key.ignore": "false",
    "schema.ignore": "true",
    "behavior.on.null.values": "delete",
    "transforms": "castKey",
    "transforms.castKey.type": "org.apache.kafka.connect.transforms.Cast$Key",
    "transforms.castKey.spec": "string"
  }
}

This gets changes into Elasticsearch in near-real-time. Deletes are handled via tombstone records. Kafka acts as a buffer, providing replay and fan-out to additional consumers.

The RisingWave Pattern: CDC Source → SQL → Elasticsearch Sink

RisingWave embeds the Debezium engine internally. You connect directly to PostgreSQL's WAL without deploying a separate Debezium cluster or Kafka.

Step 1: Create the CDC Source Table

CREATE TABLE products (
    id          BIGINT PRIMARY KEY,
    name        VARCHAR,
    description VARCHAR,
    price       NUMERIC(10,2),
    category    VARCHAR,
    in_stock    BOOLEAN,
    updated_at  TIMESTAMPTZ
) WITH (
    connector = 'postgres-cdc',
    hostname  = 'postgres',
    port      = '5432',
    username  = 'rwuser',
    password  = 'secret',
    database.name = 'shop',
    schema.name   = 'public',
    table.name    = 'products'
);

Step 2: Create a Materialized View for the Search Document

This is where RisingWave diverges from raw row replication. You can reshape the data into exactly the document structure Elasticsearch expects — using SQL.

CREATE MATERIALIZED VIEW products_search_doc AS
SELECT
    id,
    name,
    description,
    price,
    category,
    in_stock,
    updated_at,
    -- Denormalize a search-friendly price tier
    CASE
        WHEN price < 25  THEN 'budget'
        WHEN price < 100 THEN 'mid-range'
        ELSE 'premium'
    END AS price_tier,
    -- Full-text field combining name and description
    name || ' ' || COALESCE(description, '') AS search_text
FROM products
WHERE in_stock = true OR updated_at > NOW() - INTERVAL '7 days';

Step 3: Write to Elasticsearch via Sink

CREATE SINK products_to_elasticsearch
FROM products_search_doc
WITH (
    connector     = 'elasticsearch',
    index         = 'products',
    url           = 'http://elasticsearch:9200',
    primary_key   = 'id'
);

Every insert, update, or delete on the products table flows through the materialized view and lands in Elasticsearch within seconds — no Kafka, no separate connector cluster.

Handling Deletes

Both patterns handle deletes, but differently.

With Debezium + Kafka Connect, you configure behavior.on.null.values=delete on the sink connector. Debezium emits a tombstone record (null value) for deleted rows, and the Elasticsearch connector issues a DELETE request.

With RisingWave, deletes in the source table are automatically propagated through the materialized view. The Elasticsearch sink issues a DELETE request when a row disappears from the view's result set.

Comparison Table

Dimension	Debezium + Kafka + ES Connector	RisingWave + ES Sink
Infrastructure	Kafka, Schema Registry, Connect cluster	RisingWave only
Index transformation	SMTs (limited) or stream processor	Full SQL
Multiple ES indexes	Fan-out via separate sink connectors	One sink per materialized view
Replay / reprocessing	Kafka retention enables replay	Re-create materialized view
Multi-consumer fan-out	Yes — Kafka serves many downstream systems	No native Kafka fan-out
Enrichment with joins	Requires Kafka Streams or Flink	Native SQL JOINs
Operational complexity	High	Low

When to Use Debezium + Kafka

Use the Debezium → Kafka → Elasticsearch pattern when:

Multiple downstream systems consume the same change stream (another team reads the same Kafka topic for audit, another for analytics).
You need complex routing — different document types going to different indexes based on event content.
Your organization already operates a Kafka platform and wants consistency.
You need long-term event replay capability that outlasts any single system's lifecycle.

When RisingWave Is the Better Fit

Use RisingWave when:

Elasticsearch is the primary or only consumer of the CDC stream.
You need SQL-level transformations before indexing — joining with other tables, computing derived fields, filtering rows.
You want to reduce infrastructure footprint — fewer moving parts, fewer deployment units.
The team is more comfortable with SQL than with Kafka connector configuration.

Hybrid Architecture

These approaches are not mutually exclusive. Many teams run both:

Debezium publishes changes to Kafka for microservices and audit systems.
RisingWave reads from Kafka (or directly from PostgreSQL) and maintains materialized views that feed Elasticsearch.

This gives you the fan-out benefits of Kafka without burdening Elasticsearch with raw Debezium event shapes.

FAQ

Q: Does RisingWave's Elasticsearch sink support index aliases and routing? Yes. You can configure the index parameter to point to an alias, and routing keys can be derived from materialized view columns. Check the RisingWave sink documentation for the full parameter list.

Q: What happens if Elasticsearch is temporarily unavailable? With Debezium + Kafka, messages accumulate in Kafka and are delivered when Elasticsearch recovers. With RisingWave, the sink will retry and buffer internally, but Kafka provides stronger durability guarantees for this scenario if Elasticsearch outages are frequent.

Q: Can RisingWave write to multiple Elasticsearch indexes? Yes — create one sink per materialized view. Each view can filter or transform data differently and target a different index. There is no single-connector fan-out like Kafka, but the SQL model makes multi-index writes straightforward.

Q: How does schema evolution work? With Debezium + Schema Registry, schema changes are tracked automatically. With RisingWave, you ALTER TABLE or recreate the source table definition; the materialized view adjusts on the next refresh cycle. For adding new columns, RisingWave handles this gracefully.

Q: Is there latency overhead from the materialized view layer? Materialized views in RisingWave are incrementally maintained — they do not re-execute from scratch on every change. Latency from WAL capture to Elasticsearch is typically under one second under normal load.