CDC for Data Lakehouse: Debezium vs RisingWave to Apache Iceberg

CDC for Data Lakehouse: Debezium vs RisingWave to Apache Iceberg

CDC for Data Lakehouse: Debezium vs RisingWave to Apache Iceberg

Syncing an operational database into a data lakehouse on S3 using Apache Iceberg is one of the most valuable CDC use cases for analytics teams. The traditional path routes changes through Debezium, Kafka, and Flink before writing to Iceberg. RisingWave collapses that into a single SQL statement: CDC source → transformation → Iceberg sink.


Why Iceberg for CDC Landing

Apache Iceberg solves the hardest problem in lake-based CDC: how do you apply row-level updates and deletes to immutable Parquet files? Iceberg's table format supports MERGE semantics, hidden partitioning, and time travel — making it suitable as a landing zone for continuous CDC streams.

Without a format like Iceberg, every update from CDC requires rewriting entire partitions. Iceberg's merge-on-read (MOR) and copy-on-write (COW) modes give you control over this trade-off.


This is the production-standard architecture at companies running large-scale lakehouse ingestion.

PostgreSQL (WAL)
     │
     ▼
Debezium Connector (Kafka Connect)
     │
     ▼
Kafka Topic: cdc.public.orders
     │
     ▼
Apache Flink (Flink CDC + Iceberg connector)
     │
     ▼
Apache Iceberg Table on S3

Debezium Source Configuration

{
  "name": "postgres-orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.dbname": "warehouse",
    "table.include.list": "public.orders",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_orders",
    "tombstones.on.delete": "true",
    "key.converter": "io.confluent.kafka.serializers.KafkaAvroSerializer",
    "value.converter": "io.confluent.kafka.serializers.KafkaAvroSerializer",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter.schema.registry.url": "http://schema-registry:8081"
  }
}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

// Read from Kafka topic containing Debezium events
tableEnv.executeSql("""
    CREATE TABLE kafka_orders (
        id BIGINT,
        customer_id BIGINT,
        status STRING,
        total DECIMAL(10,2),
        created_at TIMESTAMP(3),
        PRIMARY KEY (id) NOT ENFORCED
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'cdc.public.orders',
        'format' = 'debezium-avro-confluent',
        'debezium-avro-confluent.schema-registry.url' = 'http://schema-registry:8081'
    )
""");

// Write to Iceberg
tableEnv.executeSql("""
    CREATE TABLE iceberg_orders (
        id BIGINT,
        customer_id BIGINT,
        status STRING,
        total DECIMAL(10,2),
        created_at TIMESTAMP(3),
        PRIMARY KEY (id) NOT ENFORCED
    ) WITH (
        'connector' = 'iceberg',
        'catalog-name' = 'prod',
        'catalog-type' = 'rest',
        'uri' = 'http://iceberg-rest:8181',
        'warehouse' = 's3://my-lakehouse/warehouse'
    )
""");

tableEnv.executeSql(
    "INSERT INTO iceberg_orders SELECT * FROM kafka_orders"
);

This pipeline is powerful but requires running and tuning Flink jobs, managing Kafka consumer groups, and handling checkpointing for exactly-once semantics.


The RisingWave Pattern: Direct CDC to Iceberg

RisingWave uses the Debezium Embedded Engine internally — the same capture logic, no separate deployment. You define the CDC source as a table and sink directly to Iceberg.

Step 1: Create the CDC Source Table

CREATE TABLE orders (
    id          BIGINT PRIMARY KEY,
    customer_id BIGINT,
    status      VARCHAR,
    total       NUMERIC(10,2),
    created_at  TIMESTAMPTZ
) WITH (
    connector = 'postgres-cdc',
    hostname  = 'postgres',
    port      = '5432',
    username  = 'rwuser',
    password  = 'secret',
    database.name = 'warehouse',
    schema.name   = 'public',
    table.name    = 'orders'
);

Step 2: Transform with a Materialized View

This is the critical advantage over raw Debezium pipelines. You can enrich, filter, and aggregate before the data ever reaches Iceberg.

CREATE MATERIALIZED VIEW orders_enriched AS
SELECT
    o.id,
    o.customer_id,
    c.region,
    c.tier          AS customer_tier,
    o.status,
    o.total,
    -- Compute derived metrics inline
    CASE
        WHEN o.total > 1000 THEN 'high_value'
        WHEN o.total > 100  THEN 'standard'
        ELSE 'low_value'
    END AS order_tier,
    DATE_TRUNC('day', o.created_at) AS order_date,
    o.created_at
FROM orders o
JOIN customers c ON o.customer_id = c.id;

Step 3: Sink to Iceberg

CREATE SINK orders_to_iceberg
FROM orders_enriched
WITH (
    connector            = 'iceberg',
    type                 = 'upsert',
    catalog.name         = 'prod',
    catalog.type         = 'rest',
    catalog.uri          = 'http://iceberg-rest:8181',
    warehouse.path       = 's3://my-lakehouse/warehouse',
    database.name        = 'analytics',
    table.name           = 'orders_enriched',
    primary_key          = 'id'
);

Transformation and Iceberg write happen in a single declarative SQL pipeline. No Flink job. No Kafka consumer group.


Merge-on-Read vs Copy-on-Write for CDC Updates

Iceberg supports two write modes, and the choice matters significantly for CDC workloads.

Copy-on-Write (COW): Every update or delete rewrites the affected data files entirely. Reads are fast because there are no delete files to merge. Writes are expensive — each CDC update causes a file rewrite. Use COW when updates are rare and read performance is critical.

Merge-on-Read (MOR): Updates and deletes write small positional delete files alongside data files. Reads must merge these at query time. Writes are cheap. For high-frequency CDC streams (hundreds of updates per second), MOR is the right choice.

Configure this in your Iceberg table properties:

-- At table creation (via Iceberg REST catalog)
ALTER TABLE analytics.orders_enriched
SET TBLPROPERTIES (
    'write.delete.mode' = 'merge-on-read',
    'write.update.mode' = 'merge-on-read',
    'write.merge.mode'  = 'merge-on-read'
);

Run compaction regularly (via Spark or Flink maintenance jobs) to merge delete files and restore read performance over time.


Comparison Table

DimensionDebezium + Kafka + Flink + IcebergRisingWave CDC + Iceberg Sink
Components to operateKafka, Connect, Flink, Iceberg catalogRisingWave, Iceberg catalog
Transformation layerFlink SQL / JavaStandard SQL materialized views
Join enrichment before writeYes (Flink joins)Yes (SQL JOINs, simpler syntax)
Exactly-once deliveryYes (Flink checkpointing)Yes (RisingWave transactional sink)
Schema evolutionSchema Registry + AvroALTER TABLE in RisingWave
Multi-table fan-outMultiple Flink jobsOne sink per materialized view
Re-processing / backfillReplay from KafkaRecreate materialized view
Latency to IcebergSeconds (Flink checkpoint interval)Seconds (configurable commit interval)

Choose the traditional stack when:

  • Multiple downstream systems (not just Iceberg) consume the CDC stream from Kafka.
  • The Flink team already manages jobs at scale and prefers Java/Scala for complex transformations.
  • You need fine-grained control over Flink watermarks, late data handling, and windowed aggregations before writing to Iceberg.
  • You require extremely long retention of raw change events for compliance or audit replay.

When RisingWave Is the Simpler Choice

Choose RisingWave when:

  • Iceberg is the primary destination and you want to reduce infrastructure complexity.
  • Enrichment or transformation before Iceberg write is necessary but manageable in SQL.
  • The team prefers SQL over distributed job management.
  • You are building a new pipeline and don't have an existing Kafka investment.

FAQ

Q: Does RisingWave support Iceberg V2 (row-level deletes)? Yes. RisingWave's Iceberg sink writes Iceberg V2 format, which supports row-level positional deletes required for upsert and delete CDC operations.

Q: Which Iceberg catalogs does RisingWave support? RisingWave supports REST catalog, AWS Glue, and storage-based catalogs (using S3 directly). Check the RisingWave Iceberg sink documentation for the current catalog list.

Q: How do I handle CDC schema changes (new columns added to PostgreSQL)? With Debezium + Schema Registry, new columns are detected automatically if schema evolution is configured. With RisingWave, run ALTER TABLE on the source definition; new columns then flow through to the materialized view if included in the SELECT.

Q: What is the Iceberg commit interval in RisingWave? The sink commits on a configurable interval (default is aligned with RisingWave's checkpoint interval, typically 10–60 seconds). This controls Iceberg file granularity — shorter intervals mean more small files and more frequent compaction needs.

Q: Can RisingWave write partitioned Iceberg tables? Yes. Iceberg hidden partitioning is defined at the Iceberg table level. RisingWave writes data and Iceberg handles partition assignment automatically based on the column values.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.