CDC for Data Lakehouse: Debezium vs RisingWave to Apache Iceberg
Syncing an operational database into a data lakehouse on S3 using Apache Iceberg is one of the most valuable CDC use cases for analytics teams. The traditional path routes changes through Debezium, Kafka, and Flink before writing to Iceberg. RisingWave collapses that into a single SQL statement: CDC source → transformation → Iceberg sink.
Why Iceberg for CDC Landing
Apache Iceberg solves the hardest problem in lake-based CDC: how do you apply row-level updates and deletes to immutable Parquet files? Iceberg's table format supports MERGE semantics, hidden partitioning, and time travel — making it suitable as a landing zone for continuous CDC streams.
Without a format like Iceberg, every update from CDC requires rewriting entire partitions. Iceberg's merge-on-read (MOR) and copy-on-write (COW) modes give you control over this trade-off.
The Classic Pattern: Debezium → Kafka → Flink → Iceberg
This is the production-standard architecture at companies running large-scale lakehouse ingestion.
PostgreSQL (WAL)
│
▼
Debezium Connector (Kafka Connect)
│
▼
Kafka Topic: cdc.public.orders
│
▼
Apache Flink (Flink CDC + Iceberg connector)
│
▼
Apache Iceberg Table on S3
Debezium Source Configuration
{
"name": "postgres-orders-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.dbname": "warehouse",
"table.include.list": "public.orders",
"plugin.name": "pgoutput",
"slot.name": "debezium_orders",
"tombstones.on.delete": "true",
"key.converter": "io.confluent.kafka.serializers.KafkaAvroSerializer",
"value.converter": "io.confluent.kafka.serializers.KafkaAvroSerializer",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
}
Flink CDC to Iceberg (Java)
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
// Read from Kafka topic containing Debezium events
tableEnv.executeSql("""
CREATE TABLE kafka_orders (
id BIGINT,
customer_id BIGINT,
status STRING,
total DECIMAL(10,2),
created_at TIMESTAMP(3),
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'topic' = 'cdc.public.orders',
'format' = 'debezium-avro-confluent',
'debezium-avro-confluent.schema-registry.url' = 'http://schema-registry:8081'
)
""");
// Write to Iceberg
tableEnv.executeSql("""
CREATE TABLE iceberg_orders (
id BIGINT,
customer_id BIGINT,
status STRING,
total DECIMAL(10,2),
created_at TIMESTAMP(3),
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'iceberg',
'catalog-name' = 'prod',
'catalog-type' = 'rest',
'uri' = 'http://iceberg-rest:8181',
'warehouse' = 's3://my-lakehouse/warehouse'
)
""");
tableEnv.executeSql(
"INSERT INTO iceberg_orders SELECT * FROM kafka_orders"
);
This pipeline is powerful but requires running and tuning Flink jobs, managing Kafka consumer groups, and handling checkpointing for exactly-once semantics.
The RisingWave Pattern: Direct CDC to Iceberg
RisingWave uses the Debezium Embedded Engine internally — the same capture logic, no separate deployment. You define the CDC source as a table and sink directly to Iceberg.
Step 1: Create the CDC Source Table
CREATE TABLE orders (
id BIGINT PRIMARY KEY,
customer_id BIGINT,
status VARCHAR,
total NUMERIC(10,2),
created_at TIMESTAMPTZ
) WITH (
connector = 'postgres-cdc',
hostname = 'postgres',
port = '5432',
username = 'rwuser',
password = 'secret',
database.name = 'warehouse',
schema.name = 'public',
table.name = 'orders'
);
Step 2: Transform with a Materialized View
This is the critical advantage over raw Debezium pipelines. You can enrich, filter, and aggregate before the data ever reaches Iceberg.
CREATE MATERIALIZED VIEW orders_enriched AS
SELECT
o.id,
o.customer_id,
c.region,
c.tier AS customer_tier,
o.status,
o.total,
-- Compute derived metrics inline
CASE
WHEN o.total > 1000 THEN 'high_value'
WHEN o.total > 100 THEN 'standard'
ELSE 'low_value'
END AS order_tier,
DATE_TRUNC('day', o.created_at) AS order_date,
o.created_at
FROM orders o
JOIN customers c ON o.customer_id = c.id;
Step 3: Sink to Iceberg
CREATE SINK orders_to_iceberg
FROM orders_enriched
WITH (
connector = 'iceberg',
type = 'upsert',
catalog.name = 'prod',
catalog.type = 'rest',
catalog.uri = 'http://iceberg-rest:8181',
warehouse.path = 's3://my-lakehouse/warehouse',
database.name = 'analytics',
table.name = 'orders_enriched',
primary_key = 'id'
);
Transformation and Iceberg write happen in a single declarative SQL pipeline. No Flink job. No Kafka consumer group.
Merge-on-Read vs Copy-on-Write for CDC Updates
Iceberg supports two write modes, and the choice matters significantly for CDC workloads.
Copy-on-Write (COW): Every update or delete rewrites the affected data files entirely. Reads are fast because there are no delete files to merge. Writes are expensive — each CDC update causes a file rewrite. Use COW when updates are rare and read performance is critical.
Merge-on-Read (MOR): Updates and deletes write small positional delete files alongside data files. Reads must merge these at query time. Writes are cheap. For high-frequency CDC streams (hundreds of updates per second), MOR is the right choice.
Configure this in your Iceberg table properties:
-- At table creation (via Iceberg REST catalog)
ALTER TABLE analytics.orders_enriched
SET TBLPROPERTIES (
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
);
Run compaction regularly (via Spark or Flink maintenance jobs) to merge delete files and restore read performance over time.
Comparison Table
| Dimension | Debezium + Kafka + Flink + Iceberg | RisingWave CDC + Iceberg Sink |
| Components to operate | Kafka, Connect, Flink, Iceberg catalog | RisingWave, Iceberg catalog |
| Transformation layer | Flink SQL / Java | Standard SQL materialized views |
| Join enrichment before write | Yes (Flink joins) | Yes (SQL JOINs, simpler syntax) |
| Exactly-once delivery | Yes (Flink checkpointing) | Yes (RisingWave transactional sink) |
| Schema evolution | Schema Registry + Avro | ALTER TABLE in RisingWave |
| Multi-table fan-out | Multiple Flink jobs | One sink per materialized view |
| Re-processing / backfill | Replay from Kafka | Recreate materialized view |
| Latency to Iceberg | Seconds (Flink checkpoint interval) | Seconds (configurable commit interval) |
When to Use the Debezium + Flink Path
Choose the traditional stack when:
- Multiple downstream systems (not just Iceberg) consume the CDC stream from Kafka.
- The Flink team already manages jobs at scale and prefers Java/Scala for complex transformations.
- You need fine-grained control over Flink watermarks, late data handling, and windowed aggregations before writing to Iceberg.
- You require extremely long retention of raw change events for compliance or audit replay.
When RisingWave Is the Simpler Choice
Choose RisingWave when:
- Iceberg is the primary destination and you want to reduce infrastructure complexity.
- Enrichment or transformation before Iceberg write is necessary but manageable in SQL.
- The team prefers SQL over distributed job management.
- You are building a new pipeline and don't have an existing Kafka investment.
FAQ
Q: Does RisingWave support Iceberg V2 (row-level deletes)? Yes. RisingWave's Iceberg sink writes Iceberg V2 format, which supports row-level positional deletes required for upsert and delete CDC operations.
Q: Which Iceberg catalogs does RisingWave support? RisingWave supports REST catalog, AWS Glue, and storage-based catalogs (using S3 directly). Check the RisingWave Iceberg sink documentation for the current catalog list.
Q: How do I handle CDC schema changes (new columns added to PostgreSQL)?
With Debezium + Schema Registry, new columns are detected automatically if schema evolution is configured. With RisingWave, run ALTER TABLE on the source definition; new columns then flow through to the materialized view if included in the SELECT.
Q: What is the Iceberg commit interval in RisingWave? The sink commits on a configurable interval (default is aligned with RisingWave's checkpoint interval, typically 10–60 seconds). This controls Iceberg file granularity — shorter intervals mean more small files and more frequent compaction needs.
Q: Can RisingWave write partitioned Iceberg tables? Yes. Iceberg hidden partitioning is defined at the Iceberg table level. RisingWave writes data and Iceberg handles partition assignment automatically based on the column values.

