Real-Time Data Lake Ingestion: Kafka to Iceberg

Real-time data lake ingestion streams events from Kafka directly into Apache Iceberg tables, eliminating batch ETL jobs. The three main approaches are Apache Flink, Spark Structured Streaming, and RisingWave — each with different trade-offs in complexity, latency, and operational overhead.

Ingestion Approaches

Approach	Latency	Complexity	Auto-Compaction
RisingWave	Sub-second	Low (SQL)	✅ Yes
Flink	Sub-second	High (Java/SQL)	❌ Separate job
Spark SS	Seconds-minutes	Medium (PySpark)	❌ Separate job
Batch (Airflow)	Hours	Low	N/A

SQL-Based Ingestion with RisingWave

-- Kafka → Transform → Iceberg in 3 SQL statements
CREATE SOURCE events (...) WITH (connector='kafka', topic='events', ...);
CREATE MATERIALIZED VIEW enriched AS SELECT ..., CASE ... END as category FROM events;
CREATE SINK to_lake AS SELECT * FROM enriched
WITH (connector='iceberg', type='append-only', catalog.type='rest', ...);

No Java, no Spark cluster, automatic compaction.

Frequently Asked Questions

Which tool is best for Kafka-to-Iceberg ingestion?

RisingWave for simplicity (SQL, auto-compaction). Flink for maximum flexibility and scale. Spark for teams already in the Spark ecosystem.

How do I handle the small files problem?

RisingWave compacts automatically. Flink and Spark require separate compaction jobs using Iceberg's rewriteDataFiles action.