Real-Time Data Lake Ingestion: Kafka to Iceberg

Real-Time Data Lake Ingestion: Kafka to Iceberg

Real-Time Data Lake Ingestion: Kafka to Iceberg

Real-time data lake ingestion streams events from Kafka directly into Apache Iceberg tables, eliminating batch ETL jobs. The three main approaches are Apache Flink, Spark Structured Streaming, and RisingWave — each with different trade-offs in complexity, latency, and operational overhead.

Ingestion Approaches

ApproachLatencyComplexityAuto-Compaction
RisingWaveSub-secondLow (SQL)✅ Yes
FlinkSub-secondHigh (Java/SQL)❌ Separate job
Spark SSSeconds-minutesMedium (PySpark)❌ Separate job
Batch (Airflow)HoursLowN/A

SQL-Based Ingestion with RisingWave

-- Kafka → Transform → Iceberg in 3 SQL statements
CREATE SOURCE events (...) WITH (connector='kafka', topic='events', ...);
CREATE MATERIALIZED VIEW enriched AS SELECT ..., CASE ... END as category FROM events;
CREATE SINK to_lake AS SELECT * FROM enriched
WITH (connector='iceberg', type='append-only', catalog.type='rest', ...);

No Java, no Spark cluster, automatic compaction.

Frequently Asked Questions

Which tool is best for Kafka-to-Iceberg ingestion?

RisingWave for simplicity (SQL, auto-compaction). Flink for maximum flexibility and scale. Spark for teams already in the Spark ecosystem.

How do I handle the small files problem?

RisingWave compacts automatically. Flink and Spark require separate compaction jobs using Iceberg's rewriteDataFiles action.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.