Real-Time Data Lake Ingestion: Kafka to Iceberg
Real-time data lake ingestion streams events from Kafka directly into Apache Iceberg tables, eliminating batch ETL jobs. The three main approaches are Apache Flink, Spark Structured Streaming, and RisingWave — each with different trade-offs in complexity, latency, and operational overhead.
Ingestion Approaches
| Approach | Latency | Complexity | Auto-Compaction |
| RisingWave | Sub-second | Low (SQL) | ✅ Yes |
| Flink | Sub-second | High (Java/SQL) | ❌ Separate job |
| Spark SS | Seconds-minutes | Medium (PySpark) | ❌ Separate job |
| Batch (Airflow) | Hours | Low | N/A |
SQL-Based Ingestion with RisingWave
-- Kafka → Transform → Iceberg in 3 SQL statements
CREATE SOURCE events (...) WITH (connector='kafka', topic='events', ...);
CREATE MATERIALIZED VIEW enriched AS SELECT ..., CASE ... END as category FROM events;
CREATE SINK to_lake AS SELECT * FROM enriched
WITH (connector='iceberg', type='append-only', catalog.type='rest', ...);
No Java, no Spark cluster, automatic compaction.
Frequently Asked Questions
Which tool is best for Kafka-to-Iceberg ingestion?
RisingWave for simplicity (SQL, auto-compaction). Flink for maximum flexibility and scale. Spark for teams already in the Spark ecosystem.
How do I handle the small files problem?
RisingWave compacts automatically. Flink and Spark require separate compaction jobs using Iceberg's rewriteDataFiles action.

