Streaming to Apache Iceberg: Building a Real-Time Lakehouse (2026)
Apache Iceberg is the leading open table format for data lakehouses, but getting streaming data into Iceberg tables in real time remains challenging. The three main approaches are Apache Flink (complex, Java-heavy), Spark Structured Streaming (micro-batch latency), and RisingWave (SQL-only, built-in compaction). RisingWave offers the simplest path to real-time Iceberg ingestion — define a SQL materialized view, create an Iceberg sink, and data flows continuously with automatic file compaction.
This guide covers real-time Iceberg ingestion patterns, the technical challenges involved, and how to build a streaming lakehouse with SQL.
Why Stream Data into Iceberg?
Apache Iceberg provides ACID transactions, schema evolution, partition evolution, and time travel on top of object storage (S3, GCS, ADLS). It's the open standard that Snowflake, Databricks, BigQuery, Trino, and Spark all support.
But Iceberg is a storage format, not a processing engine. To get real-time data into Iceberg, you need a streaming engine that:
- Ingests events from Kafka, CDC, or other sources
- Transforms and aggregates the data
- Writes Parquet files in Iceberg format
- Commits to an Iceberg catalog
- Handles compaction to prevent the small files problem
The Small Files Problem
The biggest challenge with streaming into Iceberg is the small files problem. Streaming engines write data at checkpoint intervals (e.g., every 1 minute), creating many small Parquet files:
- A Flink job with 1-minute checkpoints produces ~1,440 files per day per partition
- Each file may be only 1-10 MB
- Thousands of small files degrade query performance due to metadata overhead and excessive object storage requests
Solutions:
- Compaction: Periodically merge small files into larger ones (target: 256-512 MB per file)
- Longer checkpoint intervals: Trade freshness for fewer files
- Automatic compaction: Some tools compact files automatically
Streaming into Iceberg: Three Approaches
1. Apache Flink + Iceberg Sink
// Flink Dynamic Iceberg Sink (Java)
FlinkSink.forRowData(dataStream)
.table(icebergTable)
.tableLoader(tableLoader)
.writeParallelism(4)
.build();
Pros: Battle-tested at scale, Flink 2.0 + Iceberg 1.10 full integration, dynamic sink supports schema evolution Cons: Requires Flink cluster, Java expertise, manual compaction setup
2. Spark Structured Streaming
df.writeStream \
.format("iceberg") \
.outputMode("append") \
.option("checkpointLocation", "s3://checkpoints/") \
.toTable("catalog.db.table")
Pros: Familiar Spark API, unified batch + streaming Cons: Micro-batch latency (seconds to minutes), requires Spark cluster
3. RisingWave Iceberg Sink (SQL-Only)
CREATE SINK orders_to_iceberg AS
SELECT
order_id,
customer_id,
region,
amount,
order_time
FROM orders_stream
WITH (
connector = 'iceberg',
type = 'append-only',
warehouse.path = 's3://my-lakehouse/warehouse',
s3.endpoint = 'https://s3.amazonaws.com',
s3.region = 'us-east-1',
catalog.type = 'rest',
catalog.uri = 'http://iceberg-rest:8181',
database.name = 'analytics',
table.name = 'orders'
);
Pros: Pure SQL, no Java, automatic compaction, Rust-based performance Cons: Fewer source types than Flink
RisingWave + Iceberg: Deep Dive
Supported Write Modes
RisingWave supports two Iceberg write modes:
| Mode | How It Works | Best For |
| Merge-on-Read (default) | Writes updates/deletes to delta files, merged at query time | High-throughput ingestion, low write latency |
| Copy-on-Write | Rewrites data files to apply changes | Read-optimized analytics, less frequent updates |
Supported Catalogs
| Catalog | Type | Configuration |
| REST Catalog | Recommended; works with Polaris, Lakekeeper, Tabular | catalog.type = 'rest' |
| Hive Catalog | Traditional Hive Metastore | catalog.type = 'hive' |
| JDBC Catalog | PostgreSQL/MySQL-backed catalog | catalog.type = 'jdbc' |
| Storage Catalog | File-system-based (S3) | catalog.type = 'storage' |
| AWS S3 Tables | AWS-native managed catalog | catalog.type = 's3_tables' |
Automatic Compaction
RisingWave automatically compacts Parquet files in Iceberg, solving the small files problem without manual maintenance. This is a significant advantage over Flink and Spark, where you need to set up separate compaction jobs using Iceberg's rewriteDataFiles action.
Architecture: Streaming Lakehouse with RisingWave
Source DB ──CDC──→ RisingWave ──Iceberg Sink──→ S3 (Iceberg Tables)
Kafka ──────→ (SQL Processing) ↓
│ Trino / Spark / DuckDB
Materialized Views (Analytical Queries)
(Real-time serving)
This architecture provides:
- Real-time serving via RisingWave materialized views (sub-second)
- Historical analytics via Iceberg tables queried by Trino, Spark, or DuckDB
- Open format — No vendor lock-in, data in Parquet on S3
- Unified governance — Iceberg catalog provides schema management and access control
End-to-End Example: Real-Time Clickstream to Iceberg
Step 1: Ingest from Kafka
CREATE SOURCE clickstream (
user_id INT,
page_url VARCHAR,
referrer VARCHAR,
duration_seconds INT,
event_time TIMESTAMP WITH TIME ZONE
) WITH (
connector = 'kafka',
topic = 'clickstream',
properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;
Step 2: Transform with SQL
CREATE MATERIALIZED VIEW clickstream_enriched AS
SELECT
user_id,
page_url,
referrer,
duration_seconds,
event_time,
CASE
WHEN duration_seconds > 60 THEN 'engaged'
WHEN duration_seconds > 10 THEN 'browsing'
ELSE 'bounce'
END as engagement_level,
EXTRACT(HOUR FROM event_time) as hour_of_day
FROM clickstream;
Step 3: Sink to Iceberg
CREATE SINK clickstream_to_lakehouse AS
SELECT * FROM clickstream_enriched
WITH (
connector = 'iceberg',
type = 'append-only',
catalog.type = 'rest',
catalog.uri = 'http://iceberg-rest:8181',
warehouse.path = 's3://lakehouse/warehouse',
database.name = 'analytics',
table.name = 'clickstream'
);
Step 4: Query with Trino or DuckDB
-- In Trino
SELECT engagement_level, COUNT(*) as sessions, AVG(duration_seconds) as avg_duration
FROM iceberg.analytics.clickstream
WHERE event_time > CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY engagement_level;
Frequently Asked Questions
How do I get streaming data into Apache Iceberg?
You need a streaming engine that ingests events, processes them, and writes Parquet files in Iceberg format. The three main options are Apache Flink (Java, complex), Spark Structured Streaming (micro-batch), and RisingWave (SQL-only, automatic compaction). RisingWave provides the simplest path — define a SQL sink and data flows continuously into Iceberg tables.
What is the small files problem in Iceberg?
Streaming engines write data at checkpoint intervals, creating many small Parquet files (1-10 MB each). Thousands of small files degrade query performance due to metadata overhead and excessive object storage API calls. The solution is compaction — periodically merging small files into larger ones (target 256-512 MB). RisingWave handles compaction automatically; Flink and Spark require separate compaction jobs.
Which Iceberg catalogs does RisingWave support?
RisingWave supports REST catalog (recommended — works with Polaris, Lakekeeper, Tabular), Hive catalog, JDBC catalog, Storage catalog (S3), and AWS S3 Tables catalog. The REST catalog is the most flexible and is the Iceberg community's recommended approach.
Can I use RisingWave for both real-time serving and Iceberg storage?
Yes. RisingWave can simultaneously serve real-time queries via materialized views (sub-second freshness, PostgreSQL protocol) and sink data to Iceberg for long-term storage and historical analytics. This gives you the best of both worlds — real-time serving for operational use cases and Iceberg for analytical queries with Trino, Spark, or DuckDB.

