Streaming to Apache Iceberg: Building a Real-Time Lakehouse (2026)

Apache Iceberg is the leading open table format for data lakehouses, but getting streaming data into Iceberg tables in real time remains challenging. The three main approaches are Apache Flink (complex, Java-heavy), Spark Structured Streaming (micro-batch latency), and RisingWave (SQL-only, built-in compaction). RisingWave offers the simplest path to real-time Iceberg ingestion — define a SQL materialized view, create an Iceberg sink, and data flows continuously with automatic file compaction.

This guide covers real-time Iceberg ingestion patterns, the technical challenges involved, and how to build a streaming lakehouse with SQL.

Why Stream Data into Iceberg?

Apache Iceberg provides ACID transactions, schema evolution, partition evolution, and time travel on top of object storage (S3, GCS, ADLS). It's the open standard that Snowflake, Databricks, BigQuery, Trino, and Spark all support.

But Iceberg is a storage format, not a processing engine. To get real-time data into Iceberg, you need a streaming engine that:

Ingests events from Kafka, CDC, or other sources
Transforms and aggregates the data
Writes Parquet files in Iceberg format
Commits to an Iceberg catalog
Handles compaction to prevent the small files problem

The Small Files Problem

The biggest challenge with streaming into Iceberg is the small files problem. Streaming engines write data at checkpoint intervals (e.g., every 1 minute), creating many small Parquet files:

A Flink job with 1-minute checkpoints produces ~1,440 files per day per partition
Each file may be only 1-10 MB
Thousands of small files degrade query performance due to metadata overhead and excessive object storage requests

Solutions:

Compaction: Periodically merge small files into larger ones (target: 256-512 MB per file)
Longer checkpoint intervals: Trade freshness for fewer files
Automatic compaction: Some tools compact files automatically

Streaming into Iceberg: Three Approaches

1. Apache Flink + Iceberg Sink

// Flink Dynamic Iceberg Sink (Java)
FlinkSink.forRowData(dataStream)
    .table(icebergTable)
    .tableLoader(tableLoader)
    .writeParallelism(4)
    .build();

Pros: Battle-tested at scale, Flink 2.0 + Iceberg 1.10 full integration, dynamic sink supports schema evolution Cons: Requires Flink cluster, Java expertise, manual compaction setup

2. Spark Structured Streaming

df.writeStream \
  .format("iceberg") \
  .outputMode("append") \
  .option("checkpointLocation", "s3://checkpoints/") \
  .toTable("catalog.db.table")

Pros: Familiar Spark API, unified batch + streaming Cons: Micro-batch latency (seconds to minutes), requires Spark cluster

3. RisingWave Iceberg Sink (SQL-Only)

CREATE SINK orders_to_iceberg AS
SELECT
  order_id,
  customer_id,
  region,
  amount,
  order_time
FROM orders_stream
WITH (
  connector = 'iceberg',
  type = 'append-only',
  warehouse.path = 's3://my-lakehouse/warehouse',
  s3.endpoint = 'https://s3.amazonaws.com',
  s3.region = 'us-east-1',
  catalog.type = 'rest',
  catalog.uri = 'http://iceberg-rest:8181',
  database.name = 'analytics',
  table.name = 'orders'
);

Pros: Pure SQL, no Java, automatic compaction, Rust-based performance Cons: Fewer source types than Flink

RisingWave + Iceberg: Deep Dive

Supported Write Modes

RisingWave supports two Iceberg write modes:

Mode	How It Works	Best For
Merge-on-Read (default)	Writes updates/deletes to delta files, merged at query time	High-throughput ingestion, low write latency
Copy-on-Write	Rewrites data files to apply changes	Read-optimized analytics, less frequent updates

Supported Catalogs

Catalog	Type	Configuration
REST Catalog	Recommended; works with Polaris, Lakekeeper, Tabular	`catalog.type = 'rest'`
Hive Catalog	Traditional Hive Metastore	`catalog.type = 'hive'`
JDBC Catalog	PostgreSQL/MySQL-backed catalog	`catalog.type = 'jdbc'`
Storage Catalog	File-system-based (S3)	`catalog.type = 'storage'`
AWS S3 Tables	AWS-native managed catalog	`catalog.type = 's3_tables'`

Automatic Compaction

RisingWave automatically compacts Parquet files in Iceberg, solving the small files problem without manual maintenance. This is a significant advantage over Flink and Spark, where you need to set up separate compaction jobs using Iceberg's rewriteDataFiles action.

Architecture: Streaming Lakehouse with RisingWave

Source DB ──CDC──→  RisingWave  ──Iceberg Sink──→  S3 (Iceberg Tables)
Kafka    ──────→  (SQL Processing)                      ↓
                        │                          Trino / Spark / DuckDB
                  Materialized Views                (Analytical Queries)
                  (Real-time serving)

This architecture provides:

Real-time serving via RisingWave materialized views (sub-second)
Historical analytics via Iceberg tables queried by Trino, Spark, or DuckDB
Open format — No vendor lock-in, data in Parquet on S3
Unified governance — Iceberg catalog provides schema management and access control

End-to-End Example: Real-Time Clickstream to Iceberg

Step 1: Ingest from Kafka

CREATE SOURCE clickstream (
  user_id INT,
  page_url VARCHAR,
  referrer VARCHAR,
  duration_seconds INT,
  event_time TIMESTAMP WITH TIME ZONE
) WITH (
  connector = 'kafka',
  topic = 'clickstream',
  properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;

Step 2: Transform with SQL

CREATE MATERIALIZED VIEW clickstream_enriched AS
SELECT
  user_id,
  page_url,
  referrer,
  duration_seconds,
  event_time,
  CASE
    WHEN duration_seconds > 60 THEN 'engaged'
    WHEN duration_seconds > 10 THEN 'browsing'
    ELSE 'bounce'
  END as engagement_level,
  EXTRACT(HOUR FROM event_time) as hour_of_day
FROM clickstream;

Step 3: Sink to Iceberg

CREATE SINK clickstream_to_lakehouse AS
SELECT * FROM clickstream_enriched
WITH (
  connector = 'iceberg',
  type = 'append-only',
  catalog.type = 'rest',
  catalog.uri = 'http://iceberg-rest:8181',
  warehouse.path = 's3://lakehouse/warehouse',
  database.name = 'analytics',
  table.name = 'clickstream'
);

Step 4: Query with Trino or DuckDB

-- In Trino
SELECT engagement_level, COUNT(*) as sessions, AVG(duration_seconds) as avg_duration
FROM iceberg.analytics.clickstream
WHERE event_time > CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY engagement_level;

Frequently Asked Questions

How do I get streaming data into Apache Iceberg?

You need a streaming engine that ingests events, processes them, and writes Parquet files in Iceberg format. The three main options are Apache Flink (Java, complex), Spark Structured Streaming (micro-batch), and RisingWave (SQL-only, automatic compaction). RisingWave provides the simplest path — define a SQL sink and data flows continuously into Iceberg tables.

What is the small files problem in Iceberg?

Streaming engines write data at checkpoint intervals, creating many small Parquet files (1-10 MB each). Thousands of small files degrade query performance due to metadata overhead and excessive object storage API calls. The solution is compaction — periodically merging small files into larger ones (target 256-512 MB). RisingWave handles compaction automatically; Flink and Spark require separate compaction jobs.

Which Iceberg catalogs does RisingWave support?

RisingWave supports REST catalog (recommended — works with Polaris, Lakekeeper, Tabular), Hive catalog, JDBC catalog, Storage catalog (S3), and AWS S3 Tables catalog. The REST catalog is the most flexible and is the Iceberg community's recommended approach.

Can I use RisingWave for both real-time serving and Iceberg storage?

Yes. RisingWave can simultaneously serve real-time queries via materialized views (sub-second freshness, PostgreSQL protocol) and sink data to Iceberg for long-term storage and historical analytics. This gives you the best of both worlds — real-time serving for operational use cases and Iceberg for analytical queries with Trino, Spark, or DuckDB.