What Is Apache Iceberg? A Complete Guide (2026)

Apache Iceberg is an open table format for huge analytical datasets on object storage (S3, GCS, ADLS). It provides ACID transactions, schema evolution, partition evolution, and time travel on top of Parquet files — capabilities that raw Parquet or Hive tables lack. Created at Netflix, Iceberg is now the industry standard supported by Spark, Flink, Trino, Snowflake, BigQuery, and DuckDB.

Why Iceberg Matters

Traditional data lake files (Parquet on S3) have no transactions, no schema management, and no way to handle concurrent reads/writes safely. Iceberg adds a metadata layer that provides database-like guarantees:

Capability	Raw Parquet	Hive Table	Apache Iceberg
ACID transactions	❌	❌	✅
Schema evolution	Manual	Limited	✅ Full (add/drop/rename)
Partition evolution	Requires rewrite	Requires rewrite	✅ In-place
Time travel	❌	❌	✅ Snapshot-based
Concurrent writes	Unsafe	Limited	✅ Optimistic concurrency
Column statistics	Per-file only	❌	✅ Manifest-level

How Iceberg Works

Iceberg stores data in Parquet files on object storage. A hierarchical metadata structure tracks which files belong to which table:

metadata.json → manifest list → manifest files → data files (Parquet)

Each write creates a new snapshot. Readers see a consistent view at any snapshot. Writers use optimistic concurrency — if two writers conflict, one retries.

Streaming Data into Iceberg

Stream processors (Flink, RisingWave) can write continuously to Iceberg tables:

-- RisingWave: Stream Kafka data to Iceberg
CREATE SINK events_to_iceberg AS SELECT * FROM events_stream
WITH (connector = 'iceberg', type = 'append-only', catalog.type = 'rest', ...);

RisingWave handles automatic compaction, solving the small files problem that plagues streaming ingestion.

Frequently Asked Questions

Is Apache Iceberg a database?

No. Iceberg is a table format — a specification for organizing data files and metadata on object storage. You still need a query engine (Trino, Spark, DuckDB) to read data and a stream processor (Flink, RisingWave) or batch tool (Spark) to write data.

Why is Iceberg winning over Delta Lake and Hudi?

Iceberg has the broadest multi-engine support (Spark, Flink, Trino, Snowflake, BigQuery, DuckDB), vendor-neutral governance (Apache Foundation), and unique features like partition evolution. Delta Lake is stronger in Databricks environments; Hudi is stronger for streaming CDC ingestion.

Does RisingWave support Apache Iceberg?

Yes. RisingWave has native Iceberg sink support with 5 catalog types (REST, Hive, JDBC, Storage, S3 Tables), automatic compaction, and both Merge-on-Read and Copy-on-Write modes.