What Is Apache Iceberg? A Complete Guide (2026)

What Is Apache Iceberg? A Complete Guide (2026)

What Is Apache Iceberg? A Complete Guide (2026)

Apache Iceberg is an open table format for huge analytical datasets on object storage (S3, GCS, ADLS). It provides ACID transactions, schema evolution, partition evolution, and time travel on top of Parquet files — capabilities that raw Parquet or Hive tables lack. Created at Netflix, Iceberg is now the industry standard supported by Spark, Flink, Trino, Snowflake, BigQuery, and DuckDB.

Why Iceberg Matters

Traditional data lake files (Parquet on S3) have no transactions, no schema management, and no way to handle concurrent reads/writes safely. Iceberg adds a metadata layer that provides database-like guarantees:

CapabilityRaw ParquetHive TableApache Iceberg
ACID transactions
Schema evolutionManualLimited✅ Full (add/drop/rename)
Partition evolutionRequires rewriteRequires rewrite✅ In-place
Time travel✅ Snapshot-based
Concurrent writesUnsafeLimited✅ Optimistic concurrency
Column statisticsPer-file only✅ Manifest-level

How Iceberg Works

Iceberg stores data in Parquet files on object storage. A hierarchical metadata structure tracks which files belong to which table:

metadata.json → manifest list → manifest files → data files (Parquet)

Each write creates a new snapshot. Readers see a consistent view at any snapshot. Writers use optimistic concurrency — if two writers conflict, one retries.

Streaming Data into Iceberg

Stream processors (Flink, RisingWave) can write continuously to Iceberg tables:

-- RisingWave: Stream Kafka data to Iceberg
CREATE SINK events_to_iceberg AS SELECT * FROM events_stream
WITH (connector = 'iceberg', type = 'append-only', catalog.type = 'rest', ...);

RisingWave handles automatic compaction, solving the small files problem that plagues streaming ingestion.

Frequently Asked Questions

Is Apache Iceberg a database?

No. Iceberg is a table format — a specification for organizing data files and metadata on object storage. You still need a query engine (Trino, Spark, DuckDB) to read data and a stream processor (Flink, RisingWave) or batch tool (Spark) to write data.

Why is Iceberg winning over Delta Lake and Hudi?

Iceberg has the broadest multi-engine support (Spark, Flink, Trino, Snowflake, BigQuery, DuckDB), vendor-neutral governance (Apache Foundation), and unique features like partition evolution. Delta Lake is stronger in Databricks environments; Hudi is stronger for streaming CDC ingestion.

Does RisingWave support Apache Iceberg?

Yes. RisingWave has native Iceberg sink support with 5 catalog types (REST, Hive, JDBC, Storage, S3 Tables), automatic compaction, and both Merge-on-Read and Copy-on-Write modes.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.