Apache Iceberg: A Complete Guide for Data Engineers

Apache Iceberg: A Complete Guide for Data Engineers

Apache Iceberg is an open table format for large analytic datasets. It brings ACID transactions, schema evolution, time travel, and partition evolution to data lakes stored on object storage like S3, GCS, or ADLS — enabling reliable, high-performance analytics without vendor lock-in.

What Is Apache Iceberg?

Apache Iceberg was originally developed at Netflix to solve reliability and performance problems at petabyte scale. Donated to the Apache Software Foundation in 2018, it has since become the de facto standard for open table formats, adopted by companies like Apple, LinkedIn, Adobe, and thousands of others.

Unlike traditional Hive-style partitioning that relies on directory structures, Iceberg tracks data at the file level through a tree of metadata. This design solves long-standing problems in data lake architectures: unsafe concurrent writes, slow partition discovery, and the inability to change table schemas without full rewrites.

Core Architecture: How Iceberg Stores Data

Understanding Iceberg requires understanding its three-layer metadata architecture:

Catalog Layer — The entry point. A catalog (REST, Hive Metastore, AWS Glue, Nessie) stores pointers to the current metadata file for each table.

Metadata Layer — A chain of JSON metadata files that track the current schema, partition spec, sort order, and a list of snapshots.

Data Layer — Parquet, ORC, or Avro files organized into manifests. Each snapshot points to a manifest list, which references manifest files, which reference the actual data files.

This design enables several powerful capabilities:

  • Snapshot isolation — Each write creates a new snapshot. Readers always see a consistent view.
  • Incremental reads — Consumers can read only files added since a known snapshot.
  • Hidden partitioning — Partition transforms (bucket, truncate, year, month, day, hour) are tracked in metadata, not in the file path.

Key Features Every Data Engineer Should Know

ACID Transactions

Iceberg provides serializable isolation for writes and snapshot isolation for reads. Multiple writers can operate on the same table concurrently without corrupting data. Optimistic concurrency control detects conflicts and retries as needed.

Schema Evolution

Add, drop, rename, or reorder columns without rewriting data. Iceberg tracks column IDs internally, so existing files remain readable even after schema changes. This is a fundamental improvement over Hive tables, where schema changes often required rewriting the entire dataset.

Time Travel and Rollback

Query historical snapshots using AS OF syntax. Roll back a table to any previous snapshot in case of bad writes or accidental deletes.

Partition Evolution

Change the partitioning strategy of a table without rewriting existing data. Old files remain queryable under the old partition spec, while new files use the updated spec — completely transparent to query engines.

Row-Level Deletes and Updates

Iceberg supports both copy-on-write (COW) and merge-on-read (MOR) strategies for row-level operations, making it suitable for CDC workloads where individual rows must be updated or deleted.

Comparison: Iceberg vs. Hive Tables vs. Delta Lake

FeatureApache IcebergHive TablesDelta Lake
ACID TransactionsYes (serializable)LimitedYes (serializable)
Schema EvolutionFull (add/drop/rename/reorder)Limited (add/rename)Full
Partition EvolutionYes (without rewrite)NoPartial
Time TravelYesNoYes
Hidden PartitioningYesNoNo
Multi-Engine SupportExcellent (open spec)YesGood (primarily Spark)
Incremental ReadsYes (snapshot diff)NoYes (via change feed)
Catalog OptionsREST, Hive, Glue, Nessie, JDBCHive MetastoreDelta catalog

Integrating Iceberg with RisingWave

RisingWave is a cloud-native streaming SQL database that natively integrates with Apache Iceberg. You can stream data into Iceberg tables from Kafka, Postgres CDC, MySQL CDC, and other sources — and in RisingWave v2.8+, you can also query Iceberg tables directly using SQL.

Streaming Data into Iceberg

First, create a source to ingest events:

CREATE SOURCE user_events (
    user_id BIGINT,
    event_type VARCHAR,
    payload JSONB,
    event_time TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'user-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

Then build a materialized view to aggregate or transform the data:

CREATE MATERIALIZED VIEW hourly_event_counts AS
SELECT
    user_id,
    event_type,
    window_start,
    window_end,
    COUNT(*) AS event_count
FROM TUMBLE(user_events, event_time, INTERVAL '1 HOUR')
GROUP BY user_id, event_type, window_start, window_end;

Finally, sink the results into an Iceberg table:

CREATE SINK events_to_iceberg AS
SELECT * FROM hourly_event_counts
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://my-data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'analytics',
    table.name = 'hourly_event_counts'
);

Querying Iceberg Tables Directly (v2.8+)

In RisingWave v2.8, you can register Iceberg tables as sources and query them with standard SQL, enabling lakehouse-style analytics alongside streaming workloads:

-- Register an existing Iceberg table as a source
CREATE SOURCE iceberg_orders
WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://my-data-lake/warehouse',
    database.name = 'commerce',
    table.name = 'orders'
);

-- Join streaming data with historical Iceberg data
SELECT
    s.user_id,
    s.event_count AS realtime_events,
    h.total_orders AS historical_orders
FROM hourly_event_counts s
JOIN iceberg_orders h ON s.user_id = h.user_id;

Iceberg Catalogs: Which One Should You Use?

CatalogBest ForNotes
REST CatalogCloud-native deploymentsVendor-neutral; works with Tabular, Nessie, Polaris
AWS GlueAWS deploymentsFully managed, integrates with Athena, EMR
Hive MetastoreLegacy Hadoop environmentsWidely supported but requires HMS infrastructure
NessieMulti-table transactions, Git-like branchingOpen-source; excellent for data versioning
JDBCSimple setupsGood for testing; not recommended for production

Best Practices for Data Engineers

Compaction — Iceberg tables accumulate small files over time, especially with streaming writes. Run periodic compaction jobs (using Spark or Flink) to merge small files into larger ones, improving query performance.

Snapshot expiration — Old snapshots retain references to deleted data files. Run expireSnapshots on a schedule to reclaim storage.

Partitioning strategy — Choose partition transforms that match your query patterns. For time-series data, partitioning by day or hour is common. Avoid high-cardinality partitions.

Primary key design — For upsert workloads (like CDC), define a clear primary key on your Iceberg table. This allows RisingWave's iceberg sink to efficiently handle updates and deletes.

FAQ

Q: Can I use Apache Iceberg without Spark? Yes. Iceberg is a fully open specification. You can read and write Iceberg tables using Flink, Trino, Presto, DuckDB, RisingWave, StarRocks, Snowflake, and many other engines. Spark is not required.

Q: Does Iceberg support streaming writes? Yes. Iceberg supports incremental appends, upserts, and deletes through streaming systems like RisingWave and Apache Flink. Each micro-batch commit creates a new snapshot, enabling downstream consumers to read incremental changes.

Q: What's the difference between copy-on-write and merge-on-read? Copy-on-write (COW) rewrites entire data files when rows are updated or deleted, producing clean files but at higher write cost. Merge-on-read (MOR) appends delete files alongside data files, reducing write amplification but requiring more work at read time. For streaming CDC workloads, MOR is usually preferred.

Q: How does Iceberg handle late-arriving data? Iceberg itself is storage-agnostic about data ordering. You can append late data at any time. Query engines can filter by event time using predicate pushdown. For time-windowed aggregations, use RisingWave's TUMBLE() or HOP() windows to handle late arrivals within configurable bounds.

Q: Is Apache Iceberg free to use? Apache Iceberg is fully open-source under the Apache 2.0 license. There is no licensing cost. You pay only for the storage and compute you use.

Getting Started

Apache Iceberg combined with RisingWave gives you a complete, open-source streaming lakehouse: real-time ingestion, continuous SQL transformations, and durable storage in a vendor-neutral open format.

Ready to build your first Iceberg pipeline? Start with the RisingWave documentation for a hands-on quickstart, or join the community on Slack to ask questions and share what you're building.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.