Apache Iceberg vs Delta Lake vs Apache Hudi: Table Format Comparison

Apache Iceberg, Delta Lake, and Apache Hudi are all open table formats that bring ACID transactions and reliable analytics to data lakes. Iceberg offers the broadest multi-engine support and most flexible metadata design. Delta Lake has the deepest Spark integration. Hudi excels at record-level upserts and incremental processing. Your choice depends on your engine ecosystem and workload patterns.

Why Open Table Formats Matter

Before open table formats, data lakes were collections of raw files — Parquet, ORC, CSV — with no guarantees about consistency, no support for row-level updates, and no way to safely run concurrent writers. This led to the "data swamp" problem: data that was technically stored but practically unreliable.

Open table formats solve this by adding a metadata layer on top of object storage files. This metadata layer enables ACID transactions, schema enforcement, time travel, and efficient predicate pushdown — turning a file system into something that behaves like a database.

Three formats have emerged as the dominant options: Apache Iceberg (originally from Netflix), Delta Lake (originally from Databricks), and Apache Hudi (originally from Uber). All three are open-source and production-ready, but they make different architectural choices that affect their strengths and limitations.

Architecture Comparison

Apache Iceberg

Iceberg uses a hierarchical metadata structure:

Catalog → points to current metadata file
Metadata file (JSON) → schema, partition spec, snapshot list
Manifest list → list of manifest files for each snapshot
Manifest files → list of data files with statistics
Data files (Parquet/ORC/Avro)

This architecture enables O(1) snapshot creation (only metadata is updated, not all files), fine-grained partition pruning without listing directories, and complete decoupling of the catalog from storage.

Delta Lake

Delta Lake uses a transaction log stored in the _delta_log directory:

Delta log (JSON/Parquet checkpoint files) → ordered list of transactions
Data files (Parquet only)

The log is append-only. Each transaction appends a new JSON entry listing added and removed files. Periodically, checkpoints compact the log into Parquet files for faster reads. This design is simpler but tightly couples the catalog to the storage path.

Apache Hudi

Hudi stores metadata in a hidden .hoodie directory alongside data files:

Timeline → ordered log of commits, compactions, and clean operations
File groups → base files + delta log files (for MOR tables)
Metadata table → optional auxiliary table with bloom filters and column stats

Hudi natively supports two table types: Copy-on-Write (COW) for read-optimized workloads and Merge-on-Read (MOR) for write-optimized workloads. This explicit duality is unique to Hudi.

Feature Comparison

Feature	Apache Iceberg	Delta Lake	Apache Hudi
ACID Transactions	Yes	Yes	Yes
Schema Evolution	Full (add/drop/rename/reorder)	Full (add/rename)	Full
Partition Evolution	Yes (no rewrite)	No	Partial
Hidden Partitioning	Yes	No	No
Time Travel	Yes	Yes	Yes
Row-Level Deletes	Yes (COW + MOR)	Yes (COW)	Yes (COW + MOR)
Multi-Engine Support	Excellent	Good	Good
Incremental Reads	Yes (snapshot diff)	Yes (change data feed)	Native (incremental queries)
Branching / Versioning	Yes (via Nessie)	Yes (Delta sharing)	Limited
File Formats	Parquet, ORC, Avro	Parquet only	Parquet, ORC
Open Specification	Fully open	Open (BSL licensed tooling)	Fully open

Engine Support Comparison

Query Engine	Apache Iceberg	Delta Lake	Apache Hudi
Apache Spark	Excellent	Excellent (native)	Excellent
Apache Flink	Excellent	Good	Good
Trino / Presto	Excellent	Good	Good
Amazon Athena	Excellent	Good	Limited
Snowflake	Excellent	Native	Limited
Google BigQuery	Excellent	Good	No
DuckDB	Excellent	Good	Limited
RisingWave	Excellent (v2.8+)	No	No
StarRocks	Excellent	Good	Good

Iceberg has the broadest query engine support because it is a fully open specification — any engine can implement it independently. Delta Lake's tooling is partially under the Business Source License, which has slowed some integrations. Hudi has strong Spark support but fewer integrations with non-Spark engines.

Performance Characteristics

Write Performance

Iceberg COW: Rewrites affected data files on every update. Higher write amplification, but produces clean, unmodified files that are fast to read.

Delta Lake: Uses COW by default. Liquid Clustering (a recent feature) improves write patterns for upserts.

Hudi MOR: Appends delta log files alongside base files. Very fast writes with low amplification, but reads require merging base + delta at query time.

For high-frequency streaming writes (like Kafka-to-Iceberg pipelines), Iceberg with MOR deletion vectors or Hudi MOR provide the best write throughput.

Read Performance

Iceberg: Efficient partition and column pruning through manifest statistics. Predicate pushdown is deeply integrated into the manifest scan.

Delta Lake: Good partition pruning. Data skipping using column statistics in the transaction log.

Hudi MOR: Read-time merge can be slower unless a compaction has recently run. COW tables read as fast as raw Parquet.

Ecosystem and Community

Dimension	Apache Iceberg	Delta Lake	Apache Hudi
Governance	Apache Software Foundation	Linux Foundation	Apache Software Foundation
Original Creator	Netflix	Databricks	Uber
Primary Backer	Broad (Apple, Netflix, AWS, etc.)	Databricks	Onehouse
GitHub Stars (approx.)	7,000+	7,000+	5,000+
CNCF / Cloud Native	Yes (integrates with Nessie, Polaris)	Partial	Partial

When to Choose Each Format

Choose Apache Iceberg when:

You need broad multi-engine support (Athena, Trino, Snowflake, RisingWave, Flink)
You require partition evolution without data rewriting
You want a fully open specification with no vendor lock-in
You're building a cloud-native lakehouse with a REST catalog

Choose Delta Lake when:

Your primary engine is Apache Spark or Databricks
You want deep integration with the Databricks platform
Your team has strong Spark expertise

Choose Apache Hudi when:

Your workload is predominantly record-level upserts at high frequency
You need native incremental pull APIs for downstream consumers
You're running on Hadoop/HDFS or need the Hudi-specific clustering features

Using Apache Iceberg with RisingWave

RisingWave has native Iceberg integration for both writing and reading:

-- Continuous sink from Kafka to Iceberg
CREATE SOURCE product_events (
    product_id BIGINT,
    event_type VARCHAR,
    price NUMERIC(10,2),
    inventory_count INT,
    updated_at TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'product-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

CREATE MATERIALIZED VIEW product_current_state AS
SELECT
    product_id,
    MAX(updated_at) AS last_updated,
    SUM(CASE WHEN event_type = 'sale' THEN -1 ELSE 0 END) AS sales_count,
    MAX(price) AS current_price,
    MAX(inventory_count) AS latest_inventory
FROM product_events
GROUP BY product_id;

CREATE SINK product_state_to_iceberg AS
SELECT * FROM product_current_state
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://my-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'commerce',
    table.name = 'product_state'
);

RisingWave does not currently have native Delta Lake or Hudi connectors, making Iceberg the clear choice for RisingWave-based pipelines.

FAQ

Q: Can I migrate from Delta Lake to Iceberg? Yes. Tools like delta-iceberg (from the Delta Lake project) and Apache Iceberg's own migration utilities can convert Delta tables to Iceberg format. The process involves rewriting the metadata layer while preserving data files.

Q: Which format does AWS prefer? AWS has invested heavily in Apache Iceberg. AWS Glue, Athena, EMR, and Lake Formation all have native Iceberg support. AWS co-maintains the Iceberg REST catalog specification and contributed the Polaris open-source catalog.

Q: Is Delta Lake truly open-source? The core Delta Lake file format specification is open, but some of Databricks' proprietary Delta Lake tooling (like Delta Sharing server) uses the Business Source License, which restricts commercial use. The open-source delta-io project is under the Apache 2.0 license.

Q: Does format choice affect storage costs? All three formats store data as Parquet (or ORC), so raw storage costs are similar. Operational costs can differ: MOR tables (Hudi MOR, Iceberg with deletion vectors) require less write I/O but more read I/O. Compaction frequency affects both performance and storage costs.

Q: Which format will "win" long-term? Iceberg's adoption trajectory is the strongest across cloud providers and query engines. However, all three formats are mature and production-ready. If you're starting fresh, Iceberg is the safest choice for broad compatibility.

Ready to Build with Iceberg?

Apache Iceberg's combination of open design, broad engine support, and powerful features makes it the leading choice for new data lakehouse projects — especially when combined with RisingWave for real-time streaming.

Start exploring with the RisingWave documentation, or ask your questions in the RisingWave Slack community.