Apache Iceberg, Delta Lake, and Apache Hudi are all open table formats that bring ACID transactions and reliable analytics to data lakes. Iceberg offers the broadest multi-engine support and most flexible metadata design. Delta Lake has the deepest Spark integration. Hudi excels at record-level upserts and incremental processing. Your choice depends on your engine ecosystem and workload patterns.
Why Open Table Formats Matter
Before open table formats, data lakes were collections of raw files — Parquet, ORC, CSV — with no guarantees about consistency, no support for row-level updates, and no way to safely run concurrent writers. This led to the "data swamp" problem: data that was technically stored but practically unreliable.
Open table formats solve this by adding a metadata layer on top of object storage files. This metadata layer enables ACID transactions, schema enforcement, time travel, and efficient predicate pushdown — turning a file system into something that behaves like a database.
Three formats have emerged as the dominant options: Apache Iceberg (originally from Netflix), Delta Lake (originally from Databricks), and Apache Hudi (originally from Uber). All three are open-source and production-ready, but they make different architectural choices that affect their strengths and limitations.
Architecture Comparison
Apache Iceberg
Iceberg uses a hierarchical metadata structure:
- Catalog → points to current metadata file
- Metadata file (JSON) → schema, partition spec, snapshot list
- Manifest list → list of manifest files for each snapshot
- Manifest files → list of data files with statistics
- Data files (Parquet/ORC/Avro)
This architecture enables O(1) snapshot creation (only metadata is updated, not all files), fine-grained partition pruning without listing directories, and complete decoupling of the catalog from storage.
Delta Lake
Delta Lake uses a transaction log stored in the _delta_log directory:
- Delta log (JSON/Parquet checkpoint files) → ordered list of transactions
- Data files (Parquet only)
The log is append-only. Each transaction appends a new JSON entry listing added and removed files. Periodically, checkpoints compact the log into Parquet files for faster reads. This design is simpler but tightly couples the catalog to the storage path.
Apache Hudi
Hudi stores metadata in a hidden .hoodie directory alongside data files:
- Timeline → ordered log of commits, compactions, and clean operations
- File groups → base files + delta log files (for MOR tables)
- Metadata table → optional auxiliary table with bloom filters and column stats
Hudi natively supports two table types: Copy-on-Write (COW) for read-optimized workloads and Merge-on-Read (MOR) for write-optimized workloads. This explicit duality is unique to Hudi.
Feature Comparison
| Feature | Apache Iceberg | Delta Lake | Apache Hudi |
| ACID Transactions | Yes | Yes | Yes |
| Schema Evolution | Full (add/drop/rename/reorder) | Full (add/rename) | Full |
| Partition Evolution | Yes (no rewrite) | No | Partial |
| Hidden Partitioning | Yes | No | No |
| Time Travel | Yes | Yes | Yes |
| Row-Level Deletes | Yes (COW + MOR) | Yes (COW) | Yes (COW + MOR) |
| Multi-Engine Support | Excellent | Good | Good |
| Incremental Reads | Yes (snapshot diff) | Yes (change data feed) | Native (incremental queries) |
| Branching / Versioning | Yes (via Nessie) | Yes (Delta sharing) | Limited |
| File Formats | Parquet, ORC, Avro | Parquet only | Parquet, ORC |
| Open Specification | Fully open | Open (BSL licensed tooling) | Fully open |
Engine Support Comparison
| Query Engine | Apache Iceberg | Delta Lake | Apache Hudi |
| Apache Spark | Excellent | Excellent (native) | Excellent |
| Apache Flink | Excellent | Good | Good |
| Trino / Presto | Excellent | Good | Good |
| Amazon Athena | Excellent | Good | Limited |
| Snowflake | Excellent | Native | Limited |
| Google BigQuery | Excellent | Good | No |
| DuckDB | Excellent | Good | Limited |
| RisingWave | Excellent (v2.8+) | No | No |
| StarRocks | Excellent | Good | Good |
Iceberg has the broadest query engine support because it is a fully open specification — any engine can implement it independently. Delta Lake's tooling is partially under the Business Source License, which has slowed some integrations. Hudi has strong Spark support but fewer integrations with non-Spark engines.
Performance Characteristics
Write Performance
Iceberg COW: Rewrites affected data files on every update. Higher write amplification, but produces clean, unmodified files that are fast to read.
Delta Lake: Uses COW by default. Liquid Clustering (a recent feature) improves write patterns for upserts.
Hudi MOR: Appends delta log files alongside base files. Very fast writes with low amplification, but reads require merging base + delta at query time.
For high-frequency streaming writes (like Kafka-to-Iceberg pipelines), Iceberg with MOR deletion vectors or Hudi MOR provide the best write throughput.
Read Performance
Iceberg: Efficient partition and column pruning through manifest statistics. Predicate pushdown is deeply integrated into the manifest scan.
Delta Lake: Good partition pruning. Data skipping using column statistics in the transaction log.
Hudi MOR: Read-time merge can be slower unless a compaction has recently run. COW tables read as fast as raw Parquet.
Ecosystem and Community
| Dimension | Apache Iceberg | Delta Lake | Apache Hudi |
| Governance | Apache Software Foundation | Linux Foundation | Apache Software Foundation |
| Original Creator | Netflix | Databricks | Uber |
| Primary Backer | Broad (Apple, Netflix, AWS, etc.) | Databricks | Onehouse |
| GitHub Stars (approx.) | 7,000+ | 7,000+ | 5,000+ |
| CNCF / Cloud Native | Yes (integrates with Nessie, Polaris) | Partial | Partial |
When to Choose Each Format
Choose Apache Iceberg when:
- You need broad multi-engine support (Athena, Trino, Snowflake, RisingWave, Flink)
- You require partition evolution without data rewriting
- You want a fully open specification with no vendor lock-in
- You're building a cloud-native lakehouse with a REST catalog
Choose Delta Lake when:
- Your primary engine is Apache Spark or Databricks
- You want deep integration with the Databricks platform
- Your team has strong Spark expertise
Choose Apache Hudi when:
- Your workload is predominantly record-level upserts at high frequency
- You need native incremental pull APIs for downstream consumers
- You're running on Hadoop/HDFS or need the Hudi-specific clustering features
Using Apache Iceberg with RisingWave
RisingWave has native Iceberg integration for both writing and reading:
-- Continuous sink from Kafka to Iceberg
CREATE SOURCE product_events (
product_id BIGINT,
event_type VARCHAR,
price NUMERIC(10,2),
inventory_count INT,
updated_at TIMESTAMPTZ
)
WITH (
connector = 'kafka',
topic = 'product-events',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;
CREATE MATERIALIZED VIEW product_current_state AS
SELECT
product_id,
MAX(updated_at) AS last_updated,
SUM(CASE WHEN event_type = 'sale' THEN -1 ELSE 0 END) AS sales_count,
MAX(price) AS current_price,
MAX(inventory_count) AS latest_inventory
FROM product_events
GROUP BY product_id;
CREATE SINK product_state_to_iceberg AS
SELECT * FROM product_current_state
WITH (
connector = 'iceberg',
type = 'upsert',
catalog.type = 'rest',
catalog.uri = 'http://iceberg-catalog:8181',
warehouse.path = 's3://my-lake/warehouse',
s3.region = 'us-east-1',
database.name = 'commerce',
table.name = 'product_state'
);
RisingWave does not currently have native Delta Lake or Hudi connectors, making Iceberg the clear choice for RisingWave-based pipelines.
FAQ
Q: Can I migrate from Delta Lake to Iceberg?
Yes. Tools like delta-iceberg (from the Delta Lake project) and Apache Iceberg's own migration utilities can convert Delta tables to Iceberg format. The process involves rewriting the metadata layer while preserving data files.
Q: Which format does AWS prefer? AWS has invested heavily in Apache Iceberg. AWS Glue, Athena, EMR, and Lake Formation all have native Iceberg support. AWS co-maintains the Iceberg REST catalog specification and contributed the Polaris open-source catalog.
Q: Is Delta Lake truly open-source?
The core Delta Lake file format specification is open, but some of Databricks' proprietary Delta Lake tooling (like Delta Sharing server) uses the Business Source License, which restricts commercial use. The open-source delta-io project is under the Apache 2.0 license.
Q: Does format choice affect storage costs? All three formats store data as Parquet (or ORC), so raw storage costs are similar. Operational costs can differ: MOR tables (Hudi MOR, Iceberg with deletion vectors) require less write I/O but more read I/O. Compaction frequency affects both performance and storage costs.
Q: Which format will "win" long-term? Iceberg's adoption trajectory is the strongest across cloud providers and query engines. However, all three formats are mature and production-ready. If you're starting fresh, Iceberg is the safest choice for broad compatibility.
Ready to Build with Iceberg?
Apache Iceberg's combination of open design, broad engine support, and powerful features makes it the leading choice for new data lakehouse projects — especially when combined with RisingWave for real-time streaming.
Start exploring with the RisingWave documentation, or ask your questions in the RisingWave Slack community.

