Iceberg Copy-on-Write vs Merge-on-Read

Apache Iceberg gives you two distinct strategies for applying updates and deletes to a table: Copy-on-Write (COW) rewrites affected data files immediately so reads always scan clean files; Merge-on-Read (MOR) appends small delete files and defers the merge work to query time. Which one you choose has a direct, measurable impact on both ingestion throughput and query latency.

1. Copy-on-Write: Clean Reads, Expensive Writes

When a DELETE or UPDATE statement arrives against a COW table, Iceberg identifies every data file that contains at least one affected row and rewrites those files in full. The new files exclude the deleted rows or include the updated values. The old files are replaced in the next snapshot.

The result is a table where every data file represents the current state of the data. No reader ever needs to reconcile a base file with a delete file. A full table scan reads exactly the rows that exist today.

This cleanliness has a cost. If a 500 MB Parquet file contains a single row that needs updating, Iceberg rewrites all 500 MB to produce a new file with that one row changed. Write amplification scales with file size, not change size. For tables with a low update rate and large files, that amplification is tolerable. For tables receiving a continuous stream of CDC events, it becomes a bottleneck almost immediately.

COW is also sensitive to concurrent writes. Because a single update may touch many large files in parallel, write latency is higher and the risk of file-level conflicts grows under concurrent workloads.

2. Merge-on-Read: Fast Writes, Deferred Merge Cost

MOR separates the act of recording a change from the act of applying it. When a delete or update arrives, Iceberg writes a small delete file instead of rewriting the base data file.

Iceberg v2 supports two types of delete files:

Positional delete files record a list of (file path, row position) pairs. They are precise and compact. A reader can skip rows at exact offsets without scanning for matching values.

Equality delete files record column values that identify rows to remove. They are more flexible for cases where the physical row position is unknown at write time, which is common in streaming CDC workloads where the writer does not track internal row offsets.

Reads on a MOR table must merge base files with all outstanding delete files at query time. If a table has accumulated 50 small delete files since its last compaction, a full scan must open and cross-reference all 51 files. Read amplification grows with the number of uncompacted delete files, not with the number of changed rows.

The write path stays fast. Writing a delete file for a thousand updated rows produces a file measured in kilobytes, not gigabytes. Ingestion throughput on a high-churn CDC table can be an order of magnitude higher under MOR than under COW.

3. Comparison Table

Dimension	Copy-on-Write	Merge-on-Read
Write amplification	High (full file rewrite)	Low (small delete file)
Read performance	Excellent (no merge needed)	Degrades with delete file count
Compaction need	Low (files are always clean)	High (must compact delete files regularly)
Concurrent write sensitivity	High	Low
Best use case	Low-update reporting tables	CDC, high-churn, write-heavy ingestion
Iceberg delete file type	None	Positional or equality deletes

4. How to Configure in Practice

The write mode is set at the table level via Iceberg table properties. You can set it at table creation or alter it afterward.

-- Create a table with MOR for high-frequency CDC ingestion
CREATE TABLE orders (
    order_id BIGINT,
    customer_id BIGINT,
    status STRING,
    updated_at TIMESTAMP
)
USING iceberg
TBLPROPERTIES (
    'write.delete.mode' = 'merge-on-read',
    'write.update.mode' = 'merge-on-read',
    'write.merge.mode'  = 'merge-on-read'
);

-- Create a reporting table optimized for reads
CREATE TABLE orders_daily_snapshot (
    order_id BIGINT,
    customer_id BIGINT,
    status STRING,
    snapshot_date DATE
)
USING iceberg
TBLPROPERTIES (
    'write.delete.mode' = 'copy-on-write',
    'write.update.mode' = 'copy-on-write',
    'write.merge.mode'  = 'copy-on-write'
);

Note that write.delete.mode, write.update.mode, and write.merge.mode are independent properties. A common pattern is to set write.delete.mode and write.update.mode to MOR for streaming CDC, while setting write.merge.mode to COW for batch MERGE INTO operations that already scan large ranges.

The Iceberg catalog reflects the mode in snapshot metadata. After a MOR write, you will see delete file entries in the manifest:

-- Check delete file counts via Iceberg metadata tables
SELECT snapshot_id, added_delete_files_count, added_data_files_count
FROM orders.snapshots
ORDER BY committed_at DESC
LIMIT 10;

A healthy MOR table shows periodic spikes in added_data_files_count (compaction runs) and near-zero added_delete_files_count afterward.

5. Ingestion Branches: Combining Both Modes

Iceberg v2 introduced named branches, which makes it possible to isolate write traffic from read traffic at the metadata level. An ingestion branch pattern works like this:

The streaming ingestion job writes to a dedicated write branch using MOR. Small delete files accumulate there without touching the main branch that readers query.

A periodic compaction job reads the write branch, merges base files and delete files, rewrites the result as clean Parquet files, and fast-forwards the main branch to point at the compacted snapshot.

Readers on main always see COW-quality files. Writers on write never stall waiting for a full file rewrite. The two paths are decoupled.

This pattern requires the catalog and query engine to support branch-aware reads and writes. Catalogs that support the Iceberg REST spec (Polaris, Unity, Nessie, and others) expose branch-level snapshot operations. Not every engine exposes branch targeting in SQL yet, so check your toolchain before designing around it.

6. How Flink Handles This

Apache Flink's Iceberg sink defaults to equality deletes in streaming mode. Every UPSERT operation produces an equality delete file alongside a new data file containing the updated row. This is MOR by definition.

Flink does not compact delete files automatically. Users must schedule a separate Spark or Flink batch job running RewriteDataFiles to collapse delete files into clean base files. Until compaction runs, read latency on a high-churn Flink-fed table grows proportionally to write throughput.

The operational consequence is two pipelines: one for ingestion, one for maintenance. They must be kept in sync. If the compaction job falls behind, query performance degrades and storage costs grow as delete files accumulate.

7. How RisingWave Handles This

RisingWave, a PostgreSQL-compatible streaming database written in Rust, treats Iceberg as a first-class sink with built-in write mode control and automatic compaction.

For CDC and upsert workloads, RisingWave defaults to MOR using equality deletes:

CREATE SINK orders_sink FROM orders_cdc WITH (
    connector        = 'iceberg',
    type             = 'upsert',
    primary_key      = 'order_id',
    warehouse.path   = 's3://datalake/warehouse',
    database.name    = 'ods',
    table.name       = 'orders',
    write_mode       = 'merge-on-read'
);

RisingWave also supports COW via an ingestion branch pattern. With write_mode = 'copy-on-write', the sink stages writes to an Iceberg ingestion branch. A background compactor merges those writes and advances the main branch, so readers always see clean COW files without contending with in-flight writes.

Built-in compaction means no external Spark cluster. The same system that writes also maintains the table. Snapshot expiration and orphan file cleanup run on the same loop, so the table stays healthy without a separate maintenance pipeline.

8. Decision Matrix

Use this table to pick a starting point. Real workloads often combine both modes across different tables.

Scenario	Recommended Mode	Reason
High-frequency CDC from OLTP database	MOR (equality deletes)	Write throughput > read latency
Low-update reporting or audit table	COW	Reads dominate, writes are infrequent
Real-time dashboard fed by streaming	MOR + regular compaction	Fresh writes, compaction keeps reads fast
Batch ETL with large `MERGE INTO`	COW or MOR depends on update rate	Batch rewrites tolerate COW cost
Mixed read/write with SLA on both	MOR on write branch, COW on main	Ingestion branches decouple the two

The compaction budget matters. MOR tables with no compaction schedule will degrade. If you cannot commit to running compaction regularly, COW is the safer default for tables where read performance is observable by end users.

Update rate is the other axis. Tables that receive more than a few percent of rows updated per hour tip toward MOR. Tables that are mostly appended with occasional corrections can afford COW.

FAQ

Does Iceberg's choice of delete mode affect query engines like Trino or Spark?

Yes. Both engines support reading positional and equality delete files, but equality delete handling can be slower because the engine must join the delete file against every scanned data file by column value rather than skipping by position. Positional deletes are faster to apply at read time. If you control both the writer and the reader, prefer positional deletes for MOR tables when your writer can track row positions.

Can I change the delete mode on an existing table?

You can alter the table properties at any time. The change applies to new writes only. Data files and delete files already written remain as-is. A table in the middle of a COW-to-MOR transition will have a mix of clean data files and new delete files until the next compaction rewrites everything.

What happens if compaction never runs on a MOR table?

Delete file count grows with every update. Readers must merge more files on every scan. Eventually, the query engine's open-file limits or memory limits constrain the scan, and query latency climbs. Storage costs also grow because delete files are not removed until a snapshot expiration job runs after compaction.

How does MOR interact with Iceberg's partition pruning?

Data file pruning still works normally. Iceberg skips data files whose partition values do not overlap the query predicate. However, delete files scoped to a partition range must still be loaded and applied to every matching data file. This means partition pruning reduces the number of base files scanned but does not always reduce the number of delete files loaded.

Is there a performance difference between equality deletes and positional deletes for streaming CDC?

Equality deletes are almost always used in streaming CDC because the writer (Flink, RisingWave, or similar) does not track internal row positions in Iceberg data files at write time. Positional deletes require knowledge of the exact (file path, row index) of the target row, which is only available when the writer does a read-before-write to locate the row first. The extra read-before-write is expensive at high ingestion rates, which is why equality deletes dominate in streaming CDC workloads despite their higher read cost.