Iceberg Snapshot Expiration and GC Best Practices

Apache Iceberg snapshot expiration is the process of removing stale snapshot metadata and unreferenced data files from an Iceberg table. Without it, every table write permanently accumulates metadata and orphaned files — turning a feature (time travel) into a storage crisis in production.

1. How Snapshots Accumulate

Every write operation in Apache Iceberg produces a new snapshot. An INSERT, UPDATE, DELETE, or compaction run all create a new snapshot entry in the table's metadata.json file. Each snapshot contains a pointer to a manifest list file (.avro), which in turn points to one or more manifest files, which list the actual data files on disk.

Iceberg retains all past snapshots by default. This is intentional — it enables time travel queries and snapshot isolation for concurrent readers. But at scale, the cost compounds fast. A table with 1,000 commits per day at 10MB of metadata per commit generates 10GB of metadata daily with no GC in place.

The metadata layer alone can become a bottleneck. Planning queries slows down as the table reader has to load and parse a growing list of manifests. Storage costs rise even if the data itself is compacted, because the old manifest files and manifest lists remain.

2. What Gets Left Behind

Understanding what actually accumulates helps you target cleanup correctly.

Snapshot entries in metadata.json: The snapshot log grows with every commit. Old entries are never pruned automatically.

Manifest list files: Each snapshot has its own .avro manifest list. When a snapshot is no longer needed, its manifest list file becomes orphaned.

Manifest files: Manifests track data file metadata (path, partition, record count, column stats). Manifests can be shared across snapshots, so expiration must account for which manifests are still referenced by surviving snapshots.

Replaced data files: When a compaction rewrites 100 small files into 1 large file, the 100 old files are no longer active — but they remain on disk until the snapshots referencing them are expired and a separate orphan cleanup pass runs.

This is a two-phase problem. Snapshot expiration handles the metadata layer. Orphan file cleanup handles the physical data layer.

3. expire_snapshots: What It Does and What It Does Not

The expire_snapshots procedure is the primary maintenance operation for Iceberg tables.

What it does:

Removes snapshot entries older than the specified older_than timestamp from the snapshot log
Deletes orphaned manifest list files (.avro) that are no longer referenced by any live snapshot
Deletes orphaned manifest files that are no longer referenced by any live snapshot

What it does NOT do:

It does not automatically delete data files. Data files replaced by compaction are only removed when the snapshot referencing the old files is expired AND a separate orphan file cleanup runs.

In Spark SQL, you run snapshot expiration like this:

CALL system.expire_snapshots(
  table => 'catalog.db.events',
  older_than => TIMESTAMP '2026-04-16 00:00:00',
  retain_last => 2
);

The retain_last parameter maps to min_snapshots_to_keep. It ensures that even if all snapshots are older than the threshold, the most recent N snapshots are preserved.

In Java using the Iceberg API:

table.expireSnapshots()
     .expireOlderThan(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(7))
     .retainLast(2)
     .commit();

The .commit() call is what actually executes the deletion. Without it, the operation is prepared but not applied.

4. min_snapshots_to_keep and Why It Matters

min_snapshots_to_keep acts as a safety floor. Even if every snapshot in a table is older than your expiry threshold, Iceberg will keep the N most recent snapshots when this value is set.

This setting is critical for two reasons. First, concurrent readers: a Spark job that opened a snapshot reference at T=0 and is still scanning at T=5min could fail if that snapshot gets expired mid-scan. Keeping recent snapshots gives in-flight readers time to complete before their snapshot disappears. Second, time travel and rollback: you almost always want at least one snapshot before the latest to roll back from a bad write.

The recommended minimum is 2. Setting it to 1 means a bad write leaves you with no rollback target. Setting it to 0 is dangerous in any production environment.

You can also set this as a table property so it applies to all expiration operations automatically:

ALTER TABLE catalog.db.events
SET TBLPROPERTIES (
  'history.expire.min-snapshots-to-keep' = '2',
  'history.expire.max-snapshot-age-ms' = '604800000'  -- 7 days in ms
);

With max-snapshot-age-ms set at the table level, every maintenance operation automatically expires snapshots older than 7 days. This is the property that drives automated GC workflows.

5. Orphan File Cleanup: The Second Phase

Orphan files are data files that exist in the table's data directory but are not referenced by any snapshot. They accumulate from:

Failed or partially committed writes (job died after writing files but before committing the snapshot)
Compaction rewrites where the old files were referenced by an expired snapshot
Manual operations gone wrong

Orphan file cleanup is a separate action from snapshot expiration. In Spark:

CALL system.remove_orphan_files(
  table => 'catalog.db.events',
  older_than => TIMESTAMP '2026-04-20 00:00:00',
  dry_run => false
);

The default retention period for orphan cleanup is 3 days. Files newer than the retention threshold are never deleted, even if they appear unreferenced. This protects in-progress writes that haven't committed yet.

Set the retention threshold to at least 3 days (72 hours). Setting it too low risks deleting files that belong to an in-flight write that hasn't committed its snapshot yet, which corrupts the table.

Never run orphan cleanup before running snapshot expiration. If you delete orphan files first, you may delete files that are still referenced by a soon-to-be-expired snapshot but haven't been cleaned from the manifest yet.

6. GC Loop Design for Production

A reliable GC loop separates the two phases and runs them on different schedules.

Daily: snapshot expiration

-- Run after every batch pipeline completes, or daily as a cron
CALL system.expire_snapshots(
  table => 'catalog.db.events',
  older_than => TIMESTAMP '2026-04-16 00:00:00',  -- 7 days ago
  retain_last => 2
);

Weekly: orphan file cleanup

-- Run weekly, retention at 3 days minimum
CALL system.remove_orphan_files(
  table => 'catalog.db.events',
  older_than => TIMESTAMP '2026-04-20 00:00:00',  -- 3 days ago
  dry_run => false
);

Always run snapshot expiration first, then orphan cleanup. This ensures that files referenced by newly expired snapshots are included in the orphan detection pass.

For tables with high write frequency (thousands of commits per day), run snapshot expiration after every N commits rather than on a fixed schedule. Use table-level properties to trigger expiration automatically in Spark or Flink.

Monitor three metrics to know if GC is keeping up:

total-data-files from snapshots metadata table
total-delete-files (for tables using merge-on-read)
S3/GCS storage bytes for the table prefix

If total-data-files grows unboundedly between compaction runs, your GC loop is not keeping up with write frequency.

7. RisingWave: Built-In GC for Iceberg Sinks

If you are writing to Iceberg through RisingWave, snapshot management is handled automatically alongside the sink operation. RisingWave is an open source, PostgreSQL-compatible streaming database that supports native Iceberg sinks.

RisingWave's Iceberg sink runs GC as part of its commit cycle. You configure the behavior with sink properties:

CREATE SINK events_sink FROM events_mv
WITH (
  connector = 'iceberg',
  type = 'upsert',
  database.name = 'prod',
  table.name = 'events',
  catalog.type = 'glue',
  s3.bucket = 'my-data-lake',
  s3.region = 'us-east-1',
  -- GC settings
  commit.interval.ms = '60000',
  snapshot.expiration.enabled = 'true',
  min.snapshots.to.keep = '2',
  max.snapshot.age.ms = '604800000'
);

The built-in GC means you do not need a separate Spark job or Flink action just to keep the table clean. For teams running real-time pipelines that write continuously to Iceberg, this significantly reduces operational overhead.

RisingWave also handles the ordering guarantee: snapshot expiration runs after the current commit is stable, so orphan files from the current write cycle are never at risk.

8. The Risk of Aggressive Expiration

Expiring snapshots too aggressively breaks concurrent readers. If a Spark job opens a snapshot at 10:00 AM, scans for 20 minutes, and your GC job deletes that snapshot at 10:15 AM, the reader fails mid-scan.

The failure mode is a FileNotFoundException or a "snapshot not found" error, depending on the catalog and reader implementation. This is a hard failure, not a graceful degradation.

Three mitigations:

Always set min_snapshots_to_keep = 2 or higher. This guarantees the most recent snapshots survive even if they're outside the age window.
Use a longer expiry window than your longest-running query. If your Spark jobs run for up to 2 hours, set snapshot expiry to at least 24 hours to leave a safe buffer.
Implement reader timeout handling. Design your pipelines to retry on snapshot-not-found errors and re-read from the latest available snapshot.

GC Configuration Checklist

Setting	Recommended Value	Notes
`history.expire.max-snapshot-age-ms`	604800000 (7 days)	Set at table level for auto expiry
`history.expire.min-snapshots-to-keep`	2	Never set to 0 in production
Orphan file retention	72 hours (3 days)	Protects in-flight writes
Snapshot expiration frequency	Daily or after N commits	Daily for batch; per-N for streaming
Orphan cleanup frequency	Weekly	Less frequent than snapshot expiry
Expiry before orphan cleanup	Always	Order matters
Monitor `total-data-files`	Yes	Signals if GC is keeping up
Concurrent reader buffer	Expiry window > max query time	Prevents mid-scan failures

FAQ

Q: Does expire_snapshots delete data files?

No. expire_snapshots only removes snapshot metadata entries, orphaned manifest list files, and orphaned manifest files. Data files are handled separately by the orphan file cleanup procedure (remove_orphan_files). Skipping orphan cleanup means your storage cost does not decrease even after snapshot expiration runs.

Q: What happens if I run expire_snapshots too aggressively?

Active readers that opened a snapshot reference before expiry will fail mid-scan with a "snapshot not found" or FileNotFoundException error. Set min_snapshots_to_keep to at least 2 and make the expiry window longer than your longest-running query to prevent this.

Q: How do I know if my GC loop is working?

Query the snapshots metadata table and check total-data-files and total-delete-files. Watch the S3 storage bytes for your table prefix over time. If storage keeps growing despite regular compaction and GC runs, orphan file cleanup is likely not running or is too conservative.

Q: Can I automate GC without running Spark?

Yes. Flink provides ExpireSnapshotsAction and DeleteOrphanFilesAction as part of Flink-Iceberg. RisingWave's Iceberg sink runs GC natively without requiring a separate compute engine. For catalog-managed options, some Iceberg REST catalogs and services like AWS Glue or Nessie also support automated GC policies.

Q: What is the right value for min_snapshots_to_keep?

Two is a safe minimum for most tables. It gives you one rollback target and keeps the latest snapshot available for concurrent readers. For tables with frequent schema changes or high-stakes writes, consider 3 to 5. For append-only event tables with no rollback requirements, 2 is sufficient.