Iceberg Compaction: Built-in vs Scheduled Batch Jobs

Iceberg Compaction: Built-in vs Scheduled Batch Jobs

Apache Iceberg compaction is the process of merging many small Parquet files into fewer, larger ones. It is a necessary maintenance task for any Iceberg table that receives continuous writes, and skipping it leads to read queries that scan hundreds or thousands of tiny files, each with its own metadata overhead and I/O cost.

Why Iceberg Tables Need Compaction

Streaming ingestion writes frequently, often committing rows every few seconds or minutes. Each commit produces at least one new Parquet file. Over time, a partition that receives steady inserts can accumulate thousands of files, each holding only a few megabytes or less.

Iceberg reads are file-parallel: a query engine opens every data file in the relevant partitions. With 5,000 small files, that means 5,000 file-open operations, 5,000 metadata reads, and 5,000 page-level scans. Query latency climbs with file count, not just data volume.

Compaction fixes this by reading the small files, merging their rows, and writing a new larger file. The old files are marked for deletion via an updated delete manifest. A garbage collection process removes the physical files once they fall outside the snapshot retention window.

What Compaction Actually Does Internally

When a compaction job runs, it selects a set of small files that belong to the same partition and fall below a target size threshold. It reads all selected files into memory, merges the rows, and writes one or more output files sized close to the configured target-file-size-bytes.

After writing the new files, the compaction job commits a new snapshot to the Iceberg catalog. This snapshot points to the new large files and adds an equality delete entry covering the old small files. The table's data is unchanged; only the physical layout improves.

The old files remain on object storage until the snapshot expires and a GC sweep removes them. For S3-backed tables, this means compaction does not immediately reduce storage costs, but it can reduce your query bill significantly from the first read after compaction completes.

Approach 1: Spark-Based Scheduled Compaction

The most widely deployed compaction strategy today is a scheduled Spark job using the Iceberg RewriteDataFiles action.

from pyiceberg.catalog import load_catalog
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0") \
    .getOrCreate()

from pyspark.sql import SparkSession
from org.apache.iceberg.spark.actions import SparkActions

catalog = spark.sessionState.catalogManager.v2SessionCatalog()
table = catalog.loadTable("db.events")

SparkActions.get(spark) \
    .rewriteDataFiles(table) \
    .option("target-file-size-bytes", str(512 * 1024 * 1024)) \
    .option("min-file-size-bytes", str(32 * 1024 * 1024)) \
    .option("max-file-size-bytes", str(768 * 1024 * 1024)) \
    .execute()

You schedule this job via Airflow, a cron expression, or Spark Structured Streaming's trigger once mode. A typical schedule runs every 30 minutes to a few hours depending on ingestion rate.

Pros:

  • Battle-tested in production at scale
  • Fine-grained control over file size targets, concurrency, and partition filters
  • Supports advanced layout strategies like Z-ordering and sort-order rewrites

Cons:

  • Requires a running Spark cluster, which carries cost even when compaction is idle
  • Scheduling lag means files accumulate between runs, degrading read latency in that window
  • Compaction can fall behind ingestion if the write rate exceeds what the Spark cluster can process in the scheduling interval
  • Adds a separate job to monitor, alert on, and maintain

The operational cost of Spark compaction is often underestimated. You need a cluster, a scheduler, retry logic, observability, and someone to respond when the job fails or falls behind.

Apache Flink offers its own compaction integration through FlinkActions.

import org.apache.iceberg.flink.actions.FlinkActions;

FlinkActions.get(env)
    .rewriteDataFiles(table)
    .option("target-file-size-bytes", String.valueOf(512L * 1024 * 1024))
    .execute();

This approach integrates more naturally into a Flink-first stack, but it still runs as a separate job. You can chain it into your Flink pipeline as a periodic trigger, but it competes for TaskManager resources with your streaming workload and adds complexity to checkpoint coordination.

The trade-offs are similar to Spark: separate operational surface, scheduling lag, and resource contention. The main advantage over Spark is that teams already running Flink can avoid introducing a second processing framework.

Approach 3: Built-in Compaction in RisingWave

RisingWave is a PostgreSQL-compatible streaming database written in Rust with disaggregated storage on S3. Its Iceberg sink includes built-in compaction triggered automatically after each batch of commits.

You configure compaction behavior in the CREATE SINK statement:

CREATE SINK iceberg_events_sink
FROM events_mv
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-rest:8181',
    warehouse.path = 's3://data-lake/warehouse',
    database.name = 'analytics',
    table.name = 'events',
    -- Compaction settings
    compaction_min_file_num = 10,
    target_file_size_bytes = 536870912  -- 512MB
);

compaction_min_file_num sets the minimum number of files that must accumulate in a partition before compaction triggers. Once that threshold is crossed after a commit, RisingWave schedules a compaction task internally.

target_file_size_bytes controls the target output size for merged files, defaulting to 512MB.

Compaction runs asynchronously inside the RisingWave engine. It does not block ongoing ingestion. Your streaming query continues committing new data while compaction works on the existing small files in the background.

Pros:

  • No separate cluster required
  • No scheduling infrastructure to build or maintain
  • Compaction triggers automatically based on file count, not a time-based schedule
  • Zero additional operational surface: one system to monitor instead of two

Cons:

  • Less control over advanced layout strategies like Z-ordering
  • Compaction resources share the RisingWave cluster rather than running in isolation

Performance Comparison: Lag and Latency

The core performance difference between batch and built-in compaction is lag.

With Spark compaction on a 30-minute schedule, files accumulate for up to 30 minutes before compaction runs. During that window, reads see small files. After the Spark job completes, reads see the merged layout. The lag is predictable but unavoidable.

With built-in compaction in RisingWave, compaction triggers as soon as the file count crosses compaction_min_file_num. If you configure a threshold of 10 files, compaction begins after the tenth file is committed. For a table ingesting at 1 file per second, that is a 10-second lag rather than a 30-minute one.

DimensionSpark Batch CompactionRisingWave Built-in
Compaction lagMinutes to hours (schedule-dependent)Seconds to minutes (file-count-dependent)
Cluster requiredYes, separate Spark clusterNo, runs inside RisingWave
Scheduling overheadAirflow / cron / manualNone
Monitoring surfaceSeparate job + alertsSingle system
Z-ordering supportYesNo
Bloom filter optimizationYes (with Spark)No
Operational complexityHighLow
Cost when idleCluster still runningZero
Fine-grained controlHighModerate

When to Still Use Spark Compaction

Built-in compaction handles the common case well: merge small files, stay close to target size, keep read latency low. But there are workloads where Spark compaction remains the right choice.

Z-ordering. If your queries frequently filter on multiple columns and you want to co-locate related rows to maximize data skipping, Z-ordering rewrites the physical sort order of files. RisingWave's built-in compaction does not support Z-ordering today.

Bloom filters. Spark can write Parquet bloom filter statistics during compaction, improving point-lookup performance on high-cardinality columns. This requires explicit configuration during the compaction rewrite.

Very high volume tables. At petabyte scale with complex partition strategies, dedicated Spark clusters with tuned concurrency settings may outperform built-in compaction in raw throughput.

Regulatory isolation. Some organizations require data processing to happen in dedicated, isolated compute. A standalone Spark compaction job satisfies that constraint more cleanly than shared-resource compaction.

For most streaming-to-Iceberg pipelines ingesting at rates below a few GB per second, built-in compaction eliminates significant operational overhead without meaningful trade-offs.

Configuration Reference

RisingWave built-in compaction options

-- Trigger compaction once this many files accumulate in a partition
compaction_min_file_num = 10

-- Target output file size after merging
target_file_size_bytes = 536870912  -- 512MB (default)

-- Example: aggressive compaction for high-write-rate tables
compaction_min_file_num = 5
target_file_size_bytes = 134217728  -- 128MB

Lower compaction_min_file_num values trigger compaction sooner and keep file counts lower, at the cost of more frequent compaction passes. A value between 5 and 20 covers most use cases.

Spark compaction configuration

SparkActions.get(spark) \
    .rewriteDataFiles(table) \
    .option("target-file-size-bytes", str(512 * 1024 * 1024)) \
    .option("min-file-size-bytes", str(32 * 1024 * 1024)) \
    .option("max-concurrent-file-group-rewrites", "5") \
    .option("partial-progress.enabled", "true") \
    .option("partial-progress.max-commits", "10") \
    .filter(                                         # compact only recent partitions
        f"event_date >= '{yesterday}'"
    ) \
    .execute()

partial-progress.enabled allows the job to commit incrementally, so a failure mid-run does not discard all completed work. For large tables, this is strongly recommended.

FAQ

Does compaction change the data in my Iceberg table?

No. Compaction is a pure layout optimization. It reads existing rows, rewrites them into larger files, and updates the table's snapshot metadata to point to the new files. Row values, partition assignments, and schema are unchanged. Query results are identical before and after compaction.

Can I run compaction while ingestion is active?

Yes, for both Spark and built-in compaction. Iceberg's snapshot isolation means compaction commits do not conflict with concurrent write commits. In-flight reads see a consistent snapshot regardless of whether compaction is running.

How do I know if my table needs compaction?

Query the Iceberg metadata tables to check average file sizes. A healthy table typically has average file sizes between 128MB and 1GB. If your average is below 32MB and file counts are in the thousands, compaction will improve query performance.

-- Check file count and average size per partition (Spark SQL)
SELECT
    partition,
    count(*) AS file_count,
    avg(file_size_in_bytes) / 1048576 AS avg_size_mb
FROM db.events.files
GROUP BY partition
ORDER BY file_count DESC
LIMIT 20;

What happens to old files after compaction?

Old files remain on object storage until the snapshot that references them expires. Snapshot expiration is controlled by history.expire.min-snapshots-to-keep and history.expire.max-snapshot-age-ms. After a snapshot expires, running expireSnapshots() (Spark) or the equivalent catalog maintenance API removes the unreferenced files.

Can I use both Spark compaction and RisingWave built-in compaction on the same table?

Yes, but it requires coordination. RisingWave's built-in compaction commits snapshots through the Iceberg catalog, just like Spark. Running both simultaneously can cause compaction operations to rewrite the same files twice, wasting resources without additional benefit. The recommended approach is to pick one and disable the other: if you rely on RisingWave for continuous ingestion with low compaction lag, turn off the Spark compaction schedule for that table. Reserve Spark for periodic Z-order rewrites on a less frequent cadence (weekly or monthly) if advanced layout is needed.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.