The Iceberg Small File Problem: How to Solve It Systematically

The Iceberg Small File Problem: How to Solve It Systematically

Streaming ingestion into Apache Iceberg creates a compounding problem: every commit cycle produces new Parquet files, and without automatic merging, those files accumulate. The Iceberg small file problem degrades query performance, inflates metadata, and drives up cloud storage costs — all without any obvious warning signs until the damage is done.

What the Small File Problem Actually Is

Apache Iceberg stores data as immutable Parquet files. Every time a writer commits a batch of rows, Iceberg creates one or more new files and registers them in its metadata layer (manifest files, manifest lists, and the table snapshot chain).

In batch pipelines, commits happen infrequently — maybe once per hour. Each commit produces large, well-sized files in the 128MB to 512MB range.

In streaming pipelines, commits happen constantly. A pipeline committing every 30 or 60 seconds with modest throughput produces files that are kilobytes or low megabytes in size. These are small files — and Iceberg does not merge them on write.

Why Small Files Hurt

Object store LIST operations. Iceberg query engines must read the manifest files to discover which data files to scan. When there are thousands of tiny files per partition, the manifest grows large and the object store has to serve many LIST and HEAD requests before a single row is returned.

JVM and file handle overhead. Engines like Spark and Trino open a separate file handle, apply predicate pushdown, and allocate column readers for each file. Opening 50,000 files versus 500 files introduces significant JVM overhead that shows up as slower first-byte latency and higher memory usage.

Iceberg metadata bloat. Each data file is tracked in a manifest entry. As snapshots accumulate, the manifest list grows. Snapshot expiration helps, but between expiration runs the metadata layer inflates proportionally to the number of files written.

S3 request costs. AWS charges per GET and LIST request. A table with 50,000 small files costs materially more to query than a table with 500 well-sized files, even if the total bytes are identical.

The Root Cause in Streaming

Streaming writers commit on a fixed checkpoint interval. RisingWave, Flink, and Kafka Connect each have their own knob for this, but the problem is structural: throughput at commit time determines file size.

If your streaming pipeline produces 5 MB/s of data and commits every 30 seconds, each commit writes 150 MB total. Spread across 10 partitions, that is 15 MB per partition per commit — already on the small side. Spread across 50 partitions, it is 3 MB per partition per commit. Sub-MB files appear quickly when partitioning is high or throughput is uneven.

Quantifying the Problem

Consider a table with 10 partitions, committed every 60 seconds.

  • Files per hour per partition: 60
  • Files per day per partition: 1,440
  • Total files across 10 partitions after 24 hours: 14,400

After one week, that table contains over 100,000 data files. Iceberg's manifest files grow proportionally. Query planning time alone can reach several seconds before any data is actually read.

At 50 partitions and 30-second commits, the same math produces over 1,000,000 files per week. Object stores begin throttling LIST requests at these scales.

Solution 1: Increase the Commit Interval

The simplest lever is to commit less frequently. Fewer commits means fewer, larger files.

In RisingWave:

CREATE SINK my_sink
FROM my_source
WITH (
    connector = 'iceberg',
    type = 'append-only',
    ...
    commit_checkpoint_interval = 10
);

commit_checkpoint_interval = 10 means the sink commits to Iceberg every 10 RisingWave checkpoints. If checkpoints fire every 60 seconds, that is one Iceberg commit every 10 minutes.

In Flink:

FlinkSink.forRowType(rowType)
    .tableLoader(tableLoader)
    .flinkWriterConfig(
        IcebergSinkOptions.WRITE_INTERVAL.key(), "600s"
    )
    .build();

Trade-off: Longer commit intervals mean higher end-to-end latency. Data written at T=0 is not visible to readers until T=interval. For pipelines where freshness is critical (under 1 minute), this approach alone is insufficient. Use it in combination with compaction.

Recommended starting point: 5-10 minutes for analytics workloads tolerating some latency. Keep it at 30-60 seconds only if your throughput per partition exceeds 64 MB per commit interval.

Solution 2: Spark-Based Compaction Jobs

Spark's SparkActions.rewriteDataFiles() is the most widely documented approach for compaction. It reads small files, rewrites them into larger target-sized files, and updates the Iceberg snapshot atomically.

from pyspark.sql import SparkSession
from pyiceberg.catalog import load_catalog

spark = SparkSession.builder \
    .config("spark.sql.extensions", 
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .getOrCreate()

spark.sql("""
CALL catalog.system.rewrite_data_files(
    table => 'db.my_table',
    strategy => 'binpack',
    options => map(
        'target-file-size-bytes', '134217728',
        'min-file-size-bytes', '67108864',
        'max-concurrent-file-group-rewrites', '5'
    )
)
""")

The binpack strategy groups small files into bins that approach the target size (128 MB in this example). The sort strategy additionally reorders rows by a sort key, which improves compression and predicate pushdown but costs more CPU.

Schedule via Airflow:

from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

compact_iceberg = SparkSubmitOperator(
    task_id='compact_iceberg_table',
    application='/scripts/compact_table.py',
    conf={
        'spark.executor.memory': '4g',
        'spark.executor.cores': '2',
    },
    schedule_interval='@hourly',
    dag=dag,
)

Downsides of this approach:

  • Requires a running Spark cluster, which adds infrastructure cost even when not actively compacting.
  • Scheduling lag: data written 55 minutes ago may not be compacted yet when an hourly job runs.
  • During the rewrite, readers may observe both the old small files and the new large files in flight, depending on isolation level.
  • Operational burden: you own the job, the cluster, and the failure handling.

Apache Flink provides an equivalent compaction action for environments already running Flink clusters.

import org.apache.iceberg.flink.actions.Actions;

Actions.forTable(table)
    .rewriteDataFiles()
    .targetSizeInBytes(128 * 1024 * 1024) // 128 MB
    .execute();

This can be wrapped in a periodic Flink job:

StreamExecutionEnvironment env = 
    StreamExecutionEnvironment.getExecutionEnvironment();
env.setRestartStrategy(
    RestartStrategies.fixedDelayRestart(3, Time.minutes(1)));

// Run compaction on a keyed timer or external trigger
env.execute("iceberg-compaction");

Downsides: The Flink rewrite action shares most of the same operational overhead as Spark compaction. You need a dedicated job, checkpoint storage, and a way to handle job failures. It also requires careful coordination to avoid compacting while the streaming ingest job is writing to the same partitions.

Solution 4: Built-In Compaction in RisingWave

RisingWave's Iceberg sink includes a built-in compaction loop that runs automatically after each commit cycle. There is no external job, no separate cluster, and no scheduling configuration required.

After RisingWave commits a batch of small files to Iceberg, it identifies files below a configurable size threshold within the same partition, groups them, rewrites them into target-sized files, and updates the snapshot — all within the same process that handles ingestion.

Configuration:

CREATE SINK my_sink
FROM my_source
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'rest',
    catalog.uri = 'http://catalog:8181',
    warehouse.path = 's3://my-bucket/warehouse',
    database.name = 'analytics',
    table.name = 'events',
    commit_checkpoint_interval = 5,
    compaction.enabled = 'true',
    compaction.target_file_size_bytes = '134217728'
);

Because compaction runs inline with ingestion, files are merged before they accumulate. A table that would have 14,400 files after 24 hours with no compaction stabilizes at a much lower file count — typically close to the number of files that fit the data volume at the target file size.

What happens under the hood:

  1. RisingWave writes N small files to the partition as usual.
  2. After the snapshot is committed, the compaction loop scans the partition for files below the merge threshold.
  3. Qualifying files are read, concatenated, and rewritten as a single larger file.
  4. A new Iceberg snapshot is committed that replaces the small files with the compacted file.
  5. The old small files become eligible for expiration on the next snapshot expiration cycle.

This approach eliminates the main sources of operational overhead: no Spark cluster to manage, no scheduling gap between ingest and compaction, no risk of compaction running while the partition is also being written by a separate job.

Comparison: External Compaction vs. Built-In Compaction

DimensionSpark/Flink CompactionRisingWave Built-In
Infrastructure requiredSeparate Spark or Flink clusterNone (runs inside RisingWave)
SchedulingAirflow or cronAutomatic, per commit cycle
Compaction lagMinutes to hoursSeconds (same cycle)
Operational burdenHigh (job management, failure handling)Low (one config flag)
Conflict riskHigh (concurrent writes possible)None (same process)
CostCluster compute costIncluded in RisingWave process
Works with existing clustersYesRequires RisingWave as sink

If you are already using RisingWave for streaming ingestion, built-in compaction is the lower-cost, lower-risk path. If you are using Flink or Kafka Connect for ingestion, external compaction is the only option, and Spark's rewriteDataFiles is the most mature implementation.

Monitoring File Count

Proactive monitoring prevents the problem from compounding undetected.

Query file count per partition using Iceberg metadata tables:

SELECT
    partition,
    COUNT(*) AS file_count,
    SUM(file_size_in_bytes) / 1024 / 1024 AS total_mb,
    AVG(file_size_in_bytes) / 1024 / 1024 AS avg_file_size_mb
FROM my_catalog.db.my_table.files
GROUP BY partition
ORDER BY file_count DESC;

Set an alert threshold. A partition with over 1,000 files is a signal that compaction is not keeping up. At 10,000 files, query planning time becomes measurable. At 100,000 files, queries can time out.

Track snapshot count as a leading indicator:

SELECT COUNT(*) AS snapshot_count
FROM my_catalog.db.my_table.snapshots
WHERE committed_at > NOW() - INTERVAL '24 hours';

High snapshot counts correlate with high file counts when compaction is not running.

Prometheus integration: If you are running RisingWave, the built-in metrics endpoint exposes Iceberg sink metrics including files written per checkpoint, compaction runs, and compaction file reduction ratio. Wire these to Grafana and set alerts at your chosen thresholds.

Best Practices Summary

ParameterRecommendation
Commit interval (streaming)5-10 minutes for analytics, 30-60 seconds only for sub-minute freshness requirements
Target file size128 MB minimum, 256-512 MB for large partitions
Compaction strategybinpack for throughput, sort for query performance on filtered columns
Storage formatParquet with Zstd compression
Snapshot expirationRun daily, retain 7 days minimum
File count alert threshold1,000 files per partition
Monitoring cadenceCheck file counts hourly, review weekly

FAQ

Q: Can I run compaction while streaming ingestion is actively writing to the same table?

Yes, but with care. Iceberg's optimistic concurrency model allows concurrent writers, but a compaction job that rewrites files in a partition while a streaming writer commits to the same partition can produce conflicts. The compaction commit may fail and retry. RisingWave's built-in compaction avoids this because compaction and ingestion are coordinated within the same process.

Q: How often should I run external compaction?

For a table receiving continuous streaming data, run compaction at least hourly. If your commit interval is 60 seconds and you have 10 partitions, you accumulate 600 files per hour. Hourly compaction keeps this manageable. For higher-throughput tables, run every 15-30 minutes.

Q: Does compaction affect read availability?

No. Iceberg's snapshot isolation guarantees that readers see a consistent view of the table at a specific snapshot. Compaction commits a new snapshot atomically. Readers on the old snapshot continue reading uninterrupted. Readers that pick up the new snapshot see compacted files. There is no read downtime.

Q: What is the right target file size for compaction?

128 MB is a widely used default and works well for most workloads. For partitions that are queried with highly selective filters, larger files (256-512 MB) reduce the number of row groups scanned per file. For partitions with many small queries, 128 MB keeps individual file reads fast. Avoid going below 64 MB or above 1 GB.

Q: Does the Iceberg small file problem affect Delta Lake or Hudi the same way?

Yes. All open table formats that use immutable file semantics accumulate small files from frequent streaming commits. Delta Lake has OPTIMIZE (manual or automatic with Databricks). Hudi has clustering and compaction built into its writer model, though the behavior differs by table type (Copy-on-Write vs. Merge-on-Read). The root cause and monitoring approach are the same across formats.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.