Iceberg Compaction: Solving the Small Files Problem

Iceberg Compaction: Solving the Small Files Problem

Iceberg Compaction: Solving the Small Files Problem

Iceberg compaction merges many small data files into fewer large files, improving query performance and reducing metadata overhead. Streaming ingestion creates the small files problem — thousands of tiny Parquet files (1-10 MB each) that degrade reads. Target file size is 256-512 MB after compaction.

The Small Files Problem

Streaming writes (Flink, RisingWave) create a new Parquet file at each checkpoint interval:

  • 1-minute checkpoints × 24 hours = 1,440 files per day per partition
  • Each file may be only 1-10 MB
  • Query engines must open every file, read metadata, and scan — causing slow queries

Compaction Approaches

ApproachHowAutomationUsed By
ManualRun rewriteDataFiles actionScheduled (Airflow)Flink, Spark
AutomaticBuilt into the streaming engineNo maintenanceRisingWave
Service-basedManaged compaction serviceExternalAWS, Tabular

RisingWave Automatic Compaction

RisingWave automatically compacts Parquet files in Iceberg — no separate compaction jobs, no Airflow DAGs, no maintenance:

-- Just create the sink. Compaction happens automatically.
CREATE SINK to_iceberg AS SELECT * FROM events
WITH (connector = 'iceberg', type = 'append-only', ...);

Best Practices

  1. Target file size: 256-512 MB (files <128 MB create excessive overhead)
  2. Compaction frequency: Every 1-4 hours for streaming tables
  3. Monitor file count: Alert when file count per partition exceeds thresholds
  4. Sort order: Compact with sort order matching common query patterns

Frequently Asked Questions

How often should I compact Iceberg tables?

For streaming tables, compact every 1-4 hours. RisingWave handles this automatically. For batch tables, compact after each large write operation.

Does compaction block queries?

No. Iceberg's snapshot isolation means compaction creates new files in a new snapshot while existing queries continue reading the old snapshot. There is no downtime.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.