Iceberg Compaction: Solving the Small Files Problem

Iceberg compaction merges many small data files into fewer large files, improving query performance and reducing metadata overhead. Streaming ingestion creates the small files problem — thousands of tiny Parquet files (1-10 MB each) that degrade reads. Target file size is 256-512 MB after compaction.

The Small Files Problem

Streaming writes (Flink, RisingWave) create a new Parquet file at each checkpoint interval:

1-minute checkpoints × 24 hours = 1,440 files per day per partition
Each file may be only 1-10 MB
Query engines must open every file, read metadata, and scan — causing slow queries

Compaction Approaches

Approach	How	Automation	Used By
Manual	Run `rewriteDataFiles` action	Scheduled (Airflow)	Flink, Spark
Automatic	Built into the streaming engine	No maintenance	RisingWave
Service-based	Managed compaction service	External	AWS, Tabular

RisingWave Automatic Compaction

RisingWave automatically compacts Parquet files in Iceberg — no separate compaction jobs, no Airflow DAGs, no maintenance:

-- Just create the sink. Compaction happens automatically.
CREATE SINK to_iceberg AS SELECT * FROM events
WITH (connector = 'iceberg', type = 'append-only', ...);

Best Practices

Target file size: 256-512 MB (files <128 MB create excessive overhead)
Compaction frequency: Every 1-4 hours for streaming tables
Monitor file count: Alert when file count per partition exceeds thresholds
Sort order: Compact with sort order matching common query patterns

Frequently Asked Questions

How often should I compact Iceberg tables?

For streaming tables, compact every 1-4 hours. RisingWave handles this automatically. For batch tables, compact after each large write operation.

Does compaction block queries?

No. Iceberg's snapshot isolation means compaction creates new files in a new snapshot while existing queries continue reading the old snapshot. There is no downtime.