Iceberg Compaction: Solving the Small Files Problem
Iceberg compaction merges many small data files into fewer large files, improving query performance and reducing metadata overhead. Streaming ingestion creates the small files problem — thousands of tiny Parquet files (1-10 MB each) that degrade reads. Target file size is 256-512 MB after compaction.
The Small Files Problem
Streaming writes (Flink, RisingWave) create a new Parquet file at each checkpoint interval:
- 1-minute checkpoints × 24 hours = 1,440 files per day per partition
- Each file may be only 1-10 MB
- Query engines must open every file, read metadata, and scan — causing slow queries
Compaction Approaches
| Approach | How | Automation | Used By |
| Manual | Run rewriteDataFiles action | Scheduled (Airflow) | Flink, Spark |
| Automatic | Built into the streaming engine | No maintenance | RisingWave |
| Service-based | Managed compaction service | External | AWS, Tabular |
RisingWave Automatic Compaction
RisingWave automatically compacts Parquet files in Iceberg — no separate compaction jobs, no Airflow DAGs, no maintenance:
-- Just create the sink. Compaction happens automatically.
CREATE SINK to_iceberg AS SELECT * FROM events
WITH (connector = 'iceberg', type = 'append-only', ...);
Best Practices
- Target file size: 256-512 MB (files <128 MB create excessive overhead)
- Compaction frequency: Every 1-4 hours for streaming tables
- Monitor file count: Alert when file count per partition exceeds thresholds
- Sort order: Compact with sort order matching common query patterns
Frequently Asked Questions
How often should I compact Iceberg tables?
For streaming tables, compact every 1-4 hours. RisingWave handles this automatically. For batch tables, compact after each large write operation.
Does compaction block queries?
No. Iceberg's snapshot isolation means compaction creates new files in a new snapshot while existing queries continue reading the old snapshot. There is no downtime.

