Data Lake Compaction is an essential maintenance and optimization process performed on tables within Data Lakes and Data Lakehouses, particularly those managed by Open Table Formats like Apache Iceberg, Delta Lake, and Apache Hudi. It involves rewriting multiple small data files within the table into fewer, larger data files.
The primary goal of compaction is to mitigate the "small files problem" often encountered in big data systems, thereby improving query performance and reducing metadata management overhead.
Streaming ingestion, frequent updates, or certain batch processing patterns can lead to the accumulation of numerous small files in data lake tables. This causes several issues:
Compaction processes typically perform the following steps, often as a background job:
Compaction Strategies:
Triggering Compaction:
Compaction can be triggered manually, scheduled periodically (e.g., nightly), or sometimes automatically by the system based on heuristics.
When using RisingWave to sink data into an Iceberg table in a Streaming Lakehouse architecture, compaction becomes important: