Why Your Iceberg Ingestion Pipeline Breaks Under Scale (And How to Fix It)

Why Your Iceberg Ingestion Pipeline Breaks Under Scale (And How to Fix It)

·

10 min read

TL;DR

The three most common Iceberg ingestion pipeline failures at scale are small file explosion from frequent checkpoints, commit conflicts from concurrent writers, and compaction debt that steadily degrades query performance. These are not configuration issues you can tune away. They are structural problems with how Flink and Spark interact with Iceberg. RisingWave's streaming SQL architecture addresses all three at the engine level.


Problem 1: Small File Explosion from Checkpoint-Driven Commits

If you run a Flink-to-Iceberg pipeline, you have encountered this. Every Flink checkpoint creates new data files in Iceberg, regardless of whether any data was actually written during that interval.

The Apache Iceberg GitHub tracker has documented this extensively. Issue #2404, filed against the Flink stream writer, describes exactly this behavior: "empty files will be created every 3 seconds, even when no data is written." With a checkpoint interval of three seconds and moderate write parallelism, a single pipeline can produce tens of thousands of small files per hour, most of them empty or containing only a handful of rows.

Issue #8510 describes the same root cause from a different angle: a user with a five-minute checkpoint interval was generating so many small files that they were forced to schedule a rewrite_data_files Spark job just to keep the table functional.

The checkpoint interval creates a hard tradeoff. If you want low latency, you set short checkpoints. But short checkpoints produce many small files. If you want fewer small files, you increase the checkpoint interval, but then your downstream consumers see data that is minutes old. The Flink documentation on checkpointing recommends production checkpoint intervals of 10 to 15 minutes as a starting point, which means your Iceberg table can be up to 15 minutes behind your source.

There is no checkpoint interval that makes both problems disappear. You are choosing which problem to have.

The small files compound into a metadata problem over time. Iceberg tracks every data file in manifest files and manifest lists. More small files means larger manifests. Larger manifests means slower query planning. Research from Starburst shows that query performance on TPC-DS benchmarks degraded 1.5x when just three percent of data was modified through small-file inserts and deletions. At scale, this effect is not linear.

Problem 2: Commit Conflicts from Concurrent Writers

Iceberg uses optimistic concurrency control. Writers assume they are the only ones modifying the table and only check for conflicts at commit time. This design works well for batch workloads with infrequent writes. It breaks down when you add streaming ingestion.

In a typical modern data stack, an Iceberg table has several concurrent writers:

  • A Flink streaming pipeline writing new events
  • A Spark job running rewrite_data_files for compaction
  • A CDC connector syncing database changes
  • Snapshot expiration jobs cleaning up old metadata

Research from Ryft documents what happens when these writers collide: "Pipeline failures: Streaming or CDC pipelines encountering repeated conflicts fail entirely, stopping data writes and disrupting downstream systems." More damaging is the wasted compute: "a single failed compaction can burn thousands of cluster-hours."

Metadata conflicts can be retried automatically. Data conflicts, where two writers modify the same partition, cannot. The entire operation must restart from scratch.

The feedback loop is brutal. Compaction jobs run longer because there are more small files. Longer jobs have wider conflict windows. Wider conflict windows produce more failures. Failed compactions leave more small files. The table gets slower and the situation gets worse over time.

The recommended mitigation from the community, avoid compacting actively-written partitions, requires operational discipline that is difficult to maintain. You need to know which partitions are currently being written to and configure compaction to skip them. This is not something you configure once. It requires ongoing coordination between your streaming pipeline and your maintenance jobs.

Problem 3: The Mandatory Compaction Tax

Every Iceberg pipeline that writes frequently needs a compaction job running alongside it. This is not an edge case or a best practice. It is a hard requirement.

The Apache Iceberg documentation on Spark procedures provides the rewrite_data_files procedure specifically because compaction is expected to be a regular operational task. The most common mistake in streaming Iceberg architectures is deploying the stream processor without the compaction service. Query performance degrades within days.

The compaction burden has several dimensions:

Compute cost. Running rewrite_data_files requires Spark, which requires a cluster. You are paying for this cluster continuously. For high-throughput pipelines writing every one to five minutes, compaction needs to run every 30 to 60 minutes to stay ahead of the small file problem.

Operational complexity. Someone needs to own the compaction job. It needs to be scheduled, monitored, and tuned. When it fails, someone needs to respond. When it falls behind, query performance degrades and the failure loop from Problem 2 kicks in.

Timing conflicts. Compaction and ingestion compete for the same partitions. Running compaction too aggressively creates the commit conflicts described in Problem 2. Running it too conservatively lets the small file problem grow.

The Quesma blog's 2025 analysis of Apache Iceberg production limitations names the compaction requirement as a top-level practical constraint: metadata bloat, small files, and catalog conflicts only show up at scale, and they all require ongoing manual intervention to manage.

This is not a problem you solve once during setup. It is ongoing operational work that scales with your data volume.


How RisingWave Solves These Problems

RisingWave approaches Iceberg ingestion differently at the architectural level. Instead of coupling commit frequency to checkpoint frequency, RisingWave separates these concerns explicitly.

Decoupling Commit Frequency from Checkpoint Frequency

RisingWave's Iceberg sink uses a configurable commit_checkpoint_interval that controls how many checkpoints pass before data is committed to Iceberg. The default is a 60-second commit interval. A larger value, say 300 seconds, means RisingWave buffers five minutes of data before writing it to Iceberg as a single, larger file.

More importantly, RisingWave also supports commit_checkpoint_size_threshold_mb (default: 128MB), which triggers commits based on data volume rather than time alone. When a batch of buffered data reaches the size threshold, RisingWave commits it to Iceberg as a single file of meaningful size, regardless of whether the time interval has elapsed.

This means your data freshness and your file sizes are no longer in direct conflict. You can have two-minute data freshness while still writing files that are large enough to not require immediate compaction.

Exactly-Once Semantics Without the Conflict Surface

RisingWave provides exactly-once semantics by default. It does not rely on the same optimistic concurrency model that creates conflict windows in Flink-to-Iceberg pipelines. The commit process is managed through RisingWave's internal checkpoint mechanism, which coordinates writes before committing.

This does not eliminate the need for compaction entirely. But it removes the primary source of mid-operation conflicts: RisingWave is writing larger, less frequent files, which means the compaction job runs less often and has a smaller conflict window when it does run.

SQL-First Transformation Before Write

Because RisingWave is a streaming SQL database, you can transform data before it reaches Iceberg using materialized views. The Iceberg sink receives pre-aggregated, pre-filtered data rather than raw event streams. Fewer rows means fewer files. Fewer files means less compaction pressure.


A Working Example

This example demonstrates the full flow: ingest raw events, compute a real-time summary using a materialized view, and sink that summary to Iceberg.

-- Step 1: Create the source table for raw events
CREATE TABLE raw_events (
    event_id BIGINT,
    event_type VARCHAR,
    user_id BIGINT,
    payload VARCHAR,
    ts TIMESTAMPTZ
);

-- Step 2: Insert sample data (in production, this comes from Kafka or CDC)
INSERT INTO raw_events VALUES
    (1, 'click', 100, '{"page": "home"}', NOW()),
    (2, 'purchase', 101, '{"amount": 99.99}', NOW()),
    (3, 'click', 100, '{"page": "product"}', NOW());

-- Step 3: Create a materialized view that pre-aggregates data
-- RisingWave maintains this incrementally as new events arrive
CREATE MATERIALIZED VIEW events_summary AS
SELECT
    event_type,
    COUNT(*) AS event_count,
    COUNT(DISTINCT user_id) AS unique_users
FROM raw_events
GROUP BY event_type;

-- Step 4: Verify the materialized view is populated
SELECT * FROM events_summary ORDER BY event_count DESC;
-- Result:
-- event_type | event_count | unique_users
-- click      |           2 |           1
-- purchase   |           1 |           1

-- Step 5: Sink the aggregated view to Iceberg
-- This writes compact, pre-aggregated files instead of raw event rows
CREATE SINK iceberg_events_summary
FROM events_summary
WITH (
    connector = 'iceberg',
    type = 'append-only',
    warehouse.path = 's3://your-bucket/warehouse',
    s3.region = 'us-east-1',
    database.name = 'analytics',
    table.name = 'events_summary',
    catalog.type = 'glue',
    commit_checkpoint_interval = '5'  -- commit every 5 checkpoints
);

The materialized view in Step 3 does the heavy lifting. Instead of sinking millions of raw click events to Iceberg (each producing small files), you sink a continuously-maintained aggregation. The Iceberg table receives compact summary rows at defined intervals rather than a constant stream of tiny files.

For CDC use cases, RisingWave also supports upsert mode sinks for Iceberg, which allows primary-key-based updates to Iceberg tables without separate delete file management.


Key Takeaways

  1. Small files are not a configuration problem. They are structural. Flink checkpoints create files on a timer. RisingWave commits based on both time and size thresholds, decoupling data freshness from file size.

  2. Commit conflicts scale with concurrency. The more writers and maintenance jobs touching an Iceberg table, the more failures you will see. Reducing write frequency and file count reduces the conflict surface.

  3. Compaction is mandatory for Flink/Spark pipelines, optional for RisingWave pipelines. You still want to run compaction on any active Iceberg table. But with RisingWave writing larger, less frequent files, you can run it on a much more relaxed schedule.

  4. Materialized views reduce Iceberg write pressure. Pre-aggregating data before it reaches the sink means fewer rows, fewer files, and less compaction debt over time.

  5. The three problems compound. Small files slow down queries. Slow queries create longer-running compaction jobs. Longer compaction jobs hit more conflicts. Conflicts leave more small files. This loop does not fix itself.


FAQ

Q: Can I fix the small file problem in Flink by increasing the checkpoint interval?

You can reduce it. With a 15-minute checkpoint interval, you produce 300x fewer files than with a 3-second interval. But you also accept 15 minutes of latency. The fundamental tradeoff remains: Flink checkpoint interval directly controls both your data freshness and your file production rate. RisingWave decouples these with separate commit_checkpoint_interval and commit_checkpoint_size_threshold_mb controls.

Q: Does RisingWave eliminate the need for Iceberg compaction entirely?

No. Any table that receives ongoing writes will accumulate files over time and benefit from periodic compaction. The difference is that RisingWave writes larger files at lower frequency, which means compaction runs less often, finishes faster, and creates narrower conflict windows. You still need compaction, but you need it much less urgently.

Q: What Iceberg catalog types does RisingWave support?

RisingWave supports AWS Glue, REST catalog, and Hive metastore as catalog backends for Iceberg sinks. For a full list of connector options and authentication parameters, see the RisingWave Iceberg sink documentation.


If you are running a Flink-to-Iceberg pipeline and spending significant time managing compaction, tuning checkpoint intervals, or debugging commit conflicts, the architecture you are working around may be the root cause. RisingWave's streaming SQL approach to Iceberg ingestion is worth evaluating as a replacement. The RisingWave GitHub repository has quickstart guides to get a local environment running in minutes.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.