Apache Iceberg in Production: Lessons from Streaming Workloads

Running Apache Iceberg in production with streaming workloads exposes failure modes that batch-only deployments never encounter: small file accumulation at high ingest rates, catalog contention during concurrent writes, and snapshot metadata bloat after months of operation. The teams that succeed treat Iceberg operations—compaction, expiration, and catalog health—as first-class production concerns, not afterthoughts.

Lesson 1: Small Files Are Your First Production Crisis

Every RisingWave checkpoint (default: 60 seconds) commits at least one Parquet file per active Iceberg partition. With 100 active partitions and a 60-second interval, you accumulate 144,000 files per day. Trino and Spark queries that touch these partitions spend more time on S3 list operations than actual data reads.

The fix is compaction—but it must be scheduled, not reactive. Set up a compaction job that runs every 15–30 minutes for hot (recent) partitions:

-- Example: Compact the last 2 hours of data in Spark
CALL system.rewrite_data_files(
    table => 'payments.raw_events',
    strategy => 'binpack',
    options => map(
        'target-file-size-bytes', '268435456',
        'min-input-files', '5'
    ),
    where => 'event_time >= current_timestamp - interval 2 hours'
);

Target file size of 128–256 MB balances scan performance with write overhead. The min-input-files = 5 threshold prevents compaction from running when there are already few files (no-op guard).

Step 2: Tune RisingWave Checkpoint Interval

For sinks where latency requirements allow, increase the checkpoint interval to reduce file creation rate:

CREATE SINK payments_iceberg AS
SELECT * FROM payments_summary
WITH (
    connector                  = 'iceberg',
    type                       = 'append-only',
    catalog.type               = 'rest',
    catalog.uri                = 'http://iceberg-catalog:8181',
    warehouse.path             = 's3://prod-lakehouse/warehouse',
    s3.region                  = 'us-east-1',
    database.name              = 'payments',
    table.name                 = 'hourly_summary',
    commit_checkpoint_interval = '5'
);

Setting commit_checkpoint_interval = 5 means the sink only writes to Iceberg every 5 RisingWave checkpoints, reducing file creation rate by 5x.

Lesson 2: Snapshot Metadata Bloat Is Slow and Invisible

Iceberg's metadata grows with every snapshot. After 30 days of streaming writes at 60-second intervals, the metadata file references 43,200 snapshots. Reading this metadata file on every query adds latency that grows linearly with table age.

Schedule snapshot expiration to keep metadata manageable:

-- Expire snapshots older than 7 days, keeping at least 1
CALL system.expire_snapshots(
    table => 'payments.raw_events',
    older_than => TIMESTAMP '2026-03-27 00:00:00',
    retain_last => 1
);

For most production workloads, a 7-day retention window balances time travel capability with metadata performance. Regulated industries (financial, healthcare) often require 90 days—factor catalog response time degradation into your SLA planning.

Lesson 3: Catalog Choice Has Operational Consequences

Catalog	Concurrency Control	Managed Option	Failure Mode	Best For
REST (Polaris/Nessie)	Optimistic locking	Yes (managed Polaris)	Service downtime	Production, multi-engine
Hive Metastore	Table-level locks	Yes (AWS Glue)	HMS outage	Legacy Hive ecosystems
AWS Glue (REST adapter)	Optimistic locking	Fully managed	Regional outage	AWS-native deployments
JDBC (PostgreSQL)	Row-level locks	DIY	DB outage	Simple deployments
Hadoop (file-based)	None	No	Corruption risk	Dev/test only

For production streaming workloads with RisingWave, the REST catalog (Polaris or AWS Glue with REST adapter) is the recommended choice. It supports concurrent writes from multiple RisingWave nodes without table-level locking.

Lesson 4: Build for Failure Recovery

RisingWave uses exactly-once semantics with its checkpoint protocol. If a RisingWave node crashes mid-checkpoint, the in-progress Iceberg write is rolled back and retried from the last committed checkpoint. However, the Iceberg catalog must be available for this rollback to succeed.

Build your operational runbook around these failure scenarios:

Catalog outage: RisingWave sinks will buffer in memory and eventually apply backpressure to upstream sources. Restore the catalog within the configured sink.iceberg.catalog.connect.timeout window (default: 30 seconds) to prevent OOM.

S3 throttling: High-throughput streaming can trigger S3 rate limiting (3,500 PUT/second per prefix). Distribute files across multiple S3 prefixes by using hash-based partition keys.

Schema mismatch: If the Iceberg table schema diverges from the RisingWave sink schema (e.g., someone manually added a column to the Iceberg table), the sink will fail. Always evolve schemas through RisingWave's DDL, not manually in the catalog.

Lesson 5: Monitor These Three Metrics

-- Check materialized view lag (proxy for sink freshness)
SELECT * FROM rw_streaming_parallelism;

-- Check sink output rate
SELECT sink_name, sink_row_count, sink_throughput_bytes
FROM rw_sinks;

Snapshot age: Alert if the latest Iceberg snapshot is older than 2× your expected checkpoint interval. This indicates a sink is stalled.

File count per partition: Alert if any partition exceeds 1,000 files. Trigger an emergency compaction run.

Catalog response time: Alert if the Iceberg catalog P99 response time exceeds 500ms. Catalog slowness cascades into sink checkpoint failures.

Lesson 6: Schema Evolution Without Downtime

Adding columns to Iceberg tables while RisingWave is running requires coordination:

Add the column to the Iceberg table (metadata-only operation, no downtime)
Update the RisingWave materialized view to include the new column
Drop and recreate the sink to pick up the new schema

The sink recreation causes a brief gap (one checkpoint interval) but does not require stopping ingestion. Historical Iceberg data will show null for the new column.

FAQ

Q: What is the maximum sustained ingest rate RisingWave can deliver to Iceberg? A: In benchmarks on AWS (8 RisingWave compute nodes, S3 us-east-1), RisingWave sustains 500 MB/second of Parquet data written to Iceberg before hitting S3 throttling limits. Horizontal scaling of RisingWave nodes increases throughput proportionally.

Q: How do I handle a corrupted Iceberg snapshot? A: Use Iceberg's rollback_to_snapshot procedure to revert to the last known-good snapshot. Then investigate the cause (most commonly: an in-flight RisingWave write that was interrupted by catalog failure).

Q: Should I run compaction in the same cluster as RisingWave? A: No. Run compaction in a separate Spark or Trino job to avoid resource contention with the streaming pipeline. Schedule compaction during off-peak hours if the streaming workload is not 24/7.

Q: How do I test disaster recovery for Iceberg sinks? A: Simulate a catalog outage in staging, verify that RisingWave buffers correctly and resumes without data loss after catalog recovery. Test S3 permissions separately to ensure the sink can write new files after a credential rotation.

Q: What is the cost of running Iceberg compaction continuously? A: Continuous compaction on a large table consumes significant Spark or Trino compute. For cost optimization, use targeted compaction (specific partitions, recent time ranges) rather than full-table rewrites. Budget 10–20% of your total pipeline compute cost for compaction operations.

Get Started

Apply these production lessons to your streaming Iceberg pipeline using the RisingWave documentation. Get peer support from teams running Iceberg in production on the RisingWave Slack.