Monitoring Iceberg Table Health in Streaming Pipelines

Monitoring Iceberg Table Health in Streaming Pipelines

Introduction

Streaming into Apache Iceberg feels straightforward at first. You configure a sink, data flows, and Iceberg commits snapshots. Then, three weeks later, your Trino queries take 10x longer. The cause: 40,000 small files scattered across partitions because nobody was watching the commit rate or file sizes.

Iceberg tables in streaming pipelines need continuous monitoring. Unlike batch pipelines that run once and produce predictable file layouts, streaming writers commit frequently and accumulate metadata at a pace that can degrade query performance if left unchecked. Snapshot counts grow, manifest lists lengthen, and small files multiply with every commit cycle.

This guide covers what to monitor when streaming to Iceberg, how to query Iceberg's built-in metadata tables for health indicators, and how to build automated monitoring using RisingWave materialized views. All SQL examples target RisingWave v2.3 and standard Iceberg metadata table queries compatible with Spark and Trino.

What Should You Monitor in Streaming Iceberg Tables?

Iceberg tables expose rich metadata that tells you exactly how healthy your table is. Here are the key metrics to track in streaming pipelines.

File Count and File Sizes

Every streaming commit adds new data files. If your commit interval is 60 seconds and each commit produces one file per partition, a table with 24 hourly partitions accumulates 1,440 files per day. After a month, that is over 43,000 files.

Small files hurt query performance because query engines must open, read metadata from, and coordinate across thousands of files instead of scanning a few large ones. The overhead is not linear: doubling the file count more than doubles planning time because manifest scanning also slows down.

What to track:

  • Total data file count per partition
  • Average file size (target: 128 MB to 512 MB)
  • Number of files below a minimum threshold (e.g., under 10 MB)

Snapshot Count and Growth Rate

Each streaming commit creates a new snapshot. Snapshots are cheap individually, but they accumulate. A table with 60-second commits generates 1,440 snapshots per day, or over 43,000 per month.

Excessive snapshots slow down table operations because Iceberg must track the snapshot chain for time travel and rollback. They also increase the size of the metadata.json file, which every query must read.

What to track:

  • Total snapshot count
  • Snapshot growth rate (snapshots per hour)
  • Age of the oldest retained snapshot
  • metadata.json file size

Data Freshness

Data freshness measures the delay between when an event occurs and when it appears in the Iceberg table. In a streaming pipeline, this is the sum of:

  1. Source-to-RisingWave latency (Kafka consumer lag)
  2. RisingWave processing time
  3. Commit interval (time between Iceberg commits)

A sudden increase in data freshness indicates a problem somewhere in the pipeline, whether it is backpressure in RisingWave, a slow commit, or upstream source lag.

What to track:

  • Time since the last Iceberg commit (snapshot timestamp vs. current time)
  • Maximum event time lag (newest event time in the table vs. current time)
  • Commit duration (how long each commit takes)

Commit Rate and Failures

The commit rate tells you how reliably your streaming pipeline is writing to Iceberg. A healthy pipeline has a steady commit rate matching your configured interval. Gaps in the commit timeline indicate failures or backpressure.

What to track:

  • Commits per hour (should match expected rate based on your interval)
  • Failed commits (retries exhausted)
  • Commit latency (time from flush to confirmed commit)

Compaction Lag

Compaction is the process of merging small files into larger ones. In streaming pipelines, compaction must run continuously to keep up with the rate of new small files. Compaction lag is the difference between the rate of new file creation and the rate of compaction.

What to track:

  • Ratio of small files to total files
  • Time since the last compaction
  • Partition-level file counts (some partitions may need compaction more urgently)

How Do You Query Iceberg Metadata Tables?

Iceberg exposes several metadata tables that you can query with any compatible engine (Spark, Trino, Flink). These tables provide the raw data for monitoring.

The Snapshots Table

The snapshots table lists every snapshot in the table's history:

SELECT
    snapshot_id,
    committed_at,
    operation,
    summary
FROM analytics.orders.snapshots
ORDER BY committed_at DESC
LIMIT 10;

Expected output:

 snapshot_id     | committed_at             | operation | summary
-----------------+--------------------------+-----------+------------------------------------------
 8847236984312   | 2026-03-29 14:05:00.000  | append    | {added-files-size=134217728, ...}
 8847236984311   | 2026-03-29 14:04:00.000  | append    | {added-files-size=67108864, ...}
 8847236984310   | 2026-03-29 14:03:00.000  | append    | {added-files-size=142606336, ...}

The summary field contains key metrics per snapshot: files added, files removed, records added, and total size. Parsing this field gives you per-commit health data.

The Manifests Table

The manifests table shows all manifest files and their statistics:

SELECT
    path,
    length,
    partition_spec_id,
    added_snapshot_id,
    added_data_files_count,
    existing_data_files_count,
    deleted_data_files_count
FROM analytics.orders.manifests;

This tells you how many data files each manifest tracks. A manifest with thousands of entries indicates that compaction is overdue.

The Files Table

The files table provides per-file statistics:

SELECT
    file_path,
    file_format,
    record_count,
    file_size_in_bytes,
    partition
FROM analytics.orders.files
WHERE file_size_in_bytes < 10485760  -- Files under 10 MB
ORDER BY file_size_in_bytes ASC;

This query identifies small files that need compaction. In a healthy streaming table, fewer than 10% of files should be under 10 MB.

The History Table

The history table tracks which snapshot was current at each point in time:

SELECT
    made_current_at,
    snapshot_id,
    parent_id,
    is_current_ancestor
FROM analytics.orders.history
ORDER BY made_current_at DESC
LIMIT 20;

Gaps in the made_current_at timeline indicate periods where no commits occurred, which could signal pipeline failures.

How Can You Build Monitoring Dashboards with RisingWave?

RisingWave can ingest Iceberg metadata and compute health metrics as materialized views. This approach gives you continuously updated monitoring without running periodic batch queries.

Ingesting Iceberg Metadata into RisingWave

RisingWave supports reading from Iceberg tables as sources. You can create a source pointing to your Iceberg table's metadata to build monitoring on top of it.

For pipeline-level monitoring, create tables in RisingWave to track commit events emitted by your streaming sinks:

-- Track commit events from your streaming pipeline
CREATE TABLE iceberg_commit_log (
    table_name VARCHAR,
    snapshot_id BIGINT,
    committed_at TIMESTAMP,
    records_added INT,
    files_added INT,
    bytes_added BIGINT,
    commit_duration_ms INT
);

Populate this table by logging commit metadata from your sink operations or by periodically syncing from Iceberg metadata tables.

Materialized View: Data Freshness Monitor

CREATE MATERIALIZED VIEW data_freshness_monitor AS
SELECT
    table_name,
    MAX(committed_at) AS last_commit_time,
    NOW() - MAX(committed_at) AS time_since_last_commit,
    COUNT(*) FILTER (
        WHERE committed_at > NOW() - INTERVAL '1 hour'
    ) AS commits_last_hour,
    AVG(commit_duration_ms) FILTER (
        WHERE committed_at > NOW() - INTERVAL '1 hour'
    ) AS avg_commit_duration_ms
FROM iceberg_commit_log
GROUP BY table_name;

This materialized view continuously computes the time since the last commit and the commit rate per table. A time_since_last_commit value exceeding 2x your configured commit interval is a strong signal that something is wrong.

Materialized View: File Health Tracker

CREATE MATERIALIZED VIEW file_health_tracker AS
SELECT
    table_name,
    SUM(files_added) AS total_files_created,
    SUM(bytes_added) AS total_bytes_written,
    CASE
        WHEN SUM(files_added) > 0
        THEN SUM(bytes_added) / SUM(files_added)
        ELSE 0
    END AS avg_file_size_bytes,
    SUM(files_added) FILTER (
        WHERE committed_at > NOW() - INTERVAL '1 hour'
    ) AS files_created_last_hour,
    SUM(records_added) FILTER (
        WHERE committed_at > NOW() - INTERVAL '1 hour'
    ) AS records_added_last_hour
FROM iceberg_commit_log
GROUP BY table_name;

This view tracks file creation rates and average file sizes. If avg_file_size_bytes drops below 10 MB, your commit interval may be too aggressive or your throughput too low for the current partitioning scheme.

Materialized View: Commit Anomaly Detection

CREATE MATERIALIZED VIEW commit_anomalies AS
SELECT
    table_name,
    committed_at,
    records_added,
    files_added,
    commit_duration_ms,
    CASE
        WHEN commit_duration_ms > 30000 THEN 'SLOW_COMMIT'
        WHEN records_added = 0 THEN 'EMPTY_COMMIT'
        WHEN files_added > 100 THEN 'FILE_EXPLOSION'
        ELSE 'NORMAL'
    END AS anomaly_type
FROM iceberg_commit_log
WHERE committed_at > NOW() - INTERVAL '24 hours'
  AND (
      commit_duration_ms > 30000
      OR records_added = 0
      OR files_added > 100
  );

This view flags three types of anomalies:

  • SLOW_COMMIT: Commits taking over 30 seconds, indicating storage or catalog pressure
  • EMPTY_COMMIT: Commits with zero records, suggesting upstream pipeline stalls
  • FILE_EXPLOSION: Commits creating over 100 files, pointing to partition skew or misconfigured write settings

Alerting on Health Metrics

Once you have these materialized views, you can sink alerts to downstream systems. For example, sink anomalies to a Kafka topic for your alerting pipeline:

CREATE SINK commit_anomaly_alerts FROM commit_anomalies
WITH (
    connector = 'kafka',
    properties.bootstrap.server = 'kafka:9092',
    topic = 'iceberg-health-alerts',
    type = 'append-only',
    force_append_only = 'true'
);

This creates a real-time alerting pipeline: RisingWave detects anomalies in Iceberg commit patterns and pushes alerts to Kafka, where your incident management system (PagerDuty, Slack, or a custom consumer) picks them up.

For more on building streaming pipelines with RisingWave, see the streaming to Iceberg guide.

What Are Key Thresholds for Iceberg Table Health?

Based on production experience, here are recommended monitoring thresholds for streaming Iceberg tables:

MetricHealthyWarningCritical
Time since last commit< 2x commit interval2x-5x interval> 5x interval
Average file size128 MB - 512 MB32 MB - 128 MB< 32 MB
Files under 10 MB< 5% of total5% - 20%> 20%
Snapshot count< 1,0001,000 - 10,000> 10,000
Commit duration< 10 seconds10 - 30 seconds> 30 seconds
Empty commits< 1% of total1% - 5%> 5%
Manifest file count< 100100 - 500> 500

These thresholds vary based on your specific workload. A high-throughput pipeline (millions of records per second) naturally has larger file counts and more snapshots. Calibrate these numbers against your own baseline after a week of stable operation.

How Do You Fix Common Iceberg Health Issues?

Too Many Small Files

Cause: Commit interval too short relative to throughput, or too many partitions for the data volume.

Fix: Increase commit_checkpoint_interval in your RisingWave sink configuration. The default is 60 seconds, but for lower-throughput tables, 300-600 seconds produces healthier file sizes. Also consider reducing partition granularity (switch from hour to day).

Run compaction to clean up existing small files:

-- In Spark
CALL catalog.system.rewrite_data_files('analytics.orders');

Snapshot Accumulation

Cause: No snapshot expiration configured.

Fix: Configure snapshot retention. In Spark:

CALL catalog.system.expire_snapshots(
    table => 'analytics.orders',
    older_than => TIMESTAMP '2026-03-22 00:00:00',
    retain_last => 100
);

For streaming tables, schedule snapshot expiration to run every few hours. Retain at least the last 100 snapshots for debugging, but expire anything older than 7 days unless you have specific time-travel requirements.

Growing Metadata File Size

Cause: Each commit adds to the metadata.json file. Without cleanup, it grows indefinitely.

Fix: Enable metadata file cleanup alongside snapshot expiration. Iceberg can rewrite the metadata file to include only active snapshots. Configure write.metadata.previous-versions-max in your table properties to limit the number of retained metadata versions.

Commit Failures

Cause: Catalog contention (multiple writers to the same table), storage throttling, or network issues.

Fix: Check your catalog logs for conflict errors. If multiple pipelines write to the same Iceberg table, consider partitioning the writes so each pipeline targets different partitions. In RisingWave, increase commit_retry_num (default: 8) for tables with frequent conflicts.

For more information on data freshness monitoring and SLA tracking, see the RisingWave data freshness guide.

FAQ

What metadata tables does Iceberg expose for monitoring?

Iceberg exposes several metadata tables accessible by appending the table type to the table name: snapshots (commit history and per-commit statistics), manifests (manifest file details and data file counts), files (per-file statistics including size, record count, and partition values), history (snapshot timeline), and partitions (partition-level statistics). These tables are queryable from Spark, Trino, and other compatible engines.

How often should you run compaction on streaming Iceberg tables?

Run compaction at least once per hour for high-throughput streaming tables that commit every 60 seconds. Lower-throughput tables with longer commit intervals may only need compaction every 4 to 8 hours. The trigger should be based on the ratio of small files (under 10 MB) to total files: when it exceeds 20%, compaction is overdue.

What causes small files in streaming Iceberg pipelines?

Small files accumulate when the commit interval is short relative to the data throughput per partition. A 60-second commit interval on a table with 50 partitions creates 50 new files per minute, regardless of how much data each file contains. The fix is to increase the commit interval, reduce partition count, or run regular compaction.

How can RisingWave help monitor Iceberg table health?

RisingWave can ingest Iceberg commit metadata and compute health metrics as continuously-updated materialized views. You can track data freshness, commit rates, file creation patterns, and anomalies in real time, then sink alerts to Kafka or other destinations for automated incident response.

Conclusion

Monitoring Iceberg table health is not optional for streaming pipelines. The continuous nature of streaming writes creates unique challenges: small files accumulate, snapshots grow, and metadata bloats unless actively managed. Here are the key takeaways:

  • Track five core metrics: file count and sizes, snapshot count, data freshness, commit rate, and compaction lag.
  • Use Iceberg metadata tables: Query snapshots, manifests, files, and history tables directly for health data.
  • Automate with materialized views: Use RisingWave to build continuously-updated monitoring dashboards on Iceberg metadata.
  • Set clear thresholds: Establish warning and critical thresholds for each metric, calibrated to your workload.
  • Run regular maintenance: Schedule compaction and snapshot expiration as part of your pipeline operations, not as an afterthought.

Proactive monitoring turns Iceberg table management from a firefighting exercise into a routine operational practice.

Ready to build monitored streaming pipelines to Iceberg? Get started with RisingWave in 5 minutes. Quickstart

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.