Apache Iceberg in 2026: Streaming Integration, Catalogs, and What Changed

Apache Iceberg in 2026: Streaming Integration, Catalogs, and What Changed

·

15 min read

Why Apache Iceberg Won

Three open table formats competed for the lakehouse standard over the past several years: Apache Iceberg, Delta Lake (Databricks), and Apache Hudi (Uber). By 2026, Iceberg emerged as the format that every major data platform supports, reads, and writes.

The technical reasons are well-understood. Iceberg provides ACID transactions on object storage (S3, GCS, Azure Blob) without a centralized lock service. Schema evolution adds or removes columns without rewriting data files. Partition evolution lets you change how data is partitioned as query patterns evolve, again without rewriting data. Time travel lets you query any past snapshot by timestamp or snapshot ID.

But the decisive factor was governance. Delta Lake is controlled by Databricks. Its open source version lags the proprietary version, and Databricks makes strategic decisions about its roadmap. Apache Iceberg is governed by the Apache Software Foundation under the Apache 2.0 license, with contributors from Snowflake, AWS, Apple, Netflix, LinkedIn, and dozens of other organizations.

That open governance produced the ecosystem result visible today. Snowflake reads and writes Iceberg. Google BigQuery supports Iceberg tables as external tables and managed tables. AWS Athena queries Iceberg natively. Databricks itself supports Iceberg alongside Delta Lake because customers demanded it. Trino, DuckDB, Dremio, StarRocks, and RisingWave all work with Iceberg. No other format matches this level of engine support.

What Changed in 2026 Specifically

Several things shifted between 2023 and 2026 that matter for streaming integrations.

The REST Catalog became the standard

In earlier Iceberg deployments, teams chose between Hive Metastore (complex, stateful), AWS Glue (managed but AWS-specific), and file-system-based catalogs. The Iceberg REST Catalog specification, formalized in 2023, gave the ecosystem a single vendor-neutral API for catalog operations: listing namespaces, creating tables, committing new snapshots, and reading table metadata.

Polaris Catalog (donated by Snowflake to the Apache Foundation) implements the REST Catalog spec and has become the reference open-source implementation. Unity Catalog (Databricks) and Tabular (the company founded by Iceberg's creators, now part of Databricks) also implement the REST spec. In 2026, if you are building a new deployment, REST Catalog is the default choice. It decouples your table format from any specific cloud vendor and lets any engine with REST Catalog support discover and access your tables.

Iceberg v2 row-level deletes matured

Iceberg v2 introduced positional and equality delete files, which allow row-level deletes and updates without rewriting entire data files. This is the mechanism Flink and Spark use to implement MERGE INTO and CDC-style upsert operations on Iceberg tables. In 2023, support was present but query engine handling of delete files was inconsistent. By 2026, Trino, Spark, DuckDB, and other engines handle v2 delete files correctly and efficiently. Streaming upsert pipelines are now reliable end-to-end.

RisingWave added stable Iceberg sink support

RisingWave's Iceberg sink, available from the v2.x release line, supports both append-only and upsert modes with exactly-once semantics. The connector works with storage-based catalogs, AWS Glue, Hive Metastore, and REST catalogs. This makes RisingWave a practical choice for teams that want to run streaming SQL and sink results to Iceberg without managing a separate Flink cluster.

Query engines closed the performance gap

DuckDB and Trino made significant query performance improvements on Iceberg in 2024-2025. Vectorized Parquet reading, improved metadata pruning, and better handling of v2 delete files mean that querying a well-maintained Iceberg table with Trino or DuckDB now delivers performance in the same order of magnitude as a dedicated data warehouse for most analytical query shapes. For small-to-medium datasets (under a few terabytes), DuckDB running locally can query Iceberg on S3 with sub-second response times.

The Streaming-to-Iceberg Pattern

The foundational pattern for streaming lakehouse ingestion in 2026 is:

Operational DB (PostgreSQL, MySQL)
    → CDC (native connector)
    → RisingWave
    → Iceberg sink
    → Data lake (S3)

Kafka events (Kinesis, Pulsar)
    → RisingWave
    → Iceberg sink
    → Data lake (S3)

RisingWave sits in the middle as the streaming SQL layer. It ingests from multiple sources simultaneously, applies SQL transformations and aggregations, and writes processed results to Iceberg tables. Downstream, analytical engines (Trino, DuckDB, Athena, Spark) query the Iceberg tables for historical and ad-hoc analysis.

This separation matters. RisingWave is optimized for continuous stateful processing: it maintains incremental materialized views that update in sub-second latency as new data arrives. Trino is optimized for scanning large volumes of data in parallel. Iceberg tables serve as the boundary between these two worlds, giving each system the interface it needs.

Creating an Iceberg Sink from RisingWave

The following examples assume you have a materialized view in RisingWave that you want to persist to Iceberg. This is the most common pattern: process and aggregate in RisingWave, write the results to Iceberg for historical querying.

Step 1: Create a streaming source

-- Kafka source for order events
CREATE SOURCE order_events (
    order_id        BIGINT,
    customer_id     INT,
    merchant_id     INT,
    amount          DECIMAL,
    currency        VARCHAR,
    order_time      TIMESTAMPTZ,
    status          VARCHAR
) WITH (
    connector = 'kafka',
    topic = 'orders.v1',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

Step 2: Define a materialized view

-- Rolling hourly order summary per merchant
CREATE MATERIALIZED VIEW hourly_order_summary AS
SELECT
    window_start,
    window_end,
    merchant_id,
    COUNT(*)             AS order_count,
    SUM(amount)          AS total_revenue,
    AVG(amount)          AS avg_order_value
FROM TUMBLE(order_events, order_time, INTERVAL '1 hour')
GROUP BY window_start, window_end, merchant_id;

Step 3: Sink to Iceberg

-- Create an Iceberg sink from the materialized view
CREATE SINK hourly_orders_sink
FROM hourly_order_summary
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'storage',
    warehouse.path = 's3://my-data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'analytics',
    table.name = 'hourly_orders'
);

The type = 'upsert' setting uses Iceberg v2 equality deletes to handle updates to existing rows when the window is revised. For append-only data (raw event logs, audit trails), use type = 'append-only' instead, which writes only new rows and is faster.

Using a REST Catalog

For production deployments, a REST Catalog provides better governance and engine interoperability than the storage catalog:

CREATE SINK hourly_orders_rest_sink
FROM hourly_order_summary
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'rest',
    catalog.uri = 'https://your-polaris-catalog.example.com/api/catalog',
    catalog.credential = 'client-id:client-secret',
    warehouse = 'analytics_warehouse',
    database.name = 'analytics',
    table.name = 'hourly_orders'
);

Using AWS Glue Catalog

CREATE SINK hourly_orders_glue_sink
FROM hourly_order_summary
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'glue',
    catalog.id = '123456789012',
    warehouse.path = 's3://my-data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'analytics',
    table.name = 'hourly_orders'
);

Iceberg Catalog Options in 2026

Catalog choice affects governance, discoverability, and query engine compatibility. Here is how the main options compare.

REST Catalog (Polaris, Unity Catalog, Tabular)

The REST Catalog specification defines a standard HTTP API for Iceberg catalog operations. Any engine implementing the spec can talk to any REST Catalog server.

Polaris (Apache): Open source, donated by Snowflake to the Apache Software Foundation. Runs on-premises or on any cloud. Provides role-based access control, namespace management, and credential vending (short-lived credentials scoped to specific tables). The reference implementation for the open REST Catalog spec.

Unity Catalog (Databricks): Supports Iceberg tables alongside Delta Lake tables. Best choice if you are already in the Databricks ecosystem. Implements the REST Catalog spec, so non-Databricks engines can read your Unity-catalogued Iceberg tables.

When to use: All new deployments. The REST API is the lingua franca of Iceberg in 2026.

AWS Glue Catalog

AWS Glue is a managed metadata catalog that integrates natively with AWS services. Athena, EMR, Glue ETL jobs, and RisingWave running in AWS all discover Iceberg tables through Glue with no additional infrastructure.

When to use: AWS-native deployments where you want zero-management catalog infrastructure and tight Athena integration.

Limitation: Vendor-specific. If you need engines outside AWS to access the same catalog, you need a REST layer in front of Glue or a separate catalog sync.

Hive Metastore

The original Iceberg catalog, inherited from the Hive ecosystem. Stateful, requires a running Thrift server, operationally complex.

When to use: Legacy environments where Hive Metastore is already deployed and teams are not ready to migrate.

Recommendation: Do not start new projects on Hive Metastore. Migrate to REST Catalog when the opportunity arises.

Nessie

Nessie provides Git-like branching semantics for data. You can create a branch, apply changes (new files, schema evolution, partition changes), and merge back to main. If the experiment goes wrong, you discard the branch.

When to use: Teams doing frequent schema experiments, ML feature engineering iterations, or blue-green data pipeline deployments. Nessie adds overhead for simple ingestion pipelines that do not need branching semantics.

Time Travel in the Streaming Context

Iceberg stores a snapshot for every commit. Each commit from a streaming engine (RisingWave writing to Iceberg every 60 seconds, for example) creates a new snapshot. The metadata layer records which Parquet files belong to which snapshot and when the snapshot was created.

Time travel lets you query the state of a table at any past point. With Trino:

-- Query hourly_orders as it was 24 hours ago
SELECT *
FROM analytics.hourly_orders
FOR TIMESTAMP AS OF (CURRENT_TIMESTAMP - INTERVAL '24' HOUR);

With DuckDB via PyIceberg:

from pyiceberg.catalog import load_catalog

catalog = load_catalog("rest", **{"uri": "https://catalog.example.com"})
table = catalog.load_table("analytics.hourly_orders")

# Scan a specific snapshot
scan = table.scan(snapshot_id=8270631234567890)
df = scan.to_pandas()

With Spark:

-- Query by snapshot timestamp
SELECT *
FROM analytics.hourly_orders
TIMESTAMP AS OF '2026-04-01 00:00:00';

Note that RisingWave reads Iceberg as a batch source (current state), not by snapshot. Time travel queries should go through Trino, Spark, or DuckDB.

Snapshot retention: With streaming writes creating a new snapshot every 60 seconds, you accumulate 1,440 snapshots per day per table. Iceberg's snapshot expiration procedures remove old snapshots and their orphaned data files. Run snapshot expiration on a schedule:

# Using PyIceberg
from pyiceberg.catalog import load_catalog

catalog = load_catalog("rest", **{"uri": "https://catalog.example.com"})
table = catalog.load_table("analytics.hourly_orders")

# Expire snapshots older than 7 days
table.expire_snapshots().expire_older_than(
    datetime.now() - timedelta(days=7)
).commit()

Schema Evolution in Streaming Pipelines

Iceberg handles schema changes without rewriting data files. Columns can be added, renamed, reordered, or widened (e.g., INT to BIGINT). Existing files are read with their original schema; the query engine projects the new schema on top using column IDs rather than column names.

In a streaming pipeline, schema changes propagate as follows.

Adding a column upstream: If your PostgreSQL source table gains a new column and your CDC connector forwards it, RisingWave will surface the new field if your source definition includes it. You then need to update the downstream materialized view and the Iceberg table schema. Iceberg's schema evolution handles the table-side change. RisingWave CDC is forward-compatible: unknown columns in the CDC event are ignored if not declared in the source definition.

Removing a column: Iceberg marks the column as deleted in metadata. Existing data files still contain the column's bytes, but query engines do not surface it. Upstream, you should update your RisingWave source and materialized view definitions to remove the column reference.

Best practice for streaming schema evolution: Always add columns rather than remove or rename them when possible. This minimizes coordination between your streaming pipeline and downstream consumers. When removal is necessary, coordinate with all consumers, evolve the Iceberg schema first, then update the streaming pipeline.

Compaction: The Operational Requirement You Cannot Skip

Every commit from a streaming engine writes a new set of Parquet files to Iceberg. A streaming pipeline committing every 60 seconds produces 1,440 Parquet files per day per table. Over a week, a single active table accumulates over 10,000 files. Query engines scanning 10,000 small files for a single query suffer significant overhead: file open latency, metadata parsing, and S3 API request costs.

Compaction merges small files into larger ones (target size typically 128 MB to 512 MB) without affecting table availability. Iceberg compaction is a background operation; reads and writes continue during compaction.

With Spark using the rewriteDataFiles procedure:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("iceberg-compaction") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.analytics", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.analytics.type", "rest") \
    .config("spark.sql.catalog.analytics.uri", "https://catalog.example.com") \
    .getOrCreate()

spark.sql("""
    CALL analytics.system.rewrite_data_files(
        table => 'analytics.hourly_orders',
        strategy => 'binpack',
        options => map(
            'target-file-size-bytes', '134217728',
            'min-input-files', '5'
        )
    )
""")

With PyIceberg's compaction API (available in PyIceberg 0.7+):

from pyiceberg.catalog import load_catalog

catalog = load_catalog("rest", **{"uri": "https://catalog.example.com"})
table = catalog.load_table("analytics.hourly_orders")
table.rewrite_data_files()

When to run compaction: Schedule it as a recurring job, separate from your streaming pipeline. A reasonable frequency is once per hour for active tables and once per day for less active ones. Monitor file counts using your catalog's metadata APIs or by querying the files metadata table in Spark:

SELECT COUNT(*) AS file_count, AVG(file_size_in_bytes) / 1024 / 1024 AS avg_size_mb
FROM analytics.hourly_orders.files;

If average file size drops below 10 MB, your compaction schedule is too infrequent.

The Complete 2026 Streaming Lakehouse Architecture

Putting all of the above together, a production streaming lakehouse in 2026 looks like this:

Ingest layer:

  • PostgreSQL and MySQL operational databases connect to RisingWave via native CDC connectors
  • Kafka, Kinesis, and Pulsar topics connect to RisingWave as streaming sources
  • MongoDB and SQL Server changes are captured via RisingWave's CDC connectors

Real-time serving layer (hot path):

  • RisingWave materialized views serve sub-second query results to operational dashboards, AI agents, and application APIs
  • RisingWave exposes a PostgreSQL-compatible wire protocol, so any PostgreSQL client library connects directly
  • Use this layer for queries that need up-to-the-second freshness: fraud checks, live inventory, session-level personalization

Historical analytics layer (warm path):

  • RisingWave Iceberg sinks write processed results to Iceberg tables on S3 every 30-60 seconds
  • Trino or DuckDB query the Iceberg tables for analytical queries across hours, days, or months of data
  • AWS Athena or BigQuery can query the same tables for managed, serverless query execution
  • Snowflake can read Iceberg tables through external table definitions

Governance layer:

  • Polaris Catalog or Unity Catalog manages table discovery, access control, and credential vending
  • All engines (RisingWave, Trino, DuckDB, Spark, Snowflake) connect to the same catalog endpoint

Maintenance layer:

  • Scheduled Spark or PyIceberg jobs run compaction to merge small files
  • Snapshot expiration removes old snapshots and their data files
  • Metrics from Iceberg's metadata tables feed observability dashboards (file counts, snapshot lag, table growth rate)

This architecture gives you three capabilities simultaneously: sub-second operational data access via RisingWave, seconds-to-minutes historical analytics via Trino or DuckDB on Iceberg, and long-term storage with full time travel and schema evolution support. The entire stack uses open formats and open governance, which means no engine lock-in at any layer.

Both RisingWave and Apache Flink write to Iceberg and both support exactly-once semantics. The choice depends on your team's existing skills and your transformation requirements.

Choose RisingWave when:

  • Your team knows SQL and PostgreSQL but not Java or Scala
  • You want to manage the streaming layer without a JVM-based cluster
  • Your transformations are expressible in SQL (filters, joins, aggregations, window functions, temporal joins)
  • You want to serve results directly to applications via PostgreSQL wire protocol in addition to sinking to Iceberg

Choose Apache Flink when:

  • Your team has existing Flink expertise and a running Flink cluster
  • You need custom operators that go beyond SQL (complex state machines, external API calls inside the stream processor, custom partitioning logic)
  • You need the full Flink DataStream API for non-SQL transformations

For most teams in 2026, RisingWave covers the transformation requirements with SQL, and avoiding JVM cluster management is a meaningful operational simplification.

Summary

Apache Iceberg in 2026 is not a choice you need to justify. Every major analytical engine supports it, REST Catalog makes multi-engine access straightforward, and Iceberg v2 row-level deletes make streaming upsert pipelines reliable.

The practical work is in the operational details: scheduling compaction, managing snapshot expiration, coordinating schema evolution between streaming producers and batch consumers, and choosing a catalog that fits your cloud environment. None of these are insurmountable, but they are real work that does not disappear just because the table format is open source.

RisingWave simplifies the streaming-to-Iceberg pipeline by giving you SQL for the entire flow: define sources, write materialized views, and create sinks with CREATE SINK. The result is a streaming lakehouse that reflects operational data in near real time, serves sub-second queries directly from RisingWave, and provides the full historical analytics capability of Iceberg for everything else.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.