Apache Iceberg v2 vs v3: What Changed and Why It Matters

Apache Iceberg v2 vs v3: What Changed and Why It Matters

Apache Iceberg v3 introduces native variant and geometry types, row-level encryption, default column values, and multi-argument transforms — while remaining backward compatible with v2 readers for tables that don't use new features. For RisingWave users, the most immediately useful v3 additions are default column values (safe schema evolution) and the removal of sort order limitations.

Background: What Are Iceberg Format Versions?

Apache Iceberg's spec is versioned independently of any software release. A format version governs the structure of metadata files (manifests, manifest lists, table metadata JSON) and the semantics of operations. Format v1 introduced the foundational snapshot model. Format v2 added row-level deletes (enabling efficient upserts without full file rewrites). Format v3 is the latest spec, ratified in 2024.

Engines like Trino, Spark, Flink, and RisingWave each declare which format versions they can read and write. A v2 writer (like RisingWave's current stable release) can write to a v2 table; v3 tables require v3-capable engines for any v3-specific features.

Key Changes in Iceberg v3

1. Default Column Values

The most impactful change for streaming pipelines. In v2, adding a NOT NULL column to an existing table required a backfill — you had to rewrite every existing data file to include the new column. In v3, you can specify a default value for a new column:

-- In Trino/Spark, creating a v3 table with defaults
ALTER TABLE my_table ADD COLUMN discount_pct DOUBLE NOT NULL DEFAULT 0.0;

Existing data files return 0.0 for discount_pct without any rewrite. RisingWave pipelines writing to the table only need to include the new column going forward — old records are implicitly handled.

2. New Data Types: Variant and Geometry

Variant is a schema-less JSON/semi-structured type stored efficiently in Parquet. Instead of flattening nested JSON into dozens of columns, you can store the entire payload as a single Variant column and query into it with path expressions. This is analogous to Snowflake's VARIANT or BigQuery's JSON type.

Geometry enables native geospatial data storage (points, polygons, lines) with WKB encoding and CRS metadata. Geospatial analytics that previously required PostGIS or custom serialization can now use native Iceberg columns.

3. Row-Level Encryption

v3 adds an encryption spec that allows individual columns or rows to be encrypted at rest within the Parquet file, with key management integrated via the catalog. This is distinct from S3-side encryption (SSE-S3/SSE-KMS) — it encrypts the data before it reaches object storage, enabling column-level access control without separate masking infrastructure.

4. Multi-Argument Transforms

v2 partition transforms (bucket[N], truncate[W], year, month, day, hour) each take a single column. v3 introduces multi-argument transforms that can derive partition values from combinations of columns. This enables more precise data skipping for compound partition keys.

5. Binary Deletion Vectors

v3 replaces positional delete files with deletion vectors — compact bitsets stored directly in the data file's footprint. Deletion vectors are faster to apply at read time (O(1) per row vs. O(log n) merge join) and produce less catalog metadata overhead. This makes high-frequency upsert workloads significantly faster.

v2 vs. v3 Feature Comparison

Featurev2v3
Row-level deletesEquality + position delete filesDeletion vectors (faster)
Default column valuesNot supportedSupported
Variant typeNot supportedSupported
Geometry typeNot supportedSupported
Row-level encryptionNot supportedSupported
Multi-argument transformsNot supportedSupported
Backward compatibilityv1 readablev2 readable (for base features)
Engine support (2025)BroadGrowing

What This Means for RisingWave Users

Today: RisingWave writes Iceberg v2 tables by default. The upsert mode uses v2 equality delete files, which are well-supported by all major query engines (Trino, Spark, Athena, DuckDB).

Migration path: When RisingWave adds v3 support, the upgrade path will be straightforward — Iceberg's spec allows in-place format version upgrades via a single metadata operation. Existing data files remain unchanged.

Default values (v3 benefit): The default column values feature is particularly valuable for RisingWave pipelines because it means you can add columns to downstream Iceberg tables without restarting or reconfiguring your sinks. The new column simply starts appearing in new writes, while historical rows return the default.

Deletion vectors (v3 benefit): For high-velocity CDC pipelines (e.g., replicating a busy PostgreSQL orders table), deletion vectors will significantly reduce the read-time merge overhead when querying upserted Iceberg tables.

Working with v2 Tables in RisingWave Today

RisingWave's current v2 sink is production-ready for the vast majority of use cases:

-- Create a standard v2 upsert pipeline
CREATE SOURCE inventory_cdc (
    item_id      BIGINT PRIMARY KEY,
    warehouse_id BIGINT,
    quantity     INT,
    last_updated TIMESTAMPTZ
)
WITH (
    connector      = 'postgres-cdc',
    hostname       = 'postgres.internal',
    port           = '5432',
    username       = 'cdc_user',
    password       = 'secret',
    database.name  = 'inventory',
    table.name     = 'stock_levels'
)
FORMAT DEBEZIUM ENCODE JSON;

CREATE MATERIALIZED VIEW inventory_current AS
SELECT
    item_id,
    warehouse_id,
    quantity,
    last_updated,
    CASE WHEN quantity = 0 THEN 'OUT_OF_STOCK'
         WHEN quantity < 10 THEN 'LOW_STOCK'
         ELSE 'IN_STOCK'
    END AS stock_status
FROM inventory_cdc;

CREATE SINK inventory_iceberg_sink AS
SELECT * FROM inventory_current
WITH (
    connector      = 'iceberg',
    type           = 'upsert',
    primary_key    = 'item_id,warehouse_id',
    catalog.type   = 'rest',
    catalog.uri    = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://my-warehouse/data',
    s3.region      = 'us-east-1',
    database.name  = 'supply_chain',
    table.name     = 'inventory'
);

This pipeline works today with v2 semantics — exactly-once upserts with equality delete files. When v3 is available, the deletion vectors will make this faster without any change to the SQL.

FAQ

Q: Should I upgrade existing Iceberg tables to v3 today? A: Only if you need a v3-specific feature (variant type, encryption, defaults). For most workloads, v2 is stable and has broader engine support. Check your query engine's v3 compatibility before upgrading.

Q: Is v3 backward compatible? Can v2 engines read v3 tables? A: v2 engines can read v3 tables that only use v2 features. If a v3 table uses deletion vectors, variant columns, or other v3-only features, v2 engines will fail. Iceberg's spec includes a "supported features" field that engines check before reading.

Q: When will RisingWave support Iceberg v3? A: Check the RisingWave changelog and GitHub roadmap for the latest status. The community is actively tracking spec v3 adoption. Join the Slack for real-time updates.

Q: Does v3 change the catalog protocol? A: No. The REST catalog API (Iceberg REST Catalog spec) is independent of the table format version. v2 and v3 tables are managed through the same catalog endpoints.

Q: How do deletion vectors compare to Hudi's merge-on-read? A: Both are merge-on-read strategies. Iceberg v3 deletion vectors are more compact and faster to apply than Hudi's delta log approach, and they integrate natively with the Parquet file format rather than requiring a separate log structure.

Get Started

Stay ahead of the Iceberg spec evolution with RisingWave:

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.