Iceberg Schema Evolution in Streaming Pipelines

Introduction

Your product team just added a referral_source column to the users table. Your data warehouse team renamed txn_amt to transaction_amount for clarity. A partner integration now sends prices as decimals instead of integers. In a batch world, you handle these changes during the next scheduled ETL run. In a streaming world, the pipeline never stops, so the schema change arrives while data is actively flowing.

Schema evolution is one of the hardest problems in streaming data engineering. A mismatch between the schema your pipeline expects and the schema it actually receives can cause silent data loss, pipeline crashes, or corrupt downstream tables. Apache Iceberg was designed to handle schema changes safely through its ID-based column tracking system, and it supports adding, dropping, renaming, and reordering columns without rewriting existing data files.

This article explains how Iceberg schema evolution works, how it compares to Delta Lake's approach, and how RisingWave handles schema changes automatically when sinking data to Iceberg tables in a streaming pipeline.

How Does Iceberg Schema Evolution Work?

Apache Iceberg takes a fundamentally different approach to schema management than most data formats. Instead of identifying columns by name or position, Iceberg assigns each column a unique integer ID when it is first created. This ID never changes, even if the column is renamed or moved to a different position in the schema.

ID-based column tracking

When you create an Iceberg table with columns (id INT, name STRING, email STRING), Iceberg internally assigns IDs like this:

Column Name	Column ID	Type
id	1	INT
name	2	STRING
email	3	STRING

If you later rename email to email_address, the column ID stays as 3. Old data files still reference column ID 3, and query engines know that column ID 3 now maps to email_address. No data rewrite is needed.

Supported schema changes

Iceberg supports these schema evolution operations:

Add columns: New columns are appended with a new unique ID. Existing data files return NULL for the new column.
Drop columns: The column ID is retired. Old data files still contain the data, but query engines skip it.
Rename columns: The column name changes but the ID stays the same. All existing data remains readable.
Reorder columns: Column positions change, but IDs remain fixed. Query engines map by ID, not position.
Type promotion: Certain safe type widening is allowed, such as INT to BIGINT, FLOAT to DOUBLE, or DECIMAL(p, s) to DECIMAL(p', s) where p' > p.

What schema changes are NOT allowed?

Iceberg prohibits changes that could cause data corruption:

Narrowing types (e.g., BIGINT to INT) is not allowed because values could overflow
Changing between incompatible types (e.g., STRING to INT) is blocked
Reusing a dropped column's ID is forbidden to prevent data misinterpretation

These restrictions exist to guarantee that every historical query returns correct results, which is critical for both time travel and audit compliance.

How Does Iceberg Schema Evolution Compare to Delta Lake?

Both Iceberg and Delta Lake support schema evolution, but they differ in important ways.

Feature	Apache Iceberg	Delta Lake
Column identification	By unique integer ID	By column name
Column rename	Safe (ID unchanged)	Requires column mapping mode
Column reorder	Native support	Not natively supported
Type promotion	INT to BIGINT, FLOAT to DOUBLE	INT to BIGINT, FLOAT to DOUBLE
Partition evolution	In-place, no rewrite	Requires table recreation
Nested schema evolution	Full support (struct, list, map)	Limited support
Historical correctness	Guaranteed by ID mapping	Can break with renames

The most significant difference is in column identification. Because Delta Lake uses column names, renaming a column can break the connection between old data files and the current schema unless you enable column mapping mode (available since Delta Lake 2.0). Iceberg avoids this problem entirely through its ID-based system.

Partition evolution

Iceberg also supports partition evolution without rewriting data. You can change from daily to hourly partitioning, and Iceberg handles the mixed partition layouts transparently. With Delta Lake, changing the partition scheme requires creating a new table and migrating data.

This is particularly relevant for streaming pipelines where partition schemes may need to evolve as data volumes grow.

What Happens When Source Schemas Change in Streaming Pipelines?

In a streaming pipeline, schema changes at the source system create a cascade of challenges:

The column addition scenario

Consider this sequence of events:

Your streaming pipeline ingests order data with columns (order_id, customer_id, amount, created_at)
The product team adds a discount_code column to the orders table in PostgreSQL
New records now include discount_code, but in-flight records in your Kafka topic may not

Without proper handling, the pipeline might:

Crash because it encounters an unexpected column
Silently drop the new column
Write records with misaligned columns

The type change scenario

Your pipeline processes price as an integer (cents)
The upstream system changes price to a decimal with two fractional digits
In-flight records have integer prices while new records have decimal prices

How RisingWave handles schema changes

RisingWave provides automatic schema change propagation when you enable the auto.schema.change parameter on CDC sources. Here is how to set it up:

-- Create a PostgreSQL CDC source with auto schema change enabled
CREATE SOURCE pg_orders WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-db.example.com',
    port = '5432',
    username = 'cdc_user',
    password = '${PG_PASSWORD}',
    database.name = 'ecommerce',
    schema.name = 'public',
    auto.schema.change = 'true'
);

-- Create a table with automatic schema mapping
CREATE TABLE orders (*) FROM pg_orders TABLE 'public.orders';

When the upstream PostgreSQL table changes (e.g., ALTER TABLE orders ADD COLUMN discount_code VARCHAR), RisingWave detects the DDL change through the WAL and propagates it downstream automatically. The (*) syntax maps all columns from the source, so new columns appear in RisingWave without manual intervention.

How Do You Build a Schema-Resilient Streaming Pipeline to Iceberg?

Here is a complete example that handles schema evolution from source to lakehouse.

Step 1: Set up the CDC source with auto schema change

CREATE SOURCE pg_source WITH (
    connector = 'postgres-cdc',
    hostname = 'db.example.com',
    port = '5432',
    username = 'cdc_reader',
    password = '${PG_PASSWORD}',
    database.name = 'product_db',
    schema.name = 'public',
    auto.schema.change = 'true'
);

Step 2: Create tables that mirror upstream schemas

-- Using (*) for automatic schema mapping
CREATE TABLE products (*) FROM pg_source TABLE 'public.products';

CREATE TABLE inventory (*) FROM pg_source TABLE 'public.inventory';

When you use (*), RisingWave automatically maps all columns from the upstream table. If the upstream schema changes, RisingWave propagates ADD COLUMN and DROP COLUMN operations to the downstream table.

Step 3: Create a materialized view for transformations

CREATE MATERIALIZED VIEW product_inventory AS
SELECT
    p.product_id,
    p.name AS product_name,
    p.category,
    p.price,
    i.warehouse_id,
    i.quantity_on_hand,
    i.last_restocked_at,
    CASE
        WHEN i.quantity_on_hand = 0 THEN 'out_of_stock'
        WHEN i.quantity_on_hand < 10 THEN 'low_stock'
        ELSE 'in_stock'
    END AS stock_status
FROM products p
JOIN inventory i ON p.product_id = i.product_id;

Step 4: Sink to Iceberg with upsert mode

CREATE SINK product_inventory_iceberg FROM product_inventory
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'product_id,warehouse_id',
    warehouse.path = 's3://data-lake/warehouse',
    database.name = 'retail',
    table.name = 'product_inventory',
    catalog.type = 'rest',
    catalog.name = 'retail_catalog',
    create_table_if_not_exists = 'true',
    commit_checkpoint_interval = '60',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    s3.region = 'us-east-1'
);

What happens when a column is added upstream

Suppose the product team adds a brand column to the products table:

-- Executed on the upstream PostgreSQL database
ALTER TABLE products ADD COLUMN brand VARCHAR(100);

With auto.schema.change = 'true', here is the sequence:

PostgreSQL writes the DDL change to its WAL
RisingWave's CDC connector detects the DDL event
RisingWave adds the brand column to the products table in RisingWave
New records flowing through the pipeline now include the brand column
The Iceberg sink detects the new column and adds it to the Iceberg table schema
Existing data files in Iceberg return NULL for brand (no rewrite needed)

The pipeline never stops. No manual intervention is required.

What Are Best Practices for Schema Evolution in Streaming Iceberg Pipelines?

Managing schema evolution well requires both tooling and process discipline.

Use additive changes whenever possible

Adding new columns is the safest schema change. It never breaks existing queries or downstream consumers. Dropping or renaming columns carries more risk because downstream systems may depend on those column names.

-- Safe: Adding a new column
ALTER TABLE orders ADD COLUMN loyalty_points INT;

-- Riskier: Dropping a column that downstream reports might use
ALTER TABLE orders DROP COLUMN legacy_status;

Set up schema change monitoring

Track schema changes as they happen. In RisingWave, you can monitor CDC events to detect when upstream schemas change:

-- Monitor schema changes by checking table metadata
DESCRIBE products;

Consider setting up alerts when column counts change or when new columns appear in your pipeline.

Test schema changes in isolation

Before applying schema changes to production tables, test them in a staging environment. Iceberg's branching feature lets you create isolated branches for testing:

-- Create a branch to test schema changes (in Spark SQL)
ALTER TABLE retail.product_inventory CREATE BRANCH schema_test;

-- Write test data with the new schema to the branch
INSERT INTO retail.product_inventory.branch_schema_test
SELECT *, 'test_brand' AS brand FROM retail.product_inventory LIMIT 100;

-- Validate the results before merging
SELECT * FROM retail.product_inventory.branch_schema_test LIMIT 10;

Handle type promotions explicitly

When upstream systems change column types, verify that the promotion is compatible with Iceberg's rules. Safe promotions include:

INT to BIGINT
FLOAT to DOUBLE
DECIMAL(p, s) to DECIMAL(p', s) where p' > p

If the upstream system makes an incompatible type change (e.g., INT to VARCHAR), you need to handle it with an explicit transformation in your materialized view:

CREATE MATERIALIZED VIEW orders_transformed AS
SELECT
    order_id,
    customer_id,
    -- Cast the price to a consistent type regardless of upstream changes
    CAST(price AS DECIMAL(12, 2)) AS price,
    order_status,
    created_at
FROM orders;

Plan for nested schema evolution

If your data includes nested structures (JSON columns, structs), Iceberg supports evolution at every nesting level. You can add fields inside a struct, change nested field types, or add new map key-value pairs, all without breaking existing data.

-- Iceberg supports adding fields to nested structs
ALTER TABLE events ADD COLUMN metadata.source_region STRING;

This is a significant advantage over formats that flatten nested data during ingestion.

How Does RisingWave Compare to Other Streaming Engines for Schema Evolution?

Different streaming engines handle schema evolution differently when writing to Iceberg.

Feature	RisingWave	Apache Flink	Kafka Connect
Auto schema detection	Yes (auto.schema.change)	Yes (Flink 2.0 dynamic sink)	Via Schema Registry
CDC DDL propagation	ADD/DROP COLUMN	ADD COLUMN	Requires reconfiguration
SQL-based pipeline	Yes (native SQL)	Yes (Flink SQL)	No (configuration-based)
Iceberg sink	Built-in	Built-in	Via connector plugin
Automatic column mapping	Yes (*) syntax	Partial	No

RisingWave's advantage for schema evolution is its tight integration between CDC sources and the Iceberg sink. DDL changes detected at the source propagate through the pipeline to the Iceberg table automatically. With Flink, the new dynamic sink in Flink 2.0 provides similar capabilities, but it requires more configuration and operational overhead.

FAQ

What is schema evolution in Apache Iceberg?

Schema evolution in Apache Iceberg is the ability to modify a table's schema (add, drop, rename, or reorder columns and promote types) without rewriting existing data files. Iceberg achieves this through its ID-based column tracking system, where each column is assigned a unique integer ID that persists through all schema changes. This design ensures backward compatibility and correct historical queries.

Can I rename columns in Iceberg without breaking my streaming pipeline?

Yes. Iceberg's ID-based column tracking means renaming a column does not affect data files or existing queries that reference the column by ID. The rename is recorded in the table metadata, and query engines resolve the new name using the same underlying column ID. Your streaming pipeline continues writing data without interruption.

How does Iceberg schema evolution differ from Delta Lake?

The primary difference is column identification. Iceberg uses unique integer IDs for each column, making renames and reorders inherently safe. Delta Lake traditionally identifies columns by name, which means renames can break the connection between old data files and the current schema unless you enable column mapping mode. Iceberg also supports in-place partition evolution, while Delta Lake requires table recreation for partition changes.

Does RisingWave automatically handle upstream schema changes when sinking to Iceberg?

Yes, when you configure a CDC source with auto.schema.change = 'true' and use the (*) syntax for table creation, RisingWave detects DDL changes (such as ADD COLUMN and DROP COLUMN) from the upstream database and propagates them through the pipeline to the Iceberg sink. The pipeline does not need to be stopped or reconfigured.

Conclusion

Schema evolution is inevitable in any production data system. The question is whether your streaming pipeline handles it gracefully or breaks when it happens.

Key takeaways:

Iceberg's ID-based column tracking makes schema changes safe without data rewrites
Adding columns is the safest change; dropping and renaming carry more downstream risk
RisingWave's auto.schema.change parameter and (*) syntax propagate DDL changes from CDC sources through to Iceberg automatically
Test schema changes in isolation using Iceberg branches before applying them to production
For type changes, use explicit casts in materialized views to ensure consistency across the pipeline

Ready to build a schema-resilient streaming lakehouse? Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.