Introduction
Your product team just added a referral_source column to the users table. Your data warehouse team renamed txn_amt to transaction_amount for clarity. A partner integration now sends prices as decimals instead of integers. In a batch world, you handle these changes during the next scheduled ETL run. In a streaming world, the pipeline never stops, so the schema change arrives while data is actively flowing.
Schema evolution is one of the hardest problems in streaming data engineering. A mismatch between the schema your pipeline expects and the schema it actually receives can cause silent data loss, pipeline crashes, or corrupt downstream tables. Apache Iceberg was designed to handle schema changes safely through its ID-based column tracking system, and it supports adding, dropping, renaming, and reordering columns without rewriting existing data files.
This article explains how Iceberg schema evolution works, how it compares to Delta Lake's approach, and how RisingWave handles schema changes automatically when sinking data to Iceberg tables in a streaming pipeline.
How Does Iceberg Schema Evolution Work?
Apache Iceberg takes a fundamentally different approach to schema management than most data formats. Instead of identifying columns by name or position, Iceberg assigns each column a unique integer ID when it is first created. This ID never changes, even if the column is renamed or moved to a different position in the schema.
ID-based column tracking
When you create an Iceberg table with columns (id INT, name STRING, email STRING), Iceberg internally assigns IDs like this:
| Column Name | Column ID | Type |
| id | 1 | INT |
| name | 2 | STRING |
| 3 | STRING |
If you later rename email to email_address, the column ID stays as 3. Old data files still reference column ID 3, and query engines know that column ID 3 now maps to email_address. No data rewrite is needed.
Supported schema changes
Iceberg supports these schema evolution operations:
- Add columns: New columns are appended with a new unique ID. Existing data files return
NULLfor the new column. - Drop columns: The column ID is retired. Old data files still contain the data, but query engines skip it.
- Rename columns: The column name changes but the ID stays the same. All existing data remains readable.
- Reorder columns: Column positions change, but IDs remain fixed. Query engines map by ID, not position.
- Type promotion: Certain safe type widening is allowed, such as
INTtoBIGINT,FLOATtoDOUBLE, orDECIMAL(p, s)toDECIMAL(p', s)wherep' > p.
What schema changes are NOT allowed?
Iceberg prohibits changes that could cause data corruption:
- Narrowing types (e.g.,
BIGINTtoINT) is not allowed because values could overflow - Changing between incompatible types (e.g.,
STRINGtoINT) is blocked - Reusing a dropped column's ID is forbidden to prevent data misinterpretation
These restrictions exist to guarantee that every historical query returns correct results, which is critical for both time travel and audit compliance.
How Does Iceberg Schema Evolution Compare to Delta Lake?
Both Iceberg and Delta Lake support schema evolution, but they differ in important ways.
| Feature | Apache Iceberg | Delta Lake |
| Column identification | By unique integer ID | By column name |
| Column rename | Safe (ID unchanged) | Requires column mapping mode |
| Column reorder | Native support | Not natively supported |
| Type promotion | INT to BIGINT, FLOAT to DOUBLE | INT to BIGINT, FLOAT to DOUBLE |
| Partition evolution | In-place, no rewrite | Requires table recreation |
| Nested schema evolution | Full support (struct, list, map) | Limited support |
| Historical correctness | Guaranteed by ID mapping | Can break with renames |
The most significant difference is in column identification. Because Delta Lake uses column names, renaming a column can break the connection between old data files and the current schema unless you enable column mapping mode (available since Delta Lake 2.0). Iceberg avoids this problem entirely through its ID-based system.
Partition evolution
Iceberg also supports partition evolution without rewriting data. You can change from daily to hourly partitioning, and Iceberg handles the mixed partition layouts transparently. With Delta Lake, changing the partition scheme requires creating a new table and migrating data.
This is particularly relevant for streaming pipelines where partition schemes may need to evolve as data volumes grow.
What Happens When Source Schemas Change in Streaming Pipelines?
In a streaming pipeline, schema changes at the source system create a cascade of challenges:
The column addition scenario
Consider this sequence of events:
- Your streaming pipeline ingests order data with columns
(order_id, customer_id, amount, created_at) - The product team adds a
discount_codecolumn to the orders table in PostgreSQL - New records now include
discount_code, but in-flight records in your Kafka topic may not
Without proper handling, the pipeline might:
- Crash because it encounters an unexpected column
- Silently drop the new column
- Write records with misaligned columns
The type change scenario
- Your pipeline processes
priceas an integer (cents) - The upstream system changes
priceto a decimal with two fractional digits - In-flight records have integer prices while new records have decimal prices
How RisingWave handles schema changes
RisingWave provides automatic schema change propagation when you enable the auto.schema.change parameter on CDC sources. Here is how to set it up:
-- Create a PostgreSQL CDC source with auto schema change enabled
CREATE SOURCE pg_orders WITH (
connector = 'postgres-cdc',
hostname = 'prod-db.example.com',
port = '5432',
username = 'cdc_user',
password = '${PG_PASSWORD}',
database.name = 'ecommerce',
schema.name = 'public',
auto.schema.change = 'true'
);
-- Create a table with automatic schema mapping
CREATE TABLE orders (*) FROM pg_orders TABLE 'public.orders';
When the upstream PostgreSQL table changes (e.g., ALTER TABLE orders ADD COLUMN discount_code VARCHAR), RisingWave detects the DDL change through the WAL and propagates it downstream automatically. The (*) syntax maps all columns from the source, so new columns appear in RisingWave without manual intervention.
How Do You Build a Schema-Resilient Streaming Pipeline to Iceberg?
Here is a complete example that handles schema evolution from source to lakehouse.
Step 1: Set up the CDC source with auto schema change
CREATE SOURCE pg_source WITH (
connector = 'postgres-cdc',
hostname = 'db.example.com',
port = '5432',
username = 'cdc_reader',
password = '${PG_PASSWORD}',
database.name = 'product_db',
schema.name = 'public',
auto.schema.change = 'true'
);
Step 2: Create tables that mirror upstream schemas
-- Using (*) for automatic schema mapping
CREATE TABLE products (*) FROM pg_source TABLE 'public.products';
CREATE TABLE inventory (*) FROM pg_source TABLE 'public.inventory';
When you use (*), RisingWave automatically maps all columns from the upstream table. If the upstream schema changes, RisingWave propagates ADD COLUMN and DROP COLUMN operations to the downstream table.
Step 3: Create a materialized view for transformations
CREATE MATERIALIZED VIEW product_inventory AS
SELECT
p.product_id,
p.name AS product_name,
p.category,
p.price,
i.warehouse_id,
i.quantity_on_hand,
i.last_restocked_at,
CASE
WHEN i.quantity_on_hand = 0 THEN 'out_of_stock'
WHEN i.quantity_on_hand < 10 THEN 'low_stock'
ELSE 'in_stock'
END AS stock_status
FROM products p
JOIN inventory i ON p.product_id = i.product_id;
Step 4: Sink to Iceberg with upsert mode
CREATE SINK product_inventory_iceberg FROM product_inventory
WITH (
connector = 'iceberg',
type = 'upsert',
primary_key = 'product_id,warehouse_id',
warehouse.path = 's3://data-lake/warehouse',
database.name = 'retail',
table.name = 'product_inventory',
catalog.type = 'rest',
catalog.name = 'retail_catalog',
create_table_if_not_exists = 'true',
commit_checkpoint_interval = '60',
s3.access.key = '${AWS_ACCESS_KEY}',
s3.secret.key = '${AWS_SECRET_KEY}',
s3.region = 'us-east-1'
);
What happens when a column is added upstream
Suppose the product team adds a brand column to the products table:
-- Executed on the upstream PostgreSQL database
ALTER TABLE products ADD COLUMN brand VARCHAR(100);
With auto.schema.change = 'true', here is the sequence:
- PostgreSQL writes the DDL change to its WAL
- RisingWave's CDC connector detects the DDL event
- RisingWave adds the
brandcolumn to theproductstable in RisingWave - New records flowing through the pipeline now include the
brandcolumn - The Iceberg sink detects the new column and adds it to the Iceberg table schema
- Existing data files in Iceberg return
NULLforbrand(no rewrite needed)
The pipeline never stops. No manual intervention is required.
What Are Best Practices for Schema Evolution in Streaming Iceberg Pipelines?
Managing schema evolution well requires both tooling and process discipline.
Use additive changes whenever possible
Adding new columns is the safest schema change. It never breaks existing queries or downstream consumers. Dropping or renaming columns carries more risk because downstream systems may depend on those column names.
-- Safe: Adding a new column
ALTER TABLE orders ADD COLUMN loyalty_points INT;
-- Riskier: Dropping a column that downstream reports might use
ALTER TABLE orders DROP COLUMN legacy_status;
Set up schema change monitoring
Track schema changes as they happen. In RisingWave, you can monitor CDC events to detect when upstream schemas change:
-- Monitor schema changes by checking table metadata
DESCRIBE products;
Consider setting up alerts when column counts change or when new columns appear in your pipeline.
Test schema changes in isolation
Before applying schema changes to production tables, test them in a staging environment. Iceberg's branching feature lets you create isolated branches for testing:
-- Create a branch to test schema changes (in Spark SQL)
ALTER TABLE retail.product_inventory CREATE BRANCH schema_test;
-- Write test data with the new schema to the branch
INSERT INTO retail.product_inventory.branch_schema_test
SELECT *, 'test_brand' AS brand FROM retail.product_inventory LIMIT 100;
-- Validate the results before merging
SELECT * FROM retail.product_inventory.branch_schema_test LIMIT 10;
Handle type promotions explicitly
When upstream systems change column types, verify that the promotion is compatible with Iceberg's rules. Safe promotions include:
INTtoBIGINTFLOATtoDOUBLEDECIMAL(p, s)toDECIMAL(p', s)wherep' > p
If the upstream system makes an incompatible type change (e.g., INT to VARCHAR), you need to handle it with an explicit transformation in your materialized view:
CREATE MATERIALIZED VIEW orders_transformed AS
SELECT
order_id,
customer_id,
-- Cast the price to a consistent type regardless of upstream changes
CAST(price AS DECIMAL(12, 2)) AS price,
order_status,
created_at
FROM orders;
Plan for nested schema evolution
If your data includes nested structures (JSON columns, structs), Iceberg supports evolution at every nesting level. You can add fields inside a struct, change nested field types, or add new map key-value pairs, all without breaking existing data.
-- Iceberg supports adding fields to nested structs
ALTER TABLE events ADD COLUMN metadata.source_region STRING;
This is a significant advantage over formats that flatten nested data during ingestion.
How Does RisingWave Compare to Other Streaming Engines for Schema Evolution?
Different streaming engines handle schema evolution differently when writing to Iceberg.
| Feature | RisingWave | Apache Flink | Kafka Connect |
| Auto schema detection | Yes (auto.schema.change) | Yes (Flink 2.0 dynamic sink) | Via Schema Registry |
| CDC DDL propagation | ADD/DROP COLUMN | ADD COLUMN | Requires reconfiguration |
| SQL-based pipeline | Yes (native SQL) | Yes (Flink SQL) | No (configuration-based) |
| Iceberg sink | Built-in | Built-in | Via connector plugin |
| Automatic column mapping | Yes (*) syntax | Partial | No |
RisingWave's advantage for schema evolution is its tight integration between CDC sources and the Iceberg sink. DDL changes detected at the source propagate through the pipeline to the Iceberg table automatically. With Flink, the new dynamic sink in Flink 2.0 provides similar capabilities, but it requires more configuration and operational overhead.
FAQ
What is schema evolution in Apache Iceberg?
Schema evolution in Apache Iceberg is the ability to modify a table's schema (add, drop, rename, or reorder columns and promote types) without rewriting existing data files. Iceberg achieves this through its ID-based column tracking system, where each column is assigned a unique integer ID that persists through all schema changes. This design ensures backward compatibility and correct historical queries.
Can I rename columns in Iceberg without breaking my streaming pipeline?
Yes. Iceberg's ID-based column tracking means renaming a column does not affect data files or existing queries that reference the column by ID. The rename is recorded in the table metadata, and query engines resolve the new name using the same underlying column ID. Your streaming pipeline continues writing data without interruption.
How does Iceberg schema evolution differ from Delta Lake?
The primary difference is column identification. Iceberg uses unique integer IDs for each column, making renames and reorders inherently safe. Delta Lake traditionally identifies columns by name, which means renames can break the connection between old data files and the current schema unless you enable column mapping mode. Iceberg also supports in-place partition evolution, while Delta Lake requires table recreation for partition changes.
Does RisingWave automatically handle upstream schema changes when sinking to Iceberg?
Yes, when you configure a CDC source with auto.schema.change = 'true' and use the (*) syntax for table creation, RisingWave detects DDL changes (such as ADD COLUMN and DROP COLUMN) from the upstream database and propagates them through the pipeline to the Iceberg sink. The pipeline does not need to be stopped or reconfigured.
Conclusion
Schema evolution is inevitable in any production data system. The question is whether your streaming pipeline handles it gracefully or breaks when it happens.
Key takeaways:
- Iceberg's ID-based column tracking makes schema changes safe without data rewrites
- Adding columns is the safest change; dropping and renaming carry more downstream risk
- RisingWave's
auto.schema.changeparameter and(*)syntax propagate DDL changes from CDC sources through to Iceberg automatically - Test schema changes in isolation using Iceberg branches before applying them to production
- For type changes, use explicit casts in materialized views to ensure consistency across the pipeline
Ready to build a schema-resilient streaming lakehouse? Try RisingWave Cloud free, no credit card required. Sign up here.
Join our Slack community to ask questions and connect with other stream processing developers.

