Schema Evolution in Data Lakehouses refers to the capability of safely and efficiently modifying the structure (schema) of tables stored in a data lake, even after data has been written and is actively being queried. This is a critical feature provided by open table formats like Apache Iceberg, Apache Hudi, and Delta Lake, which underpin the Lakehouse architecture.
Without robust schema evolution, changes to data structure in a traditional data lake often involve costly and disruptive rewrites of entire datasets.
Imagine an Iceberg table orders with columns order_id, customer_id, order_date.
Add Column:
ALTER TABLE orders ADD COLUMN order_total DECIMAL(10,2);
Iceberg updates its metadata to include order_total. Existing data files are not immediately rewritten; new queries will see order_total as NULL for old rows (or a specified default).
Rename Column:
ALTER TABLE orders RENAME COLUMN order_date TO placement_date;
Iceberg updates its metadata. Queries using placement_date will now work.
While RisingWave itself manages schemas for its internal state and materialized views, when it acts as a sink to a Lakehouse table (e.g., an Iceberg table), it relies on the target table format's schema evolution capabilities. If the schema of a RisingWave materialized view (source) changes, and this view is being sunk to an Iceberg table, careful coordination is needed:
This ensures that data continues to flow correctly and that the Lakehouse table maintains its integrity.