Schema Evolution in Data Lakehouses

Schema Evolution in Data Lakehouses refers to the capability of safely and efficiently modifying the structure (schema) of tables stored in a data lake, even after data has been written and is actively being queried. This is a critical feature provided by open table formats like Apache Iceberg, Apache Hudi, and Delta Lake, which underpin the Lakehouse architecture.

Without robust schema evolution, changes to data structure in a traditional data lake often involve costly and disruptive rewrites of entire datasets.

Core Principles and Benefits

Safe Changes: Allows for common schema modifications without breaking existing data or queries. Supported operations typically include:
- Adding new columns: New columns can be added, often with a default value for existing rows.
- Dropping columns: Columns can be removed.
- Renaming columns: Existing columns can be renamed.
- Reordering columns: The logical order of columns can be changed.
- Changing data types (promotion): Widening a data type (e.g., INT to LONG, FLOAT to DOUBLE) is often supported. Type changes that might truncate or misinterpret data (e.g., STRING to INT) are usually disallowed or require explicit casting.
No Data Rewrite (Often): Many schema changes (like adding a column, renaming a column if the format supports it abstractly) can be applied by just updating the table's metadata, without rewriting any of the underlying data files. This makes schema evolution fast and cost-effective.
Atomic Metadata Operations: Schema changes are typically atomic operations on the table's metadata, ensuring consistency.
Compatibility: Table formats manage schema versions, allowing older queries or applications using a previous schema version to still function correctly or to be gracefully informed of schema changes.
Decoupling from Physical Layout: The logical schema perceived by users is decoupled from the physical schema of the underlying data files, enabling this flexibility.

Why is it Important for Data Lakehouses?

Agility: Business requirements and data sources change. Schema evolution allows data teams to adapt their data models quickly.
Long-Term Data Management: Ensures that data stored in the lakehouse remains usable and adaptable over many years, despite evolving data needs.
Reduced Maintenance Overhead: Avoids complex and error-prone manual processes for updating table structures.
Data Governance: Table formats provide a clear history of schema changes, aiding in auditing and understanding data lineage.

Example (Conceptual with Apache Iceberg)

Imagine an Iceberg table orders with columns order_id, customer_id, order_date.

Add Column:
```
ALTER TABLE orders ADD COLUMN order_total DECIMAL(10,2);
```
Iceberg updates its metadata to include order_total. Existing data files are not immediately rewritten; new queries will see order_total as NULL for old rows (or a specified default).
Rename Column:
```
ALTER TABLE orders RENAME COLUMN order_date TO placement_date;
```
Iceberg updates its metadata. Queries using placement_date will now work.

RisingWave and Schema Evolution

While RisingWave itself manages schemas for its internal state and materialized views, when it acts as a sink to a Lakehouse table (e.g., an Iceberg table), it relies on the target table format's schema evolution capabilities. If the schema of a RisingWave materialized view (source) changes, and this view is being sunk to an Iceberg table, careful coordination is needed:

The schema of the target Iceberg table must be evolved first to accommodate the changes (e.g., add a new column in Iceberg).
Then, the RisingWave sink can be updated or recreated to map to the new target schema.

This ensures that data continues to flow correctly and that the Lakehouse table maintains its integrity.

Schema Evolution in Data Lakehouses

Core Principles and Benefits

Why is it Important for Data Lakehouses?

Example (Conceptual with Apache Iceberg)

RisingWave and Schema Evolution

Related Blog Posts

Frequently Asked Questions

Related Glossary Terms