Open Table Format

An Open Table Format is a specification that defines how large collections of data files are organized and managed as tables, typically within a data lake environment (e.g., on cloud object storage like AWS S3, Google Cloud Storage, or Azure Data Lake Storage). These formats bring database-like features and reliability to data lakes, which traditionally store data in various raw file formats (like Parquet, ORC, Avro) without a higher-level table structure.

Key examples of Open Table Formats include:

Apache Iceberg
Apache Hudi
Delta Lake

Core Problems Addressed

Traditional data lakes often suffer from:

Lack of ACID Transactions: Making concurrent reads and writes reliable is difficult, leading to potential data corruption or inconsistent views.
Schema Enforcement & Evolution Challenges: Enforcing schemas across many files and evolving schemas over time without breaking pipelines can be complex.
Difficult Data Management: Operations like updates, deletes, and merges (upserts) on specific rows are inefficient or unsupported directly on raw data files.
Inconsistent Query Results: Different query engines might interpret data or list files differently, leading to varying results.
"Small File Problem": Accumulation of many small files can degrade query performance significantly due to metadata overhead.
Limited Performance Optimizations: Features like data skipping, indexing, and efficient partitioning are harder to implement on raw files.

Key Features & Benefits

Open Table Formats address these issues by providing:

Transactional Guarantees (ACID): They enable atomic, consistent, isolated, and durable operations on tables in the data lake, preventing data corruption from concurrent writes or failed jobs.
Schema Management:
- Schema Enforcement: Ensure data written to a table conforms to its defined schema.
- Schema Evolution: Support safe and robust changes to the table schema over time (e.g., adding, dropping, renaming columns, changing types) without rewriting all data files.
Time Travel & Versioning: Allow users to query previous versions or "snapshots" of a table, enabling reproducibility, auditing, and rollback capabilities.
Unified Batch and Stream Processing: Designed to be used by both batch processing engines (like Spark, Trino, Presto) and stream processing engines (like Flink, Spark Streaming, and increasingly RisingWave for sinks).
Performance Optimizations:
- Data Partitioning: Efficiently organize data based on partition keys to prune irrelevant data during queries. Some formats support hidden partitioning to simplify management.
- Data Compaction: Tools and strategies to merge small files into larger, more optimal ones.
- Data Skipping/Indexing: Store metadata (like min/max values per column per file) to allow query engines to skip reading files that don't contain relevant data.
- Z-Ordering/Data Clustering: Optimize data layout for common query patterns.
Rich Data Management Operations: Support for row-level updates, deletes, and merges (upserts) directly on data lake tables.
Openness and Interoperability: As open formats, they aim to be compatible with a wide range of query engines and data processing tools, preventing vendor lock-in.
Scalable Metadata Management: Efficiently manage metadata for tables containing petabytes of data and billions of files.

How They Work (General Concept)

Open Table Formats typically work by maintaining a layer of metadata on top of the raw data files (e.g., Parquet, ORC files stored in object storage). This metadata tracks:

The current schema of the table.
A list of data files that constitute the current version (snapshot) of the table.
Statistics about the data within each file (for performance optimizations).
Transaction logs or manifest files that record changes to the table state.

When a write operation occurs (e.g., new data ingestion, update, delete), the table format updates its metadata to reflect the changes and commits a new snapshot. Query engines then consult this metadata layer to understand the table's structure and identify the correct data files to read for a given query.

Role in Modern Data Architectures

Open Table Formats are fundamental to the Data Lakehouse architecture, which combines the cost-effectiveness and flexibility of data lakes with the reliability and performance of data warehouses.

For a streaming system like RisingWave:

Sink Target: RisingWave can use Open Table Formats (particularly Apache Iceberg) as a sink. This means RisingWave can continuously process streaming data and write the results into an Iceberg table in a data lake. This allows the fresh, processed data to be available for batch analytics, BI tools, or other data consumers that can query Iceberg tables.
Streaming Lakehouse: This capability is key to building a Streaming Lakehouse, where real-time data is ingested, processed, and made available with low latency directly on the lake, unifying batch and stream processing paradigms.

Open Table Format

Core Problems Addressed

Key Features & Benefits

How They Work (General Concept)

Role in Modern Data Architectures

Related Blog Posts

Frequently Asked Questions

Related Glossary Terms