Join our Streaming Lakehouse Tour!
Register Now.->

Open Table Format

An Open Table Format is a specification that defines how large collections of data files are organized and managed as tables, typically within a data lake environment (e.g., on cloud object storage like AWS S3, Google Cloud Storage, or Azure Data Lake Storage). These formats bring database-like features and reliability to data lakes, which traditionally store data in various raw file formats (like Parquet, ORC, Avro) without a higher-level table structure.

Key examples of Open Table Formats include:

  • Apache Iceberg
  • Apache Hudi
  • Delta Lake

Core Problems Addressed

Traditional data lakes often suffer from:

  • Lack of ACID Transactions: Making concurrent reads and writes reliable is difficult, leading to potential data corruption or inconsistent views.
  • Schema Enforcement & Evolution Challenges: Enforcing schemas across many files and evolving schemas over time without breaking pipelines can be complex.
  • Difficult Data Management: Operations like updates, deletes, and merges (upserts) on specific rows are inefficient or unsupported directly on raw data files.
  • Inconsistent Query Results: Different query engines might interpret data or list files differently, leading to varying results.
  • "Small File Problem": Accumulation of many small files can degrade query performance significantly due to metadata overhead.
  • Limited Performance Optimizations: Features like data skipping, indexing, and efficient partitioning are harder to implement on raw files.

Key Features & Benefits

Open Table Formats address these issues by providing:

  1. Transactional Guarantees (ACID): They enable atomic, consistent, isolated, and durable operations on tables in the data lake, preventing data corruption from concurrent writes or failed jobs.
  2. Schema Management:
    • Schema Enforcement: Ensure data written to a table conforms to its defined schema.
    • Schema Evolution: Support safe and robust changes to the table schema over time (e.g., adding, dropping, renaming columns, changing types) without rewriting all data files.
  3. Time Travel & Versioning: Allow users to query previous versions or "snapshots" of a table, enabling reproducibility, auditing, and rollback capabilities.
  4. Unified Batch and Stream Processing: Designed to be used by both batch processing engines (like Spark, Trino, Presto) and stream processing engines (like Flink, Spark Streaming, and increasingly RisingWave for sinks).
  5. Performance Optimizations:
    • Data Partitioning: Efficiently organize data based on partition keys to prune irrelevant data during queries. Some formats support hidden partitioning to simplify management.
    • Data Compaction: Tools and strategies to merge small files into larger, more optimal ones.
    • Data Skipping/Indexing: Store metadata (like min/max values per column per file) to allow query engines to skip reading files that don't contain relevant data.
    • Z-Ordering/Data Clustering: Optimize data layout for common query patterns.
  6. Rich Data Management Operations: Support for row-level updates, deletes, and merges (upserts) directly on data lake tables.
  7. Openness and Interoperability: As open formats, they aim to be compatible with a wide range of query engines and data processing tools, preventing vendor lock-in.
  8. Scalable Metadata Management: Efficiently manage metadata for tables containing petabytes of data and billions of files.

How They Work (General Concept)

Open Table Formats typically work by maintaining a layer of metadata on top of the raw data files (e.g., Parquet, ORC files stored in object storage). This metadata tracks:

  • The current schema of the table.
  • A list of data files that constitute the current version (snapshot) of the table.
  • Statistics about the data within each file (for performance optimizations).
  • Transaction logs or manifest files that record changes to the table state.

When a write operation occurs (e.g., new data ingestion, update, delete), the table format updates its metadata to reflect the changes and commits a new snapshot. Query engines then consult this metadata layer to understand the table's structure and identify the correct data files to read for a given query.

Role in Modern Data Architectures

Open Table Formats are fundamental to the Data Lakehouse architecture, which combines the cost-effectiveness and flexibility of data lakes with the reliability and performance of data warehouses.

For a streaming system like RisingWave:

  • Sink Target: RisingWave can use Open Table Formats (particularly Apache Iceberg) as a sink. This means RisingWave can continuously process streaming data and write the results into an Iceberg table in a data lake. This allows the fresh, processed data to be available for batch analytics, BI tools, or other data consumers that can query Iceberg tables.
  • Streaming Lakehouse: This capability is key to building a Streaming Lakehouse, where real-time data is ingested, processed, and made available with low latency directly on the lake, unifying batch and stream processing paradigms.

Related Glossary Terms

The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.