RisingWave Labs Blog

Apache Iceberg has rapidly become the open standard for table formats in modern data lakes. Its powerful features—schema evolution, hidden partitioning, time travel, and ACID compliance—enable scalable, reliable, and vendor-neutral data architectures.

However, Iceberg defines the storage format, leaving the complexities of data ingestion and processing, especially for real-time streams, to separate systems. While query engines like Trino or Athena excel with static datasets, they aren't designed for continuous, low-latency ingestion and transformation of streaming data into Iceberg. This often forces engineers to integrate multiple complex tools, increasing operational overhead and fragility.

The Challenges of Streaming into Iceberg

Based on our work with engineering teams, we see recurring challenges when building real-time pipelines that sink to Iceberg:

1. Real-time ETL is Often Necessary, Yet Complex

Raw streaming data frequently requires transformation before landing in Iceberg. Compliance mandates may require filtering sensitive fields (PII masking for GDPR), and business logic often necessitates enriching events by joining them with reference data. Implementing this continuous ETL robustly using traditional batch tools or even specialized streaming systems can be complex and resource-intensive.

2. Handling Change Data Capture (CDC) Correctly

Propagating changes from operational databases (Postgres, MySQL, etc.) via CDC into Iceberg requires careful handling. Since Iceberg tables typically lack primary key enforcement at the storage layer, updates and deletes must often be implemented via equality deletes: deleting the old row version and inserting the new one. A streaming ingest engine must reliably:

Interpret the CDC format (e.g., Debezium).
Apply changes deterministically.
Guarantee exactly-once processing to prevent data loss or duplication, especially during failures or scaling events.

3. The Need for Low-Latency Data Visibility

Engineers and analysts need to query incoming data quickly—for validation, debugging, quality assurance, or real-time monitoring. Traditional pipelines often introduce significant latency between data arrival and its queryability in Iceberg, due to micro-batching intervals or asynchronous compaction processes.

Limitations of Common Approaches

Two frequent approaches for streaming into Iceberg involve Kafka or Flink, each with trade-offs:

Kafka: Excellent for data transport, but lacks native processing capabilities. Handling transformations, joins, or CDC requires integrating other tools (Kafka Streams, ksqlDB, custom applications), adding layers of complexity.
Flink: A powerful stream processor capable of ETL and writing to Iceberg. However, it often presents a steep learning curve (requiring Java/Scala for complex jobs), its SQL dialect differs from standard SQL, and managing its state for complex operations like large multi-way joins can be challenging.

Operating these systems reliably at scale demands significant engineering investment. There's a need for a solution that simplifies streaming ingestion and management for Iceberg, offering a more integrated and SQL-native experience.

Introducing RisingWave's Iceberg Table Engine

RisingWave's Iceberg table engine, available in the upcoming v2.3 release, is designed to address these challenges directly. It allows you to treat Iceberg tables much like native tables within RisingWave, using standard SQL for definition, ingestion, and transformation. Built upon RisingWave's distributed stream processing engine, it provides a unified platform for handling real-time data destined for Iceberg.

Key Capabilities

Unified Table Management with SQL: Create and manage Iceberg tables directly within RisingWave using standard SQL DDL (CREATE TABLE ... ENGINE = 'iceberg'；)) just like any other table. Perform INSERT, UPDATE, DELETE operations using familiar SQL DML.
Native CDC Ingestion and Handling: Connect directly to CDC sources (e.g., Postgres, MySQL via Debezium). RisingWave automatically interprets change events and correctly applies UPDATE and DELETE operations as equality deletes when writing to Iceberg, ensuring data consistency. Exactly-once processing semantics are maintained via RisingWave's fault-tolerance mechanisms.
Real-Time ETL with SQL: Define complex streaming transformations (filters, joins, aggregations) using standard SQL. RisingWave executes these as incrementally updated streaming jobs, continuously processing incoming data and sinking the results into your target Iceberg tables.
Low-Latency Queryability: Query the state of your data within RisingWave tables or materialized views with low latency, allowing immediate validation and inspection even before data is committed and compacted in the underlying Iceberg files.
Native Iceberg Format Integration: The engine writes data using the official Iceberg libraries, ensuring full compatibility. It supports Iceberg features like partitioning and schema evolution. Tables are stored in standard Iceberg format (e.g., on S3) and remain queryable by any Iceberg-compatible engine (Trino, Spark, Athena, DuckDB, etc.) through a shared catalog (AWS Glue, AWS S3 Tables, REST, JDBC). Compaction can be handled by external tools as needed.

Example Architecture: Streaming Postgres CDC to Iceberg via RisingWave

A typical setup involves:

Source: Postgres database emitting CDC events (via Debezium format).
Ingestion & Storage: Ingest the CDC data and store it in Iceberg tables with RisingWave only. You’ll normally need a few SQL commands and some basic configuration.

--- Create a connection for the table engine
CREATE CONNECTION ...

--- Configure the connection for the table engine.
SET iceberg_engine_connection = 'public.conn';
ALTER system SET iceberg_engine_connection = 'public.conn';

--- Create a source
CREATE SOURCE pg_source WITH (
    connector='postgres-cdc',
    hostname='localhost',
    port='5432',
    username='your_user',
    password='your_password',
    database.name='your_database',
    schema.name='public' -- Optional, defaults to 'public'
);

-- Create an Iceberg table from a table in the source
CREATE TABLE my_table (
    id INT PRIMARY KEY,
    name VARCHAR
)
FROM pg_source TABLE 'public.my_upstream_table'
ENGINE = iceberg;

This architecture delivers real-time data ingestion and storage to Iceberg without the typical complexity of managing separate Kafka, Flink, or custom code deployments.

Summary

Apache Iceberg provides a robust foundation for data lakes, but integrating real-time streams remains a challenge. RisingWave's Iceberg table engine simplifies this by offering:

SQL-based management of Iceberg tables within a streaming context.
Built-in handling for CDC streams and equality deletes.
Real-time ETL capabilities using standard SQL.
Low-latency visibility into streaming data pipelines.
Full compatibility with the Iceberg ecosystem.

If you're building streaming data pipelines targeting Iceberg and seek a more integrated, efficient, and SQL-friendly approach, explore the RisingWave Iceberg table engine, coming soon in version 2.3.

Learn More:

Introducing Iceberg Table Engine in RisingWave: Manage Streaming Data in Iceberg with SQL