Apache Iceberg Streaming: Process and Sink Real-Time Data

Introduction

Your data lake runs on Apache Iceberg. Your dashboards refresh every hour. But your business moves in seconds. A customer abandons a cart, a sensor triggers an alert, a transaction looks fraudulent. By the time your hourly batch job catches it, the moment has passed.

This is the core tension in modern data architectures: Iceberg gives you reliable, open table formats with schema evolution and time travel, but it was designed for batch workloads. Streaming data into Iceberg closes that gap, turning your lakehouse from a historical archive into a system that reflects what is happening right now.

In this guide, you will learn why batch loading into Iceberg falls short for time-sensitive workloads, how streaming ingestion changes the architecture, and how to use RisingWave as your streaming engine to continuously process and sink data into Iceberg tables with SQL.

Why Is Batch Loading Into Iceberg Not Enough?

Batch loading remains the default pattern for most Iceberg deployments. A scheduled Spark job runs every hour (or every 15 minutes if you are ambitious), reads new data from a staging area, and appends it to Iceberg tables. This works, but it introduces several problems that compound as your data volumes and freshness requirements grow.

Latency floors

Even with aggressive scheduling, batch jobs impose a minimum latency equal to the batch interval plus job startup time plus processing time. A 15-minute Spark batch job that takes 3 minutes to initialize and 5 minutes to process delivers data with at least 23 minutes of latency. For fraud detection, operational monitoring, or real-time personalization, that is too slow.

Small file problems

Frequent batch jobs produce many small Parquet files. Iceberg's metadata layer handles this better than raw Hive tables, but query engines still suffer when scanning thousands of tiny files. You end up needing a separate compaction process to merge these files, adding operational complexity.

Resource waste

Batch jobs provision compute for peak throughput, then sit idle between runs. A Spark cluster sized for a 15-minute batch window uses resources for 5 minutes and wastes them for 10. Streaming processes use compute continuously and can scale based on actual throughput.

No incremental state

Batch jobs recompute aggregations from scratch or maintain complex watermark logic to process only new data. Streaming engines maintain state incrementally: when a new event arrives, they update the running count, sum, or join result without reprocessing historical data.

How Does Streaming Into Iceberg Work?

Streaming into Iceberg means continuously writing data from event sources (Kafka topics, CDC streams, S3 notifications) into Iceberg tables with low latency. Instead of accumulating data and writing it in bulk, a streaming engine processes each event as it arrives, applies transformations, and commits results to Iceberg at regular intervals.

The architecture follows a clear pattern:

Sources produce events continuously. These include Kafka topics carrying clickstream data, CDC connectors capturing database changes, and object storage notifications for file arrivals.

A streaming engine consumes these events, applies SQL transformations (filtering, joining, aggregating), and maintains state for operations like windowed counts or temporal joins.

An Iceberg sink commits processed data to Iceberg tables. The engine batches writes into Parquet files and commits them through Iceberg's transactional metadata layer, providing exactly-once delivery guarantees.

Query engines (Trino, Spark, DuckDB, StarRocks) read the Iceberg tables. Because Iceberg uses open metadata and Parquet files, any compatible engine can query the data immediately after a commit.

This architecture decouples the streaming engine from the query engine. RisingWave handles the continuous ingestion and transformation. Your existing analytical tools handle the querying. The Iceberg table format sits in between as the interoperability layer.

What Are the Options for Streaming Data Into Iceberg?

Three approaches dominate the landscape for streaming data into Apache Iceberg: Kafka Connect with an Iceberg sink connector, Apache Flink with its native Iceberg integration, and RisingWave with its built-in Iceberg sink. Each makes different tradeoffs between simplicity, power, and operational cost.

Kafka Connect + Iceberg sink connector

Kafka Connect provides the simplest path. You deploy a connector, configure the Iceberg table parameters, and data flows from Kafka topics into Iceberg tables. The Tabular Iceberg sink connector and similar connectors handle schema mapping and commit coordination.

Strengths: Minimal code, easy deployment, integrates with existing Kafka infrastructure.

Limitations: No support for complex transformations. You cannot join streams, compute aggregations, or filter based on state. Single Message Transforms (SMTs) handle simple field mapping but nothing stateful. If your data needs enrichment or aggregation before landing in Iceberg, you need a separate processing layer.

Apache Flink + Iceberg

Flink provides a full-featured stream processing framework with native Iceberg support through the Iceberg Flink connector. You can define complex streaming pipelines that read from Kafka, apply stateful transformations, and write to Iceberg tables.

Strengths: Powerful stateful processing, exactly-once guarantees, mature ecosystem, supports both Java/Scala and Flink SQL.

Limitations: Operational complexity is significant. Flink clusters require JVM tuning, checkpoint configuration, and careful state backend management. Flink SQL covers many use cases, but complex logic often requires dropping into Java. Cluster sizing and resource management demand dedicated platform expertise.

RisingWave + Iceberg

RisingWave takes a SQL-first approach. You define sources, transformations, and sinks entirely in PostgreSQL-compatible SQL. The engine handles state management, checkpointing, and exactly-once delivery internally, using cloud object storage (S3) instead of local SSDs for state.

Strengths: Pure SQL interface, no JVM overhead, built-in state management on S3, exactly-once Iceberg sink, automatic small-file compaction with the Iceberg Table Engine.

Limitations: Younger ecosystem compared to Flink. Best suited for SQL-expressible workloads rather than arbitrary custom operators.

Comparison table

Dimension	Kafka Connect	Apache Flink	RisingWave
Language	Configuration (JSON)	Java/Scala/SQL	PostgreSQL SQL
Transformations	SMTs only (stateless)	Full stateful processing	Full stateful processing
Joins	Not supported	Supported	Supported
Aggregations	Not supported	Supported	Supported
Exactly-once	Connector-dependent	Yes	Yes
State backend	N/A	RocksDB (local SSD)	S3 object storage
Operational complexity	Low	High	Low
Cluster management	Kafka Connect workers	JobManager + TaskManagers	Single binary or cloud
Iceberg catalog support	Limited	Hive, Glue, REST	Glue, REST, JDBC
Small file compaction	External process needed	External process needed	Built-in (Table Engine)

How Do You Stream Data Into Iceberg With RisingWave?

Let's walk through a concrete example. You have clickstream events flowing through Kafka, and you want to aggregate them into per-page, per-hour metrics stored in Iceberg for your analytics team to query with Trino.

Step 1: Create a Kafka source

Connect RisingWave to your Kafka topic:

CREATE SOURCE clickstream_events (
    user_id VARCHAR,
    page_url VARCHAR,
    action VARCHAR,
    session_id VARCHAR,
    event_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'web.clickstream',
    properties.bootstrap.server = 'kafka-broker:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

Step 2: Create a materialized view for aggregation

Define the transformation as a continuously maintained materialized view:

CREATE MATERIALIZED VIEW hourly_page_metrics AS
SELECT
    page_url,
    date_trunc('hour', event_time) AS event_hour,
    COUNT(*) AS total_events,
    COUNT(DISTINCT user_id) AS unique_visitors,
    COUNT(DISTINCT session_id) AS unique_sessions,
    COUNT(*) FILTER (WHERE action = 'click') AS click_count,
    COUNT(*) FILTER (WHERE action = 'scroll') AS scroll_count
FROM clickstream_events
GROUP BY page_url, date_trunc('hour', event_time);

RisingWave maintains this aggregation incrementally. Each new event updates only the affected row in the result, without reprocessing the entire dataset.

Step 3: Sink to Iceberg

Create a sink that continuously writes the materialized view results to an Iceberg table:

CREATE SINK clickstream_to_iceberg FROM hourly_page_metrics
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'page_url, event_hour',
    database.name = 'analytics',
    table.name = 'hourly_page_metrics',
    catalog.type = 'glue',
    catalog.name = 'my_catalog',
    warehouse.path = 's3://my-data-lake/warehouse',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    create_table_if_not_exists = 'true'
);

The type = 'upsert' setting means RisingWave updates existing rows when the aggregation changes (for example, when a new click event arrives for a page-hour combination that already has a row). The primary_key defines which columns identify a unique row.

Step 4: Query from any engine

Your Iceberg table is now continuously updated. Query it from Trino, Spark, or DuckDB:

-- In Trino
SELECT page_url, unique_visitors, click_count
FROM iceberg.analytics.hourly_page_metrics
WHERE event_hour >= current_timestamp - INTERVAL '24' HOUR
ORDER BY unique_visitors DESC
LIMIT 20;

The end-to-end latency from Kafka event to queryable Iceberg row depends on the commit_checkpoint_interval setting, which defaults to approximately 60 seconds.

How Does the Iceberg Table Engine Simplify the Architecture?

RisingWave's Iceberg Table Engine takes integration a step further. Instead of managing separate sources, materialized views, and sinks, you create tables that are backed directly by Iceberg storage:

-- Set up the Iceberg connection
CREATE CONNECTION iceberg_conn WITH (
    type = 'iceberg',
    warehouse.path = 's3://my-data-lake/warehouse',
    catalog.type = 'glue',
    catalog.name = 'my_catalog',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}'
);

SET iceberg_engine_connection = 'public.iceberg_conn';

-- Create a CDC source
CREATE SOURCE pg_source WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-db.internal',
    port = '5432',
    username = 'replication_user',
    password = '${PG_PASSWORD}',
    database.name = 'production'
);

-- Create an Iceberg-backed table from the CDC source
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    product_id INT,
    quantity INT,
    total_amount DECIMAL,
    order_status VARCHAR,
    created_at TIMESTAMPTZ,
    updated_at TIMESTAMPTZ
)
FROM pg_source TABLE 'public.orders'
ENGINE = iceberg;

With the Table Engine, every INSERT, UPDATE, and DELETE from the upstream PostgreSQL database is automatically captured, transformed into Iceberg-compatible operations, and committed to the Iceberg table. The built-in compaction service merges small files and resolves equality deletes automatically, so downstream query engines get clean, optimized Parquet files.

What About CDC and Multi-Source Streaming?

Real-world architectures rarely have a single data source. You might need to join CDC streams from your PostgreSQL database with Kafka events from your application layer, then land the enriched result in Iceberg. RisingWave handles this with standard SQL joins across sources.

Here is an example that enriches order events with customer data from a CDC stream:

-- CDC source for customer data
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR,
    email VARCHAR,
    tier VARCHAR,
    region VARCHAR
) WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-db.internal',
    port = '5432',
    username = 'replication_user',
    password = '${PG_PASSWORD}',
    database.name = 'production',
    table.name = 'public.customers'
);

-- Kafka source for order events
CREATE SOURCE order_events (
    order_id INT,
    customer_id INT,
    product_id INT,
    amount DECIMAL,
    order_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'orders.placed',
    properties.bootstrap.server = 'kafka-broker:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

-- Join and aggregate, then sink to Iceberg
CREATE MATERIALIZED VIEW orders_by_region AS
SELECT
    c.region,
    c.tier AS customer_tier,
    date_trunc('hour', o.order_time) AS order_hour,
    COUNT(*) AS order_count,
    SUM(o.amount) AS total_revenue,
    COUNT(DISTINCT o.customer_id) AS unique_customers
FROM order_events o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.region, c.tier, date_trunc('hour', o.order_time);

CREATE SINK regional_orders_to_iceberg FROM orders_by_region
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'region, customer_tier, order_hour',
    database.name = 'analytics',
    table.name = 'orders_by_region',
    catalog.type = 'glue',
    catalog.name = 'my_catalog',
    warehouse.path = 's3://my-data-lake/warehouse',
    s3.region = 'us-east-1',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    create_table_if_not_exists = 'true'
);

This pipeline joins a CDC stream with a Kafka stream, computes hourly regional aggregations, and continuously writes the results to Iceberg. Each component uses standard SQL, and RisingWave handles state management, exactly-once delivery, and checkpoint coordination internally.

FAQ

What is Apache Iceberg streaming?

Apache Iceberg streaming refers to continuously ingesting and processing real-time data into Iceberg tables, as opposed to traditional batch loading. A streaming engine (such as RisingWave, Flink, or Kafka Connect) consumes events from sources like Kafka or CDC streams and writes them to Iceberg tables with low latency, typically under 60 seconds.

Can RisingWave guarantee exactly-once delivery to Iceberg?

Yes. RisingWave's Iceberg sink connector supports exactly-once semantics by default (is_exactly_once = true). It coordinates checkpoints with Iceberg's transactional commit protocol, ensuring that each record is written to the Iceberg table exactly once, even in the event of failures and restarts.

How does streaming to Iceberg handle small files?

Frequent commits from streaming engines naturally produce many small Parquet files, which can degrade query performance. RisingWave addresses this with its built-in compaction service (part of the Iceberg Table Engine), which continuously merges small files and resolves equality deletes. Without built-in compaction, you need an external process (such as a scheduled Spark job) to compact files periodically.

When should I use Kafka Connect instead of RisingWave for Iceberg ingestion?

Use Kafka Connect when you need simple, direct data movement from Kafka topics to Iceberg tables without any transformation, joining, or aggregation. Kafka Connect is the right choice for straightforward schema-mapped ingestion where the data is already in its final form. Choose RisingWave when you need to filter, join, aggregate, or enrich data before it lands in Iceberg.

Conclusion

Streaming data into Apache Iceberg transforms your lakehouse from a batch-oriented archive into a system that reflects your business in near real-time. Here are the key takeaways:

Batch loading imposes latency floors that make Iceberg unsuitable for time-sensitive workloads like fraud detection, operational monitoring, and real-time personalization.
Streaming engines bridge the gap by continuously processing events and committing results to Iceberg tables with sub-minute latency.
RisingWave offers a SQL-first approach that eliminates JVM complexity and uses cloud object storage for state, reducing both operational burden and infrastructure cost.
The Iceberg Table Engine simplifies the architecture further by providing built-in CDC handling and automatic compaction.
Multiple sources can be joined in SQL before sinking to Iceberg, enabling enriched, pre-aggregated data in your lakehouse.

For teams already using Iceberg, adding a streaming database like RisingWave is the fastest path to a real-time lakehouse architecture.

Ready to stream data into Iceberg with SQL? Try RisingWave Cloud free, with no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.