Kafka to Apache Iceberg: A Complete SQL Pipeline Guide

Introduction

Getting data from Kafka into Apache Iceberg should not require three tools, two programming languages, and a team of data engineers to maintain. Yet that is the reality for most organizations today. The typical pipeline looks like this: Kafka Connect moves raw events from Kafka topics into a staging area, Apache Spark or Flink runs batch or micro-batch transformations, and a separate job writes the results to Iceberg tables. Each component has its own configuration, failure modes, and operational burden.

RisingWave collapses this entire pipeline into SQL. You create a Kafka source, define transformations as materialized views, and sink the results directly to Apache Iceberg, all within a single system using standard SQL statements. No Java code, no Spark clusters, no DAG orchestrators.

This guide walks you through building a complete Kafka-to-Iceberg pipeline with RisingWave. You will see how to ingest from Kafka, transform data with SQL, and deliver results to Iceberg tables. We compare this approach to Kafka Connect, Apache Flink, and Confluent Tableflow so you can evaluate the tradeoffs. All SQL examples are tested on RisingWave v2.3.

Why Is Getting Data from Kafka to Iceberg So Hard?

The Kafka-to-Iceberg path has become one of the most common data engineering patterns, but the tooling options each come with significant friction.

Kafka Connect + Iceberg Sink

Kafka Connect can write directly to Iceberg using sink connectors. The setup is straightforward for simple use cases: configure the connector, point it at your Iceberg catalog, and data flows.

The limitation is transformations. Kafka Connect is a data mover, not a data processor. If you need to join streams, aggregate events, filter based on business logic, or reshape data before writing to Iceberg, you need another tool. Single Message Transforms (SMTs) handle simple field-level changes, but anything beyond that requires chaining Kafka Connect with a separate processing framework.

Apache Flink + Iceberg Connector

Apache Flink supports Iceberg through its Table API and SQL. Flink SQL can ingest from Kafka, perform transformations, and write to Iceberg tables. This gives you full transformation capabilities.

The challenge is operational complexity. Flink requires JVM expertise for configuration tuning, checkpoint management, and state backend setup. Flink SQL is powerful but has syntax differences from standard SQL that create a learning curve. Managing Flink clusters, dealing with job failures, and tuning parallelism adds operational overhead that many teams want to avoid.

Confluent Tableflow

Confluent's Tableflow product writes Kafka topics directly to Iceberg tables as a managed service. It reduces operational burden but locks you into the Confluent ecosystem. Transformation capabilities are limited compared to a full stream processing engine, and you pay Confluent pricing rather than running open-source software.

The SQL Pipeline Alternative

RisingWave offers a different model. Instead of assembling multiple tools, you write SQL:

CREATE SOURCE connects to Kafka.
CREATE MATERIALIZED VIEW defines your transformations.
CREATE SINK writes the results to Iceberg.

Three SQL statements replace an entire pipeline architecture. Let's build one.

How to Build a Kafka-to-Iceberg SQL Pipeline

We will build a pipeline for an e-commerce platform that ingests order events from Kafka, enriches them with product data, computes real-time aggregations, and lands the results in Iceberg tables for analytics.

Step 1: Create a Kafka Source

Connect RisingWave to your Kafka topic containing order events:

CREATE SOURCE kafka_orders (
    order_id BIGINT,
    customer_id BIGINT,
    product_id VARCHAR,
    quantity INT,
    unit_price NUMERIC,
    order_status VARCHAR,
    event_time TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'ecommerce.orders',
    properties.bootstrap.server = 'broker1:9092,broker2:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

This source continuously consumes events from the ecommerce.orders topic. The scan.startup.mode = 'earliest' setting ensures RisingWave reads all existing messages, not just new ones.

For a second stream, create a source for product catalog updates:

CREATE TABLE products (
    product_id VARCHAR PRIMARY KEY,
    product_name VARCHAR,
    category VARCHAR,
    brand VARCHAR,
    updated_at TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'ecommerce.products',
    properties.bootstrap.server = 'broker1:9092,broker2:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

Note that we use CREATE TABLE (not CREATE SOURCE) for the products topic because it has a primary key and receives upsert updates. RisingWave materializes the latest state of each product, allowing you to join it with the orders stream. For more details on Kafka source configuration, see the RisingWave Kafka documentation.

Step 2: Define Transformations as Materialized Views

Now transform the raw events with SQL. This materialized view joins orders with products, calculates total amounts, and filters for completed orders:

CREATE MATERIALIZED VIEW enriched_orders AS
SELECT
    o.order_id,
    o.customer_id,
    o.product_id,
    p.product_name,
    p.category,
    p.brand,
    o.quantity,
    o.unit_price,
    o.quantity * o.unit_price AS total_amount,
    o.order_status,
    o.event_time
FROM kafka_orders o
LEFT JOIN products p ON o.product_id = p.product_id;

Create an aggregation view for real-time sales analytics:

CREATE MATERIALIZED VIEW hourly_sales AS
SELECT
    window_start,
    window_end,
    category,
    brand,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_revenue,
    AVG(total_amount) AS avg_order_value,
    SUM(quantity) AS total_units_sold
FROM TUMBLE(enriched_orders, event_time, INTERVAL '1 hour')
GROUP BY window_start, window_end, category, brand;

Both materialized views update continuously as new events arrive from Kafka. You can query them directly for real-time results:

SELECT category, SUM(total_revenue) AS revenue
FROM hourly_sales
WHERE window_start >= NOW() - INTERVAL '24 hours'
GROUP BY category
ORDER BY revenue DESC
LIMIT 10;

Expected output:

   category    |   revenue
---------------+------------
 Electronics   | 487293.50
 Clothing      | 312847.20
 Home & Garden | 198534.80
 Sports        | 156721.40
 Books         |  89432.10

Step 3: Sink to Apache Iceberg

Now deliver the transformed data to Iceberg tables for long-term storage and analytics:

CREATE SINK enriched_orders_sink FROM enriched_orders
WITH (
    connector = 'iceberg',
    type = 'append-only',
    warehouse.path = 's3://data-lakehouse/warehouse',
    database.name = 'ecommerce',
    table.name = 'enriched_orders',
    catalog.type = 'glue',
    catalog.name = 'analytics_catalog',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    s3.region = 'us-west-2',
    create_table_if_not_exists = 'true'
);

For the aggregation table, use upsert mode so that each hour-category-brand combination is updated in place:

CREATE SINK hourly_sales_sink FROM hourly_sales
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'window_start,window_end,category,brand',
    warehouse.path = 's3://data-lakehouse/warehouse',
    database.name = 'analytics',
    table.name = 'hourly_sales',
    catalog.type = 'glue',
    catalog.name = 'analytics_catalog',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    s3.region = 'us-west-2',
    create_table_if_not_exists = 'true'
);

That's the complete pipeline: three SQL statements for ingestion, transformation, and delivery. For more on Iceberg sink options, see the RisingWave Iceberg sink documentation.

How Does This Compare to Traditional Approaches?

Here is a side-by-side comparison of the four main approaches for building Kafka-to-Iceberg pipelines:

Capability	RisingWave SQL	Kafka Connect	Apache Flink SQL	Confluent Tableflow
Ingestion	SQL (CREATE SOURCE)	Connector config (JSON)	SQL or Java	Managed (UI/API)
Transformations	Full SQL (joins, aggregations, windows)	SMTs only (field-level)	Full SQL or Java/Scala	Limited
Delivery to Iceberg	SQL (CREATE SINK)	Iceberg sink connector	Iceberg connector	Automatic
Languages needed	SQL only	JSON config	SQL or Java/Scala	UI/API
Stream-table joins	Yes (native)	No	Yes	No
Exactly-once semantics	Yes	Depends on connector	Yes	Yes
Schema evolution	Automatic (from Iceberg metadata)	Connector-dependent	Manual	Automatic
Operational overhead	Low (managed or single binary)	Medium (Connect workers + monitoring)	High (JobManager, TaskManagers, ZooKeeper)	Low (managed)
Open source	Yes	Yes	Yes	No

When to Choose Each Approach

RisingWave: Choose when you want SQL-only development, need transformations (joins, aggregations), and want low operational overhead. Ideal for teams without JVM expertise.

Kafka Connect: Choose when you need simple data movement without transformations. Good for raw event archival to Iceberg where no processing is required.

Apache Flink: Choose when you need advanced stateful processing (complex event processing, custom operators) and have a team comfortable with JVM operations. See the RisingWave vs Flink comparison for a detailed breakdown.

Confluent Tableflow: Choose when you are already deep in the Confluent ecosystem and want zero operational overhead, but do not need complex transformations.

What About Exactly-Once Delivery?

A critical requirement for any production pipeline is exactly-once semantics. You need to guarantee that every Kafka event lands in Iceberg exactly once, even through failures and restarts.

RisingWave provides exactly-once delivery to Iceberg through its checkpoint-based approach:

RisingWave periodically takes consistent snapshots (checkpoints) of its processing state, including Kafka consumer offsets.
When writing to Iceberg, RisingWave commits data in atomic batches aligned with checkpoints.
If a failure occurs, RisingWave restores from the last successful checkpoint and replays from the corresponding Kafka offset.
Iceberg's ACID transaction support ensures that partial writes are never visible to downstream readers.

You can configure the commit interval to balance latency and write efficiency:

CREATE SINK my_sink FROM my_mv
WITH (
    connector = 'iceberg',
    type = 'append-only',
    commit_checkpoint_interval = 30,
    -- ... other parameters
);

The commit_checkpoint_interval controls how many checkpoints pass between Iceberg commits. A lower value means fresher data in Iceberg but more frequent small writes. A higher value batches more data per commit, which is better for query performance on the Iceberg side.

FAQ

Can RisingWave replace both Kafka Connect and Spark for Iceberg pipelines?

Yes, for most use cases. RisingWave handles ingestion (replacing Kafka Connect), transformation (replacing Spark), and delivery to Iceberg in a single SQL-based system. The main exception is heavy batch processing workloads like large historical backfills, where Spark's batch processing model may be more appropriate.

What Iceberg catalog types does RisingWave support?

RisingWave supports AWS Glue, REST catalogs, and storage-based catalogs for Iceberg. You specify the catalog type in the CREATE SINK statement using the catalog.type parameter. This means you can integrate with existing Iceberg infrastructure without changing your catalog setup.

How does the SQL pipeline handle schema changes in Kafka topics?

When the schema of your Kafka topic evolves (new fields, type changes), you can update the RisingWave source definition to reflect the changes. Iceberg's schema evolution support means the sink table can also accommodate new columns. For breaking changes, you may need to recreate the pipeline, but additive changes (new optional fields) can often be handled smoothly.

What throughput can a RisingWave Kafka-to-Iceberg pipeline handle?

RisingWave can process millions of events per second from Kafka, depending on the complexity of your transformations and cluster size. For simple pass-through pipelines, throughput is typically limited by Kafka consumer bandwidth. For complex joins and aggregations, you can scale RisingWave horizontally by adding compute nodes.

Conclusion

Key takeaways:

Traditional Kafka-to-Iceberg pipelines require multiple tools (Kafka Connect + Spark/Flink), multiple languages, and significant operational overhead. RisingWave reduces this to three SQL statements.
RisingWave supports full SQL transformations between Kafka and Iceberg, including joins, aggregations, and window functions, capabilities that Kafka Connect and Confluent Tableflow lack.
Exactly-once delivery is built in through RisingWave's checkpoint mechanism and Iceberg's ACID transactions.
You can query intermediate results (materialized views) directly in RisingWave for real-time analytics while simultaneously sinking to Iceberg for historical analysis.
The SQL-only approach lowers the barrier to entry for data teams that know SQL but not Java or Scala.

For a comprehensive walkthrough of the streaming ETL architecture that powers this pipeline, visit the RisingWave documentation.

Ready to build your Kafka-to-Iceberg pipeline in SQL? Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.