Join our Streaming Lakehouse Tour!
Register Now.->
The Simplest Way to Ingest, Transform, and Query Streaming Data in Apache Iceberg

The Simplest Way to Ingest, Transform, and Query Streaming Data in Apache Iceberg

Apache Iceberg is becoming the de facto standard for organizing data in modern lakehouses. It gives you the open table format you need—ACID guarantees, schema evolution, time travel, and support across engines like Trino, Spark, and Flink. But it doesn’t tell you how to get data into those tables. That part is up to you.

And that’s where most teams run into trouble.

You want to build a real-time pipeline into Iceberg? You’re now writing Flink jobs in Java, wrangling Kafka consumers, tuning checkpoint intervals, and dealing with yet another glue job to publish to Iceberg. Sure, it works. But it’s fragile, verbose, and hard to debug when something goes wrong at 3 a.m.

It doesn’t have to be this complicated.

This post explains how RisingWave can simplify the whole thing. In one system, and using only SQL, you can ingest from Kafka, transform on the fly, and write to Apache Iceberg with second-level freshness—no JVM, no DAGs, no Airflow. Just code that’s easy to write, read, and run.


The Problem: Iceberg Needs a Real-Time Processing Layer

Iceberg is a table format. That’s it. It stores metadata and tracks snapshots, but it doesn’t dictate how your data gets written in or queried out. And it certainly doesn’t help with real-time.

A typical architecture to feed data into Iceberg might look like:

  • Kafka as your event stream

  • Flink for transformation

  • A Flink sink to write to Iceberg

  • Spark or Trino for downstream queries

Each of these systems is powerful. Together, they’re a pain. You’ll need to:

  • Keep schemas in sync across Kafka, Flink, and Iceberg

  • Handle backpressure and failure recovery

  • Deal with state, checkpoints, and operational tuning

  • Write custom logic to keep jobs from silently failing

This might be okay if you’re a big platform team. But if you want something that works out of the box and doesn’t require babysitting, this stack is overkill.


A Simpler Alternative: RisingWave + Iceberg

RisingWave is a streaming SQL engine that looks like a database but acts like Flink under the hood. It supports real-time ingestion, incremental computation, and—critically—direct integration with Apache Iceberg.

With RisingWave, you can:

  1. Ingest from Kafka, Iceberg, or any supported source

  2. Transform data on the fly using SQL

  3. Output directly into Apache Iceberg in near real time

  4. Or even treat Iceberg as the native storage engine for your tables

You don’t write pipelines in Java. You don’t stitch together Flink operators. You just define sources, views, and sinks—all in SQL.


A Concrete Example

Let’s say your app sends order events to Kafka. You want to:

  • Clean up the data

  • Track sales metrics per region

  • Store the results in Iceberg for others to query via Trino

Here’show that looks in RisingWave:

1. Ingest from Kafka

CREATE SOURCE orders_kafka (
    user_id INT,
    product_id VARCHAR,
    order_id VARCHAR,
    order_amount DOUBLE PRECISION,
    region VARCHAR,
    order_time TIMESTAMP
) WITH (
    connector='kafka',
    topic='user_activity',
    properties.bootstrap.server='broker1:9092,broker2:9092'
) FORMAT PLAIN ENCODE JSON;

Or load from a historical Iceberg table in S3:

CREATE SOURCE orders_iceberg
WITH (
  connector = 'iceberg',
  warehouse_path = 's3://your-bucket/warehouse',
  database_name = 'sales',
  table_name = 'orders',
  s3.endpoint = '<http://your-s3-endpoint>:port',
  s3.access.key = 'YOUR_ACCESS_KEY',
  s3.secret.key = 'YOUR_SECRET_KEY'
);

RisingWave treats both as streaming sources. New records are picked up automatically.

2. Transform with SQL

Create a live view of sales totals per region:

CREATE MATERIALIZED VIEW regional_sales AS
SELECT region, SUM(order_amount) AS total_sales
FROM orders_kafka
GROUP BY region;

The view is updated as new events arrive. No batch jobs. No refresh schedules. No orchestration.

Want hourly windows?

CREATE MATERIALIZED VIEW hourly_sales AS
SELECT window_start, region,
  COUNT(*) AS orders,
  SUM(order_amount) AS revenue
FROM TUMBLE (orders_kafka, order_time, INTERVAL '1 hour')
GROUP BY window_start, region;

3. Write to Iceberg

Once you’ve got your real-time view, write it to Iceberg:

CREATE SINK regional_sales_sink
FROM regional_sales
WITH (
    connector = 'iceberg',
    type = 'append-only',
    force_append_only = true,
    catalog.type = 'hive',
    catalog.uri = 'thrift://metastore:9083',
    warehouse.path = 's3://icebergdata/demo',
    s3.endpoint = '<http://minio-0:9301>',
    s3.access.key = 'xxxxxxxxxx',
    s3.secret.key = 'xxxxxxxxxx',
    s3.region = 'ap-southeast-1',
      catalog.name = 'demo',
      database_name = 'sales',
      table_name = 'regional_sales'
);

RisingWave takes care of everything else: checkpointing, file generation, snapshot commits, and Iceberg metadata updates.


Bonus: Iceberg as the Table Engine

You can even skip the sink altogether and define a RisingWave table that stores directly into Iceberg:

CREATE TABLE user_events (
  user_id BIGINT,
  event_time TIMESTAMP,
  event_type TEXT
) WITH (commit_checkpoint_interval = 1) 
ENGINE = iceberg;

Note that you need to create a connection for Iceberg Table Engine and configure to use that connection before creating an Iceberg table directly in RisingWave.

That’s it. RisingWave will manage the writes, and the table will show up as a valid Iceberg table in your S3 bucket. Queryable by any external engine.

You can also stream data straight into it:

CREATE TABLE enriched_events (
  ...
) WITH (
    connector = 'kafka',
    topic = 'page_views_topic',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest',
    commit_checkpoint_interval = 120
)
FORMAT PLAIN ENCODE JSON
ENGINE = iceberg;

Why Engineers Actually Like This

Most engineers don’t want to babysit infrastructure. They want a pipeline that ingests data, transforms it, and lands it in Iceberg—without needing to learn Flink internals or write a hundred lines of boilerplate.

RisingWave fits naturally into the developer workflow: you write SQL, you test it, and it runs. There’s no need to wire up separate pipeline orchestration tools or worry about how state is managed under the hood. The system takes care of exactly-once delivery, snapshotting, and schema enforcement without exposing you to that complexity.

And because it’s SQL-based, everything is transparent. You can see what each view is doing, trace how data flows through the system, and make changes without breaking five downstream jobs. It’s not just easier to use—it’s easier to trust.

The end result? Your Iceberg tables stay fresh, your pipelines stay sane, and your team doesn’t spend its time debugging Flink exceptions.


Where It Fits

RisingWave with Iceberg works well in any use case where fresh, queryable data matters—but you don’t want to overcomplicate your stack. It’s already being used in:

  • Real-time ingestion paths that go directly from Kafka into Iceberg with second-level latency

  • Low-latency dashboards powered by SQL materialized views, without a batch layer in between

  • Machine learning pipelines where Iceberg acts as the durable store for continuously updating features

  • Change data capture setups, where updates from operational databases are incrementally transformed and persisted

  • Cost-efficient analytics where Trino or Spark query clean, up-to-date data that was streamed in without a separate ETL framework

Whether you're a data team looking to simplify ETL or a backend team pushing events into the lake, this model removes a ton of overhead.


Try It

RisingWave is open source and easy to self-host. You can also try the fully managed cloud version.

Stop stitching together big data systems. Just write SQL. And make Iceberg actually useful for real-time.

The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.