Getting Started: RisingWave to Apache Iceberg in 10 Minutes

Getting Started: RisingWave to Apache Iceberg in 10 Minutes

You can have RisingWave streaming data into Apache Iceberg in under 10 minutes. Spin up RisingWave and an Iceberg REST catalog with Docker Compose, connect a Kafka source, define a materialized view, and create an Iceberg sink—your first lakehouse pipeline is live with five SQL statements.

What You Will Build

This tutorial walks through a complete end-to-end pipeline:

  1. A Kafka topic producing simulated user events
  2. A RisingWave source reading those events
  3. A materialized view computing real-time session metrics
  4. An Iceberg sink writing results to local MinIO (S3-compatible)
  5. A direct query against the Iceberg table from RisingWave v2.8

By the end, you will have a working streaming lakehouse you can extend with your own data.

Prerequisites

  • Docker and Docker Compose installed
  • psql CLI (or any PostgreSQL client)
  • 10 minutes

Step 1: Start the Stack

Create a docker-compose.yml with RisingWave, Kafka, MinIO, and an Iceberg REST catalog (we use Project Nessie here, which exposes an Iceberg REST-compatible API via Iceberg v2 catalog bridge).

docker compose up -d risingwave kafka minio iceberg-catalog

Once the stack is healthy (check with docker compose ps), connect to RisingWave:

psql -h localhost -p 4566 -d dev -U root

Step 2: Create a Kafka Source

CREATE SOURCE user_events (
    user_id       VARCHAR,
    session_id    VARCHAR,
    event_type    VARCHAR,
    page          VARCHAR,
    ts            TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'user.events',
    properties.bootstrap.server = 'localhost:9092',
    scan.startup.mode = 'earliest'
)
FORMAT PLAIN ENCODE JSON;

Verify the source is reading data:

SELECT * FROM user_events LIMIT 5;

Step 3: Build a Session Metrics View

RisingWave's SESSION window function groups events into sessions separated by a gap of inactivity. This is ideal for user behavior analytics.

CREATE MATERIALIZED VIEW user_session_metrics AS
SELECT
    user_id,
    session_id,
    window_start                               AS session_start,
    window_end                                 AS session_end,
    COUNT(*)                                   AS event_count,
    COUNT(DISTINCT page)                       AS unique_pages,
    EXTRACT(EPOCH FROM (window_end - window_start)) AS session_duration_sec
FROM SESSION(user_events, ts, INTERVAL '30 MINUTES')
GROUP BY user_id, session_id, window_start, window_end;

Check that the view is producing results:

SELECT * FROM user_session_metrics LIMIT 10;

Step 4: Create the Iceberg Sink

CREATE SINK user_sessions_to_iceberg AS
SELECT * FROM user_session_metrics
WITH (
    connector       = 'iceberg',
    type            = 'append-only',
    catalog.type    = 'rest',
    catalog.uri     = 'http://localhost:8181',
    warehouse.path  = 's3://lakehouse/warehouse',
    s3.region       = 'us-east-1',
    s3.endpoint     = 'http://localhost:9000',
    s3.access.key   = 'minioadmin',
    s3.secret.key   = 'minioadmin',
    database.name   = 'analytics',
    table.name      = 'user_sessions'
);

RisingWave automatically creates the Iceberg table if it does not exist. Wait 60 seconds for the first checkpoint to commit.

Step 5: Query the Iceberg Table

In RisingWave v2.8, you can query Iceberg tables directly:

SELECT
    DATE_TRUNC('hour', session_start) AS hour,
    COUNT(*)                          AS session_count,
    AVG(session_duration_sec)         AS avg_duration,
    AVG(unique_pages)                 AS avg_pages_per_session
FROM iceberg_scan(
    's3://lakehouse/warehouse',
    'analytics',
    'user_sessions'
)
GROUP BY DATE_TRUNC('hour', session_start)
ORDER BY hour DESC
LIMIT 24;

You now have a live hourly report backed by Iceberg storage.

Understanding What Just Happened

StepComponentRole
CREATE SOURCERisingWaveReads Kafka topic in real time
CREATE MATERIALIZED VIEWRisingWaveContinuously computes session aggregations
CREATE SINKRisingWaveCommits results to Iceberg every checkpoint
Iceberg REST CatalogNessie/PolarisTracks table metadata and snapshots
MinIO / S3Object StorageStores Parquet data files durably
iceberg_scan()RisingWave v2.8Queries Iceberg tables directly from SQL

Extending the Tutorial

Once the basic pipeline works, you can:

Add schema evolution: New columns in user_events are propagated through RisingWave automatically. Add matching columns to your materialized view and Iceberg handles the schema migration.

Enable upsert: Change type = 'append-only' to type = 'upsert' and add a primary key to the sink to keep a current-state table rather than an event log.

Connect a BI tool: Point Trino, DuckDB, or Apache Spark at your MinIO bucket with the Nessie catalog URI to query the same Iceberg data from your preferred analytics tool.

Add more sources: Bring in a Postgres CDC source alongside your Kafka source. Join them in a materialized view for enriched analytics.

Common Mistakes to Avoid

Wrong catalog URI: The most common issue is a network misconfiguration between RisingWave and the catalog service. Confirm connectivity with curl http://localhost:8181/v1/config from inside the RisingWave container.

Missing S3 credentials: For production S3, always pass s3.region, s3.access.key, and s3.secret.key (or use IAM roles via instance profile). Omitting these causes silent authentication failures.

Small file accumulation: Each checkpoint creates new Parquet files. Schedule Iceberg's rewriteDataFiles procedure daily to compact files for efficient querying.

FAQ

Q: Do I need to create the Iceberg table manually before creating the sink? A: No. RisingWave automatically creates the Iceberg table based on the schema of your SELECT statement when the sink is created.

Q: What Iceberg catalog backends does RisingWave support? A: RisingWave supports the Iceberg REST catalog spec (compatible with Nessie, Polaris, Tabular, and AWS Glue with the REST adapter), as well as direct Hive Metastore via JDBC.

Q: How do I monitor sink lag? A: Use SHOW SINK BACKPRESSURE in RisingWave to check if the sink is keeping up with the materialized view. You can also monitor Iceberg snapshot creation frequency in the catalog.

Q: Can I write to multiple Iceberg tables from one RisingWave deployment? A: Yes. Create multiple CREATE SINK statements, each pointing to a different table.name. RisingWave manages each sink independently.

Q: What is the minimum Iceberg version supported? A: RisingWave supports Iceberg v2 format, which includes row-level deletes required for upsert semantics. Iceberg v1 tables are supported for append-only sinks.

Get Started

Follow the complete RisingWave documentation for production deployment guides, configuration references, and connector details. Ask questions and share your pipeline on the RisingWave Slack community.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.