Building a Real-Time Data Lakehouse on AWS with RisingWave and Iceberg

A real-time data lakehouse on AWS combines Apache Iceberg's open table format with RisingWave's stream processing to deliver sub-second data freshness directly into S3-backed tables. RisingWave ingests events from Kafka or CDC sources, applies continuous SQL transforms, and sinks results to Iceberg via a REST catalog — no Spark required.

Why Build a Lakehouse on AWS?

Traditional data warehouses are expensive and rigid. Data lakes on S3 are cheap but stale. The lakehouse pattern bridges these two worlds: store data in open formats on S3 while making it immediately queryable by engines like Trino, Athena, and Spark. Apache Iceberg has emerged as the dominant table format for this architecture because it provides ACID transactions, schema evolution, and partition pruning on top of ordinary object storage.

The missing piece has always been the ingestion layer. Batch pipelines (Spark, Glue) introduce hour-long delays. Kafka Streams and Flink can write to Iceberg but require JVM infrastructure and complex operational overhead. RisingWave fills this gap: it is a cloud-native streaming database that speaks standard SQL and writes natively to Iceberg via the iceberg sink connector.

Architecture Overview

A typical AWS real-time lakehouse with RisingWave looks like this:

Layer	Technology	Role
Event ingestion	Amazon MSK (Kafka)	Raw event stream
Stream processor	RisingWave on EKS	Continuous SQL transforms
Table format	Apache Iceberg	ACID lakehouse tables
Object storage	Amazon S3	Durable, cheap storage
Catalog	Iceberg REST Catalog	Metadata management
Query engine	Amazon Athena / Trino	Ad-hoc analytics

RisingWave sits in the middle — consuming from Kafka, materializing views, and writing cleaned, aggregated data to Iceberg on S3. Downstream tools query Iceberg directly; they never touch RisingWave.

Step 1: Create a Kafka Source

Start by connecting RisingWave to your Amazon MSK cluster. The CREATE SOURCE statement registers the Kafka topic as a streaming input:

CREATE SOURCE orders_raw (
    order_id     BIGINT,
    customer_id  BIGINT,
    product_sku  VARCHAR,
    quantity     INT,
    unit_price   NUMERIC(12, 2),
    event_time   TIMESTAMPTZ
)
WITH (
    connector        = 'kafka',
    topic            = 'orders',
    properties.bootstrap.server = 'b-1.msk-cluster.kafka.us-east-1.amazonaws.com:9092',
    scan.startup.mode = 'earliest'
)
FORMAT PLAIN ENCODE JSON;

This source is now live — as events arrive in the MSK topic, RisingWave makes them available for downstream materialized views.

Step 2: Build a Materialized View

Materialized views in RisingWave are continuously updated as new data arrives. Here we compute per-minute order revenue using a tumbling window:

CREATE MATERIALIZED VIEW orders_per_minute AS
SELECT
    window_start,
    window_end,
    product_sku,
    COUNT(*)                        AS order_count,
    SUM(quantity * unit_price)      AS revenue,
    AVG(unit_price)                 AS avg_price
FROM TUMBLE(orders_raw, event_time, INTERVAL '1 MINUTE')
GROUP BY
    window_start,
    window_end,
    product_sku;

TUMBLE() is a native RisingWave window function that assigns each event to a fixed-duration, non-overlapping window. The result is a continuously refreshed table of per-minute revenue by SKU — updated within seconds of each new event.

Step 3: Sink to Apache Iceberg on S3

With the materialized view defined, sink it to Iceberg using the iceberg connector:

CREATE SINK orders_lakehouse_sink AS
SELECT * FROM orders_per_minute
WITH (
    connector       = 'iceberg',
    type            = 'append-only',
    catalog.type    = 'rest',
    catalog.uri     = 'http://iceberg-catalog.internal:8181',
    warehouse.path  = 's3://my-lakehouse-bucket/warehouse',
    s3.region       = 'us-east-1',
    database.name   = 'analytics',
    table.name      = 'orders_per_minute'
);

RisingWave will create the Iceberg table if it does not exist, write Parquet data files to S3, and commit new snapshots to the catalog. Downstream Athena queries will immediately see the new data.

Querying Iceberg Tables in RisingWave

Starting in RisingWave v2.8, you can query Iceberg tables directly using lakehouse queries — no separate query engine needed for validation and debugging:

-- Direct lakehouse query (RisingWave v2.8+)
SELECT product_sku, SUM(revenue) AS total_revenue
FROM iceberg_scan(
    's3://my-lakehouse-bucket/warehouse/analytics/orders_per_minute',
    catalog.type = 'rest',
    catalog.uri  = 'http://iceberg-catalog.internal:8181'
)
WHERE window_start >= NOW() - INTERVAL '1 HOUR'
GROUP BY product_sku
ORDER BY total_revenue DESC
LIMIT 10;

This is particularly useful for data quality checks — you can compare the live materialized view against the committed Iceberg snapshot without leaving RisingWave.

Performance Considerations

Commit frequency: By default, RisingWave buffers records and commits to Iceberg on a configurable interval (typically 30–60 seconds). Lower intervals mean fresher data but more small files. Use Iceberg's compaction service (or scheduled Spark jobs) to merge small files periodically.

Partitioning: Define your Iceberg table partitions to match your query patterns. For time-series data, partition by day or hour. RisingWave respects existing partition specs when writing.

Catalog choice: AWS Glue Catalog is a popular managed option that integrates with Athena out of the box. RisingWave's catalog.type = 'rest' supports any REST-compatible catalog, including the Tabular REST catalog adapter for Glue.

Cost Comparison: Traditional vs. Lakehouse

Approach	Storage	Compute	Freshness	Flexibility
Redshift (warehouse)	High $/TB	Always-on cluster	Minutes–hours	Low (proprietary)
S3 + Glue batch	Low $/TB	On-demand	Hours	Medium
RisingWave + Iceberg	Low $/TB	Stream only	Seconds	High (open format)
Kafka + Flink + Iceberg	Low $/TB	Medium	Seconds	High

RisingWave + Iceberg delivers real-time freshness at batch-storage prices, with a dramatically simpler operational footprint than Flink-based alternatives.

FAQ

Q: Do I need to run a separate Iceberg catalog service on AWS? A: Not necessarily. AWS Glue Data Catalog supports Iceberg natively and works with any REST catalog adapter. You can also run the open-source Iceberg REST catalog as a lightweight container on ECS or EKS.

Q: How does RisingWave handle S3 credentials? A: RisingWave reads AWS credentials from the standard chain: environment variables, EC2 instance profile, or EKS IRSA. Set AWS_REGION and ensure the pod's IAM role has s3:PutObject and s3:GetObject on your bucket.

Q: Can I query the Iceberg tables with Amazon Athena at the same time RisingWave is writing? A: Yes. Iceberg's snapshot isolation ensures readers always see a consistent view. Athena queries read committed snapshots; in-flight RisingWave writes do not affect them.

Q: What is the minimum RisingWave version needed for Iceberg sinks? A: The iceberg sink connector has been available since RisingWave v1.7. The direct lakehouse query feature (v2.8+) is not required for sinking.

Q: How do I handle schema changes in the Kafka topic? A: RisingWave's CREATE SOURCE can use Schema Registry for Avro/Protobuf topics. Schema changes are propagated through materialized views automatically, and RisingWave can perform schema evolution on the downstream Iceberg table.

Get Started

Ready to build your real-time lakehouse on AWS? Start with the official documentation and join the community: