Real-Time Data Lakehouse with RisingWave and Apache Iceberg on S3

A real-time data lakehouse with RisingWave and Apache Iceberg on S3 gives you sub-minute data freshness, full SQL analytics, and object storage costs — without a proprietary data warehouse. RisingWave streams data continuously into Iceberg tables on S3, where any query engine (Athena, Trino, Spark, or RisingWave itself) can read results immediately.

The Case for Object Storage as Your Analytics Foundation

Amazon S3 stores data at roughly $0.023 per GB per month. Compared to the $5–50/TB/month typical of managed data warehouses, S3 is 10–100x cheaper for storage. This makes it an attractive foundation for analytics at scale — if you can solve the reliability and performance challenges that plagued early data lakes.

Apache Iceberg solves those challenges. It brings ACID transactions, efficient metadata management, and partition pruning to S3-backed storage. Combined with RisingWave's streaming SQL engine, you get a system that keeps Iceberg tables fresh in real time, making them suitable for operational analytics, not just batch reporting.

Architecture: What Goes Where

Kafka / CDC Sources
        │
        ▼
   RisingWave
   (Streaming SQL)
        │
   Materialized Views
        │
        ▼
Apache Iceberg on S3
        │
   ┌────┴────────────┐
   ▼                 ▼
Amazon Athena    RisingWave
(Batch Queries)  (Live Queries)

RisingWave sits in the hot path, ingesting events and maintaining continuously updated materialized views. It writes to Iceberg as quickly as its checkpoint interval allows — typically every 10–60 seconds in practice.

Apache Iceberg on S3 is the durable storage layer. Once data is committed here, it's safe, queryable by any engine, and retained indefinitely at S3 pricing.

Query engines (Athena, Trino, Spark, or RisingWave itself in v2.8+) serve analytical workloads from the Iceberg tables. They always see a consistent snapshot, even while RisingWave is actively writing.

Building the Pipeline: End-to-End Example

Define Your Kafka Source

CREATE SOURCE transactions (
    txn_id VARCHAR,
    account_id BIGINT,
    merchant_id BIGINT,
    amount NUMERIC(12,2),
    currency VARCHAR(3),
    txn_type VARCHAR,
    status VARCHAR,
    txn_time TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'financial-transactions',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

Build Real-Time Aggregations

CREATE MATERIALIZED VIEW account_daily_summary AS
SELECT
    account_id,
    window_start::DATE AS txn_date,
    COUNT(*) AS txn_count,
    SUM(amount) FILTER (WHERE txn_type = 'credit') AS total_credits,
    SUM(amount) FILTER (WHERE txn_type = 'debit') AS total_debits,
    COUNT(*) FILTER (WHERE status = 'failed') AS failed_count,
    MAX(amount) AS max_single_txn,
    SUM(amount) AS gross_volume
FROM TUMBLE(transactions, txn_time, INTERVAL '1 DAY')
GROUP BY account_id, window_start;

Sink to Iceberg on S3

CREATE SINK account_summary_to_s3_iceberg AS
SELECT * FROM account_daily_summary
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-rest-catalog:8181',
    warehouse.path = 's3://my-analytics-lake/iceberg-warehouse',
    s3.region = 'us-east-1',
    s3.access.key = 'YOUR_ACCESS_KEY',
    s3.secret.key = 'YOUR_SECRET_KEY',
    database.name = 'finance',
    table.name = 'account_daily_summary'
);

For AWS deployments, prefer IAM roles over access keys. When running RisingWave on EC2 or EKS, attach an IAM role with S3 read/write permissions and omit the key parameters.

S3 Configuration and Cost Optimization

Bucket Structure

Organize your S3 bucket to make lifecycle policies and cost management easier:

s3://my-analytics-lake/
├── iceberg-warehouse/
│   ├── finance/
│   │   ├── account_daily_summary/
│   │   │   ├── metadata/
│   │   │   └── data/
│   │   └── transactions_raw/
│   └── marketing/
│       └── campaign_events/
└── compaction-temp/

Storage Classes and Lifecycle Policies

Iceberg data files from streaming writes are typically accessed heavily in the first 30 days, then infrequently. Use S3 lifecycle policies to transition older files to cheaper storage classes:

Age	Recommended Storage Class	Cost (approx.)
0–30 days	S3 Standard	$0.023/GB/month
30–90 days	S3 Standard-IA	$0.0125/GB/month
90–365 days	S3 Glacier Instant Retrieval	$0.004/GB/month
365+ days	S3 Glacier Deep Archive	$0.00099/GB/month

Compaction for Query Performance

Streaming writes produce many small Parquet files — typically 1–10 MB each. Iceberg queries scan all files matching the partition filter, so too many small files degrades performance. Run compaction periodically:

# Using PyIceberg for compaction (run as a scheduled job)
from pyiceberg.catalog import load_catalog

catalog = load_catalog("rest", uri="http://iceberg-catalog:8181")
table = catalog.load_table("finance.account_daily_summary")

# Rewrite small files into larger ones (target 128MB)
table.rewrite_data_files(
    target_file_size_bytes=128 * 1024 * 1024
)

Alternatively, use Spark's RewriteDataFiles action in a scheduled EMR job.

Querying Your Lakehouse

From Amazon Athena

Once data is in Iceberg on S3, Athena can query it immediately using Iceberg's native support:

-- In Athena (Iceberg table registered in Glue catalog)
SELECT
    account_id,
    SUM(gross_volume) as total_volume,
    SUM(failed_count) as total_failures
FROM finance.account_daily_summary
WHERE txn_date >= DATE '2026-01-01'
GROUP BY account_id
ORDER BY total_volume DESC
LIMIT 100;

From RisingWave (v2.8+ Lakehouse Query)

In RisingWave v2.8, you can query Iceberg tables directly alongside streaming data:

CREATE SOURCE iceberg_account_summary
WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://my-analytics-lake/iceberg-warehouse',
    database.name = 'finance',
    table.name = 'account_daily_summary'
);

-- Join historical Iceberg data with live streaming data
SELECT
    live.account_id,
    live.total_debits AS todays_debits,
    hist.gross_volume AS yesterday_volume,
    live.total_debits / NULLIF(hist.gross_volume, 0) AS day_over_day_ratio
FROM account_daily_summary live
JOIN iceberg_account_summary hist
    ON live.account_id = hist.account_id
    AND hist.txn_date = CURRENT_DATE - INTERVAL '1 DAY';

Comparison: Real-Time Lakehouse Options on S3

Solution	Real-Time Latency	Query Performance	Operational Complexity	Cost
RisingWave + Iceberg	Seconds	High (with compaction)	Low	Low
Flink + Iceberg	Seconds	High	High	Medium
Spark Streaming + Iceberg	Minutes	High	High	Medium
Kinesis Firehose + Parquet	Minutes	Medium	Low	Medium
Redshift + S3	Hours (COPY)	Very High	Medium	High

IAM Permissions Required

For RisingWave to write to Iceberg on S3, your IAM policy needs:

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject",
    "s3:DeleteObject",
    "s3:ListBucket",
    "s3:GetBucketLocation"
  ],
  "Resource": [
    "arn:aws:s3:::my-analytics-lake",
    "arn:aws:s3:::my-analytics-lake/*"
  ]
}

If you're using AWS Glue as your Iceberg catalog, also add Glue permissions: glue:GetDatabase, glue:GetTable, glue:CreateTable, glue:UpdateTable.

FAQ

Q: Can I use MinIO instead of Amazon S3? Yes. MinIO implements the S3 API. Set s3.endpoint = 'http://minio:9000' in your sink configuration. This is ideal for on-premise or hybrid deployments.

Q: How much does this architecture cost compared to Snowflake? At scale, RisingWave on Kubernetes + S3 storage costs 60–80% less than Snowflake for equivalent workloads. The primary costs are compute (RisingWave nodes) and storage (S3). There are no per-query charges.

Q: Is data in Iceberg on S3 encrypted? S3 server-side encryption (SSE-S3 or SSE-KMS) encrypts data at rest automatically. Enable encryption on your S3 bucket, and all Iceberg files will be encrypted without any changes to your pipeline.

Q: What happens if S3 has an outage? RisingWave buffers internally during S3 outages and resumes writing when connectivity is restored. Kafka offsets are not advanced until a successful Iceberg commit, preventing data loss.

Q: Can multiple teams share the same Iceberg warehouse on S3? Yes. Use separate databases within the Iceberg catalog to namespace tables by team. Apply S3 bucket policies and IAM roles to enforce access control per team.

Start Building Your Real-Time Lakehouse

The combination of RisingWave, Apache Iceberg, and Amazon S3 gives you a production-grade real-time analytics platform at a fraction of the cost of proprietary alternatives.

Follow the RisingWave get-started guide to spin up your first pipeline in minutes. Share your architecture and questions in the RisingWave Slack community.