One-line deployment with REST catalog, automatic compaction, and an integrated ingestion engine, fully compatible with all query engines. Finally, you can run Iceberg just like Postgres.
Why We Built This
We started adopting Apache Iceberg internally almost three years ago. At that time, our goal was simple: store analytical data efficiently while avoiding vendor lock-in. But the ecosystem was still immature. There were barely any good catalog services, and we had to manage compaction ourselves by maintaining a Spark cluster.
Three years later, as a streaming lakehouse vendor, we are excited to see how far the Iceberg ecosystem has come. Nearly every company we talk to has either adopted Iceberg or is evaluating it. Yet one major frustration remains: onboarding is still too complicated. To run Iceberg in production, users need to handle the following components:
Catalog services
Compaction services
Ingestion engine (ETL/ELT, batch/streaming)
Interoperability across all query engines
Among current offerings, Amazon S3 Tables might be the only product that provides a seamless experience from onboarding to scaling. However, it is a black box that cannot be self-hosted and is hard to integrate with existing ecosystems. Snowflake-managed Iceberg tables are also restricted because you must be a Snowflake customer, and advanced features such as equality deletes are not fully supported.
We experienced the same pain. So we decided to make Iceberg truly open, self-hostable, and easy to adopt for everyone.
In the rest of this article, I will show how we achieved that with RisingWave and Lakekeeper, and how you can deploy the full stack using Docker or Kubernetes.
How It Works
We are the creators of RisingWave. Historically, RisingWave has been known for its stream processing capabilities that allow users to build real-time pipelines to continuously ingest, transform, and deliver data to downstream systems.
In early 2025, we released the Iceberg Table Engine, a feature that allows users to directly create tables stored in the Iceberg format. At that time, RisingWave offered a built-in JDBC catalog, but users still needed to host their own REST catalogs and perform compaction using external engines. It was not convenient.
To fix this, we introduced two major improvements:
Lakekeeper as the default REST catalog. Lakekeeper is one of the most popular open-source REST catalogs for Iceberg and has been adopted by many enterprises. RisingWave now includes built-in integration with Lakekeeper so that users can connect their own query engines to the same Iceberg tables without extra setup.
Lightweight compaction engine inside RisingWave. Traditional compaction tools such as Spark are difficult to maintain and not efficient for this specific purpose. We built a new Iceberg compaction engine based on DataFusion, a popular query engine in the data community, optimized for compaction efficiency.
With these two components combined, users can now run a fully managed Iceberg environment with a single command.
Notes: This is not just a demo. It is production-ready product that can serve as the backbone of your data infrastructure.
Getting Started
Let’s go through a simple, end-to-end setup using Docker on your local machine.
You will:
Installation & Environment Setup
Deploy Lakekeeper for a streaming lakehouse using Docker (local testing) or Helm (production).
Prerequisites
Docker Desktop (running) or a Kubernetes cluster with
kubectl
and Helm installed.PostgreSQL
psql
client (to run SQL against RisingWave).
Option A — Local (Docker)
Use this for quick local runs.
- Clone the repo RisingWave repository and switch to the Docker directory
git clone https://github.com/risingwavelabs/risingwave.git
cd risingwave/docker
- Launch the environment
This starts a RisingWave cluster and the Lakekeeper REST catalog, spins up all required containers, and prepares the environment for creating Iceberg-native tables in RisingWave with the Lakekeeper catalog.
docker compose -f docker-compose-with-lakekeeper.yml up -d
- What you get
A running RisingWave cluster + Lakekeeper REST catalog
A default warehouse:
risingwave-warehouse
Lakekeeper UI at:
http://localhost:8181/ui/warehouse
Option B — Production (Kubernetes/Helm)
Use this for a durable, production-style setup.
Ensure you have: a working Kubernetes cluster,
kubectl
, and Helm installed.Follow the RisingWave documentation to deploy Lakekeeper via Helm (same components, production-ready parameters).
That’s it: Docker for local development, Helm for cluster environments.
Step 1: Lakekeeper bootstrap
After deploying Lakekeeper using the above Docker Compose setup, initialize it once using the management API:
curl -X POST http://127.0.0.1:8181/management/v1/bootstrap \\
-H 'Content-Type: application/json' \\
-d '{"accept-terms-of-use": true}'
Provision a warehouse backed by your S3-compatible storage (here we use MinIO):
curl -X POST http://127.0.0.1:8181/management/v1/warehouse \\
-H 'Content-Type: application/json' \\
-d '{
"warehouse-name": "risingwave-warehouse",
"delete-profile": { "type": "hard" },
"storage-credential": {
"type": "s3",
"credential-type": "access-key",
"aws-access-key-id": "hummockadmin",
"aws-secret-access-key": "hummockadmin"
},
"storage-profile": {
"type": "s3",
"bucket": "hummock001",
"region": "us-east-1",
"flavor": "s3-compat",
"endpoint": "http://minio-0:9301",
"path-style-access": true,
"sts-enabled": false,
"key-prefix": "risingwave-lakekeeper"
}
}'
Step 2: Connect RisingWave to the REST catalog
Open a SQL shell to RisingWave:
psql -h localhost -p 4566 -d dev -U root
Create a connection pointing to the Lakekeeper REST catalog:
CREATE CONNECTION lakekeeper_catalog_conn
WITH (
type = 'iceberg',
catalog.type = 'rest',
catalog.uri = 'http://lakekeeper:8181/catalog/',
warehouse.path = 'risingwave-warehouse',
s3.access.key = 'hummockadmin',
s3.secret.key = 'hummockadmin',
s3.path.style.access = 'true',
s3.endpoint = 'http://minio-0:9301',
s3.region = 'us-east-1'
);
Set it as the default Iceberg connection:
SET iceberg_engine_connection = 'public.lakekeeper_catalog_conn';
Step 3: Create a Native Iceberg Table
Here is an example table for banking transactions. RisingWave will write Iceberg files to your object store and register them in the REST catalog automatically.
CREATE TABLE bank_financial_transactions (
txn_id INT PRIMARY KEY, -- unique transaction id
account_id INT, -- customer/account identifier
txn_timestamp TIMESTAMP, -- when the transaction occurred
amount DECIMAL, -- money amount
currency VARCHAR, -- e.g., USD
txn_type VARCHAR, -- CREDIT or DEBIT
merchant VARCHAR, -- merchant/payee if applicable
category VARCHAR, -- e.g., utilities, groceries
txn_status VARCHAR -- PENDING, POSTED, REVERSED
)
WITH (commit_checkpoint_interval = 1)
ENGINE = iceberg;
Commit interval note
The approximate commit time ≈
barrier_interval_ms × checkpoint_frequency × commit_checkpoint_interval
. Tune this to balance latency vs. file sizes.
Step 4: Insert and Query Data
Insert five sample transactions for account 50123, then flush:
INSERT INTO bank_financial_transactions (
txn_id, account_id, txn_timestamp, amount, currency, txn_type, merchant, category, txn_status
) VALUES
(1001, 50123, '2025-10-03 09:15:23', 3250.00, 'USD', 'CREDIT', 'Acme Corp Payroll', 'income', 'POSTED'),
(1002, 50123, '2025-10-04 12:48:10', 86.37, 'USD', 'DEBIT', 'Whole Foods Market', 'groceries', 'POSTED'),
(1003, 50123, '2025-10-05 08:21:44', 54.19, 'USD', 'DEBIT', 'Starbucks', 'dining', 'POSTED'),
(1004, 50123, '2025-10-06 18:02:05', 62.50, 'USD', 'DEBIT', 'Chevron', 'fuel', 'POSTED'),
(1005, 50123, '2025-10-07 21:36:59', 129.99, 'USD', 'DEBIT', 'Amazon.com', 'shopping', 'PENDING');
FLUSH;
Query the table:
SELECT * FROM bank_financial_transactions;
txn_id | account_id | txn_timestamp | amount | currency | txn_type | merchant | category | txn_status
-------+------------+------------------------+--------+----------+----------+-----------------------+------------+-----------
1001 | 50123 | 2025-10-03 09:15:23 | 3250.00| USD | CREDIT | Acme Corp Payroll | income | POSTED
1002 | 50123 | 2025-10-04 12:48:10 | 86.37| USD | DEBIT | Whole Foods Market | groceries | POSTED
1003 | 50123 | 2025-10-05 08:21:44 | 54.19| USD | DEBIT | Starbucks | dining | POSTED
1004 | 50123 | 2025-10-06 18:02:05 | 62.50| USD | DEBIT | Chevron | fuel | POSTED
1005 | 50123 | 2025-10-07 21:36:59 | 129.99| USD | DEBIT | Amazon.com | shopping | PENDING
(5 rows)
You have just ingested and queried an Iceberg table as easily as a regular database table.
Step 5: Query from DuckDB
Install and launch DuckDB, attach the same REST catalog, and query the same table.
You will see identical results, confirming interoperability across open engines.
Install DuckDB CLI:
curl https://install.duckdb.org | sh
Launch it:
~/.duckdb/cli/latest/duckdb
Adds a host-level DNS override so your DuckDB can resolve minio-0
to 127.0.0.1
.
This lets DuckDB outside Compose access MinIO at http://minio-0:9301
without changing endpoints.
echo "127.0.0.1 minio-0" | sudo tee -a /etc/hosts
Configure S3 settings and attach the same REST catalog:
SET s3_region = 'us-east-1';
SET s3_endpoint = 'http://minio-0:9301';
SET s3_access_key_id = 'hummockadmin';
SET s3_secret_access_key = 'hummockadmin';
SET s3_url_style = 'path';
SET s3_use_ssl = false;
ATTACH 'risingwave-warehouse' AS lakekeeper_catalog (
TYPE ICEBERG,
ENDPOINT 'http://127.0.0.1:8181/catalog/',
AUTHORIZATION_TYPE 'none'
);
Query the exact same table in DuckDB:
SELECT * FROM lakekeeper_catalog.public.bank_financial_transactions;
┌─────────┬────────────┬──────────────────────┬─────────┬──────────┬──────────┬──────────────────────┬───────────┬────────────┐
│ txn_id │ account_id │ txn_timestamp │ amount │ currency │ txn_type │ merchant │ category │ txn_status │
│ int32 │ int32 │ timestamp │ decimal │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼────────────┼──────────────────────┼─────────┼──────────┼──────────┼──────────────────────┼───────────┼────────────┤
│ 1001 │ 50123 │ 2025-10-03 09:15:23 │ 3250.00│ USD │ CREDIT │ Acme Corp Payroll │ income │ POSTED │
│ 1002 │ 50123 │ 2025-10-04 12:48:10 │ 86.37│ USD │ DEBIT │ Whole Foods Market │ groceries │ POSTED │
│ 1003 │ 50123 │ 2025-10-05 08:21:44 │ 54.19│ USD │ DEBIT │ Starbucks │ dining │ POSTED │
│ 1004 │ 50123 │ 2025-10-06 18:02:05 │ 62.50│ USD │ DEBIT │ Chevron │ fuel │ POSTED │
│ 1005 │ 50123 │ 2025-10-07 21:36:59 │ 129.99│ USD │ DEBIT │ Amazon.com │ shopping │ PENDING │
└─────────┴────────────┴──────────────────────┴─────────┴──────────┴──────────┴──────────────────────┴───────────┴────────────┘
Result: DuckDB reads the table registered in Lakekeeper and written by RisingWave. That’s the power of an open table format and an open REST catalog.
Advanced features you get out of the box
Time travel
Query past states of the table:
-- By timestamp
SELECT *
FROM bank_financial_transactions
FOR SYSTEM_TIME AS OF TIMESTAMPTZ '2025-10-07 22:00:00';
-- By snapshot ID
SELECT *
FROM bank_financial_transactions
FOR SYSTEM_VERSION AS OF 1234567890;
Partitioning strategies
Best practice is to choose a partition key that is a prefix of the primary key. If you want to partition by value_date
and account_id
, define the PK accordingly:
-- Partitioned variant of the table
CREATE TABLE bank_financial_transactions_p (
value_date DATE,
account_id INT,
txn_id INT,
txn_timestamp TIMESTAMP,
amount DECIMAL,
currency VARCHAR,
txn_type VARCHAR,
channel VARCHAR,
merchant VARCHAR,
description VARCHAR,
category VARCHAR,
txn_status VARCHAR,
counterparty_acct VARCHAR,
reference_id VARCHAR,
PRIMARY KEY (value_date, account_id, txn_id)
)
WITH (
commit_checkpoint_interval = 1,
partition_by = 'value_date, bucket(16, account_id)'
)
ENGINE = iceberg;
Rule: The partition key must be a prefix of the primary key.
Native compaction controls
Enable per table to keep file counts under control in streaming:
CREATE TABLE bank_financial_transactions_compacted (
txn_id INT PRIMARY KEY,
account_id INT,
txn_timestamp TIMESTAMP,
value_date DATE,
amount DECIMAL,
currency VARCHAR,
txn_type VARCHAR,
channel VARCHAR,
merchant VARCHAR,
description VARCHAR,
category VARCHAR,
txn_status VARCHAR,
counterparty_acct VARCHAR,
reference_id VARCHAR
)
WITH (
enable_compaction = true,
compaction_interval_sec = 3600,
enable_snapshot_expiration = true,
snapshot_expiration_max_age_millis = 604800000,
snapshot_expiration_retain_last = 5,
snapshot_expiration_clear_expired_files = true,
snapshot_expiration_clear_expired_meta_data = true
)
ENGINE = iceberg;
Compaction requires a dedicated Iceberg compactor.
Configuration, limits, and best practices
Commit intervals
Tune commit cadence when creating the connection (example shown earlier).
Approximate commit time:
time = barrier_interval_ms × checkpoint_frequency × commit_checkpoint_interval
Limitations
Limited DDL: Some schema changes may require recreating the table.
Single writer: For tables created with this engine, only RisingWave should write to them to ensure consistency.
Best practices
Use the hosted catalog for the simplest setup, or Lakekeeper REST as shown here.
Tune commit intervals to balance file size and latency.
In AWS, consider S3 Tables for native compaction; use Lakekeeper as an open source REST alternative.
Design partitioning early.
Monitor small file growth and enable compaction as needed.
Conclusion
If you want Iceberg to feel like a database without sacrificing openness, RisingWave delivers: open Iceberg tables managed by RisingWave and cataloged by Lakekeeper (REST), SQL first ingestion, automatic compaction, time travel, and engine portability. Query the same data from DuckDB, Spark, or Trino with no copies.