A Self-Hosted Iceberg Lakehouse in Minutes with RisingWave

A Self-Hosted Iceberg Lakehouse in Minutes with RisingWave

·

4 min read

Setting up an Apache Iceberg stack can be complex. You need a query engine to write data, a catalog to manage metadata (like Nessie or a JDBC service), and an object store for the table files. Juggling these separate components requires significant setup and maintenance. RisingWave simplifies this entire process. With its built-in hosted catalog, you can create a fully functional, open, and high-performance streaming lakehouse without any external catalog services. This guide demonstrates how to use RisingWave to provision a hosted JDBC Iceberg catalog with a single command. You will create a native Iceberg table, stream data into it, and query it from both RisingWave and Apache Spark, proving its interoperability and openness.

Overview, Prerequisites & Quickstart

This demo uses an Iceberg setup from our Awesome Stream Processing repository.

Architecture

Prerequisites

To follow along, you will need:

Quickstart

First, clone the repository and launch the demo stack.

# Clone the repo
git clone <https://github.com/risingwavelabs/awesome-stream-processing.git>

# Navigate into the Iceberg quickstart demo
cd awesome-stream-processing/07-iceberg-demos/streaming_iceberg_quickstart

# Launch demo stack
docker compose up -d

This command starts a standalone RisingWave instance (localhost:4566), PostgreSQL (localhost:5432), and a MinIO object store (localhost:9000).

Step 1: Create a Hosted Catalog Connection

Connect to your RisingWave instance using psql.

psql -h localhost -p 4566 -d dev -U root

Next, create a connection. By setting hosted_catalog = 'true', you tell RisingWave to manage all the Iceberg metadata internally, acting as a fully compliant Iceberg catalog. No Glue, Nessie, or external Postgres is required.

CREATE CONNECTION my_iceberg_connection
WITH (
    type                 = 'iceberg',
    warehouse.path       = 's3://icebergdata/demo',
    s3.access.key        = 'hummockadmin',
    s3.secret.key        = 'hummockadmin',
    s3.region            = 'us-east-1',
    s3.endpoint          = '<http://minio-0:9301>',
    s3.path.style.access = 'true',
    hosted_catalog       = 'true'            -- 👈 one flag, no extra services!
);

Step 2: Create a Native Iceberg Table

Activate the connection for your current session. All subsequent Iceberg operations will use these settings.

SET iceberg_engine_connection = 'public.my_iceberg_connection';

Now, create a native Iceberg table using the ENGINE = iceberg setting. RisingWave will manage this table's structure and data, storing the files in the MinIO path defined in your connection.

CREATE TABLE crypto_trades (
  trade_id  BIGINT PRIMARY KEY,
  symbol    VARCHAR,
  price     DOUBLE,
  quantity  DOUBLE,
  side      VARCHAR,     -- e.g., 'BUY' or 'SELL'
  exchange  VARCHAR,     -- e.g., 'binance', 'coinbase'
  trade_ts  TIMESTAMP
)
WITH (commit_checkpoint_interval = 1)  -- low-latency commits
ENGINE = iceberg;

The table is now ready for streaming inserts.

Step 3: Stream Data and Query in RisingWave

Insert a couple of records into the table.

INSERT INTO crypto_trades
VALUES
  (1000001, 'BTCUSDT', 57321.25, 0.005, 'BUY',  'binance', NOW()),
  (1000002, 'ETHUSDT',  2578.10, 0.250, 'SELL', 'coinbase', NOW());

Query the table to verify the commit.

SELECT * FROM crypto_trades;

You’ll see an output like this:

 trade_id | symbol  |  price   | quantity | side | exchange |        trade_ts
----------+---------+----------+----------+------+----------+----------------------------
  1000001 | BTCUSDT | 57321.25 |   0.005  | BUY  | binance  | 2025-07-17 15:04:56.123
  1000002 | ETHUSDT |  2578.10 |   0.250  | SELL | coinbase | 2025-07-17 15:04:56.456

Because the table is in the open Iceberg format, you can immediately query it from external engines.

Step 4: Query the Table from Apache Spark

To demonstrate interoperability, let's query the same table from an external engine like Apache Spark.

First, ensure you have Apache Spark installed. If you don't, you can download it from the official Spark website and follow their installation guide.

Once Spark is available in your environment, launch the Spark SQL shell with the packages and configuration needed to connect to RisingWave's hosted catalog.

spark-sql \\
  --packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.2,org.apache.iceberg:iceberg-aws-bundle:1.9.2,org.postgresql:postgresql:42.7.4" \\
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \\
  --conf spark.sql.defaultCatalog=dev \\
  --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog \\
  --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \\
  --conf spark.sql.catalog.dev.uri=jdbc:postgresql://127.0.0.1:4566/dev \\
  --conf spark.sql.catalog.dev.jdbc.user=postgres \\
  --conf spark.sql.catalog.dev.jdbc.password=123 \\
  --conf spark.sql.catalog.dev.warehouse=s3://hummock001/my_iceberg_connection \\
  --conf spark.sql.catalog.dev.io-impl=org.apache.iceberg.aws.s3.S3FileIO \\
  --conf spark.sql.catalog.dev.s3.endpoint=http://127.0.0.1:9301 \\
  --conf spark.sql.catalog.dev.s3.region=us-east-1 \\
  --conf spark.sql.catalog.dev.s3.path-style-access=true \\
  --conf spark.sql.catalog.dev.s3.access-key-id=hummockadmin \\
  --conf spark.sql.catalog.dev.s3.secret-access-key=hummockadmin

Then run a query to view the data you inserted into the Iceberg table in RisingWave:

select * from dev.public.crypto_trades;

Conclusion

This guide showed how RisingWave streamlines the creation of a streaming lakehouse. By setting hosted_catalog = true, you get a fully functional Iceberg catalog without deploying or managing external services. You created a native Iceberg table, streamed data into it, and queried it from both RisingWave and Spark. This demonstrates a seamless, end-to-end workflow where RisingWave handles table creation, data ingestion, and metadata management, all while keeping your data in an open format, accessible to any engine in the ecosystem.

Get Started with RisingWave

  • Try RisingWave Today:

  • Talk to Our Experts: Have a complex use case or want to see a personalized demo? Contact us to discuss how RisingWave can address your specific challenges.

  • Join Our Community: Connect with fellow developers, ask questions, and share your experiences in our vibrant Slack community.

If you’d like to see a personalized demo or discuss how this could work for your use case, please contact our sales team.

The Modern Backbone for Your
Data Streaming Workloads
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.