At RisingWave, our goal is to simplify the process of building real-time data applications. A key part of this is enabling users to build modern, open data architectures. That’s why we developed the Iceberg Table Engine, which allows you to stream data directly into tables using the open Apache Iceberg format. This is a powerful way to build a streaming lakehouse where your data is immediately available for both real-time and batch analytics.
However, using any Iceberg engine traditionally requires a first, crucial step: setting up and configuring an Iceberg catalog. This catalog is responsible for managing the table metadata. While flexible, this often means provisioning and managing a separate service like AWS Glue, a dedicated PostgreSQL database for the JDBC catalog, or a REST service. This adds an extra layer of configuration and operational overhead before you can even write your first line of data.
To simplify this process, we're excited to introduce the Hosted Iceberg Catalog in RisingWave, which is available since v2.4.
Our Solution: The New Hosted Iceberg Catalog
The Hosted Iceberg Catalog is a built-in option that lets you use RisingWave's own metadata store as a fully functional Iceberg catalog. You don't need to set up anything externally.
The difference is best shown with code. Previously, to connect to an external JDBC catalog, your connection setup would look something like this:
The Old Way: Connecting to an External Catalog
CREATE CONNECTION external_jdbc_conn WITH (
type = 'iceberg',
warehouse.path = 's3://hummock001/iceberg-data',
s3.access.key = '...',
s3.secret.key = '...',
catalog.type = 'jdbc',
catalog.uri = 'jdbc:postgresql://external-postgres:5432/iceberg_meta', -- External DB
catalog.jdbc.user = 'user',
catalog.jdbc.password = 'password',
catalog.name = 'dev'
);
Now, with the hosted catalog, the setup is radically simpler. You just need to tell RisingWave to manage the catalog for you by adding a single parameter:
CREATE CONNECTION my_hosted_catalog_conn WITH (
type = 'iceberg',
warehouse.path = 's3://your/warehouse/path',
s3.access.key = 'xxxxx',
s3.secret.key = 'yyyyy',
s3.endpoint = 'your_s3_endpoint',
hosted_catalog = true -- This is all it takes!
);
That’s it. With hosted_catalog = true
, RisingWave handles the catalog setup internally, allowing you to get started with the Iceberg engine in minutes.
How It Works Under the Hood
When you enable the hosted catalog, RisingWave uses its internal PostgreSQL-based metastore to manage Iceberg's metadata. It exposes two system views, iceberg_tables
and iceberg_namespace_properties
, which contain the necessary catalog information.
select * from iceberg_tables;
-----------
catalog_name | table_namespace | table_name | metadata_location | previous_metadata_location | iceberg_type
--------------+-----------------+------------------+------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------
dev | public | t_hosted_catalog | s3://hummock001/iceberg_connection/public/t_hosted_catalog/metadata/00000-e267a2ad-daf1-4d05-b7bd-75ff30e17629.metadata.json | |
(1 row)
Most importantly, this implementation follows the standard Iceberg JDBC Catalog protocol. This means that even though the catalog is "hosted" by RisingWave for your convenience, it's not a proprietary or closed format. It’s an open, standard catalog that external tools can access using JDBC or other SQL drivers.
Getting Started: A Practical Example
Let's walk through the three simple steps to create and populate your first table using the hosted catalog.
Step 1: Create the Connection
First, create the connection with the hosted_catalog = true
parameter.
CREATE CONNECTION my_hosted_catalog_conn
WITH (
type = 'iceberg',
warehouse.path = 's3://your/warehouse/path',
s3.access.key = 'xxxxx',
s3.secret.key = 'yyyyy',
s3.endpoint = 'your_s3_endpoint',
hosted_catalog = true
);
Step 2: Set the Active Connection
Next, tell your session to use this new connection for the Iceberg engine.
SET iceberg_engine_connection = 'public.my_hosted_catalog_conn';
Step 3: Create and Populate Your Table
Now you can create a table with ENGINE = iceberg
and insert data into it, just like any other table in RisingWave.
CREATE TABLE t_hosted_catalog (id INT PRIMARY KEY, name VARCHAR)
ENGINE = iceberg;
INSERT INTO t_hosted_catalog VALUES (1, 'RisingWave');
Interoperability: Connecting with External Tools like Spark
A common question is whether you can still read these tables with other tools. The answer is yes. Since our hosted catalog is a standard JDBC catalog, any tool that can speak to a PostgreSQL database and understands the Iceberg format can access your data.
For external tools to find the table, it’s helpful to understand the schema mapping:
Iceberg | RisingWave |
catalog name | database name |
namespace | schema |
table | table |
To show this in action, here’s how you can configure a Spark session to read the t_hosted_catalog
table we just created (assuming it was in a RisingWave database named dev
).
Spark SQL Configuration:
# Launch Spark SQL with necessary packages and configurations
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.postgresql:postgresql:42.7.4 \\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \\
--conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog \\
--conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \\
--conf spark.sql.catalog.dev.io-impl=org.apache.iceberg.aws.s3.S3FileIO \\
--conf spark.sql.catalog.dev.warehouse=s3://your/warehouse/path \\
--conf spark.sql.catalog.dev.uri="jdbc:postgresql://risingwave-hostname:4566/dev" \\
--conf spark.sql.catalog.dev.jdbc.user="your_rw_user" \\
--conf spark.sql.catalog.dev.jdbc.password="your_rw_password" \\
--conf spark.sql.catalog.dev.s3.endpoint=your_s3_endpoint
Once connected, you can query the table directly from Spark:
spark-sql> SELECT * FROM dev.public.t_hosted_catalog;
+---+------------+
| id| name|
+---+------------+
| 1| RisingWave|
+---+------------+
This demonstrates that your data is not locked in; it remains open and accessible to the broader data ecosystem.
Summary and Next Steps
We hope the Hosted Iceberg Catalog makes it significantly easier for you to build on the streaming lakehouse. This feature helps by:
Simplifying Setup: Get started with the Iceberg engine without configuring an external catalog.
Reducing Operational Overhead: No need to provision, manage, or pay for a separate catalog service.
Ensuring Openness: The catalog is a standard JDBC implementation, ensuring compatibility with tools like Spark, Trino, and Flink.
Creating an Integrated Workflow: Manage table creation, data streaming, and catalog metadata all within RisingWave.
You can learn more in our official documentation. We'd love to hear your feedback on this new feature. Please join our community on Slack or visit our GitHub repository to share your thoughts.