Iceberg Catalog Comparison: Hive Metastore vs AWS Glue vs REST vs Nessie

Iceberg Catalog Comparison: Hive Metastore vs AWS Glue vs REST vs Nessie

·

11 min read

Introduction

You have decided to build on Apache Iceberg. The table format is settled. But now comes a decision that will shape your entire lakehouse architecture: which catalog should you use?

The Iceberg catalog is the metadata backbone of your lakehouse. It tracks table locations, manages schema versions, handles concurrent writes, and determines which query engines can access your data. Pick the wrong one and you face migration headaches, vendor lock-in, or operational complexity that slows your team down.

In this guide, we compare the four most widely adopted Iceberg catalog options: Hive Metastore, AWS Glue, REST Catalog, and Project Nessie. We evaluate each across features, scalability, cloud support, and streaming compatibility so you can make an informed choice. We also cover how RisingWave's hosted Iceberg catalog fits into this landscape for teams that want a managed streaming lakehouse experience.

What Does an Iceberg Catalog Actually Do?

An Iceberg catalog is a service that maps table names to their current metadata locations. Every Iceberg table has a metadata pointer, a file in object storage that describes the table's schema, partition spec, sort order, and the list of data files (via manifest lists and manifests). The catalog's job is to manage that pointer.

When a writer commits new data, the catalog atomically updates the metadata pointer. When a reader queries a table, the catalog resolves the table name to the latest metadata file. This indirection is what gives Iceberg its key properties: ACID transactions, schema evolution, and time travel.

The choice of catalog affects three critical dimensions:

  • Concurrency and consistency: How does the catalog handle simultaneous writers? Does it support optimistic concurrency control?
  • Multi-engine compatibility: Can Spark, Trino, Flink, RisingWave, and other engines all use the same catalog?
  • Operational overhead: Do you need to deploy and manage infrastructure, or is the catalog fully managed?

How Does Hive Metastore Work as an Iceberg Catalog?

Hive Metastore (HMS) is the oldest and most widely deployed option. Originally built for Hive tables, it was extended to support Iceberg through a HiveCatalog implementation that stores Iceberg metadata pointers in the HMS database.

Strengths

  • Broad ecosystem compatibility: Nearly every data processing engine supports HMS. Spark, Trino, Flink, Hive, and Impala all integrate out of the box.
  • Mature and battle-tested: HMS has been in production at thousands of organizations for over a decade. The operational patterns are well understood.
  • On-premises friendly: If your data stays in your own data centers, HMS runs on any infrastructure you control.

Limitations

  • Single-table transactions: The HMS API handles commits one table at a time. You cannot atomically commit changes across multiple tables, which limits advanced data pipeline patterns.
  • Infrastructure overhead: HMS requires a relational database (typically MySQL or PostgreSQL) and a Thrift service. You need to manage availability, backups, and upgrades.
  • No built-in versioning: HMS has no concept of branches or tags. You cannot create isolated environments for testing schema changes before promoting them to production.
  • Scaling bottlenecks: As the number of tables and partitions grows, the backing database becomes a bottleneck. Organizations with tens of thousands of tables often hit performance limits.

Best for

Teams with existing Hadoop or Hive infrastructure who need broad engine compatibility and are comfortable managing the underlying services.

How Does AWS Glue Work as an Iceberg Catalog?

AWS Glue Data Catalog is a fully managed metadata service that supports Iceberg tables natively. It stores Iceberg metadata pointers and integrates with the broader AWS ecosystem through IAM, Lake Formation, and Athena.

Strengths

  • Fully managed: No infrastructure to deploy or maintain. AWS handles availability, scaling, and backups.
  • Deep AWS integration: IAM-based access control, Lake Formation for fine-grained permissions, native integration with Athena, EMR, Redshift Spectrum, and AWS Glue ETL.
  • Serverless scaling: The catalog scales automatically with your workload. No capacity planning required.
  • Market leader: According to the 2025 State of the Apache Iceberg Ecosystem survey, AWS Glue leads catalog adoption at 39.3%.

Limitations

  • AWS lock-in: Glue is only available on AWS. If you run a multi-cloud or hybrid architecture, you need a separate catalog for non-AWS environments.
  • Single-table transactions: Like HMS, Glue does not support atomic multi-table commits.
  • Limited engine support outside AWS: While Spark and Trino can use Glue as an Iceberg catalog, not all engines support it natively. Engines running outside AWS need additional configuration and network access.
  • Eventual consistency edge cases: In rare scenarios, Glue's eventually consistent read model can cause stale reads after rapid successive writes.

Best for

AWS-native teams running Iceberg on S3 who want zero operational overhead and tight integration with the AWS analytics stack.

How Does the REST Catalog Work for Iceberg?

The Iceberg REST Catalog specification, introduced in Iceberg 0.14.0, defines a standard HTTP-based API for catalog operations. Any catalog that implements this spec can work with any engine that supports REST, creating a vendor-neutral interoperability layer.

Strengths

  • Vendor neutral: The REST spec decouples engines from catalog implementations. You can swap the backing catalog without changing your query engine configuration.
  • Multi-engine support: Any engine that implements the REST client (Spark, Trino, Flink, RisingWave, StarRocks, and others) can connect to any REST-compatible catalog.
  • Modern architecture: HTTP-based communication works naturally with cloud-native infrastructure, service meshes, and API gateways. No Thrift dependencies.
  • Extensible: The REST spec supports custom endpoints for features like access control, table maintenance, and metrics reporting.

Implementations

Multiple vendors offer REST catalog implementations:

  • Apache Polaris (formerly Snowflake Polaris Catalog): Open-source REST catalog with fine-grained access control.
  • Lakekeeper: Rust-based open-source REST catalog focused on performance.
  • RisingWave Open Lake: Managed REST catalog with built-in streaming ingestion and automatic compaction. More on this below.
  • Tabular (now part of Databricks): Managed REST catalog with governance features.

Limitations

  • Newer ecosystem: While growing rapidly, REST catalog adoption is still behind HMS and Glue. Some older engine versions may not support it.
  • Self-hosted complexity: If you choose an open-source REST implementation, you still need to deploy and operate the service yourself (unless you choose a managed option).

Best for

New deployments that want engine flexibility, multi-cloud support, and a future-proof catalog architecture. The Apache Iceberg community recommends REST for new projects.

How Does Project Nessie Work as an Iceberg Catalog?

Project Nessie brings Git-like semantics to the Iceberg catalog. It supports branches, tags, and merge operations at the catalog level, enabling workflows that are impossible with other catalog types.

Strengths

  • Git-like branching: Create isolated branches for development, testing, or ETL. Merge changes back to main when validated. This is transformative for data pipeline development.
  • Multi-table transactions: A single Nessie commit can atomically modify multiple tables. This enables consistent cross-table operations that HMS and Glue cannot provide.
  • Audit trail: Every catalog change is versioned. You can review the full history of metadata changes, roll back to any point, and understand who changed what.
  • Growing adoption: Nessie adoption reached 28.6% in the 2025 ecosystem survey, reflecting strong momentum.

Limitations

  • Operational complexity: Nessie requires a backing store (RocksDB, DynamoDB, or a relational database) and introduces concepts (branches, merges, conflict resolution) that your team needs to learn.
  • Engine compatibility: While major engines support Nessie, the integration depth varies. Some engines may not fully leverage branching features.
  • Merge conflicts: When multiple branches modify the same table, conflict resolution can be complex. Teams need clear branching policies.

Best for

Teams that need multi-table transactions, isolated development environments, or strict audit requirements. Particularly valuable for regulated industries where change tracking is mandatory.

Which Catalog Should You Choose for Streaming Workloads?

Streaming workloads place unique demands on the Iceberg catalog. High-frequency commits, low-latency requirements, and continuous operation mean the catalog is under constant pressure. Here is how each option handles these demands:

FeatureHive MetastoreAWS GlueREST CatalogProject Nessie
Commit frequencyModerate (Thrift overhead)High (managed scaling)High (HTTP-native)High (optimized for frequent commits)
Multi-table atomicityNoNoDepends on implementationYes
Streaming engine supportFlink, Spark StreamingFlink, Spark Streaming (AWS)Flink, Spark, RisingWaveFlink, Spark
Concurrency handlingPessimistic lockingOptimistic (managed)Optimistic (spec-defined)Optimistic with branching
Latency overheadMedium (Thrift RPC)Low (managed)Low (HTTP)Low (HTTP)
Managed optionNo (self-hosted)Yes (AWS-managed)Yes (multiple vendors)Partially (Dremio Cloud)
Multi-cloudYes (self-hosted)No (AWS only)YesYes
Branch isolationNoNoDepends on implementationYes (native)
Automatic compactionNoNoDepends on implementationNo
CostInfrastructure costsPay-per-requestVaries by vendorInfrastructure costs

For streaming workloads specifically, the key considerations are:

  1. Commit frequency: Streaming writers commit every few seconds or minutes. The catalog must handle this pace without becoming a bottleneck.
  2. Concurrency: Multiple streaming jobs may write to different tables simultaneously. The catalog must support concurrent commits without excessive lock contention.
  3. Automatic maintenance: Streaming creates many small files. The catalog (or associated services) should support automatic compaction to prevent query performance degradation.

How Does RisingWave's Managed Iceberg Catalog Fit In?

RisingWave Open Lake takes a different approach by combining the catalog with the streaming engine and table maintenance in a single managed service. Instead of choosing a catalog and then integrating it with a separate streaming engine and a separate compaction service, RisingWave provides all three together.

Here is what this looks like in practice:

-- Create an Iceberg table directly in RisingWave
CREATE TABLE page_views (
    user_id INT,
    page_url VARCHAR,
    view_time TIMESTAMP,
    session_id VARCHAR
) WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'https://your-risingwave-cloud-endpoint',
    warehouse = 'my_warehouse',
    database.name = 'analytics',
    table.name = 'page_views'
);
-- Stream data from Kafka into the Iceberg table
CREATE SINK page_views_sink FROM page_views_mv WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'https://your-risingwave-cloud-endpoint',
    warehouse = 'my_warehouse',
    database.name = 'analytics',
    table.name = 'page_views'
);

RisingWave's catalog is REST-compatible, so any engine that supports the Iceberg REST spec (Spark, Trino, DuckDB, StarRocks) can query tables managed by RisingWave. The key differentiator is that automatic compaction runs as a built-in background process, not a separate batch job you need to schedule and monitor.

For more details on setting up a streaming Iceberg pipeline, see the RisingWave Iceberg streaming documentation and the lakehouse architecture guide.

What Are Common Migration Paths Between Catalogs?

Organizations rarely start with the ideal catalog. Here are the most common migration patterns:

HMS to REST Catalog

This is the most common migration. Teams start with HMS because it is what they know, then move to REST for better multi-engine support and reduced operational overhead. The migration involves:

  1. Deploying a REST catalog implementation alongside HMS
  2. Registering existing Iceberg tables in the REST catalog (metadata files are already in object storage)
  3. Updating engine configurations to point to the REST catalog
  4. Decommissioning HMS after validation

Glue to REST Catalog (multi-cloud expansion)

Teams that started on AWS and now need multi-cloud access often add a REST catalog layer. Some REST implementations (like Apache Polaris) can use Glue as a backing store, providing REST API access to Glue-managed tables.

Any catalog to Nessie

Moving to Nessie typically happens when teams need multi-table transactions or branching capabilities. The migration requires re-registering tables in Nessie and updating all engine configurations.

FAQ

What is an Apache Iceberg catalog?

An Apache Iceberg catalog is a service that manages the mapping between table names and their metadata file locations in object storage. It handles concurrent access, ensures atomic commits, and provides the entry point for query engines to discover and read Iceberg tables.

Which Iceberg catalog is best for streaming?

For streaming workloads, the REST Catalog offers the best combination of low latency, broad engine support, and vendor neutrality. If you want a fully managed experience with built-in compaction, RisingWave's managed catalog integrates streaming ingestion directly with catalog management.

Can I use multiple Iceberg catalogs at the same time?

Yes. Most query engines support configuring multiple catalogs simultaneously. For example, you can query tables from both a Glue catalog and a REST catalog in the same Spark session. This is common during migrations or in multi-cloud environments.

How does Nessie differ from other Iceberg catalogs?

Nessie provides Git-like version control for your data lakehouse, supporting branches, tags, and multi-table atomic commits. Other catalogs (HMS, Glue, basic REST) only support single-table commits and do not provide branching or history capabilities.

Conclusion

Choosing an Iceberg catalog is a foundational decision for your lakehouse architecture. Here are the key takeaways:

  • Hive Metastore is reliable and broadly compatible but adds operational overhead and lacks modern features like branching or multi-table transactions.
  • AWS Glue is the easiest option for AWS-native teams but creates cloud lock-in and does not support multi-table atomicity.
  • REST Catalog is the recommended choice for new deployments due to its vendor neutrality, broad engine support, and modern HTTP-based architecture.
  • Project Nessie is the most feature-rich option, offering Git-like branching and multi-table transactions, at the cost of additional complexity.
  • For streaming workloads, prioritize commit frequency handling, concurrency support, and automatic compaction when evaluating catalogs.

Ready to try this yourself? RisingWave provides a managed Iceberg catalog with built-in streaming ingestion and automatic compaction. Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.