The Evolution of RisingWave's Meta Store

The Evolution of RisingWave's Meta Store

So, what role does the meta store play in RisingWave clusters? What are challenges using etcd as a meta store? How does migrating from etcd to a SQL backend address these challenges and provide significant benefits? This article will guide you through the evolution of the RisingWave meta store.

Metadata storage in RisingWave

Below is a simplified architecture diagram of RisingWave. It showcases various components like the frontend, compute nodes, compactor, and meta server.

!(https://risingwave9.wpcomstaging.com/wp-content/uploads/2024/10/截屏2024-10-25-下午3.39.02.png)

RisingWave is designed for the Cloud. To fully leverage the elastic resources provided by cloud platforms, its cloud-native architecture primarily consists of stateless nodes. The persistent state of the cluster is stored in a shared storage system across nodes, divided into two parts:

  • Persistent data storage: At the bottom of the diagram, this includes data for tables and materialized views, as well as state storage for stream processing. This data is organized in an LSM tree format and stored in an Object Storage Service (OSS), such as AWS S3.

  • Metadata storage (meta store): The right side of the diagram is the highlight of this article. It stores all catalog and user information, metadata, and structure of the LSM tree, as well as the topology of all streaming tasks and their distribution across compute nodes.

The meta server node directly reads and writes the metadata store, acting as the brain of the RisingWave cluster. It communicates with other nodes to manage various functions, including:

  • Metadata management: Creating, modifying, and deleting different objects. For objects tied to streaming tasks, the meta server interacts with compute nodes to manage those tasks.

  • Scheduling: Overseeing computations on the compute nodes, including migration and scaling operations.

  • Checkpoint: Initiating and managing the cluster's checkpoints.

  • LSM tree management: Scheduling and publishing compaction tasks.

Using etcd as metadata store

Initially, we used etcd for metadata storage. Due to the ease of maintenance and compatibility, metadata was encoded in protobuf. As a proven metadata system, etcd effectively met RisingWave's needs for high availability, persistence, and transaction handling.

However, as RisingWave evolved and user clusters grew, we began to encounter several challenges.

High development complexity

Etcd, as a key-value store, uses a KV model that works well for simple metadata needs like catalogs and permission management. However, it struggles with complex scenarios. For instance, it doesn't natively support indexing multiple fields in the same record, such as making efficient querying by both table name and ID cumbersome. Moreover, etcd lacks data processing capabilities, requiring multiple queries for a single request.

For example, maintaining and altering stream computation graphs often requires checking for circular dependencies among tasks. Implementing this directly in etcd would necessitate repeated queries, resulting in unacceptable latency.

To tackle this, we implemented a full metadata cache in the memory of the meta server. This allowed for direct memory access when reading metadata and facilitated various indexing operations. However, this approach increased memory resource demands on the meta server and added complexity. Developers had to manually maintain the cache's consistency with etcd, handling updates or rollbacks in memory after meta store's transaction commits. Many bugs related to the meta server stemmed from inconsistencies or contention issues in the cached metadata.

Operational challenges in the Cloud

Etcd is primarily designed for bare-metal deployments, and cloud providers do not offer managed etcd services. Thus, RisingWave Cloud had to find solutions for deploying etcd in the cloud. After a brief attempt to develop an etcd operator, we opted to use Helm to deploy etcd on managed Kubernetes services like Amazon EKS.

In practice, maintaining a stateful distributed storage system posed significant operational challenges. While many engineers were adept at resolving RisingWave's online issues, they often felt overwhelmed when dealing with etcd. Common issues included out-of-memory (OOM) errors after heavy requests and delayed defragmentation. Resolving these issues typically required deep investigation and parameter tuning, compounded by the relatively inferior disk performance of cloud environments compared to local disks. These challenges prompted us to explore better options for cloud deployment.

Data volume limitations

Etcd was not designed for large-scale data storage, which is evident in its use of a single Raft group without sharding. This design limits its optimization goals and industrial applications to small data scales. Currently, its default storage limit is 2GB, with a recommended maximum of 8GB (see here).

Initially, we estimated that the metadata for a single RisingWave cluster would not exceed the gigabyte range. However, we found that some users could easily create and manage thousands of materialized views using scripts. This, combined with the complexity and concurrency of SQL wihch further amplifies the metadata size, resulted in unexpectedly large metadata sizes, with millions of Actors (the smallest scheduling unit in RisingWave) emerging sooner than anticipated. In RisingWave Cloud, under high-frequency metadata updates like scaling out, the internal LSM tree of etcd quickly consumed the available 4GB disk space. Many user clusters' metadata continued to grow, prompting us to think ahead.

On the other hand, for most users, their metadata scale was quite small, leading to resource wastage as each cluster occupied an independent etcd node. The lack of scalability in etcd also limited the potential for multi-tenant deployments for metadata storage.

Additional issues

We also encountered other critical or trivial issues that adversely affected user experience. For example:

  • After recovering from disk exhaustion, partially committed transactions led to dirty metadata. The exact cause remains unclear.

  • Protobuf encoding made bugging and trouble shooting difficult. Users needed to write scripts to locate the required information. Manual corrections for dirty data also required scripts, complicating operations.

  • Etcd is designed to handle smaller key-value pairs, with default RPC size limits of 1MB (see here). As metadata grows in complexity, the size of individual metadata entries continues to increase, especially with features like UDFs.

Using SQL backend as metadata store

To address these challenges, we considered using a SQL backend with an OLTP system for metadata storage in RisingWave clusters. This approach not only meets our needs for high availability, persistence, and transaction handling but also resolves the previously mentioned issues. Specifically, using a SQL backend offers the following benefits:

  • Reduced development complexity: With a SQL backend, you can create indexes for metadata tables and express previously complex queries in SQL. This eliminates the need for caching in the meta server. For instance, the circular dependency check mentioned earlier can be easily implemented with a recursive CTE statement in SQL, with all computations occurring directly in the database acting as the meta store.

  • Easier Cloud deployment: RisingWave Cloud can take advantage of managed database services from cloud vendors. Services like RDS are core offerings to ensure robust support.

  • Increased data volume limits: Modern relational databases typically support terabyte-scale data, significantly expanding RisingWave's application scope. For example, AWS RDS for PostgreSQL supports up to 128 vCPUs and 4096 GiB of RAM, with 64 TiB of storage (see here).

  • Multi-tenant shared metadata storage: A SQL backend allows for shared metadata storage for multiple tenants. Since metadata volume and concurrency are generally limited, this approach could result in a light load on OLTP databases, allowing multiple RisingWave clusters to share a single database. However, it also increases the impact of potential storage failures, so we are conducting further tests to weigh the pros and cons.

The development journey wasn't easy. We started from version 1.3 and worked through to version 1.8, where we implemented and tested all features before the public release. This led to version 2.1, which officially dropped support for etcd. Throughout this process, we continuously developed features on the meta server while ensuring stability and performance. Transitioning from etcd to a SQL backend was a significant challenge, like changing an engine mid-flight. After much effort, we successfully made the switch. Given that the SQL backend is still a relatively new feature, some issues may occur, but we encourage feedback to help enhance RisingWave, our popular streaming database system.

The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.