Applying Deterministic Simulation: The RisingWave Story (Part 2 of 2)

In our previous article, we introduced the concept of deterministic simulation and discussed the development of Madsim, a deterministic testing framework based on Rust's asynchronous programming ecosystem. In this article, we will explore the application of deterministic simulation in RisingWave, a cloud-native distributed database, and how we have used Madsim to set up deterministic tests to improve the reliability of our system. We will also share our reflections on this journey, highlighting the benefits and challenges we faced while implementing this technique.

Madsim x RisingWave

The relationship between RisingWave, Madsim, and external services

Madsim and RisingWave work together to perform deterministic testing on the entire system. This can be illustrated with an analogy from a scene in the sci-fi animation Rick and Morty. In this episode, aliens created a realistic world simulator to deceive Rick and extract information from him. However, Rick saw through their scheme, broke through, and escaped from the virtual world. In our system, RisingWave represents Rick, Madsim is the alien simulator, and external services such as Etcd and S3 are Morty, whom aliens also simulate to deceive Rick.

In RisingWave, we have implemented four types of deterministic testing:

Unit testing
End-to-end testing
Recovery testing
Scaling testing

These tests have been integrated into the continuous integration (CI) process and are automatically run after each pull request that modifies the code, ensuring continuous monitoring of the code quality and system reliability.

Unit Testing

Unit testing is the most fundamental testing method that covers relatively small modules and codes in each test case, with relatively simple testing logic. Although it may be challenging to discover potential concurrency issues in the system through unit testing, we established deterministic unit tests. We found several bugs due to oversights in the testing logic itself. Although we did not dig deeper into more profound bugs, unit testing proved to be very helpful for the initial development and verification of Madsim, as it is simple and easy to integrate.

However, deterministic simulation has other uses for unit testing as well. For example, testing for timeout handling is challenging in a normal environment because it requires waiting for a long time. But in the simulator, the long wait time can be completed instantly. Additionally, to test the logic that depends on Etcd, an actual Etcd needs to be started. But with the Etcd simulator, testing can be performed in a simulated environment, validating the Etcd simulator simultaneously.

End-to-end Testing

End-to-end testing is the main application scenario for deterministic simulation. It covers various system components with complex code logic and is the scenario most prone to concurrency errors.

To understand the end-to-end test in RisingWave, let's first look at the system architecture diagram.

The system architecture of RisingWave and the data flow in end-to-end testing

The solid colored blocks represent the core components of RisingWave, with each block representing a process that may be deployed on the same or different nodes, communicating with each other via gRPC. The surrounding icons represent external services that interact with RisingWave. Users can establish a connection with the frontend through a Postgres client and execute SQL commands. The frontend then forwards the commands to multiple compute nodes, which, coordinated by the meta service, work together to complete computing tasks and may read and write data to S3 during the process.

In traditional end-to-end testing, we deploy etcd and Minio (to provide S3 services) on a node, then start several meta-services, frontend, compute, and compactor processes. Next, we use the sqllogictest to read pre-written test scripts, send commands to and verify the execution results of the RisingWave cluster. Completing a full round of testing on CI takes about 8 minutes.

In deterministic simulation testing, all the processes and services are run in a single-threaded environment. The external services dependent on the system are replaced with corresponding simulators. We use the API provided by Madsim to create virtual nodes, set their IP addresses, and start relevant tasks. After establishing a virtual cluster, we call the sqllogictest library function on the "client node" to read the test script from the real system and run the same test. This way, it only takes about 2 minutes to complete a full round of testing.

To maximize the impact of concurrency, sqllogictest supports parallel execution of different test scripts (similar to make -j), and each script will randomly connect to different frontends with independent sessions. In this case, internal processing logic will execute in an unpredictable order, occasionally causing errors. However, we can re-execute with the same random seed from the previous run to obtain identical results in such situations. Next, we can open the log to investigate and locate the source of the bug. If the logs are not detailed enough, we can modify the code to output new debugging information at any time, without affecting the reproducibility of the bug.

Figure: We can trace the execution process of an SQL command on multiple nodes through the RPC logs of the simulator.

Madsim uses the tracing library's span mechanism to add necessary context information, such as the node, task, and RPC request, to each log. This achieves the effect of distributed tracing in a real system. By reading this log, developers can understand the behavior and relationships of each node in the system, improving the efficiency of problem diagnosis.

Through deterministic end-to-end testing, we identified and fixed several bugs caused by concurrency, including panics, deadlocks, and calculation errors.

Recovery Testing

While deterministic simulation is useful in normal end-to-end testing, it truly shines when it comes to abnormal scenarios.

As a distributed system in the cloud, RisingWave must tolerate a variety of exceptional situations, such as node and network failures. When a node goes down, k8s quickly spins up new services, and the kernel begins the process of recovery. However, if other nodes in the cluster fail simultaneously or new requests from users arrive, these events can interweave in unpredictable ways, leading to extremely difficult situations that developers could not have anticipated during design and coding.

Deterministic scaling testing of RisingWave

To thoroughly test the reliability of RisingWave under various abnormal conditions, we periodically randomly kill various nodes on top of the end-to-end tests, then bring them back up and observe whether the system can still return correct results to users. Usually, when the system experiences an internal failure, the client will receive an error response. The client will then wait and retry several times using exponential backoff. If it fails to obtain the correct result for a long time, it indicates a test failure. However, we have not yet injected failures during data modification because these operations are not atomic and idempotent, and multiple retries may result in undesired data.

After establishing the exceptional recovery testing, our developers went through a rather painful debugging period of two months, as new bugs were always discovered. For example, there were panics caused by missing error handling, assertions based on erroneous assumptions, and result inaccuracies due to concurrency issues (all of which were recorded in this issue). Fortunately, once these problems were discovered, they could be reliably reproduced, enabling us to quickly identify and fix them. This greatly helped us improve the reliability of the system.

Scaling Testing

In addition to recovery, another scenario that is prone to errors is the scaling of the cluster. As one of the key features of cloud-native systems, RisingWave can elastically scale the cluster based on the workload, adding nodes to increase computing power during heavy loads and reducing nodes to save costs during light loads. Whenever the cluster configuration changes, the meta service needs to rebalance the computing and data of each node. If user requests and recovery are overlaid during the task migration process, it poses a great challenge to the stability and correctness of the system.

During the cluster scaling tests, we create a Nexmark data source and set up a query pipeline, then randomly schedule task shards between different nodes, and finally check if the output data is consistent with the data when there was no schedule. This test also helped us discover a large number of issues, which can be roughly classified into the following categories:

Issues with the recovery of compute operator states and cache invalidation: During normal end-to-end testing, many code paths that pull data from object storage are not covered due to the small amount of data. However, when shards are migrated, each node needs to synchronize data with object storage and clear its own state cache. Improper handling can cause serious data correctness issues.
Issues with building complex data flow graphs on the meta service: When multiple shards are migrated simultaneously, the meta service will build a more complex data flow graph. In the simulator, we can easily send dense scheduling requests to the meta service in a short time to simulate extreme scenarios and greatly improve code coverage.
Issues with data source migration: When a new data flow graph is deployed to a compute node, the existing data flows will be recombined, which can easily lead to data loss or duplication.

These are just the issues encountered during pure migration. In the future, we will continue to combine cluster scaling with failure recovery further to test the system's stability under extreme conditions.

Continuous Integration

The above describes the application of several deterministic tests in RisingWave. However, testing alone is not enough; it is important to integrate them into CI to continuously test every modification and prevent new bugs from entering the system. Deterministic simulation has the advantage of being fast, and the end-to-end testing we performed on RisingWave confirmed this fact. In fact, we were able to run simulation tests approximately 4-5 times faster than the actual execution, which allowed us to run more simulation tests in the same amount of time, thus increasing the probability of detecting bugs. RisingWave's CI runs in a container environment with 16 CPU cores. Therefore, we will execute each deterministic test with different seeds in parallel 16 times to make the most of the existing computing power. We will also try to minimize the execution time so that each PR can complete all tests within 20 minutes. In addition to PRs, we will also schedule daily tests on the main branch with no time constraints allowing us to run more iterations and cover more scenarios.

Challenges and Limitations

As deterministic testing becomes more widely used, we are encountering new challenges and limitations while reaping its benefits. Below are some of the challenges we have faced:

Continuous increase in testing time: The addition of new test cases and increasing complexity of testing logic have led to a sustained and slow increase in the execution time of deterministic testing, ultimately exceeding our time limit. Injecting errors also consumes a significant amount of time in error handling and exception recovery logic. To tackle this, we introduced a probability of triggering exceptions, which allows us to trade off testing intensity for execution speed when time is tight.
Need to improve bug detection efficiency: Increasing the number of tests is not always the best way to uncover more bugs. Other smarter methods, such as using more intelligent task scheduling strategies and combined fuzzing, may be required to find more deeply hidden problems. This requires us to have more communication and cooperation with experts in the field of software testing.
Limited to Rust language projects: RisingWave has introduced more external data source connectors that may use other languages or depend on external processes. It is costly and less rewarding to develop simulators for each of them. Currently, we only maintain a simulator for Kafka data sources. However, Facebook's deterministic execution framework, Hermit, may provide a solution for various connectors by using a system-level approach to control the execution order of any process regardless of programming language.

In this article, we have introduced a testing technique called "deterministic simulation" and its application in RisingWave. > >

>

Deterministic simulation involves running the system on a single-threaded simulator, which allows bugs to be reliably reproduced and makes it more efficient to discover and solve problems. We developed the Madsim framework based on the asynchronous programming ecosystem of the Rust language and implemented various scenarios of deterministic testing for RisingWave. These tests have helped us uncover many potential issues, particularly in recovery and scaling, improving RisingWave's reliability in extreme scenarios. > >

>

We recognize the immense value and potential of this technique and acknowledge that none of this would have been possible without the foundation laid by Rust community developers. As a result, we are happy to contribute back to the community by completely decoupling Madsim from RisingWave, making it suitable for any Rust language project. If you are building a distributed system in Rust, feel free to use Madsim to add deterministic testing to your project and share your valuable experience in system testing. Let's work together to eliminate concurrent bugs in our code! > >

>

Follow us on Twitter and Linkedin, and join our Slack community to talk to our engineers and hundreds of streaming enthusiasts worldwide. > >