Deterministic Simulation: A New Era of Distributed System Testing (Part 1 of 2)

Deterministic simulation is a powerful technique for system testing that can improve the stability and reliability of distributed systems. It allows for the components of an entire system to be run on a single-threaded simulator, ensuring deterministic execution and consistent reproduction of bugs. One of the most incredible benefits of this technique is the ability to simulate long durations of real-world behavior quickly, making it an efficient tool for discovering potential issues in the system.

Many distributed systems have successfully leveraged deterministic simulation, with FoundationDB serving as a well-known example. RisingWave, a cloud-native distributed database, similarly requires strict adherence to correctness and reliability standards. Therefore, we have implemented this technique for RisingWave and developed a deterministic testing framework called Madsim. This framework is used for continuous end-to-end simulation testing of RisingWave, enabling our developers to detect and eliminate many potential concurrency and recovery issues in the system at an early stage.

In this article, we will discuss the background and principles of deterministic simulation, introduce our deterministic testing framework Madsim, and share our experience applying deterministic testing to RisingWave.

Concurrency Bugs

Distributed systems running in the real world face numerous uncertainties, including delayed, lost, or unordered messages, unpredictable thread execution orders, and random number outcomes. Concurrency issues may only occur under a specific execution order, making them difficult to identify and eliminate. These issues pose a significant risk to the system's stability and can cause bugs to disappear and reappear, adding to the debugging challenge.

Both academia and industry have proposed different solutions to this problem. The academic approach tends to rely on formal verification or exhaustive enumeration of all possible states to ensure system correctness — this method is only effective for small systems and can't handle the combinatorial explosion of possibilities in larger systems.

On the other hand, the industry focuses on extensive testing to cover as many possibilities as possible, with Chaos Engineering being a well-known technique. Chaos Engineering can effectively validate system behavior in a real environment but cannot control uncertainty or guarantee 100% problem reproduction, which makes development and debugging more difficult.

Deterministic Simulation

Principles and Advantages

Deterministic simulation offers a solution that bridges the gap between the two methods, focusing on eradicating uncertainty within a system, which enables stable replication of any problem. Let's examine the various sources of uncertainty we previously mentioned and discuss how to eliminate them:

When generating random numbers in a single thread, fixing the initial random seed is sufficient.
Regarding the scheduling order of multiple threads, we can control their execution order by running them all on a single thread.
We can create a comprehensive simulator to replicate their behavior deterministically for external factors, such as the operating system and network, which may result in random behavior. Additionally, we can introduce errors and randomness in a controlled manner to achieve deterministic chaos engineering.

The core principles of deterministic simulation are described above. Recognizing that time is also a simulated variable in this context is crucial. Each event within the system is assigned a deterministic occurrence time, and the simulator processes these events chronologically. Due to the system time being a series of discrete points on a continuous timeline, this method is also referred to as discrete-event simulation.

Discrete-event simulation can generate a time-acceleration effect: if no other events take place between two events, even if they are years apart, the simulator can instantly jump to the subsequent moment on the timeline. This feature enables the simulation of a system's behavior over an extended period in a relatively short amount of time, with the only limitation being the computing power of a single CPU. As a result, this technique is highly effective for I/O-intensive applications but less so for compute-intensive applications.

In conclusion, deterministic simulation boasts two primary advantages: determinism and rapid execution, significantly enhancing our efficiency in identifying and resolving issues. However, despite these benefits, deterministic simulation has yet to gain widespread adoption, primarily due to its significant drawbacks: high intrusiveness and engineering complexity.

Challenges and Opportunities

To ensure that all sources of uncertainty are eliminated, simulators must be constructed for all external environments that the system depends on. This is a significant amount of work, as any missed sources of randomness can undermine the determinism of the entire system. Third-party libraries with internal randomness cannot be used, and all code on all nodes must be executed on a single thread. This can be achieved by designing state machines and event loops or by introducing coroutines, but there are significant challenges in either case.

The FoundationDB team began building their deterministic simulator a decade ago when C++ lacked language-level support for coroutines, and there were no modern system programming languages like Rust available. They created their own Flow language, which patched C++ with an Actor-based, stackless coroutine syntax, and then built a simulator and database on top of it. Due to their close coupling, no other projects have been able to reuse FoundationDB's testing framework.

Now, with the advent of Rust and its built-in coroutine support, it is possible to develop a mature ecosystem for asynchronous programming. However, there has yet to emerge a widely-used and tested deterministic testing framework like Tokio.

While some attempts have been made in the community, such as the Tokio community's "turmoil" project, it is still experimental. Therefore, there is a need for an open and general deterministic testing framework that can be incubated based on the practical needs of RisingWave, which would be valuable for both themselves and the community.

The Madsim Project

Introducing the Madsim project, a deterministic simulator for distributed systems based on Rust's asynchronous programming ecosystem. While similar to an asynchronous runtime like Tokio, Madsim differs in that all behaviors are deterministic, and the environment, such as the network, is simulated.

The internal structure of Madsim is illustrated in the following figure:

An overview of Madsim architecture.

Taking a bottom-up approach to the Madsim architecture, it consists of four modules:

Global Pseudo-Random Number Generator: At the core of deterministic simulation is a global, deterministic random number generator. This generator produces all random factors in the system, including random numbers, delays, and errors introduced by the simulator. With the same random seed, the same execution sequence can be produced.
Timer: The timer serves as the source of all events in the system. Essentially, it is a priority queue sorted by timestamp. During execution, the simulator constantly takes out the nearest timer from the queue, triggers the event, and wakes up the task. The task also creates new timer events while executing, forming an event loop.
Task Scheduler: The task scheduler maintains a queue of all ready tasks. As a runtime for simulating distributed systems, tasks may come from different logical nodes but are all scheduled in a unified queue for execution. The queue is designed to be first-in-random-out (FIRO) to explore potential concurrency issues.
Environment Simulator: Various simulators, mainly network simulators and disk simulators, can be constructed based on the asynchronous execution environment. These simulators use memory storage and communicate through channels within the process, resulting in faster execution than in an actual environment.

To make Madsim accessible for developers, its functions have been wrapped into an API that is completely compatible with Tokio's. Existing projects can use Madsim without modifying any code by only adjusting a few configurations. Madsim intercepts some critical std APIs by overloading libc functions, such as gettimeofday, clock_gettime, getrandom, and sysconf , to provide deterministic time, random numbers, and system configuration in the simulation environment. This prevents uncertainty leaks caused by accessing the real world through std and third-party libraries. For third-party libraries that cannot be made deterministic using this approach, manual patching and the [patch] feature in Cargo.toml are used to achieve code replacement. All of this enables seamless migration and out-of-the-box use of existing projects.

Add the following lines to your Cargo.toml:

[dependencies]
madsim = "0.2"

If your project depends on the following crates, replace them by our simulators:

[dependencies]
tokio = { version = "0.2", package = "madsim-tokio" }
tonic = { version = "0.2", package = "madsim-tonic" }
etcd-client = { version = "0.2", package = "madsim-etcd-client" }
[dev-dependencies]
tonic-build = { version = "0.2", package = “madsim-tonic-build" }

If your dependency graph includes the following crates, replace them by our patched version:

[patch.crates-io]
quanta = { git = "https://github.com/madsim-rs/quanta.git", rev = "a819877" }
getrandom = { git = "“https://github.com/madsim-rs/getrandom.git", rev = "cc95ee3" }
tokio-retry = { git = "https://github.com/madsim-rs/rust-tokio-retry.git", rev = "95e2fd3" }
tokio-postgres = { git = "https://github.com/madsim-rs/rust-postgres.git", rev = "1b392f1" }

To run your code on the simulator, enable the config madsim:

RUSTFLAGS="--cfg madsim" cargo test

As a cloud-native distributed database, RisingWave relies on gRPC, etcd, S3, and Kafka for internal communication, control, storage, and consuming/producing streaming data. These services require messages passing over the network, and without them, RisingWave cannot function as a complete system. Building on the core functionality of madsim, deterministic simulations of various third-party services over the network simulator have been implemented and wrapped in interfaces provided by community libraries such as tonic, etcd-client, rust-rdkafka, and aws-sdk-s3.

The network simulator and supported protocols in Madsim

The network simulator serves as the common foundation for these services. It implements network layer topology based on IP addresses, simulates packet routing, delay, and loss on the virtual network, and provides reliable transmission and datagram transmission as communication primitives. For upper-layer services, they only need to use these primitives to construct an RPC service and implement the semantics of the service interface.

Notably, the simulators for Etcd, Kafka, and S3 are simpler than their real versions, as they only need to be executed sequentially in a single thread without considering concurrency, consensus, or fault tolerance. The messages they send don't even need to be serialized and can be directly transmitted in memory as object pointers.

Madsim provides extra control interfaces at each layer that can be used to control simulation behavior, obtain system information, or manually inject errors, such as blocking or killing nodes and controlling network link connectivity. These API definitions can be found in the madsim API documentation.

Deterministic simulation is a powerful technique for system testing that can improve the stability and reliability of distributed systems. This article discusses the principles and advantages of deterministic simulation, introduces the testing framework Madsim, and explains how it works. The next article will explore its application in RisingWave, along with our experiences and reflections.