RisingWave Storage Engine Optimization: Higher Performance at Lower Cost

Background

In the article "Hummock: A Storage Engine Designed for Stream Processing", we introduced the design philosophy and basic architecture of Hummock. This article mainly discusses the significant improvements and optimizations made in recent versions of Hummock.

Fast Compaction

Initially, Hummock's data file format was inspired by the design of RocksDB's Sstable. Later, to enhance query performance, Hummock adopted the xor filter to replace the original Bloom filter (RocksDB also introduced a Ribbon filter algorithm based on the xor filter). However, the xor filter temporarily consumes a large amount of memory during construction. To reduce memory usage, Hummock revamped its data storage format to build a filter for each block unit instead of a global single filter. This approach reduced peak memory usage, enabling RisingWave to operate in environments with lower specifications.

In comparison to BTree-like data structures, LSM Tree achieves higher write throughput by utilizing sequential writing and background organization. However, this comes at the expense of write amplification, where the same data may undergo repeated compaction. For a more detailed understanding of this, you can read RocksDB Tuning Guide.

When dealing with L1(Level 1, mentioned in RocksDB Tuning Guide) and older data, the process of moving data from an upper-level file to a lower level often results in overlapping with multiple files at the lower level. This necessitates including more than one file in the same compaction task, which leads to write amplification. However, it's important to note that even if files at the upper-level overlap with several files at the lower level, it doesn't necessarily mean that every block from the upper-level overlaps with a block at the lower level. This is unless each write operation strictly adheres to an equal probability distribution within the same range.

Therefore, if our compaction algorithm operates at a block granularity instead of a key granularity, it can avoid the overhead of decompression and recompression for blocks that do not overlap with other files in the task. These blocks can be directly copied to the new file.

Below is the traditional compaction algorithm:

Traditional compaction algorithm.

Here is our improved algorithm:

Improved compaction algorithm.

Comparing the traditional algorithm, it's clear that blocks 1 and 3 in SST2 of the improved algorithm, if not duplicated with data in SST1, can be directly copied to the output file, thus avoiding additional parsing and compression costs.

Our tests across various read and write loads showed that this optimization can save 20% to 50% of compactor CPU resources. For scenarios with heavy writes and high throughput, this optimization significantly reduces the cost of compaction operations. Currently, this feature is still under testing and is not enabled by default.

IO Prefetch

Like RocksDB, Hummock's storage design is based on organizing data at the block level. When the application queries data, it reads blocks one by one. If a block is not in the cache, it triggers an IO operation to read from remote storage (e.g., S3 storage in AWS environments). However, developers familiar with S3 know that S3's latency for random reads is extremely high, making it more suitable for sequential reading of large objects. Therefore, when users perform OLAP-type queries (e.g., select count(*) from mv;), Hummock has to send IO operations one by one, significantly increasing query response times.

A similar scenario occurs when users want to create another materialized view for an already imported data table, requiring traversal of all historical data, thus being limited by the reading method mentioned above.

Therefore, for batch queries in RisingWave, in the underlying storage, when we find that the current block is not in the cache, we will read multiple blocks within the current range at once. This allows us to maximize the utilization of S3's bandwidth throughput, and also reduces the IO cost of S3, as the streaming interface is counted as a single IO while S3 charges based on the number of IOs. Through some simple tests, this optimization can improve the speed of some OLAP queries by up to 8 times. However, this is only under extreme circumstances, as there are still many complex SQL bottlenecks related to computational operations, and the actual improvement depends on the query type and business scenario.

In addition to query performance, cost is also a key concern for databases. For storage backends like AWS S3, in addition to the fixed monthly charges for storing data based on size, IO operations are charged based on the number of operations. Therefore, besides optimizing performance, one of our important directions in optimizing the storage system is how to utilize the high throughput and high latency characteristics of S3 to improve its efficiency and reduce costs. In the future, we will choose IO strategies more accurately based on different query plans to further improve processing speed, and also use local disks as secondary cache to increase cache hit rate and reduce access frequency to S3.