RisingWave Labs Blog

We all know the appeal of cloud object storage like Amazon S3: durable, scalable, and seemingly cost-effective. Many modern data systems, including our own RisingWave, leverage S3 as primary storage for persisting state, tables, and materialized views. But as many discover, S3 is cheap... until you start reading from it frequently.

The hidden costs of relying solely on object storage for active data processing are latency and access charges. Fetching data from S3 can introduce delays impacting real-time performance, and AWS charges for every single read operation (GET request). For high-throughput, real-time systems like RisingWave, these costs and delays can quickly add up, impacting both your performance and your cloud bill.

RisingWave already uses local memory as a high-speed cache. But memory is finite and expensive. What happens when your streaming job's working set – the data needed for complex joins, aggregations, or windowing – exceeds available RAM? Or when frequent queries hit tables and MVs whose data isn't memory-resident?

You trigger a cache miss. The system has to retrieve data from S3, introducing performance-killing latency and racking up those access costs.

Smart Caching with Elastic Disk Cache in RisingWave v2.3

To bridge the gap between fast-but-limited memory and cheap-but-slow S3, we're thrilled to announce that Elastic Disk Cache is now available with the release of RisingWave v2.3!

Elastic Disk Cache is a powerful innovation that introduces a new caching tier using local disk storage (like high-performance NVMe SSDs or AWS EBS volumes) right where your compute nodes run.

Think of it as extending your cache capacity significantly. Data that's frequently accessed but doesn't fit into memory can now reside on fast local disk instead of requiring a slow and costly round trip to S3. This creates an intelligent tiered storage architecture: Memory -> Disk -> S3.

See the Difference: Real-World Benchmark Results

To quantify the benefits, we put Elastic Disk Cache through its paces in a demanding benchmark scenario, comparing RisingWave's performance and S3 usage with and without the cache enabled. The difference was dramatic, especially in reducing the load on S3.

Cost

Without Elastic Disk Cache:

Over 48 hours, the S3 GET cost was $24.60, and the S3 PUT cost was $1.88.

With Elastic Disk Cache:

Over 48 hours, the S3 GET cost was $6.14, and the S3 PUT cost was $2.22.

Enabling Elastic Disk Cache resulted in a 75% reduction in S3 GET costs over the 48-hour period.

Read performance

Without Elastic Disk Cache:

Over a 48-hour period, the average S3 read IOPS was 120 IOPS/node, and the average S3 read throughput was 360 MB/s. The S3 read P99 latency gradually increased to over 5 seconds.

With Elastic Disk Cache:

Enabling Elastic Disk Cache over a 48-hour period demonstrated significant improvements:

Average S3 read IOPS decreased to 30 IOPS/node, a 75% reduction from S3, indicating most reads were served from the cache.
Average S3 read throughput from S3 dropped to 20 MB/s, a 94.4% reduction.
S3 read P99 latency stabilized at around 900ms, an improvement of over 82% compared to the >5s latency observed without the cache.

Freshness

Without Elastic Disk Cache:

Average barrier latency progressively increased, and there was a growing accumulation of barriers. This resulted in deteriorating freshness.

With Elastic Disk Cache:

Both average barrier latency and barrier accumulation remained stable. Consequently, freshness stabilized at a consistently low (and therefore desirable) level.

Benchmark Note: Interpreting Barrier Latency
The ~30-second barrier latency shown with Elastic Disk Cache reflects a demanding I/O stress test. In such scenarios, systems often experience escalating latency due to frequent main-memory cache misses, forcing slow reads from object storage. This benchmark highlights how Elastic Disk Cache effectively mitigates this I/O bottleneck by serving reads from faster local disk, thus maintaining stable performance under pressure. Crucially, for typical operational workloads, RisingWave consistently achieves sub-second barrier latency, ensuring optimal data freshness.

Resource utilization

Without Elastic Disk Cache:

Around the 30-hour mark, the working set exceeded the capacity of the memory cache. Consequently, the workload began to transition from being CPU-bound to I/O-bound. CPU utilization has started to decline significantly.

With Elastic Disk Cache enabled:

Throughout the 48-hour test, the working set consistently remained within the combined capacity of the memory cache and the Elastic Disk Cache. As a result, the workload remained CPU-bound, and the CPUs were fully utilized.

Summary

Enabling Elastic Disk Cache demonstrated substantial and multifaceted improvements:

Drastically Reduced S3 Read Load:
- S3 read throughput (data volume from S3) decreased by approximately 94.4%.
- S3 read IOPS (GET requests to S3) were reduced by around 80%.
Significantly Improved Read Latency:
- S3 read P99 latency stabilized around 900ms, an improvement of over 82% from the >5 seconds observed without the cache.
Substantial Cost Savings:
- S3 GET costs were reduced by around 80% over the 48-hour test period.
Enhanced System Stability and Efficiency:
- Data freshness remained consistently good, with stable barrier latency and no excessive barrier accumulation.
- The workload remained CPU-bound, preventing the system from becoming I/O-bound and ensuring efficient resource utilization.

How We Tested This

This benchmark was conducted under the following conditions to simulate a continuous, high-throughput streaming analytics environment:

RisingWave Version: 2.3
Compute Node:
- Three compute nodes were used.
- Each node was configured with 8 CPU cores and 16GB RAM (8c16g).
Workload – NEXMARK Queries:
- We ran a comprehensive suite of 26 distinct NEXMARK queries:
  
  q0, q1, q2, q3, q4, q5, q6-group-top1, q7, q8, q9, q10, q12, q14, q15, q16, q17, q18, q19, q20, q21, q22, q101, q102, q103, q104, q105
- Each query created 8 materialized views, resulting in a total of 208 materialized views (26 queries x 8 MVs/query).
- All 208 MVs were continuously running and updating after creation.
Event Ingestion Rate: The system processed incoming data at a sustained rate of 10,000 events per second.

Why Elastic Disk Cache is a Game-Changer

By intelligently managing data across these tiers, Elastic Disk Cache delivers substantial benefits, especially for performance-sensitive workloads:

Drastic S3 Cost Reduction: As the benchmarks show, by serving most reads from the local disk cache, you dramatically reduce the number of GET requests and data transfer out of S3. This directly translates to significantly lower S3 bills – slashing access costs by up to 90% in tested scenarios.
Massively Improved Performance: Avoid the S3 latency penalty! Accessing local SSDs/EBS is orders of magnitude faster than fetching from S3. Fewer S3 operations mean lower overall latency for:
- Large-state streaming jobs: Complex operations involving joins, time windows, and aggregations remain fast even when the state size is large.
- Data Serving: Queries against RisingWave tables and materialized views return results much quicker when data is served from the disk cache instead of S3.
Faster, Smoother Scaling: When you add new compute nodes, warming up the cache is critical. Pulling potentially terabytes of state from S3 can be slow and costly. Elastic Disk Cache allows new nodes to quickly populate their cache from local disk, significantly reducing scale-up time and minimizing performance dips during scaling.
Accelerated Failure Recovery: Similar to scaling, nodes recovering from failure can restore their state much faster by leveraging the local disk cache, reducing downtime.
Enhanced Performance in Rate-Limited Environments: If you operate in an environment with S3 rate limits (e.g., private cloud, or simply hitting API limits), the disk cache significantly reduces your dependency on S3, ensuring more consistent performance and smoother scaling without hitting bottlenecks.

Who Needs Elastic Disk Cache?

This feature is particularly impactful if you are:

Running complex streaming jobs with large state sizes.
Using RisingWave to serve low-latency queries on materialized views or tables.
Experiencing high S3 costs due to frequent data access.
Needing faster scaling or recovery times.
Operating in environments where object storage access might be throttled.

How to Use Elastic Disk Cache

Elastic Disk Cache is enabled by default on deployments running v2.3 or later. No configuration is needed. Self-hosted deployments requires RisingWave v2.3 or later and a Premium Edition license. You can try Premium features if your deployment contains no more than 4 cores.

To enable and configure Elastic Disk Cache on your self-hosted compute nodes, you’ll need to plan cache resources and modify the node-specific configuration file. For details, see the documentation topic.

Get Started Today!

Ready to stop letting S3 latency and access costs bottleneck your stream processing? Upgrade to RisingWave v2.3 and explore how Elastic Disk Cache can enhance your performance and reduce your cloud spend.

Check out our documentation for detailed instructions on enabling and configuring Elastic Disk Cache: Elastic disk cache.

Start caching smart with RisingWave!

Introducing Elastic Disk Cache in RisingWave

Smart Caching with Elastic Disk Cache in RisingWave v2.3

See the Difference: Real-World Benchmark Results

Cost

Read performance

Freshness

Resource utilization

Summary

How We Tested This

Why Elastic Disk Cache is a Game-Changer

Who Needs Elastic Disk Cache?

How to Use Elastic Disk Cache

Get Started Today!