Diagnosing Data Skew in RisingWave with Vnode Key Statistics

Data skew in streaming pipelines causes certain workers to process far more data than others, leading to uneven throughput, increased latency, and hard-to-diagnose bottlenecks. RisingWave now supports optional vnode key statistics for materialized views, giving operators direct visibility into per-vnode key distribution so they can identify skewed partitions, pinpoint hot vnodes, and take corrective action before skew degrades production performance.

The Data Skew Problem in Streaming

Streaming systems distribute computation across workers by partitioning data. In an ideal world, each worker handles roughly equal load. In practice, certain key values appear far more frequently than others, sending a disproportionate share of events to a small number of workers.

This is data skew. A few workers become hot, their queues grow, and overall pipeline latency climbs even though most workers are idle.

The harder problem is detection. Skew rarely announces itself with a clear error. It shows up as subtle throughput imbalance, occasional lag spikes, or one node's CPU pinned at 100% while its neighbors sit at 20%. Without per-partition statistics, operators are left guessing.

How RisingWave Partitions Streaming Computation

RisingWave uses a concept called vnodes (virtual nodes) to partition streaming computation. When a materialized view is created, RisingWave distributes its state across a fixed number of vnodes, each assigned to a physical worker.

Vnodes are the unit of parallelism inside RisingWave. A streaming fragment — a piece of the dataflow graph — processes data for a specific set of vnodes. The assignment of vnodes to workers determines which machine handles which portion of the keyspace.

When keys are evenly distributed across vnodes, workers share the load fairly. When a popular key (or family of keys) hashes to the same few vnodes, those vnodes absorb a disproportionate share of writes, reads, and state changes.

Vnode Key Statistics: What the Feature Does

Starting with the changes introduced in PRs #25290 and #25320, RisingWave can optionally track per-vnode key distribution statistics for materialized views. When enabled, RisingWave collects counts of how many distinct keys each vnode has processed, making skew directly observable.

This is opt-in behavior, controlled at the materialized view level. Enabling it adds a small amount of bookkeeping overhead per vnode, which is why it is not on by default.

The stats are particularly valuable in large deployments where a materialized view joins or aggregates over high-cardinality keys, and where uneven throughput across streaming fragments is suspected but not confirmed.

Enabling Vnode Key Stats

To enable vnode key statistics for a materialized view, set the vnode_key_stats option when creating the view:

CREATE MATERIALIZED VIEW order_summary
WITH (vnode_key_stats = true)
AS
SELECT
  customer_id,
  COUNT(*) AS order_count,
  SUM(amount) AS total_amount
FROM orders
GROUP BY customer_id;

If you have an existing materialized view and want to enable stats, you will need to recreate it with the option set. There is no ALTER MATERIALIZED VIEW path for this option today.

For deployments where you want stats enabled cluster-wide as the default for all new materialized views, you can set the configuration at the session or system level:

-- Enable for the current session
SET vnode_key_stats = true;

-- Create the view; it inherits the session setting
CREATE MATERIALIZED VIEW revenue_by_region
AS
SELECT region, SUM(revenue) FROM transactions GROUP BY region;

Reading the Stats

Once enabled, per-vnode key counts surface through RisingWave's internal metrics and system tables. You can query the distribution to understand how evenly keys are spread.

A healthy distribution shows similar key counts across all vnodes. A skewed distribution shows a small number of vnodes with key counts orders of magnitude higher than the median. That gap tells you which vnodes are hot and gives you a concrete starting point for investigation.

If you identify a hot vnode, the next step is to look at the keys that land there. Common causes include a dominant tenant in a multi-tenant system, a single popular product or user driving a high volume of events, or an application bug generating duplicate keys at abnormal rates.

Acting on Skew Data

Knowing which vnodes are hot points you toward solutions. The most common remediation strategies are:

Repartition with a composite key. If customer_id is skewed, consider grouping by (customer_id, event_date) or another secondary attribute that spreads load more evenly across the keyspace.

Salt the key. For aggregation use cases, add a random salt to the group-by key in an upstream view, aggregate with the salt, then merge in a second view. This artificially widens the keyspace and distributes load at the cost of a two-stage aggregation.

Revisit the upstream source. Sometimes skew originates in a Kafka topic that was partitioned by a skewed key. Repartitioning the topic or adding RisingWave upstream logic to shuffle keys before they reach the materialized view can break the skew at the source.

Vnode key stats do not fix skew automatically. They give you the evidence you need to make an informed decision about which of these approaches is appropriate for your workload.

FAQ

Does enabling vnode key stats affect query performance? There is a small overhead for tracking key counts per vnode, but it is designed to be lightweight. For most workloads, the impact is negligible. If you are running an extremely write-heavy pipeline, benchmark with and without the option before enabling it in production.

Can I enable vnode key stats on a running materialized view? Not currently. The option must be set at creation time. You will need to drop and recreate the view to change this setting on an existing materialized view.

How many vnodes does RisingWave use by default? The default vnode count is 256 per table or materialized view. This is configurable at cluster level. More vnodes provide finer-grained distribution but increase metadata overhead.

What if all my vnodes show roughly equal key counts but I still see uneven CPU usage? Key count is one dimension of skew, but row size and computation cost per key also matter. A vnode with fewer keys but very wide rows or expensive aggregation functions can still become a hot spot. Use vnode key stats alongside RisingWave's CPU and memory metrics for a complete picture.

Is this feature available in RisingWave Cloud? Vnode key stats is a core engine feature and will be available in RisingWave Cloud once the underlying release is deployed. Check the RisingWave Cloud release notes for the specific version that includes this support.