From Slow Backfills to Scalable Pipelines: RisingWave’s Performance Optimization

From Slow Backfills to Scalable Pipelines: RisingWave’s Performance Optimization

In the world of streaming databases, speed is usually associated with how fast we can process new events. But anyone managing a production data pipeline knows that handling historical data—creating new Materialized Views (MVs) based on massive existing tables—is just as critical.

With the release of RisingWave v2.7.0, we are tackling one of the most resource-intensive challenges in streaming: Backfill Performance.

If you have ever stared at a progress bar while creating a new MV on a large dataset, this update is for you. Here is how we are making backfilling faster, more predictable, and more efficient.

The Pain of Poor Locality

Without optimization, backfilling large datasets often suffers from poor data locality.

When the physical layout of the source data (for example, ordered by time) does not match how a new materialized view needs to process it (such as grouping by user_id), the execution engine is forced into inefficient access patterns.

This leads to:

  1. Excessive random I/O, as data is repeatedly read from scattered locations on storage

  2. High memory consumption, due to low cache reuse and large in-flight intermediate states

  3. Unpredictable backfill time, slowing down deployments and iteration

In RisingWave v2.7.0, we address this problem with two features: Index Selection and Locality Backfill.

The Foundation: Index Selection

Backfilling starts with reading existing data. And the order in which that data is read has a significant impact on everything that follows.

Starting in RisingWave v2.7.0, the optimizer can automatically select the most suitable index during backfill, based on the query’s grouping, join, or partition keys.

Instead of performing a generic table scan, RisingWave chooses an index that aligns with how the data will be processed downstream.

Why index selection matters

Scanning data in a locality-friendly order allows RisingWave to:

  • Improve cache efficiency for the downstream operators

  • Reduce random I/O for the downstream operators

  • Feed aggregations and joins with better-ordered data

  • Speed up backfill execution without changing query logic

All of this happens automatically—no hints or query rewrites required.

Example

CREATE TABLE t (aINT, bINT, cINT);
CREATE INDEX idx_aON t(a);
CREATE INDEX idx_bON t(b);

SELECTCOUNT(*)FROM tGROUPBY b;

Since the query groups by b, RisingWave automatically uses idx_b to scan the table in grouping order, improving backfill performance.

Index selection is controlled by the following setting and and is enabled by default.

enable_index_selection = true

The Extension: Locality Backfill

Index selection optimizes how data is read. But backfills don’t stop at table scans.

Complex materialized views often involve joins, aggregations, window functions, or TopN operators—each of which can introduce costly data reshuffling if locality is lost along the way.

Locality backfill extends locality awareness across the entire backfill pipeline.

When enabled, RisingWave preserves data locality between operators, allowing execution stages that benefit from ordered or partitioned input to run more efficiently.

How it works

With locality backfill enabled, the optimizer automatically inserts LocalityProvider operators into the backfill plan. These operators ensure that downstream stages receive data in an execution-friendly layout, reducing network traffic and intermediate state.

This is especially beneficial for:

  • Multi-join analytical views

  • High-cardinality groupings

  • Group TopN and window functions

  • Large historical backfills where shuffle cost dominates runtime

Note: Locality backfill is a premium feature for complex queries. A query plan that uses more than five LocalityProvider operators is considered complex.

Example

Consider the following materialized view, adapted from a TPC-H–style analytical query:

SET enable_locality_backfill=true;

EXPLAIN
CREATE MATERIALIZEDVIEW q18AS
SELECT
  c_name,
  c_custkey,
  o_orderkey,
  o_orderdate,
  o_totalprice,
SUM(l_quantity)AS quantity
FROM
  customer,
  orders,
  lineitem
WHERE
  o_orderkey IN (
SELECT
      l_orderkey
FROM
      lineitem
GROUPBY
      l_orderkey
HAVING
SUM(l_quantity)>1
  )
AND c_custkey= o_custkey
AND o_orderkey= l_orderkey
GROUPBY
  c_name,
  c_custkey,
  o_orderkey,
  o_orderdate,
  o_totalprice
ORDERBY
  o_totalprice DESC,
  o_orderdate
LIMIT 100;

This query combines several locality-sensitive operations:

  • A subquery aggregation on lineitem (GROUP BY l_orderkey)

  • Multiple joins between customer, orders, and lineitem

  • A global aggregation followed by ORDER BY + LIMIT (TopN)

Without locality backfill, each of these stages can break locality boundaries. Each operator doesn’t process data with its optimal locality.

What locality backfill changes

With locality backfill enabled, RisingWave tries to carry forward useful physical properties—such as partitioning by join keys or ordering needed by TopN—across the backfill pipeline:

  • Aggregated lineitem data stays colocated by l_orderkey, reducing join reshuffles

  • The final TopN can operate on more localized, partially ordered inputs

The result is:

  • Lower memory overhead during backfill

  • Faster and more predictable backfill times for complex materialized views

-- Enable locality backfill for the current session
SET enable_locality_backfill=true;

-- Create your complex MV
CREATE MATERIALIZEDVIEW complex_mvAS ...

Locality backfill is especially effective for join-heavy, aggregation-heavy analytical MVs—exactly the kind of queries where backfill used to be the most expensive.

What this unlocks

Together, index selection and locality backfill transform how RisingWave handles historical data:

  • Materialized views are created faster and more predictably

  • Large backfills scale better with data size and query complexity

  • Advanced analytical pipelines become easier to build and evolve

  • Backfills stop being a bottleneck for iteration and experimentation

Instead of treating backfill as a necessary cost, RisingWave now treats it as a first-class optimization target.

Ready to Accelerate Your Pipeline?

Backfilling historical data shouldn't be the reason your project stalls with RisingWave v2.7.0.

The Modern Backbone for
Real-Time Data and AI
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.