RisingWave Iceberg Dimension Tables: When to Skip CDC and Load a Snapshot Instead

By the RisingWave Engineering Team

Introduction

Every e-commerce enrichment pipeline joins a high-volume event stream against dimension tables: product catalog, pricing tiers, user segments, geographic data. The event stream changes by the second. The dimension tables change maybe once a day, or once a week.

A common pattern is to stream the dimension data via CDC alongside everything else. It works, but it is expensive. You are paying the full cost of a streaming join state for data that barely moves.

This guide covers when to replace CDC-streamed dimension tables with Apache Iceberg snapshot lookups in RisingWave, how to set them up, how to trigger refreshes from your orchestration layer, and how to reason through the staleness tradeoff for each dimension type.

The short answer: if your dimension changes at most a few times per day, the Iceberg snapshot approach is almost always the right call.

The Core Decision: CDC Stream or Iceberg Snapshot?

Before choosing, ask one question: how much does it cost your business if this dimension value is 4 hours stale?

For product categories and pricing tiers, the answer is usually "not much." A product that launched this morning will have a NULL category in enrichment results until the next refresh. That is a minor gap in analytics data. For fraud rules or compliance flags that change throughout the day, stale values carry real risk.

The decision tree:

flowchart TD
    A[Dimension table update frequency?] --> B{Multiple times per hour}
    A --> C{A few times per day or less}
    B --> D[CDC stream - staleness too costly]
    C --> E{Stale values have\nbusiness-critical impact?}
    E -->|Yes - fraud rules, compliance| D
    E -->|No - product catalog, segments| F[Iceberg snapshot - preferred]
    F --> G[Schedule refresh after\nupstream pipeline completes]

Side-by-side comparison

Concern	Iceberg Snapshot	CDC Stream
Cluster cost	Low - table loaded once, used as lookup	High - every dimension change propagates through all downstream joins
State size	Bounded to dimension table size	Requires full hash-join state on both sides
Freshness	Stale until next refresh	Near-real-time
Operational complexity	Simple - one `REFRESH TABLE` call	Needs its own source, connector, and error handling
Best fit	Product catalog, pricing tiers, user cohort definitions	Fraud rules, compliance flags, small tables with frequent changes

Setting Up an Iceberg Dimension Table in RisingWave

RisingWave supports Iceberg as a source connector. You create the table once, and it stays static until you explicitly trigger a reload. There is no continuous polling, no replication slot, and no ongoing resource consumption beyond the initial load.

Here is a product catalog table backed by an Iceberg snapshot in AWS Glue:

-- Create an Iceberg-backed dimension table
-- No refresh_interval_sec - this table refreshes on demand only
CREATE TABLE product_catalog (
  product_id   BIGINT,
  category     VARCHAR,
  brand        VARCHAR,
  supplier_id  BIGINT,
  price_usd    DECIMAL,
  PRIMARY KEY (product_id)
) WITH (
  connector        = 'iceberg',
  catalog.type     = 'glue',
  database.name    = 'prod_catalog',
  table.name       = 'products',
  refresh_mode     = 'FULL_RELOAD'
);

For a REST catalog, replace catalog.type = 'glue' with catalog.type = 'rest' and add catalog.uri = 'http://iceberg-catalog:8181'. See the RisingWave Iceberg source documentation for the full list of supported catalog types and connection options.

Using the dimension table in a temporal join

Once the table is loaded, you can use it in a temporal join to enrich a live event stream:

CREATE MATERIALIZED VIEW enriched_orders AS
SELECT
  o.order_id,
  o.customer_id,
  o.quantity,
  o.order_time,
  p.category,
  p.brand,
  p.price_usd
FROM orders AS OF SYSTEM_TIME o
JOIN product_catalog FOR SYSTEM_TIME AS OF PROCTIME() p
  ON o.product_id = p.product_id;

The FOR SYSTEM_TIME AS OF PROCTIME() clause tells RisingWave to look up the current state of product_catalog at the moment each order event is processed. RisingWave only maintains state for the streaming side (orders). The catalog side is read as a point-in-time lookup, which is why state cost is bounded regardless of order volume.

Why Polling Intervals Are Usually Wrong

The most common mistake is setting a short refresh_interval_sec on the Iceberg source:

-- Do NOT do this for slowly-changing dimensions
refresh_interval_sec = 600   -- every 10 minutes

Every full reload does the following in sequence: rescans the Iceberg table from object storage, rebuilds the entire table state inside RisingWave, triggers downstream updates in all temporal joins backed by this table, and competes with your streaming workloads for cluster resources.

For a product catalog that changes once a day, you are doing this 144 times per day for no reason.

The right approach is to refresh on demand, triggered by your orchestration layer immediately after the upstream process that updates the Iceberg table completes.

Triggering Refreshes from Your Orchestration Layer

After your upstream pipeline completes a new catalog snapshot, fire one SQL statement:

-- Run this after the upstream rebuild finishes
ALTER TABLE product_catalog REFRESH;

This works from any orchestration system. In Airflow, wire it as a PostgresOperator task that runs immediately after the Iceberg snapshot job completes:

rebuild_catalog_iceberg >> refresh_product_catalog >> downstream_tasks

If your dimension tables are managed by dbt, a dbt run-operation refresh_iceberg_tables macro that calls this statement for each table achieves the same result. The principle is the same regardless of tool: upstream pipeline writes new Iceberg snapshot, orchestration signals RisingWave to reload, enrichment immediately uses the updated values.

What Staleness Actually Means in Practice

Staleness is not an abstract concern. Its impact is specific to each dimension type.

Product catalog

A new product added to the catalog will produce NULL values for category and brand in enriched orders until the next refresh. This affects only newly created products and only for the gap between when they were added and when REFRESH runs. For analytics, this is acceptable: a nightly reconciliation batch using a fresh snapshot corrects the historical record. For a real-time recommendation engine that must categorize new products the moment they launch, use CDC.

Pricing tiers

A price change will not propagate to enriched events until REFRESH runs. If pricing tiers are updated once a day as part of a scheduled repricing job, the risk is low: trigger a refresh immediately after the job completes and the stale window is minutes. If pricing changes multiple times per hour, use CDC.

User segments

Segment redefinition (moving a user from "casual" to "loyal" based on a weekly cohort job) is a perfect fit for Iceberg snapshots. The assignments change once a week, RisingWave refreshes once, and every subsequent event picks up the new segment. For real-time personalization where membership changes based on live behavior, the snapshot approach is too slow.

Summary by dimension type

Dimension	Change frequency	Staleness impact	Recommendation
Product categories	Rarely (new products)	Low - analytics gap only	Iceberg snapshot
Pricing tiers (scheduled)	Once per day	Low - trigger refresh post-job	Iceberg snapshot
Pricing tiers (dynamic)	Multiple per hour	Medium - depends on pipeline use	CDC
User cohort segments	Weekly	Low - acceptable for analytics	Iceberg snapshot
Geographic/lookup tables	Rarely	Very low	Iceberg snapshot
Fraud rules	Throughout the day	High - directly affects detection	CDC
Compliance flags	Throughout the day	High - regulatory exposure	CDC

When CDC Is Worth the Cost

CDC streaming for dimension tables makes sense under three conditions.

First, the dimension changes frequently. If a table is updated multiple times per hour, snapshot staleness compounds quickly and a polling-based refresh still costs more than it saves. A CDC stream keeps pace without periodic full reloads.

Second, stale values have a significant business impact. Fraud rule updates, compliance flag changes, or security policy tables fall here. If an event processed against a 2-hour-old fraud model creates a false negative, the cost of that mistake exceeds the ongoing cost of the CDC stream.

Third, the dimension table is small. A small table means the state overhead for a full hash-join is manageable. For a 500-row fraud rules table, CDC state is negligible and the freshness benefit is clear.

When none of these conditions hold, the CDC stream is overhead, not infrastructure.

The Reconciliation Strategy for Analytics Pipelines

Most analytics pipelines can tolerate some inaccuracy in real time as long as the historical record is correct. The standard pattern:

Streaming layer (RisingWave + Iceberg snapshot): enriches events in near-real-time with periodic dimension refreshes. Results are fast and approximately correct.
Nightly batch (Spark, dbt, Trino against Iceberg): re-joins raw events against the authoritative dimension snapshot at the time of each event. Overwrites the streaming results with fully accurate values.

This lets you serve low-latency dashboards during the day from the streaming layer, while the authoritative reporting data is produced nightly. Forward-looking accuracy is sufficient for the majority of analytics use cases.

See the Apache Iceberg time travel documentation for how to query snapshots at a specific point in time during the nightly reconciliation.

FAQ

Does RisingWave support Iceberg time travel for dimension lookups?

Not directly in temporal joins. The FOR SYSTEM_TIME AS OF PROCTIME() clause uses the current table state at processing time. If you need to join events against the dimension values that were current at the time of the event (rather than processing time), you need a batch reconciliation step, or a more complex stateful approach.

What happens to in-flight joins during a REFRESH?

RisingWave performs the full reload atomically. Events that arrive during the reload are buffered. Once the new snapshot is in place, the buffered events are processed using the updated dimension data. You will not get a mix of old and new dimension values for a single event.

Can I use Iceberg snapshots for CDC-sourced dimension data, and how large can the table be?

Yes, and it is a common pattern. Your CDC pipeline writes dimension changes to an Iceberg table (via a RisingWave sink or Flink), and an orchestration job triggers ALTER TABLE ... REFRESH in RisingWave after each Iceberg commit. This gives you CDC-level freshness at the source while still using the snapshot pattern in RisingWave for cost efficiency.

For table size, what matters is file count and object storage latency more than row count. A well-compacted Iceberg table of 50 million rows in a small number of Parquet files loads faster than a fragmented 5-million-row table spread across thousands of files. Run compaction on your dimension tables before refreshing. See the Iceberg table maintenance documentation for compaction best practices.

Should I use a temporal join or a regular join with the Iceberg dimension table?

Use a temporal join (FOR SYSTEM_TIME AS OF PROCTIME()). A regular join maintains state for both sides, which means RisingWave holds the full product catalog in state on the join side indefinitely. The temporal join reads the lookup table directly at processing time, which is cheaper and semantically correct for the "enrich this event with current dimension data" pattern.

Conclusion

The decision between CDC and Iceberg snapshots for dimension tables comes down to update frequency and staleness risk. For the majority of slowly-changing reference data in e-commerce pipelines, product catalogs, pricing tiers, and user cohort segments, the Iceberg snapshot approach is significantly cheaper to operate, simpler to reason about, and close enough in freshness for analytical workloads.

The key practices:

Use refresh_mode = 'FULL_RELOAD' with no polling interval for on-demand control
Trigger ALTER TABLE ... REFRESH from your orchestration layer immediately after the upstream snapshot completes
Use a temporal join (FOR SYSTEM_TIME AS OF PROCTIME()) to minimize join state cost
Reserve CDC for dimension tables that change frequently or where stale values create direct business risk
Use a nightly batch reconciliation pass for analytics pipelines that need an authoritative historical record

This pattern keeps RisingWave resource usage predictable and proportional to what actually changes, rather than to what could theoretically change.

Ready to try this yourself? Get started with RisingWave in 5 minutes. Quickstart

Join our Slack community to ask questions and connect with other stream processing developers.