By the RisingWave Engineering Team
Introduction
Every e-commerce enrichment pipeline joins a high-volume event stream against dimension tables: product catalog, pricing tiers, user segments, geographic data. The event stream changes by the second. The dimension tables change maybe once a day, or once a week.
A common pattern is to stream the dimension data via CDC alongside everything else. It works, but it is expensive. You are paying the full cost of a streaming join state for data that barely moves.
This guide covers when to replace CDC-streamed dimension tables with Apache Iceberg snapshot lookups in RisingWave, how to set them up, how to trigger refreshes from your orchestration layer, and how to reason through the staleness tradeoff for each dimension type.
The short answer: if your dimension changes at most a few times per day, the Iceberg snapshot approach is almost always the right call.
The Core Decision: CDC Stream or Iceberg Snapshot?
Before choosing, ask one question: how much does it cost your business if this dimension value is 4 hours stale?
For product categories and pricing tiers, the answer is usually "not much." A product that launched this morning will have a NULL category in enrichment results until the next refresh. That is a minor gap in analytics data. For fraud rules or compliance flags that change throughout the day, stale values carry real risk.
The decision tree:
flowchart TD
A[Dimension table update frequency?] --> B{Multiple times per hour}
A --> C{A few times per day or less}
B --> D[CDC stream - staleness too costly]
C --> E{Stale values have\nbusiness-critical impact?}
E -->|Yes - fraud rules, compliance| D
E -->|No - product catalog, segments| F[Iceberg snapshot - preferred]
F --> G[Schedule refresh after\nupstream pipeline completes]
Side-by-side comparison
| Concern | Iceberg Snapshot | CDC Stream |
| Cluster cost | Low - table loaded once, used as lookup | High - every dimension change propagates through all downstream joins |
| State size | Bounded to dimension table size | Requires full hash-join state on both sides |
| Freshness | Stale until next refresh | Near-real-time |
| Operational complexity | Simple - one REFRESH TABLE call | Needs its own source, connector, and error handling |
| Best fit | Product catalog, pricing tiers, user cohort definitions | Fraud rules, compliance flags, small tables with frequent changes |
Setting Up an Iceberg Dimension Table in RisingWave
RisingWave supports Iceberg as a source connector. You create the table once, and it stays static until you explicitly trigger a reload. There is no continuous polling, no replication slot, and no ongoing resource consumption beyond the initial load.
Here is a product catalog table backed by an Iceberg snapshot in AWS Glue:
-- Create an Iceberg-backed dimension table
-- No refresh_interval_sec - this table refreshes on demand only
CREATE TABLE product_catalog (
product_id BIGINT,
category VARCHAR,
brand VARCHAR,
supplier_id BIGINT,
price_usd DECIMAL,
PRIMARY KEY (product_id)
) WITH (
connector = 'iceberg',
catalog.type = 'glue',
database.name = 'prod_catalog',
table.name = 'products',
refresh_mode = 'FULL_RELOAD'
);
For a REST catalog, replace catalog.type = 'glue' with catalog.type = 'rest' and add catalog.uri = 'http://iceberg-catalog:8181'. See the RisingWave Iceberg source documentation for the full list of supported catalog types and connection options.
Using the dimension table in a temporal join
Once the table is loaded, you can use it in a temporal join to enrich a live event stream:
CREATE MATERIALIZED VIEW enriched_orders AS
SELECT
o.order_id,
o.customer_id,
o.quantity,
o.order_time,
p.category,
p.brand,
p.price_usd
FROM orders AS OF SYSTEM_TIME o
JOIN product_catalog FOR SYSTEM_TIME AS OF PROCTIME() p
ON o.product_id = p.product_id;
The FOR SYSTEM_TIME AS OF PROCTIME() clause tells RisingWave to look up the current state of product_catalog at the moment each order event is processed. RisingWave only maintains state for the streaming side (orders). The catalog side is read as a point-in-time lookup, which is why state cost is bounded regardless of order volume.
Why Polling Intervals Are Usually Wrong
The most common mistake is setting a short refresh_interval_sec on the Iceberg source:
-- Do NOT do this for slowly-changing dimensions
refresh_interval_sec = 600 -- every 10 minutes
Every full reload does the following in sequence: rescans the Iceberg table from object storage, rebuilds the entire table state inside RisingWave, triggers downstream updates in all temporal joins backed by this table, and competes with your streaming workloads for cluster resources.
For a product catalog that changes once a day, you are doing this 144 times per day for no reason.
The right approach is to refresh on demand, triggered by your orchestration layer immediately after the upstream process that updates the Iceberg table completes.
Triggering Refreshes from Your Orchestration Layer
After your upstream pipeline completes a new catalog snapshot, fire one SQL statement:
-- Run this after the upstream rebuild finishes
ALTER TABLE product_catalog REFRESH;
This works from any orchestration system. In Airflow, wire it as a PostgresOperator task that runs immediately after the Iceberg snapshot job completes:
rebuild_catalog_iceberg >> refresh_product_catalog >> downstream_tasks
If your dimension tables are managed by dbt, a dbt run-operation refresh_iceberg_tables macro that calls this statement for each table achieves the same result. The principle is the same regardless of tool: upstream pipeline writes new Iceberg snapshot, orchestration signals RisingWave to reload, enrichment immediately uses the updated values.
What Staleness Actually Means in Practice
Staleness is not an abstract concern. Its impact is specific to each dimension type.
Product catalog
A new product added to the catalog will produce NULL values for category and brand in enriched orders until the next refresh. This affects only newly created products and only for the gap between when they were added and when REFRESH runs. For analytics, this is acceptable: a nightly reconciliation batch using a fresh snapshot corrects the historical record. For a real-time recommendation engine that must categorize new products the moment they launch, use CDC.
Pricing tiers
A price change will not propagate to enriched events until REFRESH runs. If pricing tiers are updated once a day as part of a scheduled repricing job, the risk is low: trigger a refresh immediately after the job completes and the stale window is minutes. If pricing changes multiple times per hour, use CDC.
User segments
Segment redefinition (moving a user from "casual" to "loyal" based on a weekly cohort job) is a perfect fit for Iceberg snapshots. The assignments change once a week, RisingWave refreshes once, and every subsequent event picks up the new segment. For real-time personalization where membership changes based on live behavior, the snapshot approach is too slow.
Summary by dimension type
| Dimension | Change frequency | Staleness impact | Recommendation |
| Product categories | Rarely (new products) | Low - analytics gap only | Iceberg snapshot |
| Pricing tiers (scheduled) | Once per day | Low - trigger refresh post-job | Iceberg snapshot |
| Pricing tiers (dynamic) | Multiple per hour | Medium - depends on pipeline use | CDC |
| User cohort segments | Weekly | Low - acceptable for analytics | Iceberg snapshot |
| Geographic/lookup tables | Rarely | Very low | Iceberg snapshot |
| Fraud rules | Throughout the day | High - directly affects detection | CDC |
| Compliance flags | Throughout the day | High - regulatory exposure | CDC |
When CDC Is Worth the Cost
CDC streaming for dimension tables makes sense under three conditions.
First, the dimension changes frequently. If a table is updated multiple times per hour, snapshot staleness compounds quickly and a polling-based refresh still costs more than it saves. A CDC stream keeps pace without periodic full reloads.
Second, stale values have a significant business impact. Fraud rule updates, compliance flag changes, or security policy tables fall here. If an event processed against a 2-hour-old fraud model creates a false negative, the cost of that mistake exceeds the ongoing cost of the CDC stream.
Third, the dimension table is small. A small table means the state overhead for a full hash-join is manageable. For a 500-row fraud rules table, CDC state is negligible and the freshness benefit is clear.
When none of these conditions hold, the CDC stream is overhead, not infrastructure.
The Reconciliation Strategy for Analytics Pipelines
Most analytics pipelines can tolerate some inaccuracy in real time as long as the historical record is correct. The standard pattern:
- Streaming layer (RisingWave + Iceberg snapshot): enriches events in near-real-time with periodic dimension refreshes. Results are fast and approximately correct.
- Nightly batch (Spark, dbt, Trino against Iceberg): re-joins raw events against the authoritative dimension snapshot at the time of each event. Overwrites the streaming results with fully accurate values.
This lets you serve low-latency dashboards during the day from the streaming layer, while the authoritative reporting data is produced nightly. Forward-looking accuracy is sufficient for the majority of analytics use cases.
See the Apache Iceberg time travel documentation for how to query snapshots at a specific point in time during the nightly reconciliation.
FAQ
Does RisingWave support Iceberg time travel for dimension lookups?
Not directly in temporal joins. The FOR SYSTEM_TIME AS OF PROCTIME() clause uses the current table state at processing time. If you need to join events against the dimension values that were current at the time of the event (rather than processing time), you need a batch reconciliation step, or a more complex stateful approach.
What happens to in-flight joins during a REFRESH?
RisingWave performs the full reload atomically. Events that arrive during the reload are buffered. Once the new snapshot is in place, the buffered events are processed using the updated dimension data. You will not get a mix of old and new dimension values for a single event.
Can I use Iceberg snapshots for CDC-sourced dimension data, and how large can the table be?
Yes, and it is a common pattern. Your CDC pipeline writes dimension changes to an Iceberg table (via a RisingWave sink or Flink), and an orchestration job triggers ALTER TABLE ... REFRESH in RisingWave after each Iceberg commit. This gives you CDC-level freshness at the source while still using the snapshot pattern in RisingWave for cost efficiency.
For table size, what matters is file count and object storage latency more than row count. A well-compacted Iceberg table of 50 million rows in a small number of Parquet files loads faster than a fragmented 5-million-row table spread across thousands of files. Run compaction on your dimension tables before refreshing. See the Iceberg table maintenance documentation for compaction best practices.
Should I use a temporal join or a regular join with the Iceberg dimension table?
Use a temporal join (FOR SYSTEM_TIME AS OF PROCTIME()). A regular join maintains state for both sides, which means RisingWave holds the full product catalog in state on the join side indefinitely. The temporal join reads the lookup table directly at processing time, which is cheaper and semantically correct for the "enrich this event with current dimension data" pattern.
Conclusion
The decision between CDC and Iceberg snapshots for dimension tables comes down to update frequency and staleness risk. For the majority of slowly-changing reference data in e-commerce pipelines, product catalogs, pricing tiers, and user cohort segments, the Iceberg snapshot approach is significantly cheaper to operate, simpler to reason about, and close enough in freshness for analytical workloads.
The key practices:
- Use
refresh_mode = 'FULL_RELOAD'with no polling interval for on-demand control - Trigger
ALTER TABLE ... REFRESHfrom your orchestration layer immediately after the upstream snapshot completes - Use a temporal join (
FOR SYSTEM_TIME AS OF PROCTIME()) to minimize join state cost - Reserve CDC for dimension tables that change frequently or where stale values create direct business risk
- Use a nightly batch reconciliation pass for analytics pipelines that need an authoritative historical record
This pattern keeps RisingWave resource usage predictable and proportional to what actually changes, rather than to what could theoretically change.
Ready to try this yourself? Get started with RisingWave in 5 minutes. Quickstart
Join our Slack community to ask questions and connect with other stream processing developers.

