How to Build a Medallion Architecture with RisingWave and Iceberg

A medallion architecture with RisingWave and Iceberg uses three tiers: Bronze (raw data landed in Iceberg via RisingWave sinks), Silver (cleaned and joined data in Iceberg, computed by RisingWave materialized views), and Gold (business-ready aggregations in Iceberg for BI and analytics). Each tier runs continuously, with seconds of latency between source events and Gold layer availability.

What Is a Medallion Architecture?

The medallion architecture (also called the multi-hop architecture) organizes data into progressively refined layers:

Bronze — Raw, unmodified data from source systems. The landing zone. Immutable once written.
Silver — Cleaned, validated, and joined data. Conformed to consistent schemas. Still at event or entity grain.
Gold — Business-level aggregations and metrics. Optimized for consumption by BI tools, dashboards, and ML models.

Originally popularized by Databricks for batch pipelines, the medallion pattern maps naturally to streaming architectures. RisingWave's materialized views act as the transformation engine between tiers, while Iceberg on S3 provides the durable, queryable storage at each layer.

Why Streaming Medallion Beats Batch Medallion

Traditional batch-based medallion pipelines run on schedules: hourly or daily jobs that read Bronze, produce Silver, then produce Gold. This introduces multi-hour latency and complex dependency management (if Bronze is delayed, Silver and Gold cascade).

A streaming medallion pipeline using RisingWave is different:

Continuous — Transformations run as events arrive, not on a schedule
Incremental — RisingWave only processes changed data, not full table scans
Low-latency — Events flow from Bronze to Gold in seconds, not hours
Self-healing — If a source recovers from an outage, RisingWave automatically catches up

Setting Up Each Layer

Bronze Layer: Raw Data Landing

The Bronze layer ingests raw events directly from Kafka and lands them unmodified into Iceberg. Preserve everything — even malformed records — to enable reprocessing.

-- Bronze: Raw Kafka events → Iceberg
CREATE SOURCE raw_orders_kafka (
    order_id VARCHAR,
    customer_id VARCHAR,
    items JSONB,
    raw_total VARCHAR,
    shipping_address JSONB,
    status VARCHAR,
    source_system VARCHAR,
    kafka_timestamp TIMESTAMPTZ
)
WITH (
    connector = 'kafka',
    topic = 'raw-orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

CREATE SINK bronze_orders_sink AS
SELECT * FROM raw_orders_kafka
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'bronze',
    table.name = 'orders'
);

Use append-only for Bronze — raw events are immutable history. Never overwrite Bronze data.

Silver Layer: Cleaned and Conformed Data

The Silver layer reads from Bronze sources (via Kafka, not Iceberg directly), applies data quality rules, parses nested fields, and joins with reference data:

-- Reference table for customer data (from CDC)
CREATE SOURCE customers_postgres
WITH (
    connector = 'postgres-cdc',
    hostname = 'postgres-primary',
    port = '5432',
    username = 'replicator',
    password = 'replicator-password',
    database.name = 'commerce',
    schema.name = 'public',
    table.name = 'customers'
);

-- Silver: Cleaned and enriched orders
CREATE MATERIALIZED VIEW silver_orders AS
SELECT
    order_id,
    customer_id::BIGINT AS customer_id,
    -- Parse total safely with fallback
    CASE
        WHEN raw_total ~ '^\d+(\.\d+)?$' THEN raw_total::NUMERIC
        ELSE NULL
    END AS order_total,
    COALESCE(
        (shipping_address->>'country')::VARCHAR,
        'UNKNOWN'
    ) AS shipping_country,
    status,
    source_system,
    kafka_timestamp AS order_time,
    c.customer_name,
    c.customer_segment,
    c.email_domain
FROM raw_orders_kafka o
LEFT JOIN customers_postgres FOR SYSTEM_TIME AS OF PROCTIME() c
    ON o.customer_id::BIGINT = c.customer_id
WHERE order_id IS NOT NULL;  -- Filter out malformed records

CREATE SINK silver_orders_sink AS
SELECT * FROM silver_orders
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'silver',
    table.name = 'orders'
);

Gold Layer: Business Metrics

The Gold layer produces the aggregated, business-ready metrics that analysts and dashboards consume:

CREATE MATERIALIZED VIEW gold_daily_revenue AS
SELECT
    shipping_country,
    customer_segment,
    window_start::DATE AS date,
    COUNT(*) AS order_count,
    SUM(order_total) AS total_revenue,
    AVG(order_total) AS avg_order_value,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM TUMBLE(silver_orders, order_time, INTERVAL '1 DAY')
WHERE order_total IS NOT NULL
GROUP BY shipping_country, customer_segment, window_start;

CREATE SINK gold_revenue_sink AS
SELECT * FROM gold_daily_revenue
WITH (
    connector = 'iceberg',
    type = 'upsert',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-lake/warehouse',
    s3.region = 'us-east-1',
    database.name = 'gold',
    table.name = 'daily_revenue'
);

Architecture Overview

Layer	Storage	Transformation	Sink Type	Latency
Bronze	Iceberg (bronze DB)	None (raw)	append-only	< 30s
Silver	Iceberg (silver DB)	Clean, parse, join	upsert	< 60s
Gold	Iceberg (gold DB)	Aggregate, window	upsert	< 90s

Querying Across Layers

In RisingWave v2.8+, you can query Iceberg tables directly, enabling cross-layer analysis:

-- Query Gold layer Iceberg table from RisingWave
CREATE SOURCE gold_revenue_iceberg
WITH (
    connector = 'iceberg',
    catalog.type = 'rest',
    catalog.uri = 'http://iceberg-catalog:8181',
    warehouse.path = 's3://data-lake/warehouse',
    database.name = 'gold',
    table.name = 'daily_revenue'
);

-- Compare today's live data with historical Gold data
SELECT
    live.shipping_country,
    live.total_revenue AS todays_revenue,
    hist.total_revenue AS last_week_revenue,
    ROUND(100.0 * (live.total_revenue - hist.total_revenue) / NULLIF(hist.total_revenue, 0), 1) AS pct_change
FROM gold_daily_revenue live
JOIN gold_revenue_iceberg hist
    ON live.shipping_country = hist.shipping_country
    AND hist.date = CURRENT_DATE - INTERVAL '7 DAYS'
ORDER BY pct_change DESC;

Data Quality Patterns

Bronze-to-Silver Quality Rules

Apply quality checks in the Silver materialized view using SQL filters and CASE expressions:

Null handling — Use COALESCE to replace nulls with defaults
Type coercion — Use regex checks before casting to avoid runtime errors
Deduplication — Use MAX(event_time) grouped by entity ID to keep latest state
Range validation — Use WHERE amount BETWEEN 0 AND 1000000 to reject outliers

Silver-to-Gold Quality Rules

Referential integrity — Only aggregate rows with valid foreign keys (use INNER JOIN instead of LEFT JOIN in Gold views)
Completeness thresholds — Alert if Gold row counts drop more than 10% from previous period

Medallion vs. Other Data Organization Patterns

Pattern	Layers	Latency	Complexity	Best For
Medallion (Bronze/Silver/Gold)	3	Seconds (streaming)	Medium	General-purpose analytics
Lambda (batch + stream)	2 parallel	Minutes (batch)	Very High	Legacy systems
Single-layer raw	1	Hours (batch)	Low	Simple reporting
Domain-oriented (data mesh)	Variable	Variable	High	Large organizations
Kappa (stream only)	1	Seconds	Low	Event-driven systems

Operational Considerations

Compaction scheduling — Each Iceberg layer accumulates small files from streaming commits. Schedule compaction jobs for each layer (Bronze: daily, Silver: every 4 hours, Gold: every hour) since Gold receives the most queries.

Snapshot retention — Bronze retains snapshots for 90+ days (for reprocessing). Silver: 30 days. Gold: 7 days (data is regeneratable from Silver).

Monitoring — Track record counts at each layer boundary. Significant drops indicate upstream issues. Track schema drift at Bronze to catch unexpected source changes early.

FAQ

Q: Can I rebuild Silver from Bronze if I need to fix a bug? Yes. This is one of the main benefits of the medallion pattern. Delete the Silver Iceberg table, fix your Silver materialized view, and replay from the Bronze source in Kafka or reprocess from Bronze Iceberg files.

Q: Should Gold tables be served directly to BI tools? Yes. Gold tables in Iceberg are ideal for BI tools. Connect Tableau, Looker, or Power BI via Athena, Trino, or another Iceberg-compatible query engine. The aggregated data is already in the shape analysts need.

Q: How many Gold tables should I create? Create one Gold table per business domain or reporting area. Avoid creating a single monolithic Gold table — it becomes a bottleneck and mixes concerns. Typical setups have 5–20 Gold tables.

Q: Can I use RisingWave for all three layers, or do I need Spark? RisingWave handles all three layers entirely through SQL — no Spark required for the pipeline. You may still want Spark for periodic compaction of Iceberg files, but this is operational maintenance, not part of the pipeline itself.

Q: What if a Silver transformation fails? RisingWave applies backpressure upstream and buffers data until the issue is resolved. Bronze data in Kafka is retained (based on your Kafka retention policy), ensuring no data is lost.

Build Your Streaming Medallion Architecture

The streaming medallion architecture with RisingWave and Iceberg gives you fresh, reliable, multi-layer data — without the operational overhead of scheduled batch jobs or the complexity of Lambda-style dual pipelines.

Get started with the RisingWave documentation and join the RisingWave Slack community to see real-world medallion architecture examples from the community.