Building a Data Vault on Apache Iceberg with RisingWave

Building a Data Vault on Apache Iceberg with RisingWave

Building a Data Vault on Apache Iceberg with RisingWave delivers the auditability and flexibility of Data Vault 2.0 with the real-time freshness of streaming CDC. RisingWave captures database changes continuously, computes hub, link, and satellite structures as materialized views, and sinks them to Iceberg tables that preserve full history without hard deletes—the foundation of a compliant, evolvable enterprise data warehouse.

What Is Data Vault 2.0?

Data Vault 2.0 is a data modeling methodology designed for auditability, scalability, and adaptability. It organizes data into three entity types:

  • Hubs: Unique business keys (e.g., Customer ID, Order ID)
  • Links: Relationships between hubs (e.g., Customer placed Order)
  • Satellites: Descriptive attributes and their change history

Every record includes metadata columns: load_date (when the record was loaded), record_source (which system it came from), and optionally a hash key for deduplication. No records are ever deleted—changes are tracked as new satellite records with new load_date values.

Iceberg is the ideal storage layer for Data Vault because it preserves full history, supports schema evolution, and never hard-deletes rows unless you explicitly run an expiration procedure.

Streaming Data Vault Architecture

Traditional Data Vault implementations run nightly batch loads. RisingWave enables a real-time Data Vault where CDC events flow from source databases into Iceberg hubs, links, and satellites with sub-minute latency.

Step 1: Connect to Source Systems via CDC

CREATE SOURCE customers_cdc
WITH (
    connector     = 'postgres-cdc',
    hostname      = 'crm-db.internal',
    port          = '5432',
    username      = 'cdc_reader',
    password      = 'secret',
    database.name = 'crm',
    schema.name   = 'public',
    table.name    = 'customers',
    slot.name     = 'rw_customers_slot'
);

CREATE SOURCE orders_cdc
WITH (
    connector     = 'mysql-cdc',
    hostname      = 'orders-db.internal',
    port          = '3306',
    username      = 'cdc_reader',
    password      = 'secret',
    database.name = 'orders',
    table.name    = 'orders'
);

Step 2: Build Hub Materialized Views

A Hub contains only the business key and metadata—no descriptive attributes. RisingWave computes distinct business keys as they arrive.

CREATE MATERIALIZED VIEW hub_customer AS
SELECT DISTINCT
    MD5(CAST(customer_id AS VARCHAR))       AS h_customer_hk,
    customer_id                              AS customer_bk,
    CURRENT_TIMESTAMP                        AS load_date,
    'CRM_POSTGRES'                           AS record_source
FROM customers_cdc;

CREATE MATERIALIZED VIEW hub_order AS
SELECT DISTINCT
    MD5(CAST(order_id AS VARCHAR))          AS h_order_hk,
    order_id                                 AS order_bk,
    CURRENT_TIMESTAMP                        AS load_date,
    'ORDERS_MYSQL'                           AS record_source
FROM orders_cdc;

Links capture relationships between hubs. The customer-order link records which customer placed which order.

CREATE MATERIALIZED VIEW link_customer_order AS
SELECT
    MD5(CONCAT(
        CAST(o.customer_id AS VARCHAR), '|',
        CAST(o.order_id AS VARCHAR)
    ))                                       AS l_customer_order_hk,
    MD5(CAST(o.customer_id AS VARCHAR))      AS h_customer_hk,
    MD5(CAST(o.order_id AS VARCHAR))         AS h_order_hk,
    CURRENT_TIMESTAMP                        AS load_date,
    'ORDERS_MYSQL'                           AS record_source
FROM orders_cdc AS o;

Step 4: Build Satellite Materialized Views

Satellites hold descriptive attributes with full change history. Each record represents the state of an entity at a point in time.

CREATE MATERIALIZED VIEW sat_customer_details AS
SELECT
    MD5(CAST(customer_id AS VARCHAR))       AS h_customer_hk,
    customer_id,
    first_name,
    last_name,
    email,
    country,
    customer_segment,
    CURRENT_TIMESTAMP                        AS load_date,
    NOW()                                    AS load_end_date,
    'CRM_POSTGRES'                           AS record_source,
    MD5(CONCAT(
        COALESCE(first_name,''), '|',
        COALESCE(last_name,''), '|',
        COALESCE(email,''), '|',
        COALESCE(country,'')
    ))                                       AS hash_diff
FROM customers_cdc;

Step 5: Sink to Iceberg

CREATE SINK vault_hub_customer AS
SELECT * FROM hub_customer
WITH (
    connector       = 'iceberg',
    type            = 'append-only',
    catalog.type    = 'rest',
    catalog.uri     = 'http://iceberg-catalog:8181',
    warehouse.path  = 's3://data-vault/warehouse',
    s3.region       = 'us-east-1',
    database.name   = 'vault',
    table.name      = 'hub_customer'
);

CREATE SINK vault_sat_customer AS
SELECT * FROM sat_customer_details
WITH (
    connector       = 'iceberg',
    type            = 'append-only',
    catalog.type    = 'rest',
    catalog.uri     = 'http://iceberg-catalog:8181',
    warehouse.path  = 's3://data-vault/warehouse',
    s3.region       = 'us-east-1',
    database.name   = 'vault',
    table.name      = 'sat_customer_details'
);

Satellites use append-only because Data Vault never modifies historical records—new states are appended with a new load_date.

Data Vault Layer Comparison

LayerEntityUpdate StrategyIceberg Sink TypeTypical Size
HubBusiness keysAppend only (dedup)append-onlySmall
LinkRelationshipsAppend onlyappend-onlyMedium
SatelliteAttributes + historyAppend only (SCD Type 2)append-onlyLarge
Point-in-TimeQuery helperUpsertupsertMedium
BridgeMulti-activeAppend onlyappend-onlyVariable

Querying the Data Vault

Data Vault structures require joins across hubs, links, and satellites. RisingWave's temporal join handles the SCD Type 2 pattern:

-- Get current customer details via temporal join
SELECT
    h.customer_bk,
    s.first_name,
    s.last_name,
    s.email,
    s.country,
    s.customer_segment
FROM hub_customer AS h
JOIN sat_customer_details FOR SYSTEM_TIME AS OF NOW() AS s
  ON h.h_customer_hk = s.h_customer_hk;

Why Iceberg Is Ideal for Data Vault

Iceberg's append-only semantics align perfectly with Data Vault's prohibition on hard deletes. Schema evolution lets you add satellite columns without rebuilding history. Time travel lets you reconstruct the Data Vault state at any historical point for audits or regulatory inquiries.

Combined with RisingWave's CDC capabilities, you get a real-time Data Vault that ingests changes from PostgreSQL and MySQL within seconds—eliminating the traditional overnight batch window.

FAQ

Q: Does Data Vault require a specific Iceberg catalog? A: No. Any Iceberg REST-compatible catalog works: Polaris, Nessie, Tabular, AWS Glue with REST adapter. The catalog tracks table metadata but does not constrain Data Vault logic.

Q: How do I handle GDPR right-to-erasure requests with an append-only Data Vault? A: Use Iceberg's row-level delete feature to surgically remove specific rows by primary key without rewriting entire partitions. This is one of Iceberg v2's key advantages for regulated industries.

Q: Should I use hash keys or sequence keys for Data Vault? A: Hash keys (MD5 or SHA-1 of business key) are preferred for distributed streaming systems. They enable parallel loading without a central sequence generator, which is essential for RisingWave's multi-node architecture.

Q: How do I build Information Marts on top of the Data Vault? A: Create additional RisingWave materialized views that join hubs, links, and satellites into dimensional or flat structures. Sink these to separate Iceberg tables optimized for analytical query patterns.

Q: What is the performance impact of MD5 hashing in materialized views? A: Negligible. MD5 is a CPU-bound operation that completes in nanoseconds per record. The throughput bottleneck in streaming Data Vault is almost always I/O (Kafka reads, Iceberg writes), not hash computation.

Get Started

Start building your streaming Data Vault with the RisingWave documentation. Connect with other data engineers building Iceberg-based architectures on the RisingWave Slack.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.