Building a Fintech Data Lakehouse with RisingWave and Apache Iceberg

Building a Fintech Data Lakehouse with RisingWave and Apache Iceberg

A fintech data lakehouse with RisingWave and Apache Iceberg means using RisingWave to process streaming financial events in real time, then sinking the results as Iceberg tables on object storage—giving you sub-second operational queries on RisingWave and full historical analytics on Iceberg, without duplicating your data pipeline.

Why Fintech Needs a Lakehouse Architecture

Fintech companies face a bifurcated data problem. On one side: real-time operational needs—fraud scores, live balances, active KYC flags—that require sub-second query latency. On the other side: historical analytics, regulatory reporting, and ML feature stores that require querying months or years of data efficiently.

Traditional architectures solve this with two separate systems: a streaming database for real-time and a data warehouse for historical. This creates the classic data engineering headache: dual pipelines, data inconsistency, and high operational overhead.

The lakehouse pattern—pioneered by Delta Lake and Apache Iceberg—proposes a single storage layer (object storage like S3) that supports both operational and analytical workloads. RisingWave's native Iceberg sink connects the streaming and batch worlds: stream data flows through RisingWave's materialized views, then is written to Iceberg tables that Spark, Trino, Snowflake, or any Iceberg-compatible engine can query.

Setting Up the Streaming Layer

-- Payment events from core banking CDC
CREATE SOURCE payment_events (
    payment_id          VARCHAR,
    sender_account      VARCHAR,
    receiver_account    VARCHAR,
    amount              DECIMAL(18,2),
    currency            VARCHAR(3),
    payment_type        VARCHAR,    -- 'ACH', 'WIRE', 'RTP', 'CARD'
    status              VARCHAR,    -- 'INITIATED', 'SETTLED', 'FAILED', 'RETURNED'
    initiated_at        TIMESTAMPTZ,
    settled_at          TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'payment-events',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

-- Fraud scores from ML inference service
CREATE SOURCE fraud_scores (
    payment_id          VARCHAR,
    fraud_probability   DECIMAL(5,4),
    model_version       VARCHAR,
    scored_at           TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'fraud-scores',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
) FORMAT PLAIN ENCODE JSON;

Computing Materialized Views for Real-Time Serving

-- Real-time payment analytics: hourly settlement metrics
CREATE MATERIALIZED VIEW hourly_payment_metrics AS
SELECT
    window_start,
    window_end,
    payment_type,
    currency,
    COUNT(*)                                                AS total_payments,
    COUNT(*) FILTER (WHERE status = 'SETTLED')              AS settled_count,
    COUNT(*) FILTER (WHERE status = 'FAILED')               AS failed_count,
    COUNT(*) FILTER (WHERE status = 'RETURNED')             AS returned_count,
    SUM(amount) FILTER (WHERE status = 'SETTLED')           AS settled_volume,
    AVG(amount)                                             AS avg_payment_amount,
    AVG(EXTRACT(EPOCH FROM (settled_at - initiated_at)))    AS avg_settlement_seconds
FROM TUMBLE(payment_events, initiated_at, INTERVAL '1 HOUR')
GROUP BY window_start, window_end, payment_type, currency;

-- Enriched payment stream joining fraud scores
CREATE MATERIALIZED VIEW enriched_payments AS
SELECT
    pe.payment_id,
    pe.sender_account,
    pe.receiver_account,
    pe.amount,
    pe.currency,
    pe.payment_type,
    pe.status,
    pe.initiated_at,
    pe.settled_at,
    fs.fraud_probability,
    fs.model_version,
    CASE
        WHEN fs.fraud_probability >= 0.8 THEN 'HIGH'
        WHEN fs.fraud_probability >= 0.4 THEN 'MEDIUM'
        ELSE 'LOW'
    END AS fraud_risk_tier
FROM payment_events pe
LEFT JOIN fraud_scores fs ON pe.payment_id = fs.payment_id;

Sinking to Apache Iceberg

This is where the lakehouse architecture connects. RisingWave writes materialized view results directly to Iceberg tables on S3:

-- Sink enriched payments to Iceberg for historical analytics
CREATE SINK enriched_payments_iceberg
FROM enriched_payments
WITH (
    connector = 'iceberg',
    type = 'append-only',
    warehouse.path = 's3://fintech-lakehouse/warehouse',
    catalog.name = 'fintech',
    catalog.type = 'storage',
    database.name = 'payments',
    table.name = 'enriched_payments'
);

-- Sink hourly metrics to Iceberg for BI and reporting
CREATE SINK hourly_metrics_iceberg
FROM hourly_payment_metrics
WITH (
    connector = 'iceberg',
    type = 'append-only',
    warehouse.path = 's3://fintech-lakehouse/warehouse',
    catalog.name = 'fintech',
    catalog.type = 'storage',
    database.name = 'payments',
    table.name = 'hourly_payment_metrics'
);

Once data lands in Iceberg, it is immediately queryable by Spark, Trino, Snowflake External Tables, Databricks, and any other Iceberg-compatible engine—without any ETL job or data copy.

Regulatory Reporting View

-- Daily payment summary for regulatory reporting (queryable in RisingWave)
CREATE MATERIALIZED VIEW daily_regulatory_summary AS
SELECT
    window_start::DATE                  AS report_date,
    payment_type,
    currency,
    COUNT(*)                            AS total_count,
    SUM(amount)                         AS total_amount,
    COUNT(*) FILTER (WHERE fraud_risk_tier = 'HIGH') AS high_risk_count,
    SUM(amount) FILTER (WHERE fraud_risk_tier = 'HIGH') AS high_risk_amount
FROM TUMBLE(enriched_payments, initiated_at, INTERVAL '1 DAY')
GROUP BY window_start, payment_type, currency;

Lakehouse Architecture Comparison

DimensionTraditional (Streaming DB + Warehouse)Lakehouse (RisingWave + Iceberg)
Pipeline countTwo separate pipelinesSingle streaming pipeline
Real-time latencySub-second (streaming DB)Sub-second (RisingWave views)
Historical query engineData warehouse (proprietary)Any Iceberg engine (open)
Data consistencyRisk of divergenceSingle source of truth
Storage costHigh (duplicated storage)Low (S3 object storage)
Schema evolutionRequires migration in both systemsIceberg handles schema evolution natively
Operational overheadTwo systems to manageOne streaming layer + open storage

FAQ

Q: What Iceberg catalog types does RisingWave support? A: RisingWave supports storage-based catalogs (Iceberg's built-in REST catalog using the warehouse path) as well as Hive Metastore and AWS Glue catalog configurations. For production deployments, AWS Glue or a REST catalog service (like Nessie or Polaris) is recommended for better metadata management.

Q: How does schema evolution work when we add new fields to payment events? A: Iceberg natively supports adding nullable columns without rewriting existing data. On the RisingWave side, update the source schema and the materialized view definition. New columns appear in Iceberg with null values for historical rows. This is one of Iceberg's major advantages over Parquet-on-S3 approaches.

Q: Can we use RisingWave to backfill Iceberg from Kafka history? A: Yes. Set scan.startup.mode = 'earliest' on your Kafka source to replay from the beginning of the topic's retention window. RisingWave will process all historical events, update its materialized views, and sink everything to Iceberg in one operation.

Q: How does this handle GDPR right-to-erasure on Iceberg data? A: Iceberg supports row-level deletes via its DELETE statement (in V2 format). Use Apache Spark or Trino to issue targeted deletes on Iceberg tables for specific customer IDs. This rewrites only the affected data files, not the entire table.

Q: What is the typical write latency from RisingWave to Iceberg? A: RisingWave batches Iceberg writes and commits new snapshots periodically (configurable, typically every 60 seconds). This means Iceberg tables are slightly behind RisingWave's materialized views. For real-time queries, read from RisingWave; for historical analytics, read from Iceberg.


Get Started

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.