Streaming ETL for Core Banking Systems with RisingWave

Streaming ETL for core banking means replacing nightly batch extracts with continuous change data capture (CDC) pipelines that propagate every account update, transaction, and ledger change to downstream systems within seconds of commitment — eliminating data latency across the entire banking data stack.

The Batch ETL Problem in Banking

Core banking systems — Temenos T24, FIS Profile, Finacle, Flexcube — are the operational heart of every bank. But they were designed as transaction processors, not analytical data platforms. For decades, banks have relied on nightly batch ETL to extract data from core systems and load it into data warehouses, reporting systems, and downstream applications.

The consequences are well-known to anyone who has worked in banking technology:

Data latency: Reports and dashboards are always T-1
Settlement reconciliation: Batch windows create multi-hour gaps in payment status visibility
Regulatory reporting: End-of-day snapshots miss intraday position changes
Downstream system sync: Customer service tools show yesterday's account state

Streaming ETL with CDC eliminates these gaps. RisingWave connects directly to core banking databases using PostgreSQL and MySQL CDC connectors, captures every committed change, and continuously propagates transformed data to downstream systems.

CDC Architecture

Core Banking DB (PostgreSQL/MySQL)
        ↓ (CDC — logical replication)
   RisingWave Source
        ↓
Materialized Views (transform, enrich, aggregate)
        ↓
Sinks: Kafka | JDBC | Iceberg
        ↓
Data Warehouse | Reporting | Downstream Apps

The key advantage over traditional ETL tools: no staging tables, no scheduled jobs, no watermark complexity. RisingWave ingests changes as a continuous stream and maintains transformed views incrementally.

Setting Up CDC from Core Banking

-- Ingest account changes from PostgreSQL-based core banking
CREATE SOURCE core_accounts
WITH (
    connector = 'postgres-cdc',
    hostname = 'core-banking-db.internal',
    port = '5432',
    username = 'rw_cdc_user',
    password = 'secret',
    database.name = 'corebank',
    schema.name = 'accounts',
    table.name = 'accounts',
    slot.name = 'rw_accounts_slot'
);

-- Ingest transaction ledger from MySQL-based transaction system
CREATE SOURCE ledger_transactions
WITH (
    connector = 'mysql-cdc',
    hostname = 'txn-ledger-db.internal',
    port = '3306',
    username = 'rw_cdc_user',
    password = 'secret',
    database.name = 'ledger',
    table.name = 'transactions'
);

Once these sources are created, RisingWave begins consuming the logical replication stream, capturing every INSERT, UPDATE, and DELETE from the source tables.

Building Streaming Transformations

With raw CDC data flowing in, create materialized views for the downstream use cases:

-- Real-time account balance summary for customer service
CREATE MATERIALIZED VIEW account_balance_summary AS
SELECT
    a.account_id,
    a.customer_id,
    a.account_type,
    a.account_status,
    a.currency,
    SUM(t.amount) FILTER (WHERE t.txn_type = 'credit') AS total_credits,
    SUM(t.amount) FILTER (WHERE t.txn_type = 'debit')  AS total_debits,
    SUM(t.amount) FILTER (WHERE t.txn_type = 'credit')
        - SUM(t.amount) FILTER (WHERE t.txn_type = 'debit') AS running_balance,
    MAX(t.txn_time)                                          AS last_transaction_time,
    COUNT(t.transaction_id)                                   AS transaction_count
FROM core_accounts a
LEFT JOIN ledger_transactions t
    ON a.account_id = t.account_id
GROUP BY
    a.account_id,
    a.customer_id,
    a.account_type,
    a.account_status,
    a.currency;

This view is continuously maintained. When a new debit posts in the ledger, the running_balance for that account updates within milliseconds — without any batch job or scheduled refresh.

Streaming to the Data Warehouse

Propagate transformed data to downstream systems via multiple sinks:

-- Sink account summaries to data warehouse via JDBC
CREATE SINK account_summary_dw_sink
FROM account_balance_summary
WITH (
    connector = 'jdbc',
    jdbc.url = 'jdbc:postgresql://datawarehouse.internal:5432/analytics',
    table.name = 'account_balance_summary',
    user = 'sink_user',
    password = 'secret'
)
FORMAT PLAIN ENCODE JSON;

-- Sink to Kafka for downstream microservices
CREATE SINK account_updates_kafka_sink
FROM account_balance_summary
WITH (
    connector = 'kafka',
    topic = 'banking.account.updates',
    properties.bootstrap.server = 'broker:9092'
)
FORMAT PLAIN ENCODE JSON;

-- Archive to Iceberg for long-term storage
CREATE SINK account_history_iceberg_sink
FROM account_balance_summary
WITH (
    connector = 'iceberg',
    type = 'append-only',
    catalog.type = 'storage',
    s3.region = 'us-east-1',
    s3.bucket.name = 'bank-data-lake',
    s3.path = 'account-history/',
    database.name = 'banking',
    table.name = 'account_balance_history'
)
FORMAT PLAIN ENCODE PARQUET;

Batch ETL vs. Streaming ETL for Core Banking

Dimension	Nightly Batch ETL	Streaming ETL (RisingWave + CDC)
Data latency	8–12 hours	Seconds
Account state in downstream apps	T-1	Near-real-time
Settlement reconciliation	EOD	Continuous
Intraday regulatory reporting	Not possible	Supported
ETL complexity	High (transformations in code)	SQL-only
Impact of source DB changes	Full pipeline redeploy	Modify materialized view
Data warehouse freshness	Daily	Always current

FAQ

Q: Does CDC from core banking databases impact production performance? A: PostgreSQL and MySQL CDC uses logical replication, which reads from the replication log rather than querying production tables. The overhead on the source database is typically under 1% CPU and minimal I/O impact.

Q: How do we handle schema changes in the core banking database? A: When source table schemas change, the CDC source in RisingWave needs to be updated to reflect the new columns or types. RisingWave supports adding columns to existing CDC sources, and schema evolution is a planned area of improvement.

Q: Can we do initial historical load in addition to ongoing CDC? A: Yes. RisingWave CDC sources perform a full snapshot of the source table on first startup, followed by ongoing change streaming. This snapshot-then-stream approach ensures the materialized views are fully initialized before the streaming phase begins.

Q: How do we handle rollbacks or failed transactions in the source DB? A: CDC is based on the database's committed transaction log. Rolled-back transactions never appear in the CDC stream, so RisingWave only sees committed changes — the same guarantee as any other database replication.

Q: What is the recovery behavior after a RisingWave restart? A: RisingWave maintains CDC offsets (log sequence numbers) as part of its checkpoint state. On restart, it resumes from the last checkpoint, processing any changes that occurred during the downtime before catching up to real time.

Get Started

Modernize your core banking data pipelines with continuous streaming ETL:

Try the CDC quickstart: docs.risingwave.com/get-started
Join the banking data engineering community: risingwave.com/slack