MySQL to Iceberg: Stream Database Changes to Your Lakehouse

Introduction

MySQL is the most widely deployed open-source relational database in the world, and most analytics teams have at least one critical MySQL instance they need to replicate to a data lake. The traditional approach involves scheduled mysqldump exports or batch SELECT queries that extract data every few hours. These batch exports create the same problems every data team knows well: stale analytics, heavy load on the production database, and missed deletes.

MySQL's binary log (binlog) offers a better mechanism. The binlog records every data modification that happens in the database, and CDC (Change Data Capture) tools can read it in real time to stream changes downstream. By connecting MySQL's binlog to RisingWave and sinking the results to Apache Iceberg, you can build a lakehouse pipeline that stays within seconds of production, handles inserts, updates, and deletes correctly, and requires no batch orchestration.

This tutorial shows you how to build a complete MySQL-to-Iceberg pipeline using RisingWave. If you have worked through our PostgreSQL to Iceberg guide, the structure is similar, but the underlying CDC mechanism is different, and this guide covers the MySQL-specific configuration you need.

How Does MySQL CDC Differ from PostgreSQL CDC?

Both MySQL and PostgreSQL support Change Data Capture, but they use fundamentally different replication mechanisms.

MySQL binlog vs PostgreSQL WAL

Aspect	MySQL (binlog)	PostgreSQL (WAL)
Log type	Binary log (binlog)	Write-Ahead Log (WAL)
Log format options	ROW, STATEMENT, MIXED	Logical, Physical
Required format for CDC	ROW	Logical
Server ID requirement	Yes (unique per consumer)	No
Replication slot	No	Yes
Publication/subscription	No	Yes (logical replication)
GTID support	Optional but recommended	N/A
Compatible versions	MySQL 5.7, 8.0, 8.4	PostgreSQL 10-17

The most important practical difference is the server ID. MySQL requires each binlog consumer to register with a unique server.id. If two CDC consumers share the same server ID, they will conflict and one will be disconnected. PostgreSQL uses replication slots instead, which are named and managed by the server.

binlog format requirements

MySQL's binlog supports three formats, and CDC requires row-based logging:

ROW format (required for CDC): Records the actual row data that changed. This is the only format that gives you the before and after values for each row.
STATEMENT format: Records the SQL statements that caused changes. Not suitable for CDC because the same statement can produce different results on different replicas.
MIXED format: Combination of both. Not reliable for CDC.

What Do You Need Before Starting?

MySQL prerequisites

Configure your MySQL instance for row-based binlog replication:

-- Check current binlog settings
SHOW VARIABLES LIKE 'log_bin';
SHOW VARIABLES LIKE 'binlog_format';
SHOW VARIABLES LIKE 'binlog_row_image';

Update my.cnf (or my.ini on Windows) if needed:

[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 7

Restart MySQL after changing these settings.

Create a dedicated CDC user with the required permissions:

-- Create a CDC user with replication privileges
CREATE USER 'cdc_reader'@'%' IDENTIFIED BY 'your_secure_password';

-- Grant replication permissions
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT
ON *.* TO 'cdc_reader'@'%';

-- Grant read access to the target database
GRANT SELECT ON ecommerce.* TO 'cdc_reader'@'%';

FLUSH PRIVILEGES;

For AWS RDS MySQL, enable automated backups (which enables binlog) and set binlog_format to ROW in the parameter group.

RisingWave setup

Start a RisingWave instance for testing:

docker run -it --pull=always -p 4566:4566 -p 5691:5691 risingwavelabs/risingwave:latest single_node

Or use RisingWave Cloud for a managed deployment.

Iceberg storage

Create an S3 bucket for your Iceberg warehouse:

aws s3 mb s3://my-mysql-lakehouse

How Do You Build the MySQL to Iceberg Pipeline?

Here is the step-by-step pipeline.

Step 1: Create the MySQL CDC source

Connect RisingWave to MySQL using the MySQL CDC connector:

CREATE SOURCE mysql_ecommerce WITH (
    connector = 'mysql-cdc',
    hostname = '192.168.1.200',
    port = '3306',
    username = 'cdc_reader',
    password = 'your_secure_password',
    database.name = 'ecommerce',
    server.id = '5001'
);

Key differences from PostgreSQL CDC:

The connector is 'mysql-cdc' instead of 'postgres-cdc'
You must specify a unique server.id (any integer that is not used by another binlog consumer)
There is no schema.name parameter because MySQL does not have schemas within databases (the database itself serves as the namespace)
There is no slot.name because MySQL does not use replication slots

Step 2: Create tables from the MySQL source

For each upstream MySQL table, create a corresponding table in RisingWave:

-- Orders table
CREATE TABLE orders (
    order_id BIGINT PRIMARY KEY,
    customer_id BIGINT,
    product_id INT,
    quantity INT,
    unit_price DECIMAL(10, 2),
    total_amount DECIMAL(12, 2),
    order_status VARCHAR,
    shipping_method VARCHAR,
    created_at TIMESTAMPTZ,
    updated_at TIMESTAMPTZ
) FROM mysql_ecommerce TABLE 'ecommerce.orders';

-- Customers table
CREATE TABLE customers (
    customer_id BIGINT PRIMARY KEY,
    first_name VARCHAR,
    last_name VARCHAR,
    email VARCHAR,
    phone VARCHAR,
    city VARCHAR,
    country VARCHAR,
    registered_at TIMESTAMPTZ
) FROM mysql_ecommerce TABLE 'ecommerce.customers';

-- Products table
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR,
    category VARCHAR,
    subcategory VARCHAR,
    price DECIMAL(10, 2),
    cost DECIMAL(10, 2),
    supplier_id INT
) FROM mysql_ecommerce TABLE 'ecommerce.products';

Notice the table reference format: in MySQL CDC, you specify 'database.table' (e.g., 'ecommerce.orders'), while PostgreSQL uses 'schema.table' (e.g., 'public.orders').

When these tables are created, RisingWave:

Takes an initial snapshot of the existing data via a SELECT query
Records the current binlog position
Starts streaming binlog events from that position forward

Step 3: Build transformations with materialized views

Create SQL transformations that enrich and aggregate the raw CDC data:

-- Revenue analytics by customer segment
CREATE MATERIALIZED VIEW customer_revenue AS
SELECT
    c.customer_id,
    c.first_name || ' ' || c.last_name AS full_name,
    c.country,
    c.city,
    COUNT(o.order_id) AS total_orders,
    SUM(o.total_amount) AS lifetime_revenue,
    AVG(o.total_amount) AS avg_order_value,
    MAX(o.created_at) AS most_recent_order,
    CASE
        WHEN COUNT(o.order_id) >= 20 THEN 'champion'
        WHEN COUNT(o.order_id) >= 10 THEN 'loyal'
        WHEN COUNT(o.order_id) >= 3 THEN 'regular'
        ELSE 'new'
    END AS customer_segment
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
    AND o.order_status IN ('completed', 'shipped')
GROUP BY c.customer_id, c.first_name, c.last_name, c.country, c.city;

-- Product margin analysis
CREATE MATERIALIZED VIEW product_margins AS
SELECT
    p.product_id,
    p.product_name,
    p.category,
    p.subcategory,
    p.price AS list_price,
    p.cost,
    p.price - p.cost AS margin_per_unit,
    CASE
        WHEN p.cost > 0 THEN ROUND(((p.price - p.cost) / p.cost) * 100, 1)
        ELSE 0
    END AS margin_percentage,
    COUNT(o.order_id) AS units_sold,
    SUM(o.quantity * (p.price - p.cost)) AS total_margin
FROM products p
LEFT JOIN orders o ON p.product_id = o.product_id
    AND o.order_status = 'completed'
GROUP BY p.product_id, p.product_name, p.category, p.subcategory, p.price, p.cost;

-- Denormalized order facts for the data lake
CREATE MATERIALIZED VIEW order_facts AS
SELECT
    o.order_id,
    o.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    c.country,
    c.city,
    o.product_id,
    p.product_name,
    p.category,
    p.subcategory,
    o.quantity,
    o.unit_price,
    o.total_amount,
    p.cost * o.quantity AS total_cost,
    o.total_amount - (p.cost * o.quantity) AS order_margin,
    o.order_status,
    o.shipping_method,
    o.created_at,
    o.updated_at
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id;

Step 4: Sink to Iceberg

Create Iceberg sinks for each materialized view:

-- Sink customer revenue analytics
CREATE SINK customer_revenue_iceberg FROM customer_revenue
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'customer_id',
    warehouse.path = 's3://my-mysql-lakehouse/warehouse',
    database.name = 'analytics',
    table.name = 'customer_revenue',
    catalog.type = 'rest',
    catalog.name = 'lakehouse_catalog',
    create_table_if_not_exists = 'true',
    commit_checkpoint_interval = '60',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    s3.region = 'us-west-2'
);

-- Sink product margins
CREATE SINK product_margins_iceberg FROM product_margins
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'product_id',
    warehouse.path = 's3://my-mysql-lakehouse/warehouse',
    database.name = 'analytics',
    table.name = 'product_margins',
    catalog.type = 'rest',
    catalog.name = 'lakehouse_catalog',
    create_table_if_not_exists = 'true',
    commit_checkpoint_interval = '60',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    s3.region = 'us-west-2'
);

-- Sink order facts
CREATE SINK order_facts_iceberg FROM order_facts
WITH (
    connector = 'iceberg',
    type = 'upsert',
    primary_key = 'order_id',
    warehouse.path = 's3://my-mysql-lakehouse/warehouse',
    database.name = 'analytics',
    table.name = 'order_facts',
    catalog.type = 'rest',
    catalog.name = 'lakehouse_catalog',
    create_table_if_not_exists = 'true',
    commit_checkpoint_interval = '60',
    s3.access.key = '${AWS_ACCESS_KEY}',
    s3.secret.key = '${AWS_SECRET_KEY}',
    s3.region = 'us-west-2'
);

Step 5: Verify the pipeline

Confirm everything is working:

-- Check active sinks
SHOW SINKS;

-- Check the source
SHOW SOURCES;

-- Verify data is flowing
SELECT * FROM customer_revenue ORDER BY lifetime_revenue DESC LIMIT 5;

Expected output:

 customer_id |   full_name    | country | city      | total_orders | lifetime_revenue | avg_order_value |   most_recent_order    | customer_segment
-------------+----------------+---------+-----------+--------------+------------------+-----------------+------------------------+-----------------
         147 | Sarah Kim      | US      | Seattle   |           28 |         14520.00 |          518.57 | 2026-03-29 09:15:00+00 | champion
          83 | James Liu      | CA      | Toronto   |           15 |          8340.00 |          556.00 | 2026-03-28 22:30:00+00 | loyal
         201 | Maria Garcia   | MX      | CDMX      |           12 |          6780.00 |          565.00 | 2026-03-29 11:45:00+00 | loyal
          42 | Alex Mueller   | DE      | Berlin    |            8 |          4120.00 |          515.00 | 2026-03-27 16:20:00+00 | regular
         310 | Yuki Tanaka    | JP      | Tokyo     |            4 |          2890.00 |          722.50 | 2026-03-29 03:10:00+00 | regular

How Do You Handle MySQL-Specific CDC Challenges?

MySQL CDC has a few operational considerations that differ from PostgreSQL.

Server ID conflicts

Every binlog consumer needs a unique server.id. If you run multiple RisingWave instances or other CDC tools (Debezium, Maxwell) reading from the same MySQL server, each must have a different server ID:

-- First RisingWave pipeline
CREATE SOURCE pipeline_1 WITH (
    connector = 'mysql-cdc',
    hostname = 'mysql-prod',
    port = '3306',
    username = 'cdc_reader',
    password = 'password',
    database.name = 'ecommerce',
    server.id = '5001'
);

-- Second RisingWave pipeline (different database)
CREATE SOURCE pipeline_2 WITH (
    connector = 'mysql-cdc',
    hostname = 'mysql-prod',
    port = '3306',
    username = 'cdc_reader',
    password = 'password',
    database.name = 'inventory',
    server.id = '5002'
);

Binlog retention

MySQL automatically purges binlog files based on expire_logs_days (or binlog_expire_logs_seconds in MySQL 8.0+). If RisingWave is offline longer than the retention period, the binlog events are lost and you need to re-snapshot:

-- Check binlog retention on MySQL
SHOW VARIABLES LIKE 'expire_logs_days';
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';

Set a retention period that gives you enough buffer for planned maintenance windows.

Auto schema change support

Like the PostgreSQL connector, MySQL CDC in RisingWave supports automatic schema change propagation:

CREATE SOURCE mysql_with_schema_change WITH (
    connector = 'mysql-cdc',
    hostname = 'mysql-prod',
    port = '3306',
    username = 'cdc_reader',
    password = 'password',
    database.name = 'ecommerce',
    server.id = '5001',
    auto.schema.change = 'true'
);

-- Use (*) for automatic column mapping
CREATE TABLE orders (*) FROM mysql_with_schema_change TABLE 'ecommerce.orders';

When someone runs ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50) on MySQL, RisingWave detects the DDL change in the binlog and propagates it downstream.

MariaDB and TiDB compatibility

RisingWave's MySQL CDC connector is compatible with MariaDB and TiDB, which use MySQL-compatible binlog formats. The setup is identical, just point the connector at your MariaDB or TiDB host.

How Do You Query the Iceberg Tables?

Once data flows into Iceberg, any compatible query engine can read it.

Trino query example

-- Top customers by revenue and segment
SELECT
    customer_name,
    country,
    customer_segment,
    lifetime_revenue,
    total_orders
FROM analytics.customer_revenue
WHERE customer_segment IN ('champion', 'loyal')
ORDER BY lifetime_revenue DESC
LIMIT 20;

Spark SQL query example

-- Product category margin analysis
SELECT
    category,
    COUNT(*) AS products,
    ROUND(AVG(margin_percentage), 1) AS avg_margin_pct,
    SUM(total_margin) AS total_category_margin,
    SUM(units_sold) AS total_units
FROM analytics.product_margins
GROUP BY category
ORDER BY total_category_margin DESC;

Time travel with streaming-created snapshots

Because the streaming pipeline creates Iceberg snapshots every 60 seconds, you get fine-grained time travel:

-- What did customer revenue look like yesterday at 3 PM?
SELECT customer_name, lifetime_revenue, customer_segment
FROM analytics.customer_revenue
TIMESTAMP AS OF '2026-03-28 15:00:00'
WHERE customer_id = 147;

How Does This Compare to a Debezium-Based Setup?

Many teams currently use Debezium with Kafka Connect to stream MySQL changes to a data lake. Here is how the RisingWave approach compares:

Aspect	Debezium + Kafka + Spark	RisingWave
Components	MySQL, Debezium, Kafka, Schema Registry, Spark, Iceberg	MySQL, RisingWave, Iceberg
Configuration	Debezium connector JSON + Spark job config	SQL statements
Transformation	Spark (Scala/Python) or ksqlDB	SQL materialized views
Latency	Seconds to minutes (Kafka) + batch interval (Spark)	Seconds
Schema evolution	Schema Registry + manual connector updates	auto.schema.change
Operational complexity	High (multiple systems to monitor)	Low (single system)

The key advantage of RisingWave's approach is the elimination of Kafka as a middle layer. RisingWave reads the binlog directly, processes the data with SQL, and writes to Iceberg, all in one system. This removes the operational burden of managing Kafka clusters, Debezium connectors, and Schema Registry.

FAQ

Do I need Kafka or Debezium to stream MySQL changes to Iceberg with RisingWave?

No. RisingWave connects directly to MySQL's binlog replication protocol without requiring Kafka, Debezium, or any external message broker. This simplifies infrastructure and reduces operational overhead. RisingWave handles snapshotting, binlog parsing, transformation, and Iceberg writing in a single system.

What MySQL versions are compatible with RisingWave's CDC connector?

RisingWave's MySQL CDC connector supports MySQL 5.7, 8.0, and 8.4. It is also compatible with MySQL-protocol-compatible databases like MariaDB and TiDB. The binlog must be configured with ROW format and FULL row image for CDC to work correctly.

How does the MySQL CDC pipeline handle deletes?

When a row is deleted in MySQL, the binlog records the delete event with the full row data. RisingWave processes this delete and propagates it downstream. If your Iceberg sink uses type = 'upsert', deletes are applied to the Iceberg table, keeping it synchronized with the source. If you want to preserve delete history, you can use an append-only pattern with a change_type column.

What happens if the MySQL binlog is purged before RisingWave reads it?

If binlog files are purged while RisingWave is offline (due to expire_logs_days or manual PURGE BINARY LOGS), the CDC connector cannot resume from its last position. In this case, you need to drop and recreate the source and tables in RisingWave, which triggers a fresh snapshot of the current data followed by streaming from the current binlog position.

Conclusion

Building a real-time lakehouse from MySQL does not require a complex stack of Debezium, Kafka, and Spark. A streaming SQL engine like RisingWave reads the binlog directly, transforms the data with standard SQL, and writes to Iceberg continuously.

Key takeaways:

MySQL CDC reads the binlog to capture every insert, update, and delete in real time
The main difference from PostgreSQL CDC is the server.id requirement and binlog format configuration
RisingWave connects directly to MySQL without Kafka or Debezium
SQL materialized views transform and enrich CDC data before it lands in Iceberg
The Iceberg sink creates snapshots every minute, enabling fine-grained time travel and analytics
The same pipeline pattern works with MariaDB and TiDB

Ready to stream MySQL to your Iceberg lakehouse? Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.