Introduction
MySQL is the most widely deployed open-source relational database in the world, and most analytics teams have at least one critical MySQL instance they need to replicate to a data lake. The traditional approach involves scheduled mysqldump exports or batch SELECT queries that extract data every few hours. These batch exports create the same problems every data team knows well: stale analytics, heavy load on the production database, and missed deletes.
MySQL's binary log (binlog) offers a better mechanism. The binlog records every data modification that happens in the database, and CDC (Change Data Capture) tools can read it in real time to stream changes downstream. By connecting MySQL's binlog to RisingWave and sinking the results to Apache Iceberg, you can build a lakehouse pipeline that stays within seconds of production, handles inserts, updates, and deletes correctly, and requires no batch orchestration.
This tutorial shows you how to build a complete MySQL-to-Iceberg pipeline using RisingWave. If you have worked through our PostgreSQL to Iceberg guide, the structure is similar, but the underlying CDC mechanism is different, and this guide covers the MySQL-specific configuration you need.
How Does MySQL CDC Differ from PostgreSQL CDC?
Both MySQL and PostgreSQL support Change Data Capture, but they use fundamentally different replication mechanisms.
MySQL binlog vs PostgreSQL WAL
| Aspect | MySQL (binlog) | PostgreSQL (WAL) |
| Log type | Binary log (binlog) | Write-Ahead Log (WAL) |
| Log format options | ROW, STATEMENT, MIXED | Logical, Physical |
| Required format for CDC | ROW | Logical |
| Server ID requirement | Yes (unique per consumer) | No |
| Replication slot | No | Yes |
| Publication/subscription | No | Yes (logical replication) |
| GTID support | Optional but recommended | N/A |
| Compatible versions | MySQL 5.7, 8.0, 8.4 | PostgreSQL 10-17 |
The most important practical difference is the server ID. MySQL requires each binlog consumer to register with a unique server.id. If two CDC consumers share the same server ID, they will conflict and one will be disconnected. PostgreSQL uses replication slots instead, which are named and managed by the server.
binlog format requirements
MySQL's binlog supports three formats, and CDC requires row-based logging:
- ROW format (required for CDC): Records the actual row data that changed. This is the only format that gives you the before and after values for each row.
- STATEMENT format: Records the SQL statements that caused changes. Not suitable for CDC because the same statement can produce different results on different replicas.
- MIXED format: Combination of both. Not reliable for CDC.
What Do You Need Before Starting?
MySQL prerequisites
Configure your MySQL instance for row-based binlog replication:
-- Check current binlog settings
SHOW VARIABLES LIKE 'log_bin';
SHOW VARIABLES LIKE 'binlog_format';
SHOW VARIABLES LIKE 'binlog_row_image';
Update my.cnf (or my.ini on Windows) if needed:
[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 7
Restart MySQL after changing these settings.
Create a dedicated CDC user with the required permissions:
-- Create a CDC user with replication privileges
CREATE USER 'cdc_reader'@'%' IDENTIFIED BY 'your_secure_password';
-- Grant replication permissions
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT
ON *.* TO 'cdc_reader'@'%';
-- Grant read access to the target database
GRANT SELECT ON ecommerce.* TO 'cdc_reader'@'%';
FLUSH PRIVILEGES;
For AWS RDS MySQL, enable automated backups (which enables binlog) and set binlog_format to ROW in the parameter group.
RisingWave setup
Start a RisingWave instance for testing:
docker run -it --pull=always -p 4566:4566 -p 5691:5691 risingwavelabs/risingwave:latest single_node
Or use RisingWave Cloud for a managed deployment.
Iceberg storage
Create an S3 bucket for your Iceberg warehouse:
aws s3 mb s3://my-mysql-lakehouse
How Do You Build the MySQL to Iceberg Pipeline?
Here is the step-by-step pipeline.
Step 1: Create the MySQL CDC source
Connect RisingWave to MySQL using the MySQL CDC connector:
CREATE SOURCE mysql_ecommerce WITH (
connector = 'mysql-cdc',
hostname = '192.168.1.200',
port = '3306',
username = 'cdc_reader',
password = 'your_secure_password',
database.name = 'ecommerce',
server.id = '5001'
);
Key differences from PostgreSQL CDC:
- The connector is
'mysql-cdc'instead of'postgres-cdc' - You must specify a unique
server.id(any integer that is not used by another binlog consumer) - There is no
schema.nameparameter because MySQL does not have schemas within databases (the database itself serves as the namespace) - There is no
slot.namebecause MySQL does not use replication slots
Step 2: Create tables from the MySQL source
For each upstream MySQL table, create a corresponding table in RisingWave:
-- Orders table
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
customer_id BIGINT,
product_id INT,
quantity INT,
unit_price DECIMAL(10, 2),
total_amount DECIMAL(12, 2),
order_status VARCHAR,
shipping_method VARCHAR,
created_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ
) FROM mysql_ecommerce TABLE 'ecommerce.orders';
-- Customers table
CREATE TABLE customers (
customer_id BIGINT PRIMARY KEY,
first_name VARCHAR,
last_name VARCHAR,
email VARCHAR,
phone VARCHAR,
city VARCHAR,
country VARCHAR,
registered_at TIMESTAMPTZ
) FROM mysql_ecommerce TABLE 'ecommerce.customers';
-- Products table
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR,
category VARCHAR,
subcategory VARCHAR,
price DECIMAL(10, 2),
cost DECIMAL(10, 2),
supplier_id INT
) FROM mysql_ecommerce TABLE 'ecommerce.products';
Notice the table reference format: in MySQL CDC, you specify 'database.table' (e.g., 'ecommerce.orders'), while PostgreSQL uses 'schema.table' (e.g., 'public.orders').
When these tables are created, RisingWave:
- Takes an initial snapshot of the existing data via a SELECT query
- Records the current binlog position
- Starts streaming binlog events from that position forward
Step 3: Build transformations with materialized views
Create SQL transformations that enrich and aggregate the raw CDC data:
-- Revenue analytics by customer segment
CREATE MATERIALIZED VIEW customer_revenue AS
SELECT
c.customer_id,
c.first_name || ' ' || c.last_name AS full_name,
c.country,
c.city,
COUNT(o.order_id) AS total_orders,
SUM(o.total_amount) AS lifetime_revenue,
AVG(o.total_amount) AS avg_order_value,
MAX(o.created_at) AS most_recent_order,
CASE
WHEN COUNT(o.order_id) >= 20 THEN 'champion'
WHEN COUNT(o.order_id) >= 10 THEN 'loyal'
WHEN COUNT(o.order_id) >= 3 THEN 'regular'
ELSE 'new'
END AS customer_segment
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
AND o.order_status IN ('completed', 'shipped')
GROUP BY c.customer_id, c.first_name, c.last_name, c.country, c.city;
-- Product margin analysis
CREATE MATERIALIZED VIEW product_margins AS
SELECT
p.product_id,
p.product_name,
p.category,
p.subcategory,
p.price AS list_price,
p.cost,
p.price - p.cost AS margin_per_unit,
CASE
WHEN p.cost > 0 THEN ROUND(((p.price - p.cost) / p.cost) * 100, 1)
ELSE 0
END AS margin_percentage,
COUNT(o.order_id) AS units_sold,
SUM(o.quantity * (p.price - p.cost)) AS total_margin
FROM products p
LEFT JOIN orders o ON p.product_id = o.product_id
AND o.order_status = 'completed'
GROUP BY p.product_id, p.product_name, p.category, p.subcategory, p.price, p.cost;
-- Denormalized order facts for the data lake
CREATE MATERIALIZED VIEW order_facts AS
SELECT
o.order_id,
o.customer_id,
c.first_name || ' ' || c.last_name AS customer_name,
c.country,
c.city,
o.product_id,
p.product_name,
p.category,
p.subcategory,
o.quantity,
o.unit_price,
o.total_amount,
p.cost * o.quantity AS total_cost,
o.total_amount - (p.cost * o.quantity) AS order_margin,
o.order_status,
o.shipping_method,
o.created_at,
o.updated_at
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id;
Step 4: Sink to Iceberg
Create Iceberg sinks for each materialized view:
-- Sink customer revenue analytics
CREATE SINK customer_revenue_iceberg FROM customer_revenue
WITH (
connector = 'iceberg',
type = 'upsert',
primary_key = 'customer_id',
warehouse.path = 's3://my-mysql-lakehouse/warehouse',
database.name = 'analytics',
table.name = 'customer_revenue',
catalog.type = 'rest',
catalog.name = 'lakehouse_catalog',
create_table_if_not_exists = 'true',
commit_checkpoint_interval = '60',
s3.access.key = '${AWS_ACCESS_KEY}',
s3.secret.key = '${AWS_SECRET_KEY}',
s3.region = 'us-west-2'
);
-- Sink product margins
CREATE SINK product_margins_iceberg FROM product_margins
WITH (
connector = 'iceberg',
type = 'upsert',
primary_key = 'product_id',
warehouse.path = 's3://my-mysql-lakehouse/warehouse',
database.name = 'analytics',
table.name = 'product_margins',
catalog.type = 'rest',
catalog.name = 'lakehouse_catalog',
create_table_if_not_exists = 'true',
commit_checkpoint_interval = '60',
s3.access.key = '${AWS_ACCESS_KEY}',
s3.secret.key = '${AWS_SECRET_KEY}',
s3.region = 'us-west-2'
);
-- Sink order facts
CREATE SINK order_facts_iceberg FROM order_facts
WITH (
connector = 'iceberg',
type = 'upsert',
primary_key = 'order_id',
warehouse.path = 's3://my-mysql-lakehouse/warehouse',
database.name = 'analytics',
table.name = 'order_facts',
catalog.type = 'rest',
catalog.name = 'lakehouse_catalog',
create_table_if_not_exists = 'true',
commit_checkpoint_interval = '60',
s3.access.key = '${AWS_ACCESS_KEY}',
s3.secret.key = '${AWS_SECRET_KEY}',
s3.region = 'us-west-2'
);
Step 5: Verify the pipeline
Confirm everything is working:
-- Check active sinks
SHOW SINKS;
-- Check the source
SHOW SOURCES;
-- Verify data is flowing
SELECT * FROM customer_revenue ORDER BY lifetime_revenue DESC LIMIT 5;
Expected output:
customer_id | full_name | country | city | total_orders | lifetime_revenue | avg_order_value | most_recent_order | customer_segment
-------------+----------------+---------+-----------+--------------+------------------+-----------------+------------------------+-----------------
147 | Sarah Kim | US | Seattle | 28 | 14520.00 | 518.57 | 2026-03-29 09:15:00+00 | champion
83 | James Liu | CA | Toronto | 15 | 8340.00 | 556.00 | 2026-03-28 22:30:00+00 | loyal
201 | Maria Garcia | MX | CDMX | 12 | 6780.00 | 565.00 | 2026-03-29 11:45:00+00 | loyal
42 | Alex Mueller | DE | Berlin | 8 | 4120.00 | 515.00 | 2026-03-27 16:20:00+00 | regular
310 | Yuki Tanaka | JP | Tokyo | 4 | 2890.00 | 722.50 | 2026-03-29 03:10:00+00 | regular
How Do You Handle MySQL-Specific CDC Challenges?
MySQL CDC has a few operational considerations that differ from PostgreSQL.
Server ID conflicts
Every binlog consumer needs a unique server.id. If you run multiple RisingWave instances or other CDC tools (Debezium, Maxwell) reading from the same MySQL server, each must have a different server ID:
-- First RisingWave pipeline
CREATE SOURCE pipeline_1 WITH (
connector = 'mysql-cdc',
hostname = 'mysql-prod',
port = '3306',
username = 'cdc_reader',
password = 'password',
database.name = 'ecommerce',
server.id = '5001'
);
-- Second RisingWave pipeline (different database)
CREATE SOURCE pipeline_2 WITH (
connector = 'mysql-cdc',
hostname = 'mysql-prod',
port = '3306',
username = 'cdc_reader',
password = 'password',
database.name = 'inventory',
server.id = '5002'
);
Binlog retention
MySQL automatically purges binlog files based on expire_logs_days (or binlog_expire_logs_seconds in MySQL 8.0+). If RisingWave is offline longer than the retention period, the binlog events are lost and you need to re-snapshot:
-- Check binlog retention on MySQL
SHOW VARIABLES LIKE 'expire_logs_days';
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';
Set a retention period that gives you enough buffer for planned maintenance windows.
Auto schema change support
Like the PostgreSQL connector, MySQL CDC in RisingWave supports automatic schema change propagation:
CREATE SOURCE mysql_with_schema_change WITH (
connector = 'mysql-cdc',
hostname = 'mysql-prod',
port = '3306',
username = 'cdc_reader',
password = 'password',
database.name = 'ecommerce',
server.id = '5001',
auto.schema.change = 'true'
);
-- Use (*) for automatic column mapping
CREATE TABLE orders (*) FROM mysql_with_schema_change TABLE 'ecommerce.orders';
When someone runs ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50) on MySQL, RisingWave detects the DDL change in the binlog and propagates it downstream.
MariaDB and TiDB compatibility
RisingWave's MySQL CDC connector is compatible with MariaDB and TiDB, which use MySQL-compatible binlog formats. The setup is identical, just point the connector at your MariaDB or TiDB host.
How Do You Query the Iceberg Tables?
Once data flows into Iceberg, any compatible query engine can read it.
Trino query example
-- Top customers by revenue and segment
SELECT
customer_name,
country,
customer_segment,
lifetime_revenue,
total_orders
FROM analytics.customer_revenue
WHERE customer_segment IN ('champion', 'loyal')
ORDER BY lifetime_revenue DESC
LIMIT 20;
Spark SQL query example
-- Product category margin analysis
SELECT
category,
COUNT(*) AS products,
ROUND(AVG(margin_percentage), 1) AS avg_margin_pct,
SUM(total_margin) AS total_category_margin,
SUM(units_sold) AS total_units
FROM analytics.product_margins
GROUP BY category
ORDER BY total_category_margin DESC;
Time travel with streaming-created snapshots
Because the streaming pipeline creates Iceberg snapshots every 60 seconds, you get fine-grained time travel:
-- What did customer revenue look like yesterday at 3 PM?
SELECT customer_name, lifetime_revenue, customer_segment
FROM analytics.customer_revenue
TIMESTAMP AS OF '2026-03-28 15:00:00'
WHERE customer_id = 147;
How Does This Compare to a Debezium-Based Setup?
Many teams currently use Debezium with Kafka Connect to stream MySQL changes to a data lake. Here is how the RisingWave approach compares:
| Aspect | Debezium + Kafka + Spark | RisingWave |
| Components | MySQL, Debezium, Kafka, Schema Registry, Spark, Iceberg | MySQL, RisingWave, Iceberg |
| Configuration | Debezium connector JSON + Spark job config | SQL statements |
| Transformation | Spark (Scala/Python) or ksqlDB | SQL materialized views |
| Latency | Seconds to minutes (Kafka) + batch interval (Spark) | Seconds |
| Schema evolution | Schema Registry + manual connector updates | auto.schema.change |
| Operational complexity | High (multiple systems to monitor) | Low (single system) |
The key advantage of RisingWave's approach is the elimination of Kafka as a middle layer. RisingWave reads the binlog directly, processes the data with SQL, and writes to Iceberg, all in one system. This removes the operational burden of managing Kafka clusters, Debezium connectors, and Schema Registry.
FAQ
Do I need Kafka or Debezium to stream MySQL changes to Iceberg with RisingWave?
No. RisingWave connects directly to MySQL's binlog replication protocol without requiring Kafka, Debezium, or any external message broker. This simplifies infrastructure and reduces operational overhead. RisingWave handles snapshotting, binlog parsing, transformation, and Iceberg writing in a single system.
What MySQL versions are compatible with RisingWave's CDC connector?
RisingWave's MySQL CDC connector supports MySQL 5.7, 8.0, and 8.4. It is also compatible with MySQL-protocol-compatible databases like MariaDB and TiDB. The binlog must be configured with ROW format and FULL row image for CDC to work correctly.
How does the MySQL CDC pipeline handle deletes?
When a row is deleted in MySQL, the binlog records the delete event with the full row data. RisingWave processes this delete and propagates it downstream. If your Iceberg sink uses type = 'upsert', deletes are applied to the Iceberg table, keeping it synchronized with the source. If you want to preserve delete history, you can use an append-only pattern with a change_type column.
What happens if the MySQL binlog is purged before RisingWave reads it?
If binlog files are purged while RisingWave is offline (due to expire_logs_days or manual PURGE BINARY LOGS), the CDC connector cannot resume from its last position. In this case, you need to drop and recreate the source and tables in RisingWave, which triggers a fresh snapshot of the current data followed by streaming from the current binlog position.
Conclusion
Building a real-time lakehouse from MySQL does not require a complex stack of Debezium, Kafka, and Spark. A streaming SQL engine like RisingWave reads the binlog directly, transforms the data with standard SQL, and writes to Iceberg continuously.
Key takeaways:
- MySQL CDC reads the binlog to capture every insert, update, and delete in real time
- The main difference from PostgreSQL CDC is the
server.idrequirement and binlog format configuration - RisingWave connects directly to MySQL without Kafka or Debezium
- SQL materialized views transform and enrich CDC data before it lands in Iceberg
- The Iceberg sink creates snapshots every minute, enabling fine-grained time travel and analytics
- The same pipeline pattern works with MariaDB and TiDB
Ready to stream MySQL to your Iceberg lakehouse? Try RisingWave Cloud free, no credit card required. Sign up here.
Join our Slack community to ask questions and connect with other stream processing developers.

