Debezium MySQL CDC: Building Real-Time Data Pipelines

Debezium MySQL CDC: Building Real-Time Data Pipelines

Debezium's MySQL connector reads from MySQL's binary log (binlog) to capture every insert, update, and delete in real time. Paired with Kafka and RisingWave — a PostgreSQL-compatible streaming database — you can build low-latency data pipelines that react to MySQL changes the moment they are committed.

Why CDC for MySQL?

MySQL is one of the most widely deployed relational databases in the world, powering everything from e-commerce platforms to SaaS applications. However, MySQL is optimized for OLTP, not analytics. Running analytical queries directly against a production MySQL instance risks degrading application performance.

The MySQL binlog offers a non-intrusive way to observe changes: it records every committed transaction for replication purposes, and Debezium reads it just like a replica would — without touching the application tables with additional queries.

Use cases that benefit from MySQL CDC:

  • Order processing: Stream new orders to fulfillment and analytics systems simultaneously
  • Inventory management: Propagate stock-level changes to downstream services in real time
  • User activity tracking: Capture user record changes for personalization pipelines
  • Multi-database synchronization: Keep MySQL and secondary stores (Redis, Elasticsearch, PostgreSQL) consistent

How the Debezium MySQL Connector Works

MySQL's binlog records every committed change to the database. Debezium connects to MySQL as a replica, requesting a binlog stream starting from a specific position (log file + offset).

The connector performs two phases:

  1. Snapshot: Reads the current state of all monitored tables and emits op=r (read) events for each row
  2. Streaming: Tails the binlog continuously, emitting op=c (create), op=u (update), and op=d (delete) events as they occur

Each Debezium MySQL event contains before (for updates and deletes), after (for inserts and updates), op, and a source block with ts_ms, file (binlog filename), pos (binlog position), gtid (if GTIDs are enabled), and server_id.

Step-by-Step Tutorial

Step 1: Prepare MySQL

Enable binary logging in my.cnf:

[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 7

Create a Debezium user with the required privileges:

CREATE USER 'debezium'@'%' IDENTIFIED BY 'secret';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'debezium'@'%';
FLUSH PRIVILEGES;

Step 2: Deploy the Debezium MySQL Connector

Register the connector with Kafka Connect:

{
  "name": "mysql-orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql.internal",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "secret",
    "database.server.id": "184054",
    "topic.prefix": "mysql1",
    "database.include.list": "shop",
    "table.include.list": "shop.orders,shop.products",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-changes.mysql1",
    "include.schema.changes": "true"
  }
}
curl -X POST http://kafka-connect:8083/connectors \
  -H "Content-Type: application/json" \
  -d @mysql-connector.json

Step 3: Connect RisingWave to the Kafka Topic

RisingWave is a PostgreSQL-compatible streaming database that understands the Debezium event format natively:

CREATE SOURCE orders_mysql
WITH (
    connector = 'kafka',
    topic = 'mysql1.shop.orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT DEBEZIUM ENCODE JSON;

The FORMAT DEBEZIUM ENCODE JSON clause tells RisingWave to apply the Debezium upsert/delete semantics, so the source behaves like a live table reflecting the current state of shop.orders.

Step 4: Build a Real-Time Pipeline with Materialized Views

Compute a running leaderboard of top products by revenue:

CREATE MATERIALIZED VIEW product_revenue AS
SELECT
    product_id,
    SUM(quantity * unit_price) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_buyers,
    MAX(order_time) AS last_sale_at
FROM orders_mysql
WHERE status != 'cancelled'
GROUP BY product_id;

Join with a product dimension source for enriched output:

CREATE MATERIALIZED VIEW enriched_product_revenue AS
SELECT
    pr.product_id,
    p.name AS product_name,
    p.category,
    pr.total_revenue,
    pr.unique_buyers
FROM product_revenue pr
JOIN products_mysql p ON pr.product_id = p.id;

Step 5: Sink Results Downstream

Write the enriched view to a downstream Kafka topic or database:

CREATE SINK revenue_sink
FROM enriched_product_revenue
WITH (
    connector = 'kafka',
    topic = 'product-revenue-live',
    properties.bootstrap.server = 'kafka:9092'
) FORMAT UPSERT ENCODE JSON
KEY ENCODE JSON (product_id);

Comparison Table

FeatureDebezium MySQL CDCDirect Polling
Captures deletesYes (op=d)No (requires soft deletes)
LatencyMillisecondsSeconds to minutes
Impact on MySQLMinimal (reads binlog)Query load on tables
Schema change trackingYes (schema history topic)Manual
Initial snapshotAutomaticManual
GTID supportYesN/A

FAQ

Do I need to use GTIDs with Debezium MySQL? GTIDs (Global Transaction Identifiers) are recommended but not required. With GTIDs enabled, Debezium can more reliably handle MySQL failovers, as it can resume from the correct position on a new primary without manual intervention.

What binlog_row_image setting is required? You must set binlog_row_image = FULL. This ensures that MySQL writes the complete before-image and after-image of each changed row to the binlog. With MINIMAL or NOBLOB, Debezium cannot reconstruct the full before state.

How does RisingWave's native MySQL CDC differ from Debezium? RisingWave includes a built-in connector = 'mysql-cdc' that connects directly to MySQL without needing Kafka or Debezium. It's simpler to set up and suitable when RisingWave is the only consumer of the change stream. Debezium + Kafka is better when multiple systems need to consume the same change events.

Key Takeaways

  • Debezium reads MySQL binlog as a replica, capturing all committed changes with no application-level impact
  • Set binlog_format = ROW and binlog_row_image = FULL before deploying the connector
  • RisingWave ingests Debezium MySQL events using FORMAT DEBEZIUM ENCODE JSON on a Kafka source
  • Materialized views update incrementally with each CDC event, enabling real-time aggregations
  • For simpler pipelines, RisingWave's native mysql-cdc connector eliminates the Kafka dependency

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.