Debezium Initial Snapshot vs Streaming: Best Practices

When Debezium first connects to a database, it must decide how much existing data to capture before it starts streaming live changes. The wrong snapshot configuration can lock production tables, produce duplicate data, or miss rows entirely. Understanding the snapshot.mode options and snapshot.isolation.mode settings is essential before you deploy to production.

The Two Phases of a Debezium Connector

Every Debezium connector operates in two distinct phases:

Snapshot phase: A one-time, point-in-time copy of existing rows, using a consistent read (or a series of reads depending on isolation mode).
Streaming phase: Continuous capture of changes from the transaction log (WAL, redo log, CDC change tables), starting from the offset established at the end of the snapshot.

The transition between phases is critical. If there is a gap between the snapshot's high-water mark and the streaming start position, you lose changes. Debezium handles this internally, but your configuration choices determine whether the transition is clean.

Snapshot Modes Explained

`initial` (default)

Performs a full snapshot of all tables the connector is configured to capture, then switches to streaming. Use this when:

The downstream system is empty and needs to be bootstrapped.
You want a complete, current copy of every row before tracking changes.

Risk: On large tables, the snapshot can take hours and hold read locks (depending on snapshot.isolation.mode).

`initial_only`

Performs the snapshot and then stops—does not switch to streaming mode. Useful for one-time data migrations.

`schema_only`

Captures the schema of all tables but no data rows, then immediately starts streaming. Use this when:

The downstream system already has the data (loaded via another mechanism).
You only care about changes going forward.

Risk: Any rows that existed before the connector started are not in the downstream system unless they're subsequently updated or you load them separately.

`never`

Skips the snapshot entirely and starts streaming from the current log position. Use only when you have already performed a snapshot via another method (e.g., a pg_dump restore) and you know the exact log offset at which to start.

`always`

Performs a fresh snapshot every time the connector starts. Rarely appropriate for production—it creates massive data duplication and load on the source database. Only useful for small reference tables that should be fully reloaded on each restart.

`when_needed`

Performs a snapshot only when the connector detects that it cannot resume from its stored offset (e.g., the WAL has been purged). A sensible default for production deployments where you want automatic recovery from log retention gaps.

Snapshot Isolation Modes

The snapshot.isolation.mode setting controls the database-level isolation used during the snapshot read.

Mode	Database Support	Behavior	Risk
`read_committed`	All	Multiple reads, no consistent snapshot	Possible duplicate/missing rows on large tables
`repeatable_read`	MySQL, PostgreSQL	Consistent snapshot, row-level locks	Lock contention on high-write tables
`serializable`	All	Strongest consistency, table-level locks	High lock contention, not recommended for production
`read_uncommitted`	MySQL only	Fastest, no locks	Dirty reads possible
`snapshot` (SQL Server)	SQL Server only	Uses database snapshot feature	Requires snapshot isolation enabled on DB
`exclusive`	Oracle	Table-level exclusive lock	Blocks all writes during snapshot

For most production deployments, read_committed combined with a single-read strategy (small tables) or chunked reads (large tables with snapshot.max.rows tuning) is the right choice.

Step-by-Step Tutorial

Step 1: Choose the Right Snapshot Mode for Your Scenario

{
  "name": "postgres-orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "shop",
    "database.server.name": "pgserver1",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "snapshot.mode": "initial",
    "snapshot.isolation.mode": "read_committed",
    "snapshot.fetch.size": "10240",
    "snapshot.max.threads": "1"
  }
}

For large tables (>100M rows), consider setting snapshot.fetch.size higher (32768 or more) to reduce round trips, and snapshot.max.threads > 1 to parallelize the snapshot across tables.

Step 2: Connect RisingWave to Stream CDC Events

-- For Debezium → Kafka → RisingWave pipeline:
CREATE SOURCE orders_cdc (
    id          BIGINT,
    customer_id BIGINT,
    total       NUMERIC,
    status      VARCHAR,
    updated_at  TIMESTAMPTZ,
    _op         VARCHAR  -- debezium op field: c/u/d/r
) WITH (
    connector = 'kafka',
    topic = 'pgserver1.public.orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT DEBEZIUM ENCODE JSON;

scan.startup.mode = 'earliest' ensures RisingWave reads from the start of the topic, including snapshot events. Snapshot rows are emitted as op: r (read) events in the Debezium envelope.

Step 3: Monitor Snapshot Progress

During a snapshot, Debezium exposes progress metrics via JMX:

-- If you're ingesting Debezium metrics into RisingWave:
CREATE MATERIALIZED VIEW snapshot_progress AS
SELECT
    server_name,
    remaining_table_count,
    total_table_count,
    ROUND(100.0 * (total_table_count - remaining_table_count) / NULLIF(total_table_count, 0), 1) AS pct_complete
FROM debezium_snapshot_metrics;

Step 4: Validate Snapshot Completeness

After the snapshot completes, verify row counts match between source and destination:

-- RisingWave: count snapshot rows (op = 'r')
SELECT COUNT(*) AS snapshot_rows
FROM orders_cdc
WHERE _op = 'r';

-- Compare against source PostgreSQL count
-- (run this on PostgreSQL directly)
-- SELECT COUNT(*) FROM orders;

Discrepancies suggest a snapshot interruption or isolation issue. In that case, reset the connector offsets and retrigger the snapshot.

Comparison Table

Snapshot Mode	Use Case	Data Completeness	Load on Source	Recovery Safe
`initial`	First-time bootstrap	Full history + changes	Medium (one-time)	Yes
`schema_only`	Forward-only tracking	Changes only	Low	Yes
`never`	Manual data load + known offset	Depends on prep	None	Only if prep is correct
`always`	Small reference table reloads	Full on each restart	High	No (duplication)
`when_needed`	Auto-recovery from log gaps	Full on gap detection	Medium	Yes
`initial_only`	One-time data migration	Full, no changes after	Medium	N/A

FAQ

Q: What happens if a snapshot is interrupted midway? Debezium does not perform a partial snapshot. If interrupted, it will restart the snapshot from the beginning on next connector start (for initial mode). There is no partial-snapshot resume in most connectors, though the PostgreSQL connector has an incremental snapshot feature that allows resumable, chunk-based snapshots.

Q: Can I trigger a re-snapshot of specific tables without restarting the full connector? Yes, with the incremental snapshot feature (available in PostgreSQL, MySQL, and some other connectors). Send a signal record to the Debezium signals table: INSERT INTO debezium_signal (id, type, data) VALUES ('unique-id', 'execute-snapshot', '{"data-collections":["public.orders"]}');

Q: Does schema_only snapshot protect against schema evolution issues? Yes—even in schema_only mode, Debezium captures the schema history, which is needed to correctly interpret future change events whose structure might have changed over time. The schema history Kafka topic is always populated regardless of snapshot mode.

Key Takeaways

Use initial for bootstrapping new downstream systems; use schema_only when the data already exists downstream.
read_committed isolation is usually sufficient for production snapshots and avoids lock contention.
Monitor RemainingTableCount via JMX during long snapshots to estimate completion time.
Snapshot rows arrive as op: r events in the Debezium envelope; RisingWave's FORMAT DEBEZIUM treats them as inserts.
For very large tables, prefer the incremental snapshot feature over a full initial snapshot to minimize source database impact.

Get started | Slack