Debezium Initial Snapshot vs Streaming: Best Practices

Debezium Initial Snapshot vs Streaming: Best Practices

When Debezium first connects to a database, it must decide how much existing data to capture before it starts streaming live changes. The wrong snapshot configuration can lock production tables, produce duplicate data, or miss rows entirely. Understanding the snapshot.mode options and snapshot.isolation.mode settings is essential before you deploy to production.

The Two Phases of a Debezium Connector

Every Debezium connector operates in two distinct phases:

  1. Snapshot phase: A one-time, point-in-time copy of existing rows, using a consistent read (or a series of reads depending on isolation mode).
  2. Streaming phase: Continuous capture of changes from the transaction log (WAL, redo log, CDC change tables), starting from the offset established at the end of the snapshot.

The transition between phases is critical. If there is a gap between the snapshot's high-water mark and the streaming start position, you lose changes. Debezium handles this internally, but your configuration choices determine whether the transition is clean.

Snapshot Modes Explained

initial (default)

Performs a full snapshot of all tables the connector is configured to capture, then switches to streaming. Use this when:

  • The downstream system is empty and needs to be bootstrapped.
  • You want a complete, current copy of every row before tracking changes.

Risk: On large tables, the snapshot can take hours and hold read locks (depending on snapshot.isolation.mode).

initial_only

Performs the snapshot and then stops—does not switch to streaming mode. Useful for one-time data migrations.

schema_only

Captures the schema of all tables but no data rows, then immediately starts streaming. Use this when:

  • The downstream system already has the data (loaded via another mechanism).
  • You only care about changes going forward.

Risk: Any rows that existed before the connector started are not in the downstream system unless they're subsequently updated or you load them separately.

never

Skips the snapshot entirely and starts streaming from the current log position. Use only when you have already performed a snapshot via another method (e.g., a pg_dump restore) and you know the exact log offset at which to start.

always

Performs a fresh snapshot every time the connector starts. Rarely appropriate for production—it creates massive data duplication and load on the source database. Only useful for small reference tables that should be fully reloaded on each restart.

when_needed

Performs a snapshot only when the connector detects that it cannot resume from its stored offset (e.g., the WAL has been purged). A sensible default for production deployments where you want automatic recovery from log retention gaps.

Snapshot Isolation Modes

The snapshot.isolation.mode setting controls the database-level isolation used during the snapshot read.

ModeDatabase SupportBehaviorRisk
read_committedAllMultiple reads, no consistent snapshotPossible duplicate/missing rows on large tables
repeatable_readMySQL, PostgreSQLConsistent snapshot, row-level locksLock contention on high-write tables
serializableAllStrongest consistency, table-level locksHigh lock contention, not recommended for production
read_uncommittedMySQL onlyFastest, no locksDirty reads possible
snapshot (SQL Server)SQL Server onlyUses database snapshot featureRequires snapshot isolation enabled on DB
exclusiveOracleTable-level exclusive lockBlocks all writes during snapshot

For most production deployments, read_committed combined with a single-read strategy (small tables) or chunked reads (large tables with snapshot.max.rows tuning) is the right choice.

Step-by-Step Tutorial

Step 1: Choose the Right Snapshot Mode for Your Scenario

{
  "name": "postgres-orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "shop",
    "database.server.name": "pgserver1",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "snapshot.mode": "initial",
    "snapshot.isolation.mode": "read_committed",
    "snapshot.fetch.size": "10240",
    "snapshot.max.threads": "1"
  }
}

For large tables (>100M rows), consider setting snapshot.fetch.size higher (32768 or more) to reduce round trips, and snapshot.max.threads > 1 to parallelize the snapshot across tables.

Step 2: Connect RisingWave to Stream CDC Events

-- For Debezium → Kafka → RisingWave pipeline:
CREATE SOURCE orders_cdc (
    id          BIGINT,
    customer_id BIGINT,
    total       NUMERIC,
    status      VARCHAR,
    updated_at  TIMESTAMPTZ,
    _op         VARCHAR  -- debezium op field: c/u/d/r
) WITH (
    connector = 'kafka',
    topic = 'pgserver1.public.orders',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'earliest'
) FORMAT DEBEZIUM ENCODE JSON;

scan.startup.mode = 'earliest' ensures RisingWave reads from the start of the topic, including snapshot events. Snapshot rows are emitted as op: r (read) events in the Debezium envelope.

Step 3: Monitor Snapshot Progress

During a snapshot, Debezium exposes progress metrics via JMX:

-- If you're ingesting Debezium metrics into RisingWave:
CREATE MATERIALIZED VIEW snapshot_progress AS
SELECT
    server_name,
    remaining_table_count,
    total_table_count,
    ROUND(100.0 * (total_table_count - remaining_table_count) / NULLIF(total_table_count, 0), 1) AS pct_complete
FROM debezium_snapshot_metrics;

Step 4: Validate Snapshot Completeness

After the snapshot completes, verify row counts match between source and destination:

-- RisingWave: count snapshot rows (op = 'r')
SELECT COUNT(*) AS snapshot_rows
FROM orders_cdc
WHERE _op = 'r';

-- Compare against source PostgreSQL count
-- (run this on PostgreSQL directly)
-- SELECT COUNT(*) FROM orders;

Discrepancies suggest a snapshot interruption or isolation issue. In that case, reset the connector offsets and retrigger the snapshot.

Comparison Table

Snapshot ModeUse CaseData CompletenessLoad on SourceRecovery Safe
initialFirst-time bootstrapFull history + changesMedium (one-time)Yes
schema_onlyForward-only trackingChanges onlyLowYes
neverManual data load + known offsetDepends on prepNoneOnly if prep is correct
alwaysSmall reference table reloadsFull on each restartHighNo (duplication)
when_neededAuto-recovery from log gapsFull on gap detectionMediumYes
initial_onlyOne-time data migrationFull, no changes afterMediumN/A

FAQ

Q: What happens if a snapshot is interrupted midway? Debezium does not perform a partial snapshot. If interrupted, it will restart the snapshot from the beginning on next connector start (for initial mode). There is no partial-snapshot resume in most connectors, though the PostgreSQL connector has an incremental snapshot feature that allows resumable, chunk-based snapshots.

Q: Can I trigger a re-snapshot of specific tables without restarting the full connector? Yes, with the incremental snapshot feature (available in PostgreSQL, MySQL, and some other connectors). Send a signal record to the Debezium signals table: INSERT INTO debezium_signal (id, type, data) VALUES ('unique-id', 'execute-snapshot', '{"data-collections":["public.orders"]}');

Q: Does schema_only snapshot protect against schema evolution issues? Yes—even in schema_only mode, Debezium captures the schema history, which is needed to correctly interpret future change events whose structure might have changed over time. The schema history Kafka topic is always populated regardless of snapshot mode.

Key Takeaways

  • Use initial for bootstrapping new downstream systems; use schema_only when the data already exists downstream.
  • read_committed isolation is usually sufficient for production snapshots and avoids lock contention.
  • Monitor RemainingTableCount via JMX during long snapshots to estimate completion time.
  • Snapshot rows arrive as op: r events in the Debezium envelope; RisingWave's FORMAT DEBEZIUM treats them as inserts.
  • For very large tables, prefer the incremental snapshot feature over a full initial snapshot to minimize source database impact.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.