PostgreSQL Replication Slots and Debezium: What You Need to Know

A PostgreSQL replication slot is a server-side cursor that marks how far a consumer has read the WAL (Write-Ahead Log). The database keeps all WAL segments from that position forward — forever — until the consumer advances the slot. This is the underlying mechanism for both Debezium and RisingWave's CDC, and mismanaging slots is one of the most common causes of production outages in CDC pipelines.

What a Replication Slot Actually Is

Think of a replication slot as a bookmark. PostgreSQL's normal WAL rotation deletes old segments once they're no longer needed for crash recovery. A replication slot tells PostgreSQL: "a consumer is reading from LSN X — don't delete anything before that."

There are two types:

Physical replication slots: used by streaming replicas. They retain raw WAL bytes.
Logical replication slots: used by Debezium and logical decoding tools. They decode WAL into row-level change records.

Debezium exclusively uses logical replication slots with the pgoutput or wal2json output plugin.

You can see active slots with:

SELECT
  slot_name,
  plugin,
  slot_type,
  database,
  active,
  active_pid,
  restart_lsn,
  confirmed_flush_lsn,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS wal_retained
FROM pg_replication_slots;

The wal_retained column is your primary health metric. If this grows past a few GB on a busy system, you have a lag problem.

Why Replication Slots Are Dangerous

The danger is simple: a replication slot that stops consuming will cause disk exhaustion.

PostgreSQL cannot delete WAL segments older than the slot's restart_lsn. If your Debezium connector goes down, the slot stays open, and WAL accumulates at the rate your database is being written to.

A system generating 1 GB/hour of WAL, with a connector down for 24 hours, accumulates 24 GB of retained WAL. On a database server with 50 GB of disk, this causes disk full within hours of reaching capacity — at which point PostgreSQL itself stops accepting writes.

This is not a theoretical concern. It is one of the most common causes of production incidents in CDC-heavy systems.

Monitoring Slot Health

Use these queries to build monitoring and alerting:

-- Core slot health query
SELECT
  slot_name,
  active,
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
  ) AS wal_retained,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS wal_retained_bytes,
  extract(epoch from now() - pg_last_xact_replay_timestamp()) AS replication_delay_sec
FROM pg_replication_slots;

-- Alert threshold: raise alarm if any slot has > 5GB of retained WAL
SELECT slot_name, wal_retained_bytes
FROM (
  SELECT
    slot_name,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS wal_retained_bytes
  FROM pg_replication_slots
) sub
WHERE wal_retained_bytes > 5368709120; -- 5 GB

-- Check slot activity
SELECT
  slot_name,
  active,
  active_pid,
  to_char(to_timestamp(
    extract(epoch from now()) - pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) / 1024 / 1024
  ), 'YYYY-MM-DD HH24:MI:SS') AS estimated_lag_note
FROM pg_replication_slots
WHERE NOT active;

A robust monitoring setup should alert on:

wal_retained_bytes > 2GB — warning
wal_retained_bytes > 10GB — critical
active = false for more than 5 minutes — warning
Disk usage > 70% on the PostgreSQL data volume

The `max_replication_slots` Limit

PostgreSQL defaults to max_replication_slots = 10. This includes both physical and logical slots.

Each Debezium connector creates one slot. RisingWave's postgres-cdc source also creates one slot per source. If you have 8 active Debezium connectors and 2 physical replicas, you've exhausted the default limit.

Check and adjust:

-- Check current limit
SHOW max_replication_slots;

-- Check how many are in use
SELECT count(*) FROM pg_replication_slots;

In postgresql.conf:

max_replication_slots = 20
wal_level = logical
max_wal_senders = 20

Restart is required to change max_replication_slots. On managed databases (RDS, Cloud SQL, AlloyDB), this is a parameter group change that may require a maintenance window.

Slot Lag: What Causes It and How to Recover

Slot lag builds up when the consumer cannot keep up with the production rate of WAL. Common causes:

Debezium connector is down (crash, restart, Kafka Connect worker failure)
Debezium is running but Kafka is a bottleneck (topic full, producer back-pressure)
RisingWave checkpoint failure or backfill in progress
Initial snapshot in progress (slot created but not yet consuming streaming events)

Emergency runbook when slot lag is critical:

Step 1: Verify the slot and consumer state.

SELECT slot_name, active, pg_size_pretty(
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS wal_retained
FROM pg_replication_slots;

Step 2: If the consumer is running and active, the slot will drain on its own. Monitor the rate of change in wal_retained. If it's decreasing, wait.

Step 3: If the consumer is not running and won't restart quickly, and disk is at risk, you must decide: drop the slot or free disk space.

-- DESTRUCTIVE: drops the slot, consumer must re-snapshot
SELECT pg_drop_replication_slot('debezium_slot_name');

Dropping the slot means Debezium loses its position. It will need to perform a new initial snapshot before streaming resumes. Plan for this: a dropped slot + re-snapshot can add significant load to the database.

Step 4: After dropping, restart Debezium with snapshot.mode=initial (default) to rebuild state.

Preventing Slot Accumulation

Set max_slot_wal_keep_size (PostgreSQL 13+): This parameter caps how much WAL a slot can retain. Once exceeded, the slot is automatically invalidated rather than causing disk exhaustion.

max_slot_wal_keep_size = 10GB

When the slot is invalidated, it appears as:

SELECT slot_name, invalidation_reason FROM pg_replication_slots;
-- invalidation_reason: wal-size-limit-exceeded

Your consumer needs to handle this by re-snapshotting. This is a tradeoff: better to lose CDC position than to lose the entire database to a disk-full event.

Use wal_sender_timeout: Idle WAL sender connections holding slots open should be reaped.

wal_sender_timeout = 60s

Name your slots explicitly: Debezium uses the database.server.name configuration to derive the slot name. Naming slots clearly makes monitoring and incident response faster.

database.server.name=prod-orders
slot.name=debezium_prod_orders

How RisingWave Manages Slots

RisingWave creates and manages its own replication slot per CDC source. The slot name is derived from the source name. You can see it in pg_replication_slots like any other slot.

-- After creating a RisingWave CDC source, check slot in PostgreSQL
SELECT slot_name, active, confirmed_flush_lsn
FROM pg_replication_slots
WHERE plugin = 'pgoutput';

The same operational rules apply: if RisingWave's CDC source stalls (backfill stuck, checkpoint failure), the slot will accumulate WAL. Monitor the same wal_retained metric regardless of whether the consumer is Debezium or RisingWave.

One operational advantage with RisingWave: if a CDC source is dropped (DROP SOURCE), RisingWave automatically drops the replication slot. With standalone Debezium, dropping the connector does not automatically drop the slot, leaving orphaned slots that continue to accumulate WAL.

FAQ

How many replication slots can I safely create? The practical limit is determined by your WAL generation rate and disk capacity. A common guideline: never have more than 5-10 active logical replication slots unless you've specifically sized your disk for it. More slots = more risk surface if any one consumer falls behind.

Will PostgreSQL keep WAL forever if a slot exists? Yes, without max_slot_wal_keep_size. PostgreSQL will retain all WAL from the slot's restart_lsn forward indefinitely. This is why orphaned slots (created but never consumed) cause disk exhaustion even on idle systems — there's always some WAL being generated.

Does a replication slot affect database performance? A healthy, actively consumed slot has negligible impact. A lagging slot can impact performance indirectly: retained WAL causes autovacuum to be less effective (dead tuples can't be cleaned if there's an open transaction older than the slot lag), leading to table bloat and slower queries.

Can I have two consumers on one slot? No. A logical replication slot can only have one active consumer at a time. This is by design: advancing the slot's confirmed LSN is destructive — once advanced, those WAL records can be deleted. Two consumers would cause data loss.

What's the difference between restart_lsn and confirmed_flush_lsn? confirmed_flush_lsn is the LSN up to which the consumer has confirmed it has processed and persisted the data. restart_lsn is the LSN from which the server must retain WAL in case the consumer needs to reconnect. restart_lsn is typically slightly behind confirmed_flush_lsn as a safety margin. For monitoring consumer lag, use confirmed_flush_lsn.