FORMAT PLAIN vs FORMAT DEBEZIUM: How to Unlock Append-Only CDC Streams in RisingWave

FORMAT PLAIN vs FORMAT DEBEZIUM: How to Unlock Append-Only CDC Streams in RisingWave

·

8 min read

Most engineers who connect RisingWave to a Debezium CDC topic reach for FORMAT DEBEZIUM by default. The name matches the tool, so the choice feels obvious. But for a surprisingly common case, it is the wrong choice, and using it silently trades away three of the most useful features in the streaming SQL toolbox.

This post explains the trade-off, shows the pattern that avoids it, and walks through a complete e-commerce order pipeline that becomes possible once you make the switch.

What FORMAT DEBEZIUM Actually Creates

When you declare a Kafka source with FORMAT DEBEZIUM, RisingWave treats the topic as a changelog that mirrors the current state of the upstream table. Every insert, update, and delete event from the Debezium topic is applied to maintain a live replica with one row per primary key.

This is powerful when you genuinely need a replica. If upstream rows are updated or deleted after being written, FORMAT DEBEZIUM keeps your RisingWave table in sync.

But the result is a mutable, upsert-style table. RisingWave cannot declare it APPEND ONLY, because by definition it is not: rows can be overwritten or retracted as CDC events arrive.

This matters because APPEND ONLY is the gate to three capabilities:

  • Watermarks: Watermarks require the engine to guarantee that no row older than the watermark will ever appear. A mutable table cannot provide that guarantee, so watermarks are not allowed.
  • retention_seconds: This parameter caps how long the source table retains data, bounding its size over time. Without APPEND ONLY, you cannot set it.
  • Process-time temporal joins: The FOR SYSTEM_TIME AS OF PROCTIME() join pattern requires the left side to be append-only. Without it, you fall back to a regular stateful join, which grows state unboundedly for dimension lookups.

If your upstream table is truly insert-only, you are paying the full price of FORMAT DEBEZIUM without getting any of its benefits.

The Insert-Only Case Is More Common Than You Think

Consider order events in an e-commerce system. A customer places an order, the record is written once, and the orders table is never touched again. Downstream systems read from it, but the source rows stay fixed.

The Debezium topic for this table still contains a full envelope with op, before, after, and source fields. Every event will have op = 'c'. There are no updates, no deletes.

FORMAT DEBEZIUM would correctly process this topic, but it produces an upsert mirror and discards all the append-only capabilities. You would be paying for generality you do not need.

The FORMAT PLAIN Pattern

The alternative is to read the topic as raw structured data, then filter to insert events in a downstream materialized view that you declare APPEND ONLY.

-- Step 1: Raw source
-- FORMAT PLAIN reads the Debezium envelope as a structured column.
-- No CDC semantics are applied; every message becomes one row.
CREATE TABLE orders_source (
  payload JSONB
) WITH (
  connector = 'kafka',
  topic = 'orders',
  properties.bootstrap.server = 'kafka:9092',
  scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;
-- Step 2: Append-only stream
-- Filter to op='c' to include only insert events.
-- PROCTIME() stamps each row with the processing time, enabling temporal joins.
CREATE MATERIALIZED VIEW orders_append_only AS
SELECT
  (payload->'after'->>'id')::BIGINT          AS order_id,
  (payload->'after'->>'user_id')::BIGINT     AS user_id,
  (payload->'after'->>'product_id')::BIGINT  AS product_id,
  (payload->'after'->>'amount')::DECIMAL     AS amount,
  (payload->'after'->>'created_at')::TIMESTAMPTZ AS created_at,
  PROCTIME() AS proc_time
FROM orders_source
WHERE payload->>'op' = 'c'
EMIT CHANGES
APPEND ONLY;
-- Step 3: Temporal join against a product catalog
-- This join is only possible because orders_append_only is APPEND ONLY.
-- RisingWave looks up each product at the moment the order is processed,
-- without retaining full state for the catalog side.
CREATE MATERIALIZED VIEW orders_enriched AS
SELECT
  o.order_id,
  o.user_id,
  o.amount,
  o.created_at,
  p.category,
  p.brand
FROM orders_append_only AS o
LEFT JOIN product_catalog_source FOR SYSTEM_TIME AS OF PROCTIME() AS p
  ON o.product_id = p.product_id;

The orders_enriched view enriches every new order with product metadata at processing time. Because orders_append_only is append-only, RisingWave uses the process-time temporal join path: state is maintained only for the left side, and each arriving row triggers a point lookup into the catalog. The catalog side is never materialized into join state.

What the Pipeline Looks Like End to End

graph LR
    A[PostgreSQL orders table] -->|Debezium CDC| B[Kafka topic: orders]
    B -->|FORMAT PLAIN ENCODE JSON| C[orders_source\nraw Debezium envelope]
    C -->|WHERE op = 'c'\nAPPEND ONLY| D[orders_append_only\nclean event stream]
    D -->|temporal join\nFOR SYSTEM_TIME AS OF PROCTIME| E[orders_enriched\nenriched orders MV]
    F[product_catalog_source\ndimension table] --> E
    D -->|event-time windows\nwatermark-driven aggregation| G[hourly_order_metrics\nwindowed aggregates]

Unlocking Watermarks and Event-Time Processing

With an append-only materialized view, you can add a watermark definition and drive event-time window aggregations:

-- Tumbling window aggregation over event time
-- Requires watermark, which requires APPEND ONLY upstream
CREATE MATERIALIZED VIEW hourly_order_metrics AS
SELECT
  window_start,
  window_end,
  COUNT(*)            AS order_count,
  SUM(amount)         AS total_revenue,
  AVG(amount)         AS avg_order_value
FROM TUMBLE(orders_append_only, created_at, INTERVAL '1 hour')
GROUP BY window_start, window_end;

Watermarks tell RisingWave when a window is complete and can be safely emitted. Without them, time-based window aggregations cannot make forward progress. See the RisingWave watermark documentation for configuration options.

Bounding Source Table Size with retention_seconds

A Kafka-backed source table in RisingWave will grow without bound unless you constrain it. The retention_seconds parameter caps how long rows are kept in the source table's internal state:

-- Retain only the last 7 days of raw order events
CREATE TABLE orders_source (
  payload JSONB
) WITH (
  connector = 'kafka',
  topic = 'orders',
  properties.bootstrap.server = 'kafka:9092',
  scan.startup.mode = 'earliest',
  retention_seconds = '604800'
) FORMAT PLAIN ENCODE JSON;

This does not affect Kafka topic retention. It controls how long RisingWave's internal state for the table is kept. Without APPEND ONLY semantics, the parameter is unavailable. With it, memory usage stays predictable for long-running pipelines.

For source configuration details, see the Kafka source connector documentation.

When FORMAT DEBEZIUM Is Still the Right Choice

The pattern above works only when the upstream table is genuinely insert-only. If any of the following are true, FORMAT DEBEZIUM is the correct tool:

  • The upstream table receives updates after rows are written (for example, an order status field that changes from pending to shipped to delivered)
  • Rows can be deleted (for example, a soft-delete replaced by a hard delete)
  • You need a live replica that reflects the current state of the upstream table

If you use FORMAT PLAIN with WHERE op = 'c' on a table that does have updates, those update events (op = 'u') are silently dropped. The filter does not error; it discards changes. The resulting materialized view will contain stale data.

Before adopting the pattern, verify the distribution of op values in the topic. If you see only 'c', the pattern is safe. If 'u' or 'd' events appear in volume, stay with FORMAT DEBEZIUM.

Comparing the Two Approaches

FORMAT DEBEZIUMFORMAT PLAIN + filter
Models upstream stateYes (upsert mirror)No (raw event stream)
APPEND ONLYNot allowedSupported
WatermarksNot availableAvailable
retention_secondsNot availableAvailable
Process-time temporal joinNot availableAvailable
Handles updates/deletesYesNo (they are filtered out)
Best forTables with mutationsInsert-only tables

FAQ

What is FORMAT PLAIN ENCODE AVRO in RisingWave?

FORMAT PLAIN instructs RisingWave to read each Kafka message as a plain record without applying CDC changelog semantics. ENCODE AVRO (or ENCODE JSON) specifies the serialization format. Together they tell RisingWave to produce one new row per message, passing the full Debezium envelope through as structured data. No upsert logic is applied. For Avro-encoded topics with a Confluent-compatible schema registry, add FORMAT PLAIN ENCODE AVRO (schema.registry = 'http://schema-registry:8081') to the source definition.

What does APPEND ONLY do in a materialized view?

Declaring a materialized view APPEND ONLY tells RisingWave that rows will only ever be inserted, never updated or deleted. This unlocks watermark support, retention_seconds configuration, and the ability to use the view as the left side of a process-time temporal join. See the RisingWave materialized view documentation for the full syntax.

What is a process-time temporal join?

A process-time temporal join uses FOR SYSTEM_TIME AS OF PROCTIME() to look up the current value of a dimension table at the moment each event is processed. Because it only maintains state for the stream side, it is significantly more efficient than a regular stateful join for slowly changing dimension lookups. The left side must be append-only.

How do I verify that my upstream table is insert-only?

Query the raw source table for the distribution of op values after creating it with FORMAT PLAIN:

SELECT DISTINCT payload->>'op' AS op_type
FROM orders_source
LIMIT 1000;

If this returns only 'c' (and possibly 'r' for snapshot reads), the table is insert-only and the pattern is safe. If you see 'u' or 'd', the upstream table has mutations and FORMAT DEBEZIUM is the correct choice.

Conclusion

FORMAT DEBEZIUM is the right tool when you need to mirror a mutable upstream table. But for insert-only sources, it is unnecessarily restrictive. By reading the Debezium envelope as plain structured data and filtering to op = 'c', you get a true append-only stream that supports watermarks, bounded source retention, and process-time temporal joins.

For e-commerce order events, activity logs, audit trails, and any table where rows are written once and never modified, this pattern gives you a much more capable foundation than a Debezium upsert mirror.

Key takeaways:

  • FORMAT DEBEZIUM creates a mutable upsert mirror, which cannot be APPEND ONLY
  • Without APPEND ONLY, you lose watermarks, retention_seconds, and temporal join support
  • FORMAT PLAIN reads the raw Debezium envelope as structured data without applying CDC semantics
  • Filtering to WHERE op = 'c' in a downstream APPEND ONLY materialized view gives you a clean insert stream
  • This pattern is safe only when the upstream table is genuinely insert-only; updates and deletes will be silently filtered if they exist

Want to try this yourself? Get started with RisingWave in 5 minutes. Quickstart

Give us a star on GitHub and join the RisingWave Slack community to connect with engineers building on streaming SQL.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.