Stream Processing Without a Message Queue: Do You Really Need Kafka?

Stream Processing Without a Message Queue: Do You Really Need Kafka?

TL;DR: Many real-time systems reach for Apache Kafka by default, but a streaming database like RisingWave can ingest directly from Postgres or MySQL via CDC, from S3 files, from synthetic data generators, and from HTTP sources without a broker in the middle. Kafka still earns its keep when you have many independent consumers or strict replay requirements, but for analytics, dashboards, and feature pipelines you can often skip it entirely.

Do I need Kafka for real-time data processing?

No, you do not need Kafka for most real-time data processing workloads. A modern streaming database can pull change events directly from your operational database via log-based CDC, read append-only event files from object storage, and emit results to sinks without a separate message broker. You only need Kafka when multiple downstream systems must independently consume the same firehose of events at high throughput, or when you need long retention for replay.

Kafka is excellent at what it does. It is a durable, partitioned, replicated commit log that fans out one source of truth to many consumers. The problem is that teams adopt Kafka as a reflex. Every "real-time" architecture diagram begins with a Kafka cluster in the center, even when the data path is Postgres -> dashboard and the only consumer is a single analytics job. That reflex carries real costs: a broker fleet to operate, Schema Registry to wrangle, Kafka Connect connectors to babysit, and a JVM stack to staff.

This post walks through three direct ingestion patterns that let you delete Kafka from your diagram entirely, and then identifies the cases where you should put it back in.

The Kafka-as-default anti-pattern

The Kafka-as-default anti-pattern shows up when a team adds a broker between their operational database and their analytics layer purely because every tutorial does. The data flows Postgres -> Debezium -> Kafka Connect -> Kafka -> Flink -> warehouse, and each arrow is a piece of infrastructure that can break, requires upgrades, and consumes engineering hours. If the only thing reading those Kafka topics is one downstream consumer, the broker is dead weight.

A typical pipeline looks like this:

  1. Debezium reads the Postgres write-ahead log
  2. Kafka Connect serializes those changes into Kafka topics
  3. Schema Registry tracks the Avro or Protobuf schemas
  4. A Flink job reads the topics, runs aggregations, and writes back to Kafka
  5. A second Kafka Connect sink moves the results to a warehouse or dashboard

That stack has five distributed systems for what is essentially one query: "show me the running total of orders by region." None of them are simple to operate, and each one introduces an independent failure mode. Kafka rebalancing, Connect worker restarts, schema evolution mistakes, Flink checkpoint stalls. Confluent's own documentation acknowledges that log-based CDC with Debezium is the recommended path, but does not address whether the Kafka layer is actually buying you anything in single-consumer cases.

Pattern 1: Direct CDC ingestion from Postgres

Direct CDC ingestion means RisingWave connects to your Postgres replication slot and consumes the write-ahead log itself, without Kafka or Debezium in between. The streaming database becomes the CDC reader, the stream processor, and the materialized view server in one process. You get every insert, update, and delete from your source tables, transformed in real time, with a single piece of infrastructure to operate.

Here is the DDL pattern. RisingWave reads directly from the Postgres logical replication slot:

CREATE SOURCE pg_cdc WITH (
  connector = 'postgres-cdc',
  hostname = 'localhost',
  port = '5432',
  username = 'postgres',
  password = 'password',
  database.name = 'mydb',
  schema.name = 'public',
  slot.name = 'rw_slot'
);

Once the source exists, you create tables that map to specific upstream tables and materialized views that maintain incremental aggregates over them:

CREATE TABLE orders (
  order_id BIGINT PRIMARY KEY,
  user_id INT,
  amount DECIMAL,
  status VARCHAR,
  created_at TIMESTAMPTZ
) FROM pg_cdc TABLE 'public.orders';

CREATE MATERIALIZED VIEW revenue_by_status AS
SELECT status, COUNT(*) AS order_count, SUM(amount) AS total_revenue
FROM orders
GROUP BY status;

The materialized view updates incrementally as Postgres commits new rows. There is no Kafka topic, no Schema Registry, no Connect worker. The Postgres replication slot is the buffer, and the streaming database is the consumer.

When this pattern works well: a single streaming database is the only consumer of the change stream, you want sub-second freshness on aggregates, and you do not need to replay history beyond what Postgres retains in its WAL.

Pattern 2: Generating synthetic streams with datagen

The datagen connector lets RisingWave produce its own synthetic event stream entirely in-database, with no external system at all. This is useful for prototyping pipelines, load testing your downstream sinks, and validating that your SQL logic is correct before you point it at production data. The connector is built in, so there is no separate process to run.

The following SQL was tested against RisingWave 2.8.0 on localhost:4566 and produces 100 synthetic order events per second:

CREATE SOURCE orders_datagen (
  order_id BIGINT,
  user_id INT,
  product_id INT,
  amount INT,
  event_time TIMESTAMPTZ
) WITH (
  connector = 'datagen',
  fields.order_id.kind = 'sequence',
  fields.order_id.start = '1',
  fields.user_id.kind = 'random',
  fields.user_id.min = '1',
  fields.user_id.max = '1000',
  fields.product_id.kind = 'random',
  fields.product_id.min = '1',
  fields.product_id.max = '100',
  fields.amount.kind = 'random',
  fields.amount.min = '10',
  fields.amount.max = '1000',
  fields.event_time.kind = 'random',
  fields.event_time.max_past = '5s',
  fields.event_time.max_past_mode = 'relative',
  datagen.rows.per.second = '100'
) FORMAT PLAIN ENCODE JSON;

SELECT order_id, user_id, amount, event_time FROM orders_datagen LIMIT 3;

A sample result from a live run:

 order_id | user_id | amount |            event_time
----------+---------+--------+----------------------------------
        1 |     802 |    804 | 2026-05-22 21:03:06.749183+00:00
        2 |     825 |    827 | 2026-05-22 21:03:05.533265+00:00
        3 |      82 |     90 | 2026-05-22 21:03:08.919268+00:00

You can layer materialized views on top of orders_datagen exactly as you would on a Kafka source. This is the fastest way to demo a streaming pipeline to a stakeholder, because there is no Kafka cluster to spin up first.

Pattern 3: S3 and file sources for batch backfill

Object storage is the third common source that does not need a message broker. When your data is already landing in S3 as JSON or Parquet files, RisingWave can read those files directly and feed them into the same materialized views that handle live data. This is the right pattern for historical backfills, batch refreshes of slowly changing dimensions, and feeding data into streaming joins from a static reference table.

CREATE SOURCE order_archive (
  order_id BIGINT,
  amount DECIMAL,
  event_time TIMESTAMPTZ
) WITH (
  connector = 's3',
  s3.region_name = 'us-east-1',
  s3.bucket_name = 'orders-archive',
  s3.credentials.access = 'AKIA...',
  s3.credentials.secret = '...',
  match_pattern = 'orders/*.json'
) FORMAT PLAIN ENCODE JSON;

The same SQL that processes live CDC data can read these files. When backfill and live ingestion share the same query, you eliminate an entire class of "the batch numbers do not match the streaming numbers" bugs.

When Kafka is actually the right answer

Kafka is worth its operational weight when you have specific requirements that a streaming database alone cannot meet. The most common ones are:

Many independent consumers. If five separate systems need to read the same event stream at full throughput, with independent offsets and replay positions, Kafka is exactly the tool you want. The broker holds the canonical log; each consumer maintains its own cursor. A streaming database can be one of those consumers, but it should not be the only buffer.

Long retention for replay. Kafka topics can hold weeks or months of events on cheap storage. If a downstream system needs to rebuild its state from scratch, it replays from the topic. Postgres WAL retention is measured in hours unless you tune it aggressively.

Cross-team isolation. Teams that own the producers should not have to coordinate every consumer's schedule. A Kafka topic is a stable contract between them.

Throughput spikes that exceed your sink. If your event rate spikes 100x for short periods and your downstream cannot keep up, Kafka absorbs the burst. A direct CDC connection backs pressure straight into your operational database, which is usually unacceptable.

If none of those apply, you can ship without Kafka and add it later if and when a second consumer appears. The decision is reversible, and the simpler architecture is faster to debug.

A migration checklist

If you are looking at an existing Kafka-centric pipeline and wondering whether you can simplify it, ask:

  1. How many consumers are there? One real consumer means Kafka is overkill.
  2. What is the replay window? Hours of Postgres WAL is often enough.
  3. What is the schema discipline? Direct CDC sidesteps Schema Registry entirely.
  4. What is the operational cost? Count engineer hours, not just AWS bill.
  5. What is the latency budget? Direct CDC removes one network hop.

When the answer to most of these is "we are over-engineered," replacing the broker with a streaming database that ingests directly from your operational stores can collapse a five-system pipeline into one.

Key takeaways

  • Kafka is a powerful commit log, not a default requirement for real-time pipelines.
  • A streaming database can ingest directly from Postgres CDC, MySQL CDC, S3 files, HTTP sources, and synthetic generators.
  • Single-consumer analytics paths are the most common case where Kafka adds operational cost without architectural value.
  • Keep Kafka in the picture when you have many independent consumers, long replay windows, or producer-consumer isolation requirements.
  • The decision is reversible. Start simple and add a broker the day you actually need one.

Want to try RisingWave's direct ingestion patterns? Download it on GitHub or spin up a free cluster on RisingWave Cloud.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.