Streaming Joins: How to Join Real-Time Data Streams

Streaming joins combine data from two or more continuously flowing streams in real time. Unlike batch joins that operate on complete, bounded datasets, streaming joins must handle data arriving at different times, in different orders, from different sources. RisingWave, Apache Flink, and ksqlDB support streaming joins, but they differ significantly in complexity and capability.

Types of Streaming Joins

Stream-Stream Join

Join two event streams based on a time window:

-- Join clicks with purchases within 30 minutes
CREATE MATERIALIZED VIEW click_to_purchase AS
SELECT c.user_id, c.page_url, p.product_id, p.amount,
       p.purchase_time - c.click_time as conversion_time
FROM clicks c
JOIN purchases p ON c.user_id = p.user_id
WHERE p.purchase_time BETWEEN c.click_time AND c.click_time + INTERVAL '30 MINUTES';

Stream-Table Join (Enrichment)

Join a stream with a relatively static table for enrichment:

-- Enrich order events with customer data
CREATE MATERIALIZED VIEW enriched_orders AS
SELECT o.order_id, o.amount, o.order_time,
       c.name, c.tier, c.country
FROM order_events o
JOIN customers c ON o.customer_id = c.customer_id;

Temporal Join

Join a stream with a versioned table at the event's timestamp:

-- Join trades with the exchange rate at the time of the trade
CREATE MATERIALIZED VIEW trades_in_usd AS
SELECT t.trade_id, t.amount, t.currency, t.trade_time,
       t.amount * r.rate as amount_usd
FROM trades t
JOIN exchange_rates FOR SYSTEM_TIME AS OF t.trade_time r
ON t.currency = r.currency;

Join Challenges in Streaming

Challenge	Description	Solution
Unbounded state	Join buffers grow indefinitely	Time-bounded windows, TTL
Late data	One side of the join arrives late	Watermarks, allowed lateness
Out-of-order	Events arrive in wrong order	Event-time processing
Multi-way joins	Joining 3+ streams	Cascading materialized views

Platform Comparison

Feature	RisingWave	Flink	ksqlDB
Stream-stream join	✅	✅	✅ (manual repartition)
Stream-table join	✅	✅	✅
Temporal join	✅	✅	❌
Multi-way joins (5+)	✅ (efficient)	⚠️ (state-heavy)	❌
Cascading joins	✅ (via cascading MVs)	✅	❌
State backend	S3 (managed)	RocksDB / S3	RocksDB + Kafka

RisingWave handles multi-stream joins across 10+ streams efficiently — a workload where Flink often encounters state management issues.

Frequently Asked Questions

How do streaming joins differ from batch joins?

Batch joins operate on complete, finite datasets — both sides are fully available. Streaming joins operate on continuously arriving data — one side may arrive before the other. This requires buffering state, handling late data, and deciding how long to wait for matching events.

Can I join a Kafka stream with a database table?

Yes. In RisingWave, you can join a Kafka source with a CDC table from PostgreSQL/MySQL. The CDC table continuously reflects the current state of the database, and the join updates in real time as either side changes.

Why do multi-way streaming joins fail in some systems?

Each join in a multi-way join multiplies the state that must be maintained. In systems using local state (RocksDB), large join state can exhaust memory or disk. RisingWave handles this better because state is stored on S3, removing local storage limits.