Streaming Joins: How to Join Real-Time Data Streams

Streaming Joins: How to Join Real-Time Data Streams

Streaming Joins: How to Join Real-Time Data Streams

Streaming joins combine data from two or more continuously flowing streams in real time. Unlike batch joins that operate on complete, bounded datasets, streaming joins must handle data arriving at different times, in different orders, from different sources. RisingWave, Apache Flink, and ksqlDB support streaming joins, but they differ significantly in complexity and capability.

Types of Streaming Joins

Stream-Stream Join

Join two event streams based on a time window:

-- Join clicks with purchases within 30 minutes
CREATE MATERIALIZED VIEW click_to_purchase AS
SELECT c.user_id, c.page_url, p.product_id, p.amount,
       p.purchase_time - c.click_time as conversion_time
FROM clicks c
JOIN purchases p ON c.user_id = p.user_id
WHERE p.purchase_time BETWEEN c.click_time AND c.click_time + INTERVAL '30 MINUTES';

Stream-Table Join (Enrichment)

Join a stream with a relatively static table for enrichment:

-- Enrich order events with customer data
CREATE MATERIALIZED VIEW enriched_orders AS
SELECT o.order_id, o.amount, o.order_time,
       c.name, c.tier, c.country
FROM order_events o
JOIN customers c ON o.customer_id = c.customer_id;

Temporal Join

Join a stream with a versioned table at the event's timestamp:

-- Join trades with the exchange rate at the time of the trade
CREATE MATERIALIZED VIEW trades_in_usd AS
SELECT t.trade_id, t.amount, t.currency, t.trade_time,
       t.amount * r.rate as amount_usd
FROM trades t
JOIN exchange_rates FOR SYSTEM_TIME AS OF t.trade_time r
ON t.currency = r.currency;

Join Challenges in Streaming

ChallengeDescriptionSolution
Unbounded stateJoin buffers grow indefinitelyTime-bounded windows, TTL
Late dataOne side of the join arrives lateWatermarks, allowed lateness
Out-of-orderEvents arrive in wrong orderEvent-time processing
Multi-way joinsJoining 3+ streamsCascading materialized views

Platform Comparison

FeatureRisingWaveFlinkksqlDB
Stream-stream join✅ (manual repartition)
Stream-table join
Temporal join
Multi-way joins (5+)✅ (efficient)⚠️ (state-heavy)
Cascading joins✅ (via cascading MVs)
State backendS3 (managed)RocksDB / S3RocksDB + Kafka

RisingWave handles multi-stream joins across 10+ streams efficiently — a workload where Flink often encounters state management issues.

Frequently Asked Questions

How do streaming joins differ from batch joins?

Batch joins operate on complete, finite datasets — both sides are fully available. Streaming joins operate on continuously arriving data — one side may arrive before the other. This requires buffering state, handling late data, and deciding how long to wait for matching events.

Can I join a Kafka stream with a database table?

Yes. In RisingWave, you can join a Kafka source with a CDC table from PostgreSQL/MySQL. The CDC table continuously reflects the current state of the database, and the join updates in real time as either side changes.

Why do multi-way streaming joins fail in some systems?

Each join in a multi-way join multiplies the state that must be maintained. In systems using local state (RocksDB), large join state can exhaust memory or disk. RisingWave handles this better because state is stored on S3, removing local storage limits.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.