Streaming Joins: How to Join Real-Time Data Streams
Streaming joins combine data from two or more continuously flowing streams in real time. Unlike batch joins that operate on complete, bounded datasets, streaming joins must handle data arriving at different times, in different orders, from different sources. RisingWave, Apache Flink, and ksqlDB support streaming joins, but they differ significantly in complexity and capability.
Types of Streaming Joins
Stream-Stream Join
Join two event streams based on a time window:
-- Join clicks with purchases within 30 minutes
CREATE MATERIALIZED VIEW click_to_purchase AS
SELECT c.user_id, c.page_url, p.product_id, p.amount,
p.purchase_time - c.click_time as conversion_time
FROM clicks c
JOIN purchases p ON c.user_id = p.user_id
WHERE p.purchase_time BETWEEN c.click_time AND c.click_time + INTERVAL '30 MINUTES';
Stream-Table Join (Enrichment)
Join a stream with a relatively static table for enrichment:
-- Enrich order events with customer data
CREATE MATERIALIZED VIEW enriched_orders AS
SELECT o.order_id, o.amount, o.order_time,
c.name, c.tier, c.country
FROM order_events o
JOIN customers c ON o.customer_id = c.customer_id;
Temporal Join
Join a stream with a versioned table at the event's timestamp:
-- Join trades with the exchange rate at the time of the trade
CREATE MATERIALIZED VIEW trades_in_usd AS
SELECT t.trade_id, t.amount, t.currency, t.trade_time,
t.amount * r.rate as amount_usd
FROM trades t
JOIN exchange_rates FOR SYSTEM_TIME AS OF t.trade_time r
ON t.currency = r.currency;
Join Challenges in Streaming
| Challenge | Description | Solution |
| Unbounded state | Join buffers grow indefinitely | Time-bounded windows, TTL |
| Late data | One side of the join arrives late | Watermarks, allowed lateness |
| Out-of-order | Events arrive in wrong order | Event-time processing |
| Multi-way joins | Joining 3+ streams | Cascading materialized views |
Platform Comparison
| Feature | RisingWave | Flink | ksqlDB |
| Stream-stream join | ✅ | ✅ | ✅ (manual repartition) |
| Stream-table join | ✅ | ✅ | ✅ |
| Temporal join | ✅ | ✅ | ❌ |
| Multi-way joins (5+) | ✅ (efficient) | ⚠️ (state-heavy) | ❌ |
| Cascading joins | ✅ (via cascading MVs) | ✅ | ❌ |
| State backend | S3 (managed) | RocksDB / S3 | RocksDB + Kafka |
RisingWave handles multi-stream joins across 10+ streams efficiently — a workload where Flink often encounters state management issues.
Frequently Asked Questions
How do streaming joins differ from batch joins?
Batch joins operate on complete, finite datasets — both sides are fully available. Streaming joins operate on continuously arriving data — one side may arrive before the other. This requires buffering state, handling late data, and deciding how long to wait for matching events.
Can I join a Kafka stream with a database table?
Yes. In RisingWave, you can join a Kafka source with a CDC table from PostgreSQL/MySQL. The CDC table continuously reflects the current state of the database, and the join updates in real time as either side changes.
Why do multi-way streaming joins fail in some systems?
Each join in a multi-way join multiplies the state that must be maintained. In systems using local state (RocksDB), large join state can exhaust memory or disk. RisingWave handles this better because state is stored on S3, removing local storage limits.

