Tutorial: Window Joins in RisingWave

Imagine you are running a busy online store. You have a constant stream of orders coming in, and a separate stream of payments being processed. You need to match each order to its payment to confirm the sale.

In streaming data processing, doing a regular join (without time limits) can cause problems. As more data arrives, the system has to keep a list of every past record forever to match with future ones. This "unbounded state" makes the system slower and heavier over time.

Window Joins solve this. They chop the data from two streams into fixed time windows. The system only joins rows that fall into the same time window. This keeps the process fast and efficient because the system doesn't need to remember everything forever—just what is in the current window.

This blog post will explain the two main types of Window Joins in RisingWave and show you how to use them with a step-by-step e-commerce demo.

What Are Window Joins?

RisingWave supports two main ways to slice time for these joins:

Tumbling Window: These are fixed-size time blocks that do not overlap. Think of them as back-to-back hour-long buckets. Every event falls into exactly one bucket.
Hopping Window: These windows also have a fixed size, but they can overlap. You define how long the window is (size) and how often it starts a new one (slide).

A Step-by-Step Demo: E-Commerce Order Matching

We will analyze an e-commerce platform with four types of data: user profiles, a product catalog, orders, and payments.

Our goal is to join the Order stream with the Payment stream to see which orders have been paid for within a specific time frame.

Step 1: Create Sample Tables and Data

First, start RisingWave (see the installation guide for details). We will set up our static reference data: User Profiles and the Product Catalog.

Run the following queries to create the tables and add some data:

-- 1. Static user profile table (reference data)
CREATE TABLE users ( 
        user_id INT PRIMARY KEY, 
        membership_level VARCHAR, -- Possible values: basic, silver, gold 
        registration_date DATE
);

INSERT INTO users (user_id, membership_level, registration_date) VALUES
(101, 'gold', '2022-01-15'),
(102, 'silver', '2022-05-20'),
(103, 'basic', '2023-02-10'),
(104, 'basic', '2023-03-05'),
(105, 'silver', '2022-11-30');

-- 2. Static product catalog table (reference data)
CREATE TABLE products ( 
        product_id INT PRIMARY KEY, 
        category VARCHAR, 
        price FLOAT
);

INSERT INTO products (product_id, category, price) VALUES
(201, 'electronics', 599.99),
(202, 'clothing', 89.99),
(203, 'home', 129.99),
(204, 'electronics', 299.99),
(205, 'clothing', 59.99);

Step 2: Set Up Streaming Data

Next, we create the tables for our "streaming" data: Orders and Payments. In a real system, this data would flow in continuously. Here, we will insert them with timestamps to simulate the stream.

-- 3. Orders table (streaming-like data with timestamps)
CREATE TABLE orders ( 
        order_id INT, 
        user_id INT, 
        product_id INT, 
        order_time TIMESTAMP
);

INSERT INTO orders (order_id, user_id, product_id, order_time) VALUES
(1, 101, 201, '2023-06-01 09:15:00'),
(2, 102, 202, '2023-06-01 09:30:00'),
(3, 103, 203, '2023-06-01 09:45:00'),
(4, 101, 204, '2023-06-01 10:05:00'),
(5, 104, 205, '2023-06-01 10:20:00'),
(6, 105, 202, '2023-06-01 10:50:00'),
(7, 103, 204, '2023-06-01 11:15:00'),
(8, 102, 201, '2023-06-01 11:40:00');

-- 4. Payments table (streaming-like data with timestamps)
CREATE TABLE payments ( 
        payment_id INT, 
        order_id INT, 
        amount FLOAT, 
        payment_time TIMESTAMP
);

INSERT INTO payments (payment_id, order_id, amount, payment_time) VALUES
(101, 1, 599.99, '2023-06-01 09:17:00'),
(102, 2, 89.99, '2023-06-01 09:32:00'),
(103, 3, 129.99, '2023-06-01 09:50:00'),
(104, 4, 299.99, '2023-06-01 10:08:00'),
(105, 6, 89.99, '2023-06-01 10:55:00'),
(106, 8, 599.99, '2023-06-01 11:42:00');

Step 3: Perform a Tumbling Window Join

Now we will join the orders and payments using 1-hour non-overlapping windows. This query calculates how many orders get paid within the same hour they were made. This is useful for finding specific issues in your payment flow.

We use the TUMBLE function to create the windows.

CREATE MATERIALIZED VIEW tumbling_window_join AS
SELECT
    o.window_start, -- use col generated by window function
    COUNT(DISTINCT o.order_id) AS total_orders,
    COUNT(DISTINCT p.order_id) AS paid_orders
FROM TUMBLE(orders, order_time, INTERVAL '1 hour') o -- create window1
LEFT JOIN TUMBLE(payments, payment_time, INTERVAL '1 hour') p -- create window2
    ON o.window_start = p.window_start
    AND o.order_id = p.order_id
GROUP BY o.window_start;

Let's look at the results:

SELECT * FROM tumbling_window_join
ORDER BY window_start;

Output:

        window_start  | total_orders | paid_orders
---------------------+--------------+-------------
 2023-06-01 09:00:00 |            3 |           3
 2023-06-01 10:00:00 |            3 |           2
 2023-06-01 11:00:00 |            2 |           1

You can see that in the 10:00 hour, there were 3 orders but only 2 were paid.

Step 4: Perform a Hopping Window Join

Next, we will use a Hopping Window. We will look at 2-hour windows that slide (hop) every 1 hour. This helps us track trends that might cross over standard hour lines.

We use the HOP function here.

CREATE MATERIALIZED VIEW hopping_window_join AS
SELECT
    o.window_start, -- use col generated by window function
    o.window_end,   -- use col generated by window function
    p2.category,
    COUNT(DISTINCT o.order_id) AS total_orders,
    COUNT(DISTINCT p.order_id) AS paid_orders,
    SUM(p.amount) AS total_paid_amount
FROM HOP(orders, order_time, INTERVAL '1 hour', INTERVAL '2 hours') o -- create window1
LEFT JOIN products p2 ON o.product_id = p2.product_id
LEFT JOIN HOP(payments, payment_time, INTERVAL '1 hour', INTERVAL '2 hours') p -- create window2
    ON o.window_start = p.window_start
    AND o.order_id = p.order_id
GROUP BY o.window_start, o.window_end, p2.category;

Let's check the results:

SELECT * FROM hopping_window_join
ORDER BY window_start, category;

Output:

    window_start     |     window_end      |  category   | total_orders | paid_orders | total_paid_amount
---------------------+---------------------+-------------+--------------+-------------+-------------------
 2023-06-01 08:00:00 | 2023-06-01 10:00:00 | clothing    |            1 |           1 |             89.99
 2023-06-01 08:00:00 | 2023-06-01 10:00:00 | electronics |            1 |           1 |            599.99
 2023-06-01 08:00:00 | 2023-06-01 10:00:00 | home        |            1 |           1 |            129.99
 2023-06-01 09:00:00 | 2023-06-01 11:00:00 | clothing    |            3 |           2 |            179.98
 2023-06-01 09:00:00 | 2023-06-01 11:00:00 | electronics |            2 |           2 |            899.98
 ...

This view gives you a broader look at sales performance by category over overlapping time periods.

Conclusion

Window Joins are a powerful tool for handling real-time data streams without slowing down your system. By grouping data into manageable time windows, you can join high-volume streams like orders and payments efficiently.

Whether you use Tumbling Windows for clear, separate intervals or Hopping Windows to spot overlapping trends, RisingWave makes it easy to write these joins using standard SQL.

Get Started with RisingWave

Try RisingWave Today:
- Download the open-sourced version of RisingWave to deploy on your own infrastructure.
- Get started quickly with RisingWave Cloud for a fully managed experience.
Talk to Our Experts: Have a complex use case or want to see a personalized demo? Contact us to discuss how RisingWave can address your specific challenges.
Join Our Community: Connect with fellow developers, ask questions, and share your experiences in our vibrant Slack community.

If you’d like to see a personalized demo or discuss how this could work for your use case, please contact our sales team.