Frequently Bought Together: Build Amazon-Style Recommendations with Streaming SQL

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is 'Frequently Bought Together' and how does it work?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Frequently Bought Together is a recommendation widget that surfaces products commonly purchased in the same order as the item a shopper is viewing. It is built on a co-purchase matrix: for every pair of products that appear together in at least one order, a co-occurrence count is maintained. At display time, the system looks up the highest-scoring pairs for the current product and returns them as recommendations."
      }
    },
    {
      "@type": "Question",
      "name": "How often should a co-purchase matrix update?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Ideally in real time or near-real time. Batch updates that run overnight cause the matrix to miss viral product moments, seasonal spikes, and same-day promotions. A streaming approach using CDC and materialized views can reflect new purchases within seconds of the transaction being committed to your database."
      }
    },
    {
      "@type": "Question",
      "name": "Can I build real-time product recommendations with SQL alone?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Yes. The core logic of Frequently Bought Together is a self-join on orders grouped by order ID, which is standard SQL. When that SQL runs as a materialized view in a streaming database like RisingWave, the join result updates incrementally with every new purchase instead of requiring a nightly batch recompute."
      }
    },
    {
      "@type": "Question",
      "name": "What is affinity score normalization and why does it matter?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Affinity normalization divides the raw co-purchase count by the individual popularity of the co-purchased product. Without normalization, universally popular items (like batteries or phone cases) appear as top recommendations for almost every product, even when the pairing is coincidental. Normalization surfaces genuinely complementary items rather than just popular ones."
      }
    },
    {
      "@type": "Question",
      "name": "How does RisingWave connect to an existing PostgreSQL order database?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "RisingWave uses Change Data Capture (CDC) with the postgres-cdc connector to subscribe to the PostgreSQL logical replication stream. This means your existing transactional database remains the source of truth, and RisingWave mirrors the data in real time without any application-level changes or dual-write logic."
      }
    }
  ]
}

Scroll down to the bottom of any Amazon product page and you will find a row of three products under the heading "Frequently bought together." Add all three to your cart with a single button click. That widget is one of the highest-revenue features Amazon has ever shipped. It works because it catches a shopper at the precise moment of highest purchase intent, while their wallet is already open, and shows them what other people in the same mental state decided to buy alongside this item.

The recommendation is only valuable if it is accurate. And accuracy depends on freshness.

The Freshness Problem

Here is the scenario. A product goes viral on social media at 2pm on a Tuesday. Within the hour, 400 people buy it. Many of them pair it with a phone mount, a travel organizer, or a specific brand of cable. By 3pm, a clear complementary purchase pattern has emerged.

If your co-purchase matrix runs as a nightly batch job, your store has no idea about this pattern until 2am. For eighteen hours, the Frequently Bought Together section for that product shows recommendations built from last week's data, which was computed before the product was trending. The cross-sell revenue evaporates.

The fix is not to run the batch job more often. Running a full co-purchase matrix computation every hour is expensive and still leaves a 60-minute gap. The fix is to maintain the matrix as a streaming materialized view that updates the moment each purchase is recorded.

Architecture: Two Layers, Honest About Trade-offs

Before the SQL, it is worth being clear about how the data flows.

Your product catalog and order history live in PostgreSQL. That is the right tool for transactional data: ACID guarantees, relational integrity, foreign key constraints between orders and line items. You do not move this data out of PostgreSQL.

RisingWave sits alongside your PostgreSQL instance and subscribes to it via Change Data Capture (CDC). Every INSERT or UPDATE to your orders table flows into RisingWave's streaming engine in real time. RisingWave maintains the co-purchase matrix as a materialized view, recomputing only the rows affected by each incoming event.

Your application reads recommendations from RisingWave over a standard PostgreSQL connection (port 4566). From the application's perspective, it is just another PostgreSQL query. The freshness is a property of how the view is maintained, not how it is queried.

This split is deliberate. PostgreSQL handles writes. RisingWave handles the read-heavy analytical computation. You get the durability and consistency of a relational database alongside the real-time computation of a streaming engine.

Setting Up the Pipeline

Ingesting Orders via CDC

First, create a CDC source pointing at your PostgreSQL instance. You will need logical replication enabled on the PostgreSQL side and a publication created for the orders table.

-- Ingest orders from PostgreSQL via CDC
CREATE SOURCE orders_source
WITH (
    connector = 'postgres-cdc',
    hostname = 'localhost',
    port = '5432',
    username = 'postgres',
    password = '${POSTGRES_PASSWORD}',
    database.name = 'ecommerce',
    schema.name = 'public',
    publication.name = 'orders_pub'
);

CREATE TABLE order_items (
    order_id    TEXT,
    user_id     TEXT,
    item_id     TEXT,
    item_name   TEXT,
    category    TEXT,
    price       DECIMAL,
    ordered_at  TIMESTAMPTZ
) FROM orders_source TABLE 'public.order_items';

Every row in order_items represents one line item in one order. An order with three products generates three rows with the same order_id. This shared order ID is the foundation of the co-purchase logic.

Building the Co-Purchase Matrix

The core recommendation signal is a self-join on order_id. For every pair of products that appear in the same order, increment a counter.

-- Co-purchased item pairs (the core of "Frequently Bought Together")
CREATE MATERIALIZED VIEW co_purchased_items AS
SELECT
    a.item_id   AS item_a,
    b.item_id   AS item_b,
    a.item_name AS name_a,
    b.item_name AS name_b,
    COUNT(*)    AS purchase_pair_count
FROM order_items a
JOIN order_items b 
    ON a.order_id = b.order_id 
    AND a.item_id < b.item_id   -- canonical ordering prevents duplicates
GROUP BY a.item_id, b.item_id, a.item_name, b.item_name;

The a.item_id < b.item_id condition is essential. Without it, every pair (A, B) would also appear as (B, A), doubling your counts and requiring deduplication downstream. By enforcing lexicographic ordering, each pair appears exactly once in canonical form.

This materialized view updates incrementally. When a new order arrives with items X, Y, and Z, RisingWave computes the three new pairs (X,Y), (X,Z), (Y,Z) and increments their counts. It does not recompute the entire matrix.

Normalizing by Item Popularity

The raw co-purchase count has a bias problem. Popular items like charging cables or screen protectors appear in millions of orders. They will have high co-purchase counts with almost every other product, not because they are genuinely complementary but because they are everywhere.

Affinity normalization fixes this by dividing the co-purchase count by how often the recommended item appears in orders overall. If two items are bought together 100 times, but one of them appears in 10,000 orders total, the affinity score is 0.01. If they are bought together 100 times and the item appears in 200 orders total, the score is 0.50, which correctly reflects a strong complementary relationship.

-- Normalize by individual item popularity to avoid recommending always-popular items
CREATE MATERIALIZED VIEW co_purchase_affinity AS
SELECT
    cp.item_a,
    cp.item_b,
    cp.name_b,
    cp.purchase_pair_count,
    item_counts.total_orders AS item_b_popularity,
    cp.purchase_pair_count::FLOAT / NULLIF(item_counts.total_orders, 0) AS affinity_score
FROM co_purchased_items cp
JOIN (
    SELECT item_id, COUNT(DISTINCT order_id) AS total_orders
    FROM order_items
    GROUP BY item_id
) item_counts ON cp.item_b = item_counts.item_id;

The NULLIF(item_counts.total_orders, 0) guard prevents division-by-zero for items that somehow appear in the co-purchase matrix without individual order history, which can happen during initial data loads.

Category-Level Cross-Sell Signals

Beyond item-level pairs, category affinity tells you which product categories tend to be bought together. This is useful for new products with limited order history, or for recommendation fallbacks when item-level data is sparse.

-- Category-level cross-sell signals
CREATE MATERIALIZED VIEW category_affinity AS
SELECT
    a.category AS source_category,
    b.category AS target_category,
    COUNT(*)   AS co_purchase_count
FROM order_items a
JOIN order_items b
    ON a.order_id = b.order_id
    AND a.category != b.category
    AND a.category < b.category
GROUP BY a.category, b.category;

If cameras are frequently bought alongside camera bags and memory cards, a new camera lens with no order history can immediately inherit recommendations from its parent category.

Serving Recommendations at Page Load

All the computation above happens continuously in the background. At page load time, your application fires a simple read query.

Top Items Frequently Bought With Product X

-- Top 3 items frequently bought with item X, scored by affinity
SELECT item_b AS recommended_item_id, name_b, affinity_score
FROM co_purchase_affinity
WHERE item_a = $1
ORDER BY affinity_score DESC
LIMIT 3;

This returns in milliseconds. The co_purchase_affinity view is already maintained; the query is just a lookup with a filter and sort.

Personalized: Exclude What the User Already Owns

A more sophisticated version excludes items the current user has already purchased. Someone who bought product X two weeks ago should not be shown a recommendation to buy it again.

-- Also bought by users who bought X but not yet bought by current user
SELECT cp.item_b, cp.affinity_score
FROM co_purchase_affinity cp
LEFT JOIN order_items oi 
    ON oi.item_id = cp.item_b AND oi.user_id = $2
WHERE cp.item_a = $1
  AND oi.item_id IS NULL   -- exclude items user already owns
ORDER BY cp.affinity_score DESC
LIMIT 3;

The LEFT JOIN on order_items with a NULL check on oi.item_id is a standard SQL anti-join pattern. It is efficient because order_items is indexed by user_id and item_id, and it personalizes the recommendation output without requiring a separate user history service.

Detecting Viral Product Pairs in Real Time

The affinity matrix gives you the all-time co-purchase relationship. But you also want to detect pairs that are trending right now, which is the viral product scenario from the introduction.

Use TUMBLE windows to compute hourly co-purchase counts and surface pairs with a sudden spike in activity.

-- Hourly co-purchase counts to detect trending pairs
CREATE MATERIALIZED VIEW co_purchase_hourly AS
SELECT
    a.item_id AS item_a,
    b.item_id AS item_b,
    COUNT(*) AS pair_count,
    window_start,
    window_end
FROM TUMBLE(order_items, ordered_at, INTERVAL '1' HOUR) a
JOIN order_items b 
    ON a.order_id = b.order_id 
    AND a.item_id < b.item_id
GROUP BY a.item_id, b.item_id, window_start, window_end;

Each row in co_purchase_hourly represents the count of a specific item pair within a specific one-hour window. To find pairs that are spiking today compared to their historical baseline, query the most recent complete window and compare it against the preceding windows.

-- Pairs with a 3x spike in the most recent hour vs. prior 6-hour average
SELECT
    current_window.item_a,
    current_window.item_b,
    current_window.pair_count AS current_hour_count,
    AVG(prior_windows.pair_count) AS baseline_avg,
    current_window.pair_count::FLOAT / NULLIF(AVG(prior_windows.pair_count), 0) AS spike_ratio
FROM co_purchase_hourly current_window
JOIN co_purchase_hourly prior_windows
    ON current_window.item_a = prior_windows.item_a
    AND current_window.item_b = prior_windows.item_b
    AND prior_windows.window_end < current_window.window_start
    AND prior_windows.window_start > current_window.window_start - INTERVAL '7' HOUR
WHERE current_window.window_end = (SELECT MAX(window_end) FROM co_purchase_hourly)
GROUP BY current_window.item_a, current_window.item_b, current_window.pair_count
HAVING current_window.pair_count::FLOAT / NULLIF(AVG(prior_windows.pair_count), 0) > 3.0
ORDER BY spike_ratio DESC;

A spike_ratio above 3.0 means the pair is being purchased together at three times its normal rate. These are your viral candidates, and they surface within an hour of the trend starting.

The Python Integration

Your e-commerce backend queries RisingWave over a standard psycopg2 connection. The connection parameters look exactly like connecting to PostgreSQL, with port 4566 instead of 5432.

import psycopg2

conn = psycopg2.connect(
    host="localhost", port=4566,
    database="dev", user="root", password=""
)

def get_frequently_bought_together(item_id, user_id=None, limit=3):
    with conn.cursor() as cur:
        cur.execute("""
            SELECT cp.item_b, cp.name_b, cp.affinity_score
            FROM co_purchase_affinity cp
            LEFT JOIN order_items oi 
                ON oi.item_id = cp.item_b AND oi.user_id = %s
            WHERE cp.item_a = %s
              AND oi.item_id IS NULL
            ORDER BY cp.affinity_score DESC
            LIMIT %s
        """, (user_id, item_id, limit))
        return cur.fetchall()

You can call this function from any product page handler. If user_id is None, the LEFT JOIN returns no rows to exclude, and the function falls back to pure affinity-based recommendations.

For production deployments, use a connection pool such as psycopg2.pool.ThreadedConnectionPool. RisingWave handles concurrent reads efficiently, and each materialized view query runs against pre-computed state rather than re-executing the underlying aggregation.

The Latency Chain

Understanding end-to-end latency helps set realistic expectations. Here is what happens when a user completes a purchase:

The transaction commits to PostgreSQL. The row appears in order_items in the application database.
PostgreSQL's logical replication slot emits a WAL event. This typically happens within 10-50 milliseconds of the commit.
RisingWave's CDC connector receives the event and processes it through the streaming engine. The materialized views for co_purchased_items and co_purchase_affinity update incrementally. This step takes milliseconds to a few seconds depending on load.
The next product page request queries co_purchase_affinity and receives the updated recommendation set.

The total end-to-end latency from purchase to recommendation update is typically under 10 seconds in a production deployment. Compare this to a nightly batch job where the same signal would take up to 24 hours to propagate.

For the viral product scenario: if 50 people buy product X with product Y in the same hour, the affinity score between X and Y will reflect all 50 purchases by the time the 51st customer is on product X's page. There is no batch boundary to wait for.

Batch vs. Streaming: A Direct Comparison

The following table compares a traditional nightly batch co-purchase pipeline against the streaming approach described in this article.

Dimension	Nightly Batch	Streaming with RisingWave
Update latency	Up to 24 hours	Under 10 seconds
Accuracy for viral products	Misses same-day trends entirely	Reflects trends within minutes
Accuracy for seasonal spikes	Catches it in the next batch cycle (hours late)	Captures it as it develops
Accuracy for slow-moving items	Good (enough data accumulates overnight)	Good (same data, just fresher)
Infrastructure complexity	Spark or dbt + Airflow + scheduler + data warehouse	Kafka + RisingWave + PostgreSQL query
SQL vs. code	dbt SQL or Spark DataFrame	Standard SQL materialized views
Cost model	Batch compute spikes during job runs	Continuous low-level compute
Personalization (exclude owned items)	Requires join with user history at serving time	Native LEFT JOIN in read query

The streaming approach does not just improve latency. It simplifies the infrastructure stack. A nightly batch pipeline typically involves a job scheduler, a distributed compute cluster, a data warehouse for the intermediate results, and a separate cache layer for serving. The streaming approach collapses this into a single streaming database that handles both the computation and the serving.

Extending the Pipeline

Once you have the co-purchase foundation running, several extensions are straightforward.

Segment-specific recommendations. Add a GROUP BY on customer segment or geographic region to build segment-aware co-purchase matrices. A product pair that is popular with enterprise buyers may be different from the pair popular with individual consumers.

Price-range filtering. Extend the serving query to filter co_purchase_affinity by price range. A customer looking at a budget item probably does not want to be shown a premium complement as the first recommendation.

Bundle detection. When three or more products have high mutual affinity scores, they are bundle candidates. A query across co_purchase_affinity that finds triangular relationships (A-B, B-C, A-C all with high affinity) surfaces natural product bundles automatically.

Inventory awareness. JOIN the co_purchase_affinity serving query against an inventory view (also maintained via CDC) to exclude out-of-stock items from recommendations. This can be another materialized view that RisingWave maintains from your inventory table CDC stream.

Getting Started

To run this pipeline locally, you need:

PostgreSQL with logical replication enabled (wal_level = logical) and a publication on your orders table
RisingWave running locally via Docker (docker run -d -p 4566:4566 risingwavelabs/risingwave:latest)
The SQL above executed in order against RisingWave's psql endpoint

The RisingWave CDC quickstart walks through the PostgreSQL logical replication setup in detail. Once you have the source connected, the materialized views in this article are copy-paste ready.

Ready to build real-time Frequently Bought Together? Sign up for RisingWave Cloud for a free managed instance, or follow the quickstart guide to run locally. Join the Slack community if you run into questions.