Semantic Product Recommendations with Vector Search and Streaming SQL

Semantic Product Recommendations with Vector Search and Streaming SQL

·

14 min read

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What are semantic recommendations?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Semantic recommendations surface items that are meaningfully related to what a user is viewing or searching for, even when there are no shared tags, categories, or co-purchase history. They work by embedding item descriptions into a vector space where semantic similarity corresponds to geometric proximity, then finding the nearest neighbors to a seed item or query."
      }
    },
    {
      "@type": "Question",
      "name": "How does vector search work for product recommendations?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Product descriptions are converted into high-dimensional vectors (embeddings) using a model like text-embedding-3-small. Items with semantically similar descriptions end up close together in vector space. At query time, a nearest-neighbor search finds the items closest to a seed item or user query, producing recommendations based on meaning rather than metadata tags."
      }
    },
    {
      "@type": "Question",
      "name": "What is embedding-based product recommendation?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Embedding-based product recommendation converts item titles, descriptions, and attributes into vector representations and uses similarity search to find related items. Unlike collaborative filtering (which relies on user behavior patterns) or category filtering (which uses taxonomy), embeddings capture semantic relationships between items even when no behavioral or metadata overlap exists."
      }
    },
    {
      "@type": "Question",
      "name": "How does RisingWave keep embeddings fresh without batch jobs?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "RisingWave maintains embeddings as a materialized view. When a product's description changes in the source database, the CDC event flows into RisingWave, and the materialized view recomputes the embedding for only that row. No scheduled re-indexing job is required. The HNSW index reflects the updated embedding automatically."
      }
    },
    {
      "@type": "Question",
      "name": "When should you use vector similarity vs. collaborative filtering for recommendations?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Use vector similarity when you need 'More like this' or search-based discovery, especially for new items with no purchase history. Use collaborative filtering for 'Users like you also liked' recommendations where behavioral data is dense. Use co-purchase signals for 'Frequently Bought Together' widgets. Category filtering works best as a fallback when no other signals are available."
      }
    }
  ]
}

A user just bought a "linen shirt for summer travel." They are clearly preparing for a trip somewhere warm and light-packing conscious. Your recommendation engine should immediately surface "lightweight travel pants," "quick-dry travel socks," and "packable rain jacket." These items do not share a category with the shirt. They have not been co-purchased with it enough times to register in a "Frequently Bought Together" signal. And no other user behavior connects them in a collaborative filter.

But they are obviously related. Anyone reading those product descriptions would see the connection instantly. The phrase "travel," the quality of being lightweight, the implicit use case of a warm-weather trip: these shared characteristics live in the meaning of the descriptions, not in the metadata taxonomy.

This is the gap that vector search fills for recommendation systems. By representing product descriptions as points in a high-dimensional vector space, we can find items that are semantically related even when they share no categories, no tags, and no co-purchase history. And by maintaining those embeddings as materialized views in RisingWave, we keep them fresh as the catalog changes: no batch jobs, no re-indexing schedules, no lag.

What Vector Search Does for Recommendations

Traditional recommendation approaches have predictable blind spots.

Category filtering is the simplest approach: recommend items in the same category. It works well within tightly defined categories (other linen shirts) but completely misses cross-category relevance. Travel pants are not shirts, so they never appear.

Collaborative filtering looks at what users with similar behavior bought or viewed. It is powerful when behavioral data is dense. But it requires many users to interact with items before relationships emerge. A newly added product has no purchase history, so it is invisible to collaborative filtering until enough users interact with it. And even with dense data, collaborative filtering will miss the "linen shirt to travel pants" connection unless many users happened to buy both.

Keyword and tag matching finds items with similar labels. But synonyms break it: "breathable" and "lightweight" are semantically equivalent in a travel context, but exact-match systems treat them as unrelated. Tag completeness also varies across catalog teams: one item might be tagged "travel" and another might not be, even if the description says the same thing.

Vector search solves all three problems. Each product description is encoded as a high-dimensional vector (an embedding) where direction and distance carry semantic meaning. Items with similar meanings end up close together in this space regardless of category, purchase history, or tag overlap. "Linen shirt for summer travel" and "lightweight travel pants" end up near each other because the embedding model has learned, from billions of text examples, that these phrases describe related concepts.

Finding recommendations then becomes a nearest-neighbor search: given the vector for item A, find the items whose vectors are closest to it.

The Streaming Angle: Why Embeddings Go Stale

Most vector search implementations work like this: a data engineer writes a batch job that reads the product catalog, calls an embedding API, and writes the results to a vector database. The job runs nightly or weekly. When a product description changes, the embedding stays wrong until the next batch run.

For a catalog that changes frequently (new products launched daily, descriptions revised based on SEO testing, items going out of stock), this staleness is a real problem. A product whose description was updated to emphasize a new use case will not reflect that change in recommendations until the batch job catches up.

RisingWave takes a different approach. Embeddings are maintained as a materialized view defined over the product table. When a product's description changes in the source database, the CDC (change data capture) event flows into RisingWave, and the materialized view recomputes the embedding for only that row. No batch job. No re-indexing schedule. No lag window.

This means your vector search always reflects the current state of your catalog.

Setting Up the Product Source

We will start by connecting RisingWave to a PostgreSQL product catalog via CDC:

-- Products table (synced from PostgreSQL via CDC)
CREATE SOURCE products_source
WITH (
    connector = 'postgres-cdc',
    hostname = 'postgres',
    port = '5432',
    username = 'postgres',
    password = '${POSTGRES_PASSWORD}',
    database.name = 'store',
    schema.name = 'public',
    publication.name = 'products_pub'
);

CREATE TABLE products (
    item_id     TEXT PRIMARY KEY,
    title       TEXT,
    description TEXT,
    category    TEXT,
    brand       TEXT,
    tags        TEXT[],
    in_stock    BOOLEAN,
    updated_at  TIMESTAMPTZ
) FROM products_source TABLE 'public.products';

This creates a CDC-backed table that reflects your PostgreSQL product catalog in real time. Every INSERT, UPDATE, and DELETE in the source products table propagates to RisingWave within milliseconds.

The Embedding Materialized View

With the product table in place, we define the materialized view that generates and maintains embeddings:

-- Materialized view that auto-embeds product descriptions
-- When a product description changes via CDC, its embedding updates automatically
CREATE MATERIALIZED VIEW product_embeddings AS
SELECT
    item_id,
    title,
    category,
    brand,
    in_stock,
    openai_embedding(
        '${OPENAI_API_KEY}',
        'text-embedding-3-small',
        title || '. ' || description || '. Brand: ' || brand || '. Category: ' || category
    )::vector(1536) AS embedding
FROM products
WHERE in_stock = true;

-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX product_embedding_idx 
ON product_embeddings 
USING hnsw (embedding vector_cosine_ops);

A few things worth noting here.

The input string combines title, description, brand, and category into a single text. This is intentional. Product titles are often too short to carry full semantic meaning ("Travel Pants"). Full descriptions add context ("moisture-wicking, quick-dry fabric, articulated knees for mobility"). Brand and category ground the embedding in product space rather than general language space. The combination produces better embeddings than title alone.

The ::vector(1536) cast converts the REAL[] array returned by openai_embedding into the vector type that RisingWave uses for similarity search. The dimension (1536) must match the model's output dimension.

The HNSW index with vector_cosine_ops enables approximate nearest neighbor search using cosine similarity. For text embeddings, cosine similarity is generally preferred over L2 distance because it measures the angle between vectors (semantic direction) rather than absolute distance, which can be influenced by vector magnitude. The index makes nearest-neighbor queries fast even with hundreds of thousands of products.

Finding Semantically Similar Items

With embeddings maintained and indexed, a "More like this" recommendation query becomes straightforward:

-- Items semantically similar to item X
SELECT
    candidate.item_id,
    candidate.title,
    candidate.category,
    1 - (candidate.embedding <=> seed.embedding) AS similarity_score
FROM product_embeddings candidate
CROSS JOIN (
    SELECT embedding FROM product_embeddings WHERE item_id = $1
) seed
WHERE candidate.item_id != $1
  AND candidate.in_stock = true
ORDER BY candidate.embedding <=> seed.embedding
LIMIT 20;

The <=> operator computes cosine distance (lower is more similar). 1 - distance converts it to a similarity score between 0 and 1. The subquery fetches the seed item's embedding once and crosses it with all candidates, letting the HNSW index apply efficiently.

This query will find travel pants, quick-dry socks, and packable jackets as neighbors of a linen travel shirt, because their descriptions share semantic features in the embedding space regardless of their category assignments in the taxonomy.

User Intent Embeddings

Vector similarity is not limited to item-to-item recommendations. You can embed a user's search query or the description of whatever they are currently viewing and find the nearest items:

import psycopg2
import openai

conn = psycopg2.connect(
    host="localhost", port=4566,
    database="dev", user="root", password=""
)

openai_client = openai.Client(api_key=OPENAI_API_KEY)

def find_similar_by_query(search_query: str, limit: int = 10):
    """Find products semantically similar to a free-text query."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=search_query
    )
    query_embedding = response.data[0].embedding

    with conn.cursor() as cur:
        cur.execute("""
            SELECT item_id, title, category,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM product_embeddings
            WHERE in_stock = true
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_embedding, query_embedding, limit))
        return cur.fetchall()


def find_similar_to_item(item_id: str, limit: int = 10):
    """Find products semantically similar to a given product."""
    with conn.cursor() as cur:
        cur.execute("""
            SELECT
                c.item_id, c.title, c.category,
                1 - (c.embedding <=> s.embedding) AS similarity
            FROM product_embeddings c
            CROSS JOIN (
                SELECT embedding FROM product_embeddings WHERE item_id = %s
            ) s
            WHERE c.item_id != %s
            ORDER BY c.embedding <=> s.embedding
            LIMIT %s
        """, (item_id, item_id, limit))
        return cur.fetchall()

find_similar_by_query is especially useful for search-based discovery. A user who types "something comfortable for a beach vacation" has expressed intent that does not map cleanly to any category. Embedding that query and finding nearest neighbors in product space will surface swimwear, sandals, linen cover-ups, and beach towels: items that match the intent even though no keyword overlap exists.

find_similar_to_item powers the "More like this" widget on product detail pages. Both functions use the same RisingWave connection pool, so the application layer is simple.

Hybrid Recommendations: Semantic + Behavioral Signals

Vector similarity is powerful, but it operates on description text alone. It cannot know that users who buy item A almost always buy item B, even if their descriptions are not particularly similar (think: phone and phone case). For the best recommendation quality, combine semantic similarity with behavioral co-purchase signals.

Assuming you have a co_purchased_items view tracking items frequently bought together (from behavioral data), the hybrid query is:

-- Hybrid: semantic similarity (50%) + co-purchase signal (50%)
SELECT
    semantic.item_id,
    semantic.title,
    semantic.similarity_score * 0.5 +
        COALESCE(cp.purchase_pair_count, 0) * 0.5 AS hybrid_score
FROM (
    -- Semantic similarity to the seed item
    SELECT
        c.item_id,
        c.title,
        1 - (c.embedding <=> s.embedding) AS similarity_score
    FROM product_embeddings c
    CROSS JOIN (SELECT embedding FROM product_embeddings WHERE item_id = $1) s
    WHERE c.item_id != $1
    ORDER BY c.embedding <=> s.embedding
    LIMIT 50
) semantic
LEFT JOIN co_purchased_items cp 
    ON cp.item_a = $1 AND cp.item_b = semantic.item_id
ORDER BY hybrid_score DESC
LIMIT 10;

The outer query pulls 50 semantic candidates and scores each by combining semantic similarity with co-purchase frequency. Items with strong semantic overlap get promoted further if they also have behavioral confirmation. Items that are semantically similar but have never been co-purchased still appear (via the semantic score alone), which is critical for new items that have no purchase history yet.

The 50/50 split is a starting point. If your catalog is text-rich and behavioral data is sparse (common in B2B catalogs or newer stores), weight semantic similarity more heavily. If your catalog has many similar-sounding items that serve different purposes, weight co-purchase data more heavily. A/B testing will reveal the right balance.

Choosing an Embedding Model

Not all embedding models are equally suited for product catalog search. Here is a practical comparison:

ModelDimensionsSpeedCostBest For
text-embedding-3-small1536FastLowProduct titles, short descriptions, real-time use
text-embedding-3-large3072SlowerHigherLong detailed descriptions, high-precision search
text-embedding-ada-0021536FastLowLegacy; prefer 3-small for new projects

For most product catalog use cases, text-embedding-3-small is the right choice. It is cheap enough to embed an entire catalog without significant cost, fast enough to use in real-time CDC pipelines, and accurate enough for product title and description similarity.

text-embedding-3-large is worth considering if your product descriptions are long (500+ words) and contain technical specifications where nuance matters. Think industrial equipment, pharmaceutical products, or specialized B2B catalogs where "compatible with model X but not model Y" in a description must be correctly understood.

One practical note: the embedding model version should match between the indexing step and the query step. If you embed product descriptions with text-embedding-3-small, embed user queries with the same model. Mixing models produces meaningless similarity scores.

The Streaming Advantage in Practice

Let us walk through what happens when a product description changes, to make the streaming advantage concrete.

Batch approach:

  1. Product description updated in PostgreSQL at 2:00 PM
  2. Nightly embedding job scheduled for 2:00 AM runs
  3. New embedding written to vector database at 2:15 AM
  4. Recommendations reflect the updated description 12+ hours after the change

RisingWave streaming approach:

  1. Product description updated in PostgreSQL at 2:00 PM
  2. CDC event arrives in RisingWave within milliseconds
  3. Materialized view recomputes embedding for that row, calling OpenAI API
  4. HNSW index updated with new embedding
  5. Recommendations reflect the updated description within seconds

For most catalog updates, the 12-hour lag in the batch approach is invisible to users. But consider cases where the lag matters:

  • An item is discontinued and its description changes to indicate a replacement product. With batch, users see the old item in "similar items" for 12 more hours after it goes out of stock.
  • A product description is updated to correct a factual error. Recommendations based on the wrong description persist until the next batch run.
  • A new product is added with a description that positions it against existing inventory. It should immediately appear in "similar items" for those existing products.

The streaming approach handles all three cases correctly without any additional infrastructure.

When to Use Which Recommendation Approach

Different recommendation signals serve different purposes on a product page. Here is a practical guide:

Co-purchase signals ("Frequently Bought Together"): Best for accessory bundling and complementary items where the relationship is behavioral, not descriptive. Works only for items with purchase history. Built from event stream data with a co-occurrence materialized view.

Collaborative filtering ("Users like you also liked"): Best for discovery across your catalog based on user similarity. Requires dense interaction data and a model training step. The feature inputs (user category profiles, item interaction counts) can be maintained as streaming materialized views, even if the model itself is trained offline.

Vector similarity ("More like this" and search discovery): Best for new items with no behavioral history, cross-category recommendations based on semantic relationships, and search-based discovery from free-text queries. Works immediately for any item with a description.

Category filtering: Best as a fallback when no other signals are available, or for highly browsable categories (like apparel) where users genuinely want to see alternatives within the same product type.

In practice, a production recommendation system uses all four, routing users to the appropriate strategy based on context. The product detail page might show vector similarity in the "More like this" section and co-purchase signals in the "Frequently Bought Together" section. The search results page might blend vector similarity (semantic matches) with collaborative filtering (what engaged users have clicked on) in the ranking score.

Architecture Summary

The complete semantic recommendation pipeline has three layers:

Source and ingestion. The product catalog lives in PostgreSQL. A postgres-cdc connector feeds changes into RisingWave's products table in real time. No Debezium, no additional message bus: CDC is handled directly by RisingWave.

Embedding and indexing. The product_embeddings materialized view maintains embeddings for all in-stock products. RisingWave calls the OpenAI embedding API incrementally as rows are created or updated. The HNSW index provides fast approximate nearest-neighbor search over the embedding column.

Serving. Your application queries recommendations via a standard psycopg2 connection on port 4566. Both item-to-item similarity and query-to-item similarity queries hit the HNSW index, returning results in milliseconds. The Python functions shown earlier are the complete serving layer: no separate vector database client, no additional SDK.

This replaces what typically requires: a separate vector database (Pinecone, Weaviate, or Qdrant), a batch embedding pipeline, a re-indexing scheduler, and CDC infrastructure to detect catalog changes. RisingWave handles all of it in a single system that speaks standard SQL.

Getting Started

If you want to try this pipeline, RisingWave runs locally with Docker. You will need an OpenAI API key for the embedding function. For a managed deployment, RisingWave Cloud provides a free tier with no cluster management.

A practical starting point: load a sample of 1,000 products, create the materialized view, and run some similarity queries against items in different categories. The cross-category recommendations that emerge (items semantically related across taxonomy boundaries) are typically where recommendation quality surprises users most.

Once the base pipeline is working, extensions worth building:

  • User session embeddings: embed a running summary of what the user has viewed in this session and find items nearest to that evolving intent vector
  • Diversity controls: filter out items with similarity scores above 0.98 to avoid trivial duplicates (same item, different color) dominating the results
  • Freshness boosting: multiply similarity scores by a recency factor to surface newer items in semantically similar results
  • Negative feedback: when a user dismisses a recommendation, store the item's embedding and filter out near-neighbors of dismissed items in future queries

All of these extensions layer onto the same materialized view and HNSW index without requiring changes to the core pipeline.


Ready to build semantic recommendations? Sign up for RisingWave Cloud, free with no credit card required. Or follow the Quickstart to run locally.

Join the RisingWave Slack community to ask questions and see what others are building with vector search and streaming SQL.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.