{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "What is collaborative filtering?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Collaborative filtering is a recommendation technique that uses collective user behavior to make predictions. Item-based collaborative filtering finds items that are frequently co-interacted with and recommends similar items. User-based collaborative filtering finds users with similar interaction histories and recommends what those similar users liked."
}
},
{
"@type": "Question",
"name": "What is real-time collaborative filtering?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Real-time collaborative filtering computes and updates similarity signals continuously as events arrive, rather than in nightly batch jobs. When a user buys an item, the co-purchase matrix updates immediately. When a new item accumulates views, its co-occurrence relationships with other items form within minutes rather than waiting until the next training run."
}
},
{
"@type": "Question",
"name": "How does item-based collaborative filtering differ from user-based?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Item-based CF builds a matrix of item-to-item similarity based on which items users interact with together. User-based CF builds a matrix of user-to-user similarity based on overlapping interaction histories. Item-based typically scales better because item catalogs are smaller and more stable than user bases, and item relationships change more slowly."
}
},
{
"@type": "Question",
"name": "Can you implement collaborative filtering in SQL without training a model?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Yes. Co-occurrence-based collaborative filtering can be expressed entirely in SQL using self-joins on session or user identifiers. Streaming databases like RisingWave maintain these materialized views incrementally, so similarity scores update as new events arrive without retraining. This gives you strong baseline recommendations without a training pipeline."
}
},
{
"@type": "Question",
"name": "What are the scalability limits of SQL-based collaborative filtering?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Item-based CF scales to catalogs of hundreds of thousands of items because the co-occurrence matrix grows with O(items squared) but can be pruned aggressively. User-based CF is harder to scale because the user similarity matrix grows with O(users squared) and user counts are usually much larger than item counts. For tens of millions of users, user-based CF in pure SQL requires additional approximation techniques."
}
}
]
}
The Table That Updates While You Sleep -- and the One That Does Not
Walk into any data team that runs collaborative filtering recommendations and ask them when "users who bought this also bought that" was last updated. The honest answer is usually: sometime last night.
The intuition behind collaborative filtering is immediate and powerful. If users A, B, and C all bought a camping tent and a sleeping bag in close succession, the sleeping bag is a strong recommendation for any new user who just put a camping tent in their cart. The insight is obvious. The implementation question is: how recently is "also bought" measured?
In most production systems the answer is: the last 24 hours of transactions that completed before 2am. Which means the user who bought a tent at 11pm and is browsing your site at 9am the next morning will not benefit from the purchases of the hundreds of people who also bought tents overnight. Those co-purchase signals exist in your database. They are simply invisible to the recommendation layer because the batch job has not run yet.
Real-time collaborative filtering fixes this by replacing the nightly batch run with continuously maintained materialized views. The co-purchase matrix updates the moment a new purchase lands. A user browsing at 9am sees recommendation signals built from purchases that happened at 8:59am.
This article walks through the complete SQL implementation of both item-based and user-based collaborative filtering in RisingWave, a PostgreSQL-compatible streaming database. Every query in this article is runnable. All views update within milliseconds of each new event.
The Two Flavors of Collaborative Filtering
Collaborative filtering comes in two forms. Understanding the difference matters for choosing which one to implement first.
Item-based collaborative filtering asks: which items are frequently interacted with together? If item A and item B appear in many of the same sessions, or are co-purchased by many of the same users, they have a relationship. When a user engages with item A, recommend item B.
The similarity matrix here is item x item. For a catalog of 100,000 products, that is up to 10 billion potential pairs, but in practice the matrix is extremely sparse -- most pairs have zero co-occurrences -- and you only need to store the non-zero ones.
User-based collaborative filtering asks: which users have similar taste profiles? If user X and user Y have both bought items 1, 2, and 3, they are similar. Whatever user Y bought next (item 4) is a good recommendation for user X.
The similarity matrix here is user x user. For an application with 10 million users, that is 100 trillion potential pairs. User-based CF is fundamentally harder to scale, which is why most large production systems use item-based CF as the primary approach and either abandon user-based or approximate it aggressively.
We will implement both, starting with item-based.
Setting Up the Event Stream
Both approaches need the same input: a stream of user interaction events. In production these flow through Kafka from your application layer.
CREATE SOURCE user_events_source
WITH (
connector = 'kafka',
topic = 'user-events',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'earliest'
)
FORMAT PLAIN ENCODE JSON;
CREATE TABLE user_events (
event_id TEXT,
user_id TEXT,
item_id TEXT,
event_type TEXT, -- 'view', 'add_to_cart', 'purchase', 'like'
category TEXT,
session_id TEXT,
event_time TIMESTAMPTZ
) FROM user_events_source;
For local development without Kafka, create the table directly:
CREATE TABLE user_events (
event_id TEXT,
user_id TEXT,
item_id TEXT,
event_type TEXT,
category TEXT,
session_id TEXT,
event_time TIMESTAMPTZ
);
-- Sample data representing two sessions across three users
INSERT INTO user_events VALUES
('e01', 'u1', 'tent_pro', 'view', 'camping', 's1', NOW() - INTERVAL '2 hours'),
('e02', 'u1', 'sleeping_bag','view', 'camping', 's1', NOW() - INTERVAL '2 hours'),
('e03', 'u1', 'tent_pro', 'purchase', 'camping', 's1', NOW() - INTERVAL '2 hours'),
('e04', 'u1', 'sleeping_bag','purchase', 'camping', 's1', NOW() - INTERVAL '2 hours'),
('e05', 'u2', 'tent_pro', 'view', 'camping', 's2', NOW() - INTERVAL '1 hour'),
('e06', 'u2', 'headlamp', 'view', 'camping', 's2', NOW() - INTERVAL '1 hour'),
('e07', 'u2', 'tent_pro', 'purchase', 'camping', 's2', NOW() - INTERVAL '1 hour'),
('e08', 'u2', 'headlamp', 'purchase', 'camping', 's2', NOW() - INTERVAL '1 hour'),
('e09', 'u3', 'sleeping_bag','view', 'camping', 's3', NOW() - INTERVAL '30 minutes'),
('e10', 'u3', 'headlamp', 'view', 'camping', 's3', NOW() - INTERVAL '30 minutes'),
('e11', 'u3', 'sleeping_bag','purchase', 'camping', 's3', NOW() - INTERVAL '30 minutes');
With this data, tent_pro and sleeping_bag are co-purchased by user u1. tent_pro and headlamp are co-purchased by user u2. sleeping_bag and headlamp are co-purchased by user u3. The item similarity views we build next will capture all three relationships.
Item-Based Collaborative Filtering
The Co-View Matrix
Session co-views are the fastest-building signal. Items that appear in the same browsing session have a demonstrated behavioral connection, regardless of whether the user completed a purchase.
CREATE MATERIALIZED VIEW item_coview_matrix AS
SELECT
a.item_id AS item_a,
b.item_id AS item_b,
COUNT(DISTINCT a.session_id) AS session_cooccurrence,
COUNT(DISTINCT a.user_id) AS user_cooccurrence
FROM user_events a
JOIN user_events b
ON a.session_id = b.session_id
AND a.item_id < b.item_id -- prevents (A,B) and (B,A) duplicates
GROUP BY a.item_id, b.item_id;
Two details matter here. The a.item_id < b.item_id condition is not just an optimization: it is necessary for correctness. Without it, every pair would appear twice in the output -- once as (tent_pro, sleeping_bag) and once as (sleeping_bag, tent_pro) -- and all downstream aggregations would double-count.
The distinction between session_cooccurrence and user_cooccurrence is also meaningful. An item pair with session_cooccurrence = 100 and user_cooccurrence = 80 was viewed together by 80 different people in 100 sessions -- a strong, diverse signal. An item pair with session_cooccurrence = 100 and user_cooccurrence = 2 was generated by two users who each had very long sessions -- much noisier, and potentially driven by unusual behavior rather than a real product relationship.
The Co-Purchase Matrix
Purchases are a stronger signal than views because they represent a completed transaction decision. Co-purchases within the same session are the strongest item similarity signal available without an explicit rating.
CREATE MATERIALIZED VIEW co_purchased AS
SELECT
a.item_id AS item_a,
b.item_id AS item_b,
COUNT(*) AS purchase_pair_count,
COUNT(DISTINCT a.user_id) AS unique_buyers
FROM user_events a
JOIN user_events b
ON a.session_id = b.session_id
AND a.item_id < b.item_id
AND a.event_type = 'purchase'
AND b.event_type = 'purchase'
GROUP BY a.item_id, b.item_id;
After inserting the sample data above, querying this view should return:
- (tent_pro, sleeping_bag): purchase_pair_count = 1, unique_buyers = 1
- (headlamp, tent_pro): purchase_pair_count = 1, unique_buyers = 1
- (headlamp, sleeping_bag): purchase_pair_count = 1, unique_buyers = 1
As more users make purchases, these counts grow and the similarity picture sharpens.
Combined Item Similarity Score
The most useful similarity signal combines co-view breadth (many different users saw these items together) with co-purchase strength (users who committed both purchases).
CREATE MATERIALIZED VIEW item_similarity AS
SELECT
COALESCE(cv.item_a, cp.item_a) AS item_a,
COALESCE(cv.item_b, cp.item_b) AS item_b,
COALESCE(cv.session_cooccurrence, 0)
+ COALESCE(cp.purchase_pair_count, 0) * 3 AS similarity_score,
COALESCE(cv.user_cooccurrence, 0) AS coview_users,
COALESCE(cp.unique_buyers, 0) AS copurchase_users
FROM item_coview_matrix cv
FULL OUTER JOIN co_purchased cp
ON cv.item_a = cp.item_a AND cv.item_b = cp.item_b;
The weight of 3 on purchase_pair_count reflects the signal strength difference: a purchase is roughly three times as strong an affinity signal as a co-view. You can tune this weight based on your conversion funnel. For a marketplace with low purchase rates, you might weight purchases even higher. For a content platform where "purchases" are subscriptions, you might weight plays and saves instead.
The FULL OUTER JOIN ensures that item pairs that only appear in one signal (only co-viewed, or only co-purchased) still appear in the result with a zero for the missing signal. This is important: a pair that has been co-purchased even once but never explicitly co-viewed (because users added directly from search) still deserves a non-zero similarity score.
Serving Item-Based Recommendations
-- "Customers who viewed this also viewed..."
-- Used on product detail pages
SELECT
item_b AS related_item_id,
similarity_score,
coview_users,
copurchase_users
FROM item_similarity
WHERE item_a = $1 -- the item currently being viewed
ORDER BY similarity_score DESC
LIMIT 10;
This query touches a single indexed view and returns in single-digit milliseconds regardless of total catalog size. There is no on-the-fly computation: the similarity score for every item pair was computed the last time a new purchase or view event arrived.
What "Real Time" Means for Item Similarity
Consider the concrete sequence of events: user u4 views tent_pro and sleeping_bag in a new session, then purchases both.
At the moment the first purchase event arrives, co_purchased gains a new row for (tent_pro, sleeping_bag). Before u4's purchase, this pair had purchase_pair_count = 1 (from u1). After u4's purchase, it has purchase_pair_count = 2. The item_similarity score for this pair increases immediately.
If u5 is browsing tent_pro at that exact moment and queries for related items, they will see sleeping_bag with a similarity score that already incorporates u4's purchase -- an event that happened seconds ago. In a batch system, u4's purchase would not affect recommendations until the next morning's run.
This matters most for new items and viral trends. A product that gets co-purchased with an existing bestseller dozens of times in a single afternoon will not surface as a related item in any batch system until the next day. In a streaming system, it surfaces within the hour.
User-Based Collaborative Filtering
User-based CF requires two steps: compute each user's interaction profile, then find users whose profiles overlap meaningfully.
User Interaction Profiles
A user profile is their weighted history of item interactions. Purchases count more than cart additions, which count more than views.
CREATE MATERIALIZED VIEW user_item_interactions AS
SELECT
user_id,
item_id,
SUM(CASE
WHEN event_type = 'purchase' THEN 10
WHEN event_type = 'add_to_cart' THEN 5
WHEN event_type = 'like' THEN 3
WHEN event_type = 'view' THEN 1
ELSE 0
END) AS score,
MAX(event_time) AS last_interaction
FROM user_events
GROUP BY user_id, item_id;
This view gives you every (user, item) pair with a weighted engagement score. A user who has both viewed and purchased an item has a score of 11 for that item. A user who only viewed it once has a score of 1. These scores are the basis for user similarity computation.
User Similarity
User similarity is computed by finding pairs of users who have both engaged with the same items at a meaningful level.
CREATE MATERIALIZED VIEW user_similarity AS
SELECT
a.user_id AS user_a,
b.user_id AS user_b,
COUNT(DISTINCT a.item_id) AS shared_items,
SUM(LEAST(a.score, b.score)) AS overlap_score,
MAX(GREATEST(a.last_interaction, b.last_interaction)) AS last_active
FROM user_item_interactions a
JOIN user_item_interactions b
ON a.item_id = b.item_id
AND a.user_id < b.user_id -- prevents (A,B) and (B,A) duplicates
AND a.score >= 5 -- only significant interactions
AND b.score >= 5 -- (view-only users add noise)
GROUP BY a.user_id, b.user_id
HAVING COUNT(DISTINCT a.item_id) >= 3; -- at least 3 shared meaningful items
Walk through the design choices:
The score >= 5 filter on both sides eliminates casual view-only overlaps. A user who briefly viewed an item and a user who purchased it are not meaningfully "similar" on that item. Requiring at least a 5-point score (which corresponds to an add-to-cart or a purchase) ensures that shared items represent real affinities.
The LEAST(a.score, b.score) for overlap_score computes a conservative similarity measure. If user A has score 10 (purchased) and user B has score 1 (viewed) for the same item, LEAST returns 1, not 10. This prevents a very casual interaction by one user from inflating the similarity just because the other user had a strong interaction.
The HAVING COUNT(DISTINCT a.item_id) >= 3 threshold prevents spurious high-similarity matches between users who happen to have interacted with the same single popular item. Requiring three shared items provides a much stronger basis for claiming users have similar tastes.
Category Affinity for User Similarity
A complementary view for user similarity based on category behavior rather than individual items. This is more stable than item-level similarity because categories change more slowly than item catalogs:
CREATE MATERIALIZED VIEW user_category_profile AS
SELECT
user_id,
category,
SUM(CASE
WHEN event_type = 'purchase' THEN 10
WHEN event_type = 'add_to_cart' THEN 5
WHEN event_type = 'view' THEN 1
ELSE 0
END) AS category_score,
COUNT(DISTINCT item_id) AS distinct_items_in_category
FROM user_events
GROUP BY user_id, category;
Two users who both have high category_score in "camping" but low scores in all other categories are likely to have overlapping item interests within that category. Your serving layer can use this profile to focus user-similarity lookups within a category, reducing the search space substantially.
Serving User-Based Recommendations
-- "Users like you also liked..."
-- Returns items that similar users engaged with, that the current user has not seen
SELECT
ub.item_id,
SUM(us.overlap_score) AS neighbor_affinity,
COUNT(DISTINCT us.user_b) AS supporting_neighbors
FROM user_similarity us
JOIN user_item_interactions ub
ON us.user_b = ub.user_id
AND ub.score >= 5 -- only items the neighbor engaged with meaningfully
LEFT JOIN user_item_interactions ua
ON ua.user_id = $1
AND ua.item_id = ub.item_id
WHERE us.user_a = $1
AND ua.item_id IS NULL -- exclude items the current user already knows
GROUP BY ub.item_id
ORDER BY neighbor_affinity DESC
LIMIT 10;
The LEFT JOIN with WHERE ua.item_id IS NULL is the suppression filter. It removes any item from the results that the current user has already interacted with, at any level. This is more aggressive than filtering on purchases alone -- it also removes items the user has viewed, which prevents recommending something they have already considered and passed on.
supporting_neighbors tells you how many similar users liked this item. An item with neighbor_affinity = 500 backed by 10 similar users is a stronger recommendation than one with the same score backed by 2 users who happen to have very high overlap scores with the current user.
Why Item-Based Scales Better Than User-Based
This is not theoretical. Let us put concrete numbers to the O(n) vs O(n squared) difference.
A mid-size e-commerce site might have 200,000 items and 5 million registered users.
For item-based CF, the co-occurrence matrix has at most 200,000 x 200,000 = 40 billion potential pairs, but in practice only a small fraction of items are ever co-viewed. If the average item co-occurs with 500 others (a generous estimate), the materialized co-occurrence table has about 100 million rows. That is manageable.
For user-based CF, the similarity matrix has at most 5 million x 5 million = 25 trillion potential pairs. Even with the aggressive threshold (3+ shared items, score >= 5 on both sides), the number of pairs that clear the bar can easily reach hundreds of millions for an active platform. Storing and querying that table is expensive.
There are three practical strategies to make user-based CF tractable at scale:
1. Cluster users into cohorts before computing similarity. Users in the "camping enthusiasts" cohort only get compared to other camping enthusiasts, not to every user on the platform. You can maintain category cohort assignments as a materialized view:
CREATE MATERIALIZED VIEW user_primary_category AS
SELECT DISTINCT ON (user_id)
user_id,
category AS primary_category,
category_score
FROM user_category_profile
ORDER BY user_id, category_score DESC;
Then add AND a_profile.primary_category = b_profile.primary_category to the user similarity query to limit comparisons to the same cohort.
2. Use item-based CF as the primary signal, user-based as re-ranking. Generate the initial recommendation candidates from item-based CF (fast, low latency, small result set), then re-rank them using user similarity scores (more expensive but applied to a small set). This hybrid approach is used by most large recommendation systems.
3. Accept approximation. Instead of computing exact user similarity over all users, maintain similarity only for "active" users (those with at least one interaction in the last 30 days). Inactive users fall back to item-based or trending recommendations. This can cut the effective user population by 90%+ for most platforms.
Combining Both Approaches
Item-based and user-based signals are complementary. Item-based tells you "this item belongs with those items." User-based tells you "this user is like those users, who liked these items." Combining them gives you a more robust ranking:
-- Hybrid recommendation score: item-based signal + user-based signal
SELECT
candidate_item,
SUM(item_score) AS item_based_signal,
SUM(user_score) AS user_based_signal,
SUM(item_score) * 0.6 + SUM(user_score) * 0.4 AS hybrid_score
FROM (
-- Item-based candidates
SELECT
isim.item_b AS candidate_item,
isim.similarity_score AS item_score,
0 AS user_score
FROM item_similarity isim
JOIN user_item_interactions uii
ON uii.item_id = isim.item_a
AND uii.user_id = $1
AND uii.score >= 3
UNION ALL
-- User-based candidates
SELECT
ub.item_id AS candidate_item,
0 AS item_score,
us.overlap_score AS user_score
FROM user_similarity us
JOIN user_item_interactions ub
ON us.user_b = ub.user_id
AND ub.score >= 5
WHERE us.user_a = $1
) combined
GROUP BY candidate_item
ORDER BY hybrid_score DESC
LIMIT 20;
The weights (0.6 item-based, 0.4 user-based) are a starting point. In practice you would A/B test different weightings and optimize against your conversion metric. RisingWave does not need to change for this: the materialized views remain the same, and only the serving query changes.
Monitoring Similarity Quality
A production collaborative filtering system needs observability into the quality of its similarity signals. These views are useful for monitoring:
-- Distribution of similarity scores (check for skew or degenerate cases)
CREATE MATERIALIZED VIEW item_similarity_stats AS
SELECT
COUNT(*) AS total_pairs,
AVG(similarity_score) AS avg_score,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY similarity_score) AS median_score,
MAX(similarity_score) AS max_score,
COUNT(*) FILTER (WHERE copurchase_users >= 10) AS high_confidence_pairs
FROM item_similarity;
-- Items with no similarity relationships (catalog gaps)
CREATE MATERIALIZED VIEW items_without_similarity AS
SELECT i.item_id, i.category
FROM items i
LEFT JOIN item_similarity s
ON i.item_id = s.item_a OR i.item_id = s.item_b
WHERE s.item_a IS NULL;
-- Recent co-purchase rate (health check for the pipeline)
CREATE MATERIALIZED VIEW copurchase_rate AS
SELECT
window_start,
window_end,
COUNT(*) AS co_purchase_pairs_formed
FROM TUMBLE(
(SELECT a.event_time, a.item_id AS item_a, b.item_id AS item_b
FROM user_events a
JOIN user_events b
ON a.session_id = b.session_id
AND a.item_id < b.item_id
AND a.event_type = 'purchase'
AND b.event_type = 'purchase'),
event_time,
INTERVAL '1' HOUR
)
GROUP BY window_start, window_end;
If items_without_similarity grows unexpectedly, you may have new items in the catalog that have not been purchased yet -- a cold-start issue that requires a fallback strategy. If copurchase_rate drops to zero, your event pipeline may have a problem upstream.
What SQL Collaborative Filtering Cannot Do
Being honest about limitations is part of building systems that last.
Matrix factorization and neural collaborative filtering (NCF) are not expressible in pure SQL. These approaches learn latent embeddings for users and items that capture patterns the co-occurrence matrix cannot represent. A user who buys budget products in five categories and premium products in one specific category reveals a nuanced preference structure that co-occurrence statistics will flatten into noise. Neural CF models can learn this. SQL cannot.
Temporal dynamics are hard to model explicitly. A user's preferences evolve over time. The camping gear they bought three years ago is less predictive of today's interests than the fitness equipment they browsed last week. You can add recency weights to the SQL (exponential decay on event_time), but capturing the full dynamics of preference drift is better handled by a model with explicit time modeling.
Implicit negative feedback is lost. When a user views an item 10 times without buying, that might mean they are interested but uncertain, or it might mean they keep seeing it in recommendations and ignoring it. SQL cannot distinguish these cases without explicit dismiss events.
For teams that want deeper personalization, the right architecture is a hybrid: SQL collaborative filtering produces fast, always-fresh candidate sets, and a trained ranking model (LightGBM, neural networks, or NCF) reranks those candidates using richer signals including the streaming features computed in RisingWave. The SQL layer handles freshness; the ML layer handles complexity.
RisingWave supports vector columns (vector(n) type, <=> cosine distance operator, HNSW indexes) for teams that want to store learned item embeddings alongside streaming signals and serve hybrid recommendations from a single system.
Frequently Asked Questions
How quickly does the item similarity matrix update after a new purchase?
Within milliseconds. When a new purchase event arrives in user_events, RisingWave's streaming engine propagates it through the operator graph. The co_purchased view updates first (it reads directly from user_events). The item_similarity view updates next (it reads from co_purchased and item_coview_matrix). By the time your application issues the next recommendation query, the new purchase is already reflected in the similarity score.
Is this approach suitable for a content platform (articles, videos) rather than e-commerce?
Yes. Replace "purchase" with "complete" (finished reading/watching) and "add_to_cart" with "save" or "bookmark." The co-occurrence signals work on any interaction events. On a content platform, co-view within a session (two articles read in the same session) is often a stronger signal than co-completion because completion rates are low. Adjust event weights accordingly.
How do I avoid popularity bias in collaborative filtering?
Popularity bias occurs when highly popular items dominate all co-occurrence matrices simply because they appear in so many sessions. Techniques to reduce it: (1) normalize similarity_score by the individual item popularity (coview_count / SQRT(views_a * views_b) is a Jaccard-like normalization), (2) add a WHERE item_b NOT IN (SELECT item_id FROM top_100_items) filter to force diversity, or (3) cap the contribution of any single popular item to the similarity score with LEAST(coview_count, 50).
Can I use this for a new item that has zero interactions?
No. Zero-interaction items cannot form co-occurrence relationships. You need a cold-start fallback: use content-based similarity (items with similar descriptions, same category, same price range) to bootstrap an item until it accumulates real co-occurrence data. Once a new item reaches a minimum of interactions (typically 10 to 50 depending on catalog size), it begins to appear in co-occurrence matrices and can take over from the content-based fallback.
How many workers does RisingWave need to maintain these views for a large catalog?
It depends on your event rate. At 10,000 events per second, a three-node RisingWave cluster (each node with 4 vCPUs and 16GB RAM) can comfortably maintain all views described in this article. The most expensive view is user_similarity because it involves a self-join. Adding the score >= 5 filter and the HAVING threshold dramatically reduces the number of rows that enter the join, which is why those guards are important not just for recommendation quality but for compute efficiency.

