{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "Can you build a content recommendation engine without machine learning?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Yes. For most content platforms, three SQL-based signals cover the majority of recommendation scenarios: trending content (what is being read right now), tag similarity (articles about topics the user has engaged with), and reading history exclusion (do not recommend what the user already read). These heuristics update continuously and outperform a stale batch ML model for freshness, which is the dominant quality factor for time-sensitive content."
}
},
{
"@type": "Question",
"name": "What is the main limitation of batch ML models for content recommendations?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Batch ML models are trained on historical data and cannot know about content published after their training cutoff. For news, blogs, or any platform where freshness matters, a model trained last week will never surface an article published today in its top recommendations, regardless of how relevant the article is. SQL-based streaming recommendations update within seconds of a new article being published and read."
}
},
{
"@type": "Question",
"name": "How does tag-based content recommendation work?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Each article carries a set of topic tags. When a user reads an article, their engagement with its tags is recorded and weighted by reading depth (how much of the article they read). A user tag affinity profile accumulates these weighted scores over time. At recommendation time, articles whose tags overlap with the user's highest-affinity tags are ranked above articles on unfamiliar topics."
}
},
{
"@type": "Question",
"name": "When should you add machine learning on top of SQL heuristics?",
"acceptedAnswer": {
"@type": "Answer",
"text": "ML adds the most value for cold-start problems (new users with no reading history), deep taste modeling (distinguishing between subtly different topic preferences), and cross-format recommendations (matching video or audio content to a reader's text-based history). For these cases, ML can be layered on top of the SQL heuristics as a re-ranking step, while the heuristics continue to handle freshness and trend detection."
}
},
{
"@type": "Question",
"name": "How does RisingWave handle reading events at scale?",
"acceptedAnswer": {
"@type": "Answer",
"text": "RisingWave ingests reading events from Kafka and maintains materialized views incrementally. Each new event updates only the affected rows in the materialized views rather than recomputing from scratch. RisingWave distributes computation across multiple nodes and checkpoints state to S3-compatible storage, so the system scales horizontally and recovers from failures without reprocessing the full event history."
}
}
]
}
A major news story breaks at 9am. A CEO resigns, a regulatory decision lands, or a new product ships. By 9:15am, thousands of readers are on your platform searching for coverage. Articles about the story are being published in real time. Readers who came for unrelated content stumble across the breaking story and stay to read more.
At 9:30am, a reader finishes your coverage and looks at the recommended articles below the fold. The recommendation model, trained on last week's data, suggests three articles about entirely different topics. The breaking story does not appear at all. The model has no concept of any article published in the past seven days.
That is not a model quality problem. It is a freshness problem. And it cannot be fixed by training a better model on the same stale data.
Why Batch ML Breaks for Content
Machine learning recommendation models are powerful for stable content: movies, products, long-form reference material. The user's taste in comedy films does not change by the hour. A good collaborative filtering model trained on months of viewing history will surface relevant recommendations for a user whose preferences are stable.
Content is different. The editorial calendar is unpredictable. An article published this morning may be the most relevant piece on your platform for a large segment of your audience by noon, but your model trained last week has never seen it. The model is not wrong about what it knows. It is simply blind to everything that happened after training.
This blindness compounds with coverage lag. Even if you retrain daily, a model trained this morning cannot see what was published this afternoon. The gap between content creation and model awareness is structural, not tuning-related.
Three signals reliably close this gap without requiring any ML infrastructure:
Trending captures what is actually being read right now, across all users. An article getting 500 reads in its first hour is almost certainly relevant to a large fraction of your audience, regardless of any individual user's historical preferences.
Tag similarity captures topical relevance. If a user has read five articles tagged "central banking" with high reading depth, an article tagged "Federal Reserve" or "interest rates" is a reasonable suggestion, even if the specific article is brand new.
Reading history exclusion is the simplest personalization signal: do not recommend what the user already read. This alone, combined with trending, produces a surprisingly good personalized feed.
None of these signals require training. They are computed from live data in real time, and they degrade gracefully when a user is new.
The Complete SQL Pipeline
Ingesting Reading Events
Reading events flow from your application into Kafka. Each event captures what the user read, for how long, and how much of the article they consumed.
-- Ingest reading events from Kafka
CREATE SOURCE reading_events_source
WITH (
connector = 'kafka',
topic = 'reading-events',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'earliest'
)
FORMAT PLAIN ENCODE JSON;
CREATE TABLE reading_events (
event_id TEXT,
user_id TEXT,
article_id TEXT,
session_id TEXT,
read_pct INT, -- percentage of article read (0-100)
dwell_time INT, -- seconds spent on article
event_time TIMESTAMPTZ
) FROM reading_events_source;
The read_pct field is important. A user who reads 90% of an article has expressed strong interest in its topics. A user who opens an article and leaves after five seconds has expressed almost nothing. Weighting recommendations by reading depth prevents click-bait articles from accumulating false affinity signals.
Syncing the Article Catalog
Articles are created in your CMS and need to be visible to RisingWave for tag joins. Sync them via CDC from your CMS database.
-- Article catalog (synced from CMS via CDC)
CREATE TABLE articles (
article_id TEXT,
title TEXT,
author TEXT,
tags TEXT[], -- array of tag strings
published_at TIMESTAMPTZ
) FROM article_source TABLE 'public.articles';
The tags field uses a TEXT array. PostgreSQL and RisingWave both support array types natively, and the array overlap operator (&&) used later for tag matching is a standard SQL feature.
Trending Articles by the Hour
The trending view is the foundation of the real-time recommendation system. It counts unique readers and computes an engagement score for each article within each one-hour tumbling window.
-- Trending articles by the hour
CREATE MATERIALIZED VIEW trending_articles AS
SELECT
article_id,
COUNT(DISTINCT user_id) AS unique_readers,
AVG(read_pct) AS avg_read_pct,
AVG(dwell_time) AS avg_dwell_seconds,
COUNT(DISTINCT user_id) * (AVG(read_pct) / 100.0) AS engagement_score,
window_start,
window_end
FROM TUMBLE(reading_events, event_time, INTERVAL '1' HOUR)
GROUP BY article_id, window_start, window_end;
The engagement_score formula combines breadth (how many unique readers) with depth (how much of the article they actually read). An article with 1,000 readers who each read 20% of it scores the same as an article with 200 readers who each read 100% of it. Both represent 200 effective full reads.
This composite score deprioritizes articles that get many superficial visits (likely clickbait headlines) in favor of articles that hold attention.
Per-User Reading History
The reading history view tracks every article each user has meaningfully engaged with. "Meaningfully" here means at least some portion was read, not just clicked and immediately bounced.
-- Per-user reading history (articles they've read at least 30%)
CREATE MATERIALIZED VIEW user_read_history AS
SELECT
user_id,
article_id,
MAX(read_pct) AS max_read_pct,
MAX(dwell_time) AS max_dwell_time,
COUNT(*) AS visit_count,
MAX(event_time) AS last_read_at
FROM reading_events
GROUP BY user_id, article_id;
This view is used in two ways downstream. First, it provides the exclusion list for recommendation queries: do not recommend articles the user has already read. Second, the max_read_pct and visit_count fields serve as signals for how strongly the user engaged with a topic, which feeds into tag affinity scoring.
User Tag Affinity
Tag affinity is the personalization layer. It answers the question: which topics has this user demonstrated interest in, and how strong is that interest?
-- Tags a user has engaged with (their topic profile)
CREATE MATERIALIZED VIEW user_tag_affinity AS
SELECT
re.user_id,
unnest(a.tags) AS tag,
SUM(CASE
WHEN re.read_pct >= 80 THEN 5
WHEN re.read_pct >= 50 THEN 3
WHEN re.read_pct >= 20 THEN 1
ELSE 0
END) AS tag_score
FROM reading_events re
JOIN articles a ON re.article_id = a.article_id
GROUP BY re.user_id, unnest(a.tags);
The unnest(a.tags) call expands the tag array into individual rows, so each reading event contributes to the score of every tag on the article. The tiered scoring (5 for near-complete reads, 3 for majority reads, 1 for partial reads) ensures that deep reading has proportionally more influence on the user's taste profile than casual browsing.
A user who reads five articles about "distributed systems" all the way through will have a tag_score of 25 for that tag. A user who opened three distributed systems articles and left after reading a quarter of each will have a score of 3. The system treats these users differently at recommendation time.
Serving Personalized Recommendations
With the materialized views in place, the recommendation query combines all three signals: trending articles, matching the user's tag interests, excluding what they have already read.
-- Personalized feed: trending articles matching user's tag interests
-- that they haven't already read
SELECT
t.article_id,
a.title,
a.published_at,
t.engagement_score,
SUM(uta.tag_score) AS relevance_score
FROM trending_articles t
JOIN articles a ON t.article_id = a.article_id
JOIN LATERAL unnest(a.tags) AS tag_ref(tag) ON true
JOIN user_tag_affinity uta ON uta.user_id = $1 AND uta.tag = tag_ref.tag
LEFT JOIN user_read_history urh ON urh.user_id = $1 AND urh.article_id = t.article_id
WHERE urh.article_id IS NULL -- exclude already-read articles
AND t.window_end = (SELECT MAX(window_end) FROM trending_articles)
GROUP BY t.article_id, a.title, a.published_at, t.engagement_score
ORDER BY (t.engagement_score * 0.4 + SUM(uta.tag_score) * 0.6) DESC
LIMIT 20;
The final ORDER BY blends two signals. Engagement score (40% weight) reflects platform-wide popularity right now. Relevance score (60% weight) reflects how well the article's tags match the user's reading history. Giving more weight to personalization than pure popularity produces a feed that feels both fresh and relevant.
The window_end = (SELECT MAX(window_end) FROM trending_articles) clause ensures you are ranking from the most recent complete hour window, not an older historical window. Adjust the weight split based on your platform's emphasis: a breaking news site might weight engagement higher; a professional learning platform might weight relevance higher.
More Like This
Beyond the personalized feed, a "More like this" section below each article suggests other articles on related topics. This is pure tag matching, no user history required.
-- Articles sharing the most tags with a given article
SELECT
b.article_id,
b.title,
COUNT(*) AS shared_tag_count
FROM articles a
JOIN articles b
ON a.article_id != b.article_id
AND b.tags && a.tags -- array overlap operator
WHERE a.article_id = $1
GROUP BY b.article_id, b.title
ORDER BY shared_tag_count DESC
LIMIT 5;
The && operator (array overlap) returns true if any element of the left array appears in the right array. The GROUP BY with COUNT(*) counts how many tags are shared. An article sharing four tags with the current article ranks above one sharing two.
This query can be converted to a materialized view for performance if your article catalog is large. The view would precompute shared tag counts for all article pairs. For catalogs under a few hundred thousand articles, the on-demand join executes fast enough for serving.
How a Breaking Story Gets Recommended
The latency chain from publication to recommendation is worth tracing explicitly, because it illustrates why streaming matters for content.
- An editor publishes a new article in your CMS at 9:02am. The article is inserted into your CMS database with a set of topic tags.
- RisingWave's CDC connector picks up the insert within seconds. The article is now in the
articlestable. - Readers begin finding the article through search or direct links. Reading events flow into Kafka and are consumed by RisingWave.
- By 9:15am, 300 readers have opened the article and spent an average of four minutes on it. The
trending_articlesmaterialized view has updated incrementally with each batch of Kafka messages. - A user who has previously read heavily about the same topic and has not yet seen this article receives it as the top recommendation in their personalized feed.
The total time from publication to appearing in recommendations: roughly 15 minutes, constrained mainly by how quickly the first readers discover the article. There is no batch window, no nightly job, no retraining cycle. The recommendation system is aware of the article as soon as the first readers interact with it.
For comparison, a model trained weekly would not surface this article in recommendations until the following week's training run, at which point the breaking story has likely faded from relevance.
The New User Problem and Fallback Strategy
New users with no reading history have no tag affinity profile. The user_tag_affinity join in the personalized query returns no rows, and the relevance score is zero for all articles.
The graceful fallback is pure trending. When SUM(uta.tag_score) is zero for all candidates, the final ORDER BY reduces to sorting by engagement_score alone, which produces the platform-wide trending feed. This is reasonable behavior: a new reader gets the most popular articles right now, which is a better first experience than a cold empty state.
As the user reads more articles, their tag affinity profile builds. After reading five or ten articles, personalization kicks in automatically. No onboarding survey, no explicit preference selection, no model training cycle required.
A slightly more sophisticated fallback uses the user's first session to infer interests from the articles they arrived at directly (via search or sharing), even before building a reading history. Storing session-level entry articles and extracting their tags gives enough signal for light personalization on the first visit.
What Medium and Substack Actually Do
Most successful content recommendation systems are not pure ML. They are blends of exactly the signals described in this article.
Medium's recommendation engine uses a combination of topic tags, reading history, and engagement signals (claps, follows) to surface articles. The trending feed is a core feature, not a fallback. The "More from this author" and "Related topics" sections are tag-based. ML models handle the ranking layer and cold-start cases, but the underlying signals are all computable from structured data.
Substack's discovery is primarily tag-based and trending-based. Publications declare their topics; readers' engagement history is tracked; new content from known topics surfaces in recommendations. The recommendation system is relatively simple and effective precisely because the platform is topic-structured.
Both platforms benefit from a key property of content: articles have explicit, human-assigned topic metadata. This structure makes SQL-based tag matching high-quality in a way that is harder to achieve for unstructured recommendations like movies or songs, where the metadata is less explicit and ML fills in the gaps.
When to Add Machine Learning
SQL heuristics cover roughly 80% of content recommendation scenarios well. ML adds genuine value in three situations:
Cold start for new users. If a user arrived from a specific campaign or referral link, ML can infer probable interests from the acquisition source. SQL has no concept of this signal; a model trained on acquisition-to-engagement patterns can use it.
Deep taste modeling. Two users who both read climate articles may have very different interests within that topic: one follows financial impacts, the other follows technology solutions. Tag overlap cannot distinguish them. A model with sufficient training data can learn these subtopic preferences.
Cross-format recommendations. Matching a user's text reading history to relevant podcasts, videos, or newsletters requires understanding semantic similarity across format boundaries. SQL tag matching only works when the same tag vocabulary is applied consistently across formats.
For these cases, ML works best as a re-ranking layer on top of the SQL pipeline. The heuristics generate a candidate set of 50-100 articles quickly and cheaply. The ML model re-ranks this candidate set based on deeper preference signals. This architecture gives you the freshness benefits of streaming SQL alongside the precision benefits of ML, without requiring the model to handle trending or freshness directly.
Architecture Overview
The full pipeline uses three infrastructure components:
Kafka receives reading events from your application. Your frontend sends an event whenever a user opens an article, and periodically updates the read_pct and dwell_time as the user progresses through the article.
RisingWave consumes Kafka events and the article CDC stream, maintains all four materialized views continuously, and serves recommendation queries over a PostgreSQL-compatible interface.
Your application connects to RisingWave on port 4566 using any PostgreSQL client library and fires recommendation queries on page load.
There is no separate batch scheduler, no training pipeline for the heuristic system, and no feature store. The materialized views are the feature store. They are always up to date because they update incrementally with each event.
Getting Started
To run this pipeline, you need RisingWave connected to your Kafka cluster and your CMS database. A minimal local setup:
# Start RisingWave locally
docker run -d -p 4566:4566 risingwavelabs/risingwave:latest
# Connect and create the pipeline
psql -h localhost -p 4566 -d dev -U root
Apply the SQL from the sections above in order: sources first, then the articles table, then materialized views in dependency order (trending_articles and user_read_history before user_tag_affinity).
The recommendation query in the "Serving" section is ready to use immediately after the views are created. Early results will be sparse if your article catalog is small, but the structure is correct and scales to millions of articles and readers without schema changes.
Want to try this on your content platform? Sign up for RisingWave Cloud for a free managed instance, or run locally with the quickstart guide. Join the Slack community to share what you are building or ask questions.

