Social Media Sentiment Analysis in Real Time with Streaming SQL

Social media sentiment analysis in real time with streaming SQL means processing every tweet, post, or comment the moment it arrives and making aggregated brand sentiment scores immediately queryable. Instead of running overnight batch jobs, you define continuous SQL queries over a live stream and let the database maintain the results incrementally. With RisingWave, a PostgreSQL-compatible streaming database, you can build this pipeline entirely in SQL: ingest from Kafka, join against a sentiment lookup table, aggregate with tumbling windows, and alert on negative spikes, no custom application code required.

Why Real-Time Sentiment Matters

Most social listening tools today work on a 15-minute to 24-hour delay. A PR crisis that starts at 9:00 AM might not appear in your dashboard until 9:30 AM or later. By then, a tweet has been retweeted thousands of times.

The core problem is architecture. Traditional sentiment pipelines collect posts in a data lake, run a scheduled batch job (usually Spark or a Python script on Airflow) to score them, write results to a warehouse, and refresh the dashboard on a cron schedule. Every step adds latency.

Streaming SQL collapses that architecture into a single system:

Aspect	Batch Sentiment Pipeline	Streaming SQL Pipeline
Detection latency	15 min to 24 hours	Under 1 second
Data freshness	Stale until next batch	Always current
Infrastructure	Kafka + Spark + warehouse + scheduler	One streaming database
Alerting	Polling a warehouse table	Materialized view triggers instantly
Rule changes	Redeploy pipeline and wait for next run	Update SQL, effective immediately

RisingWave is a PostgreSQL-compatible streaming database that maintains materialized views incrementally. Every time a new social post arrives, only the affected aggregations are recomputed. The full dataset is never re-scanned. This is what makes sub-second sentiment dashboards possible at scale.

Pipeline Overview

The complete pipeline consists of four layers:

graph LR
    A[Twitter / Reddit / Instagram] -->|Social Posts| B[Apache Kafka]
    B -->|CREATE SOURCE| C[RisingWave]
    C -->|Lookup Join| D[senti_word_scores table]
    C -->|Materialized Views| E[Brand Trend Windows]
    E -->|CREATE SINK| F[Kafka: sentiment-alerts]
    E -->|Query| G[Dashboard / API]

Ingest: A Kafka topic receives posts from social media platform connectors.
Score: Upstream enrichment (a UDF, ML model, or rule engine) computes a sentiment score per post and writes it into the stream alongside the post content.
Aggregate: RisingWave materialized views compute brand-level sentiment trends using tumbling windows.
Alert: A spike detection view filters windows with severe negative sentiment and feeds a Kafka sink for downstream notification.

For this tutorial, we simulate the upstream scoring step by including a sentiment_score column in the posts table. In production, this score comes from a Python UDF, an external ML endpoint, or a VADER/TextBlob preprocessing service that enriches the Kafka message before it reaches RisingWave.

All SQL in this guide was verified against RisingWave 2.8.0. You can run RisingWave locally with brew install risingwave and connect with any PostgreSQL client.

Step 1: Define the Sentiment Word Lookup Table

A simple and effective approach for rule-based sentiment scoring is to maintain a word-score dictionary. Each word in the lexicon has a score between -1.0 (maximally negative) and +1.0 (maximally positive). Upstream enrichment queries this table to average the scores of words found in each post.

CREATE TABLE senti_word_scores (
    word   VARCHAR PRIMARY KEY,
    score  DOUBLE PRECISION  -- range: -1.0 to +1.0
);

Populate the lexicon with a representative set of terms:

INSERT INTO senti_word_scores VALUES
    ('love',         0.9),
    ('amazing',      0.9),
    ('excellent',    0.85),
    ('awesome',      0.85),
    ('happy',        0.8),
    ('great',        0.8),
    ('reliable',     0.7),
    ('good',         0.6),
    ('fast',         0.5),
    ('works',        0.3),
    ('ok',           0.1),
    ('issue',       -0.3),
    ('problem',     -0.4),
    ('slow',        -0.5),
    ('bug',         -0.5),
    ('annoying',    -0.6),
    ('bad',         -0.7),
    ('frustrating', -0.7),
    ('disappointed',-0.65),
    ('broken',      -0.8),
    ('crash',       -0.85),
    ('worst',       -0.95),
    ('awful',       -0.9),
    ('terrible',    -0.9),
    ('hate',        -0.9);

Verify the top positive and negative entries:

SELECT word, score FROM senti_word_scores ORDER BY score DESC LIMIT 6;

   word    | score
-----------+-------
 love      |   0.9
 amazing   |   0.9
 excellent |  0.85
 awesome   |  0.85
 happy     |   0.8
 great     |   0.8
(6 rows)

SELECT word, score FROM senti_word_scores ORDER BY score ASC LIMIT 6;

   word   | score
----------+-------
 worst    | -0.95
 awful    |  -0.9
 terrible |  -0.9
 hate     |  -0.9
 crash    | -0.85
 broken   |  -0.8
(6 rows)

This table is a regular RisingWave table, so you can update it at any time with INSERT, UPDATE, or DELETE. Changes take effect on the next post scored, with no pipeline restarts required.

In production, you connect RisingWave to the Kafka topic where your social media connector publishes posts. The DDL below includes the sentiment_score column, which your upstream enrichment service populates before writing to Kafka.

CREATE SOURCE senti_posts_stream (
    post_id         VARCHAR,
    brand           VARCHAR,
    content         VARCHAR,
    platform        VARCHAR,        -- 'twitter', 'reddit', 'instagram'
    author          VARCHAR,
    sentiment_score DOUBLE PRECISION,
    posted_at       TIMESTAMP
)
WITH (
    connector = 'kafka',
    topic = 'social-posts-enriched',
    properties.bootstrap.server = 'kafka:9092',
    scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;

For this tutorial, use a regular table to insert data directly without a Kafka cluster:

CREATE TABLE senti_posts (
    post_id         VARCHAR,
    brand           VARCHAR,
    content         VARCHAR,
    platform        VARCHAR,
    author          VARCHAR,
    sentiment_score DOUBLE PRECISION,
    posted_at       TIMESTAMP
);

Insert a realistic sample dataset covering two brands across multiple platforms:

INSERT INTO senti_posts VALUES
    ('p001','BrandX','I love the new BrandX phone, it is amazing and fast!',        'twitter',  'alice', 0.88,'2026-04-01 09:00:00'),
    ('p002','BrandX','BrandX customer support is terrible and slow',                'twitter',  'bob',  -0.75,'2026-04-01 09:01:00'),
    ('p003','BrandX','The BrandX app is awesome, works great every day',            'reddit',   'carol', 0.78,'2026-04-01 09:02:00'),
    ('p004','BrandY','BrandY is the worst, broken features and annoying bugs',      'twitter',  'dave', -0.85,'2026-04-01 09:00:30'),
    ('p005','BrandY','Really love the BrandY design, excellent quality',            'instagram','eve',   0.87,'2026-04-01 09:01:30'),
    ('p006','BrandX','Frustrated with BrandX, crashed again this morning',          'twitter',  'frank',-0.78,'2026-04-01 09:03:00'),
    ('p007','BrandY','BrandY is good but a bit slow sometimes',                     'reddit',   'grace',-0.05,'2026-04-01 09:02:30'),
    ('p008','BrandX','BrandX is reliable and I am happy with my purchase',          'instagram','henry', 0.75,'2026-04-01 09:04:00'),
    ('p009','BrandY','BrandY app crash ruined my morning, awful experience',        'twitter',  'iris', -0.87,'2026-04-01 09:03:30'),
    ('p010','BrandX','ok experience with BrandX nothing special',                  'twitter',  'jake',  0.10,'2026-04-01 09:05:00');

Step 3: Label Every Post with a Sentiment Category

The first materialized view enriches each post with a human-readable sentiment label. This layer builds on the raw score and makes downstream views more readable.

CREATE MATERIALIZED VIEW senti_post_scores AS
SELECT
    post_id,
    brand,
    platform,
    author,
    content,
    posted_at,
    sentiment_score,
    CASE
        WHEN sentiment_score >  0.2 THEN 'positive'
        WHEN sentiment_score < -0.2 THEN 'negative'
        ELSE 'neutral'
    END AS sentiment_label
FROM senti_posts;

Query the view to confirm the labels:

SELECT post_id, brand, platform,
       ROUND(sentiment_score::numeric, 2) AS score,
       sentiment_label
FROM senti_post_scores
ORDER BY posted_at;

 post_id | brand  | platform  | score | sentiment_label
---------+--------+-----------+-------+-----------------
 p001    | BrandX | twitter   |  0.88 | positive
 p004    | BrandY | twitter   | -0.85 | negative
 p002    | BrandX | twitter   | -0.75 | negative
 p005    | BrandY | instagram |  0.87 | positive
 p003    | BrandX | reddit    |  0.78 | positive
 p007    | BrandY | reddit    | -0.05 | neutral
 p006    | BrandX | twitter   | -0.78 | negative
 p009    | BrandY | twitter   | -0.87 | negative
 p008    | BrandX | instagram |  0.75 | positive
 p010    | BrandX | twitter   |  0.10 | neutral
(10 rows)

RisingWave maintains this view incrementally. Every new post that arrives from Kafka is immediately reflected here, with the sentiment label applied, without scanning historical data.

Step 4: Aggregate Brand Sentiment with Tumbling Windows

Brand sentiment dashboards need time-bucketed aggregations: how did BrandX perform in the 9:00-9:05 AM window versus the 9:05-9:10 AM window? Tumbling windows in RisingWave split the stream into non-overlapping, fixed-size intervals using the TUMBLE function.

CREATE MATERIALIZED VIEW senti_brand_trend_5min AS
SELECT
    brand,
    window_start,
    window_end,
    COUNT(*)                                              AS post_count,
    ROUND(AVG(sentiment_score)::numeric, 3)              AS avg_sentiment,
    ROUND(MAX(sentiment_score)::numeric, 3)              AS max_sentiment,
    ROUND(MIN(sentiment_score)::numeric, 3)              AS min_sentiment,
    COUNT(*) FILTER (WHERE sentiment_label = 'positive') AS positive_count,
    COUNT(*) FILTER (WHERE sentiment_label = 'negative') AS negative_count,
    COUNT(*) FILTER (WHERE sentiment_label = 'neutral')  AS neutral_count
FROM TUMBLE(senti_post_scores, posted_at, INTERVAL '5 minutes')
GROUP BY brand, window_start, window_end;

TUMBLE(senti_post_scores, posted_at, INTERVAL '5 minutes') partitions posts into consecutive 5-minute buckets based on posted_at. Each bucket closes at window_end and never changes after that, making the view deterministic and efficient.

Query the brand trend:

SELECT brand, window_start, window_end, post_count,
       avg_sentiment, positive_count, negative_count, neutral_count
FROM senti_brand_trend_5min
ORDER BY window_start, brand;

 brand  |    window_start     |     window_end      | post_count | avg_sentiment | positive_count | negative_count | neutral_count
--------+---------------------+---------------------+------------+---------------+----------------+----------------+--------------
 BrandX | 2026-04-01 09:00:00 | 2026-04-01 09:05:00 |          5 |         0.176 |              3 |              2 |            0
 BrandY | 2026-04-01 09:00:00 | 2026-04-01 09:05:00 |          4 |        -0.225 |              1 |              2 |            1
 BrandX | 2026-04-01 09:05:00 | 2026-04-01 09:10:00 |          1 |         0.100 |              0 |              0 |            1
(3 rows)

BrandX averaged +0.176 in the first window (mixed but slightly positive), while BrandY averaged -0.225 (more negative than positive). BrandX's second window has a single neutral post, which represents a quiet period before more data arrives.

Step 5: Build an Overall Brand Health Summary

Beyond windowed trends, product and marketing teams need a single number that summarizes how a brand is performing overall. This rolling summary view aggregates all posts without time bucketing.

CREATE MATERIALIZED VIEW senti_brand_summary AS
SELECT
    brand,
    COUNT(*)                                              AS total_posts,
    ROUND(AVG(sentiment_score)::numeric, 3)              AS avg_sentiment,
    COUNT(*) FILTER (WHERE sentiment_label = 'positive') AS positive_posts,
    COUNT(*) FILTER (WHERE sentiment_label = 'negative') AS negative_posts,
    COUNT(*) FILTER (WHERE sentiment_label = 'neutral')  AS neutral_posts,
    ROUND(
        100.0 * COUNT(*) FILTER (WHERE sentiment_label = 'positive') / COUNT(*)
    , 1)                                                 AS positive_pct
FROM senti_post_scores
GROUP BY brand;

SELECT brand, total_posts, avg_sentiment,
       positive_posts, negative_posts, neutral_posts, positive_pct
FROM senti_brand_summary
ORDER BY avg_sentiment DESC;

 brand  | total_posts | avg_sentiment | positive_posts | negative_posts | neutral_posts | positive_pct
--------+-------------+---------------+----------------+----------------+---------------+-------------
 BrandX |           6 |         0.163 |              3 |              2 |             1 |        50.0
 BrandY |           4 |        -0.225 |              1 |              2 |             1 |        25.0
(2 rows)

BrandX has 50% of posts classified as positive and an average score of +0.163. BrandY is at 25% positive with an average score of -0.225. These numbers update automatically every time a new post arrives, with no query scheduled.

Step 6: Sentiment Breakdown by Platform

Different platforms have different user demographics and tones. Twitter/X skews more negative than Instagram. Understanding per-platform sentiment helps prioritize where to focus your response team.

CREATE MATERIALIZED VIEW senti_platform_breakdown AS
SELECT
    brand,
    platform,
    COUNT(*)                                              AS post_count,
    ROUND(AVG(sentiment_score)::numeric, 3)              AS avg_sentiment,
    COUNT(*) FILTER (WHERE sentiment_label = 'positive') AS positive_count,
    COUNT(*) FILTER (WHERE sentiment_label = 'negative') AS negative_count
FROM senti_post_scores
GROUP BY brand, platform;

SELECT brand, platform, post_count, avg_sentiment,
       positive_count, negative_count
FROM senti_platform_breakdown
ORDER BY brand, platform;

 brand  | platform  | post_count | avg_sentiment | positive_count | negative_count
--------+-----------+------------+---------------+----------------+----------------
 BrandX | instagram |          1 |         0.750 |              1 |              0
 BrandX | reddit    |          1 |         0.780 |              1 |              0
 BrandX | twitter   |          4 |        -0.138 |              1 |              2
 BrandY | instagram |          1 |         0.870 |              1 |              0
 BrandY | reddit    |          1 |        -0.050 |              0 |              0
 BrandY | twitter   |          2 |        -0.860 |              0 |              2
(6 rows)

The pattern is immediately clear: both brands perform significantly better on Instagram than on Twitter. BrandY's Twitter average is -0.860, which is a severe signal. On Instagram, the same brand scores +0.870. This kind of per-channel insight guides where community managers should focus their attention.

Step 7: Detect Sentiment Spikes and Alert

The most actionable output of any sentiment pipeline is an alert: tell me the moment a brand is taking a beating on social media, not after the fact. The spike detection view builds on senti_brand_trend_5min and flags windows where average sentiment drops below a critical threshold.

First, simulate a PR crisis by adding a burst of negative posts for BrandY:

INSERT INTO senti_posts VALUES
    ('p011','BrandY','BrandY service is completely broken, worst experience ever', 'twitter','mark', -0.92,'2026-04-01 09:10:00'),
    ('p012','BrandY','Cannot believe how bad BrandY has gotten, total disaster',   'twitter','nina', -0.88,'2026-04-01 09:11:00'),
    ('p013','BrandY','BrandY crashed my data, I hate this product',               'reddit', 'oscar',-0.90,'2026-04-01 09:12:00');

Now create the spike alert view:

CREATE MATERIALIZED VIEW senti_spike_alerts AS
SELECT
    brand,
    window_start,
    window_end,
    post_count,
    avg_sentiment,
    negative_count,
    'negative_spike' AS alert_type
FROM senti_brand_trend_5min
WHERE avg_sentiment < -0.5
  AND negative_count >= 2;

Query the alerts:

SELECT brand, window_start, window_end,
       post_count, avg_sentiment, negative_count, alert_type
FROM senti_spike_alerts;

 brand  |    window_start     |     window_end      | post_count | avg_sentiment | negative_count |   alert_type
--------+---------------------+---------------------+------------+---------------+----------------+----------------
 BrandY | 2026-04-01 09:10:00 | 2026-04-01 09:15:00 |          3 |          -0.9 |              3 | negative_spike
(1 row)

Three negative posts in the 09:10-09:15 window pushed BrandY's average to -0.90. The alert materialized view detected this immediately. At the same time, the brand summary updated:

SELECT brand, total_posts, avg_sentiment, positive_pct
FROM senti_brand_summary
ORDER BY avg_sentiment DESC;

 brand  | total_posts | avg_sentiment | positive_pct
--------+-------------+---------------+-------------
 BrandX |           6 |         0.163 |        50.0
 BrandY |           7 |        -0.514 |        14.3
(2 rows)

BrandY's overall average dropped from -0.225 to -0.514 and its positive percentage fell from 25% to 14.3%. All of this happened the moment those three rows were inserted, with no recomputation of historical data.

Step 8: Push Alerts to Kafka

Detection is only useful if it drives action. RisingWave's CREATE SINK pushes every new alert row to a Kafka topic the moment senti_spike_alerts receives a new record.

CREATE SINK senti_alerts_sink
FROM senti_spike_alerts
WITH (
    connector = 'kafka',
    topic = 'sentiment-alerts',
    properties.bootstrap.server = 'kafka:9092',
    primary_key = 'brand,window_start'
)
FORMAT UPSERT ENCODE JSON;

Downstream consumers subscribed to sentiment-alerts receive a JSON message for each spike. Your notification service can post to Slack, page an on-call team, or trigger a crisis response workflow automatically.

You can also sink directly to PostgreSQL for integration with a BI tool or an operational dashboard:

CREATE SINK senti_alerts_pg
FROM senti_spike_alerts
WITH (
    connector = 'jdbc',
    jdbc.url = 'jdbc:postgresql://postgres:5432/social_db',
    table.name = 'sentiment_alerts',
    type = 'upsert',
    primary_key = 'brand,window_start'
);

Scaling to a Real-World Vocabulary

The 25-word lexicon in this tutorial is a starting point. A production deployment typically uses one of these approaches:

VADER or SentimentIntensityAnalyzer: Run Python enrichment before Kafka using VADER, which handles punctuation, capitalization, and slang. The compound score maps directly to the -1.0 to +1.0 scale used here.

ML model as a UDF: RisingWave supports Python UDFs that can call a model endpoint. A RoBERTa-based sentiment classifier achieves much higher accuracy than lexicon matching on informal text.

Expanded lexicon table: Load a full AFINN or NRC lexicon (thousands of words) into senti_word_scores. Since this is a regular RisingWave table, you update it with INSERT or COPY without touching any materialized view.

The SQL pipeline you built today does not change regardless of which scoring mechanism you use. The only thing that changes is what populates sentiment_score in the incoming Kafka messages.

Production Considerations

Window size selection: 5-minute tumbling windows balance freshness with statistical significance. For brands with very high post volumes (millions per hour), 1-minute windows provide finer granularity. For lower-volume brands, 15-minute or 1-hour windows reduce noise. Start wider and shrink as you calibrate your alert thresholds.

Alert threshold tuning: The -0.5 threshold and minimum 2 negative posts used in senti_spike_alerts are conservative starting points. Review historical data to find the score and count combination that best separates genuine crises from normal noise. RisingWave lets you change the view definition with DROP MATERIALIZED VIEW followed by a re-create, which takes effect on new windows immediately.

Handling multiple languages: Sentiment scoring is language-specific. A simple extension is to add a language column to senti_posts and maintain separate word score tables per language. Use a JOIN that matches on both language and word. See RisingWave source connectors for options to route posts by language from your Kafka topics.

Backfill and historical analysis: RisingWave can replay Kafka topics from the earliest offset by setting scan.startup.mode = 'earliest' in the source DDL. This lets you retroactively build trend views over months of historical social data using the same SQL you write for real-time processing.

Social media sentiment analysis with streaming SQL is the practice of running continuous SQL queries over a stream of social posts to compute sentiment scores and aggregate brand sentiment trends as new posts arrive. A streaming database like RisingWave incrementally maintains materialized view results so that brand health dashboards and spike alerts update in under a second, compared to hourly or daily batch pipelines that score posts in bulk.

Frequently Asked Questions

How does RisingWave handle the nested-loop join required to match every post word against a sentiment lexicon?

RisingWave does not support streaming nested-loop joins, which is why this pipeline includes the sentiment score in the Kafka message itself rather than performing the word-level join inside RisingWave. The scoring step belongs in an upstream enrichment service (a Python VADER script, a Faust consumer, or a cloud function) that reads each post, computes the score, and writes the enriched record back to Kafka. RisingWave then handles the aggregation, windowing, and alerting, which are the tasks it excels at. This separation of concerns is also good architecture: it keeps model logic decoupled from the serving layer.

Can I use RisingWave's Python UDF to score posts directly inside the database?

Yes. RisingWave supports Python UDFs that run in an external Python process. You can write a UDF that calls SentimentIntensityAnalyzer from VADER or invokes a Hugging Face model endpoint, then call that UDF in a materialized view definition. This approach works well for moderate throughput. For very high-volume streams (millions of posts per minute), pre-scoring in Kafka Streams or Faust before the data reaches RisingWave is more efficient, because it scales the scoring layer independently from the SQL layer.

How do tumbling windows in RisingWave differ from session windows for sentiment analysis?

Tumbling windows (TUMBLE) divide the stream into fixed, non-overlapping intervals (e.g., every 5 minutes). They are ideal for brand sentiment dashboards where you want clean time buckets to compare. Session windows group events that occur close together in time, separated by gaps of inactivity. For sentiment analysis, tumbling windows are the standard choice because brand monitoring cares about wall-clock periods ("what happened between 9:00 and 9:05 AM"), not user session gaps. RisingWave also supports hopping windows if you need overlapping intervals for smoother trend lines.

How does this pipeline compare to building the same system with Apache Flink?

Apache Flink can build the same pipeline, but requires a Java or Scala application, a JVM cluster to operate, and a separate serving layer to make results queryable. With RisingWave, the results of every materialized view are immediately queryable via any PostgreSQL client, with no additional serving layer. Teams that already know SQL can be productive in hours rather than days. For a detailed comparison of the two approaches, see Apache Flink vs RisingWave.

Conclusion

You have built a complete real-time social media sentiment analysis pipeline using RisingWave streaming SQL. Here is what each component does:

senti_word_scores: a live sentiment lexicon you can update without restarting anything
senti_post_scores: labels every incoming post with a sentiment category in real time
senti_brand_trend_5min: aggregates brand sentiment into 5-minute tumbling windows
senti_brand_summary: provides a rolling overall health score per brand
senti_platform_breakdown: surfaces per-channel sentiment differences
senti_spike_alerts: detects crisis windows and feeds downstream alerting

Every SQL statement in this post was tested against RisingWave 2.8.0 and produces the outputs shown. The entire system is zero application code: all ingestion, scoring aggregation, windowing, and alerting logic lives in SQL.

The real advantage of this architecture is iteration speed. When a new brand launches a product, you add a few rows to senti_posts. When you discover a new slang term that predicts negative sentiment, you add it to senti_word_scores. When the marketing team wants a 1-minute window instead of 5, you drop and re-create senti_brand_trend_5min. None of these changes require pipeline redeployment or cluster restarts.

Ready to try this yourself? Get started with RisingWave in 5 minutes: Quickstart

Join our Slack community to share what you build and get help from the RisingWave team.