Real-Time Clickstream Analytics: How E-Commerce Teams Understand Shopper Behavior Live
Google Analytics tells you what happened yesterday. Your merchandising team needs to know what is happening right now: which products are spiking, where users are dropping out of the funnel, and which sessions are about to abandon. Streaming SQL on raw clickstream events — captured to Kafka and processed continuously — gives you that live visibility without custom application code or expensive analytics platforms.
The Problem with Delayed Clickstream Analytics
Standard web analytics platforms aggregate data in reporting windows. By the time a dashboard reflects a broken checkout flow or a 404 on your top landing page, thousands of sessions have already been lost.
The delay is not just a UI problem. It is a data pipeline problem. Most analytics stacks send events to a cloud collector, batch-load them into a data warehouse, run transformation jobs, and serve results through a BI layer. Each step adds latency. The whole chain typically takes 30 minutes to 4 hours to produce a usable number.
For e-commerce operations running flash sales, launching new products, or running paid campaigns, 30-minute delayed feedback is not analytics — it is a post-mortem.
What Raw Clickstream Enables That Aggregated Analytics Cannot
When you control your raw clickstream — events sent directly from your frontend to Kafka — you can:
- Reconstruct sessions in real time, calculating time-on-page and navigation sequences
- Detect funnel drop-off as it happens, not after the session closes
- Identify cart abandonment within seconds of the signal, enabling real-time intervention
- Measure live conversion rates per product, per campaign, per traffic source
- Detect anomalies like sudden traffic spikes, bot sessions, or a payment gateway going down
None of this requires a data science team. It requires a streaming database that can process the event stream with SQL.
Clickstream Event Schema
The foundation is a well-structured event schema. Every user interaction fires a JSON event to Kafka:
CREATE SOURCE clickstream_events (
event_id VARCHAR,
session_id VARCHAR,
user_id VARCHAR, -- NULL for anonymous sessions
event_type VARCHAR, -- 'page_view', 'product_view', 'add_to_cart',
-- 'checkout_start', 'purchase', 'search'
page_url VARCHAR,
product_id VARCHAR, -- NULL for non-product pages
category_id VARCHAR,
search_query VARCHAR, -- populated for search events
traffic_source VARCHAR, -- 'organic', 'paid_search', 'email', 'direct'
campaign_id VARCHAR,
device_type VARCHAR, -- 'mobile', 'desktop', 'tablet'
event_time TIMESTAMPTZ
)
WITH (
connector = 'kafka',
topic = 'clickstream',
properties.bootstrap.server = 'kafka:9092',
scan.startup.mode = 'latest'
)
FORMAT PLAIN ENCODE JSON;
RisingWave is a PostgreSQL-compatible streaming database, open source under Apache 2.0, built in Rust, and uses S3 for durable storage. Once this source is defined, you write SQL materialized views to derive every analytics metric you need.
Live Funnel Metrics
The most critical e-commerce metric is funnel conversion: how many sessions that start product discovery make it to purchase. Track it as a live tumbling-window aggregate:
CREATE MATERIALIZED VIEW funnel_metrics_10min AS
SELECT
WINDOW_START,
WINDOW_END,
traffic_source,
device_type,
COUNT(DISTINCT session_id)
FILTER (WHERE event_type = 'page_view') AS sessions,
COUNT(DISTINCT session_id)
FILTER (WHERE event_type = 'product_view') AS product_views,
COUNT(DISTINCT session_id)
FILTER (WHERE event_type = 'add_to_cart') AS add_to_carts,
COUNT(DISTINCT session_id)
FILTER (WHERE event_type = 'checkout_start') AS checkout_starts,
COUNT(DISTINCT session_id)
FILTER (WHERE event_type = 'purchase') AS purchases,
ROUND(
COUNT(DISTINCT session_id) FILTER (WHERE event_type = 'purchase')::NUMERIC
/ NULLIF(COUNT(DISTINCT session_id) FILTER (WHERE event_type = 'page_view'), 0) * 100,
2
) AS overall_conversion_rate
FROM TUMBLE(
clickstream_events,
event_time,
INTERVAL '10 MINUTES'
)
GROUP BY WINDOW_START, WINDOW_END, traffic_source, device_type;
Your live dashboard queries funnel_metrics_10min and gets a conversion rate that reflects the last 10 minutes of actual shopper behavior, broken down by traffic source and device type.
Session Reconstruction
Reconstructing sessions — understanding the navigation sequence and time-on-page within each visit — requires tracking the previous event within a session. Use a windowed aggregation over session events:
CREATE MATERIALIZED VIEW session_summary AS
SELECT
session_id,
user_id,
traffic_source,
device_type,
MIN(event_time) AS session_start,
MAX(event_time) AS last_event_time,
EXTRACT(EPOCH FROM (MAX(event_time) - MIN(event_time))) AS session_duration_seconds,
COUNT(*) AS total_events,
COUNT(*) FILTER (WHERE event_type = 'page_view') AS page_views,
COUNT(*) FILTER (WHERE event_type = 'product_view') AS product_views,
COUNT(*) FILTER (WHERE event_type = 'add_to_cart') AS add_to_carts,
BOOL_OR(event_type = 'purchase') AS converted,
ARRAY_AGG(product_id ORDER BY event_time)
FILTER (WHERE product_id IS NOT NULL) AS viewed_products
FROM clickstream_events
GROUP BY session_id, user_id, traffic_source, device_type;
This view builds up incrementally as each event arrives. By the time a session ends, session_summary contains the complete session profile without any post-processing step.
Real-Time Abandonment Detection
Cart abandonment detection is traditionally a batch job that runs hours after the session closes. With streaming SQL, you can identify sessions that added to cart but have not progressed in the last 15 minutes:
CREATE MATERIALIZED VIEW cart_abandonment_candidates AS
SELECT
s.session_id,
s.user_id,
s.last_event_time,
s.add_to_carts,
s.converted,
NOW() - s.last_event_time AS idle_duration
FROM session_summary s
WHERE s.add_to_carts > 0
AND s.converted = FALSE
AND NOW() - s.last_event_time > INTERVAL '15 MINUTES'
AND NOW() - s.last_event_time < INTERVAL '60 MINUTES';
Pair this with a RisingWave sink that pushes these session IDs to your marketing automation platform:
CREATE SINK cart_abandonment_alerts
FROM cart_abandonment_candidates
WITH (
connector = 'kafka',
topic = 'cart_abandonment_triggers',
properties.bootstrap.server = 'kafka:9092'
)
FORMAT PLAIN ENCODE JSON;
Your email platform subscribes to cart_abandonment_triggers and fires a recovery email within minutes of the inactivity signal — not hours later when a nightly job processes abandoned carts.
Live Product Performance Dashboard
Track how each product is performing right now — views, adds-to-cart, and purchases in the last hour:
CREATE MATERIALIZED VIEW product_performance_1h AS
SELECT
product_id,
category_id,
COUNT(*) FILTER (WHERE event_type = 'product_view') AS views_1h,
COUNT(*) FILTER (WHERE event_type = 'add_to_cart') AS add_to_carts_1h,
COUNT(*) FILTER (WHERE event_type = 'purchase') AS purchases_1h,
ROUND(
COUNT(*) FILTER (WHERE event_type = 'purchase')::NUMERIC
/ NULLIF(COUNT(*) FILTER (WHERE event_type = 'product_view'), 0) * 100,
2
) AS view_to_purchase_rate,
WINDOW_START,
WINDOW_END
FROM TUMBLE(
clickstream_events,
event_time,
INTERVAL '1 HOUR'
)
WHERE product_id IS NOT NULL
GROUP BY product_id, category_id, WINDOW_START, WINDOW_END;
Merchandising teams use this to spot underperforming products that need better placement and high-converting products that deserve more prominent positioning — in real time, during a sale, not the morning after.
Search Query Analytics
Understanding what shoppers search for is a directional signal for inventory gaps and merchandising opportunities:
CREATE MATERIALIZED VIEW search_analytics_1h AS
SELECT
LOWER(search_query) AS search_term,
COUNT(*) AS search_count,
COUNT(DISTINCT session_id) AS unique_sessions,
WINDOW_START,
WINDOW_END
FROM TUMBLE(
clickstream_events,
event_time,
INTERVAL '1 HOUR'
)
WHERE event_type = 'search'
AND search_query IS NOT NULL
GROUP BY LOWER(search_query), WINDOW_START, WINDOW_END
ORDER BY search_count DESC;
A search term that jumps from 5 to 500 queries per hour during a news cycle is an inventory signal, not just a search analytics metric.
Anomaly Detection: Traffic Spikes and Drops
Sudden traffic changes — a campaign going viral, a bot attack, or a service outage — show up in clickstream before they show up anywhere else:
CREATE MATERIALIZED VIEW traffic_anomaly_detection AS
WITH current_window AS (
SELECT
COUNT(DISTINCT session_id) AS current_sessions,
WINDOW_START,
WINDOW_END
FROM TUMBLE(clickstream_events, event_time, INTERVAL '5 MINUTES')
GROUP BY WINDOW_START, WINDOW_END
),
baseline AS (
SELECT AVG(current_sessions) AS avg_sessions
FROM current_window
WHERE WINDOW_END < NOW() - INTERVAL '5 MINUTES'
AND WINDOW_END > NOW() - INTERVAL '2 HOURS'
)
SELECT
cw.WINDOW_START,
cw.WINDOW_END,
cw.current_sessions,
b.avg_sessions,
ROUND((cw.current_sessions - b.avg_sessions) / NULLIF(b.avg_sessions, 0) * 100, 1) AS pct_deviation
FROM current_window cw
CROSS JOIN baseline b
ORDER BY cw.WINDOW_START DESC;
Sessions more than 50% above or below the rolling 2-hour baseline trigger an alert. This kind of real-time anomaly detection catches payment gateway outages (sudden conversion drop) and viral social traffic (sudden session spike) within minutes.
Comparison: Analytics Platform Approaches
| Dimension | Google Analytics 4 | Data Warehouse + dbt | Streaming SQL (RisingWave) |
| Data freshness | 24–48 hours | 30 min–4 hours | Seconds |
| Raw event access | No | Yes | Yes |
| Session reconstruction | Pre-built (limited) | Post-processing job | Live materialized view |
| Abandonment detection | Batch (daily) | Batch (hourly at best) | Real-time |
| Custom funnel definitions | Limited | Flexible | Full SQL flexibility |
| Infrastructure | SaaS | Cloud DW + orchestrator | Streaming database |
| Cost at scale | Per-event pricing | Warehouse compute | Storage on S3 |
Getting Events into Kafka
Your frontend sends events via a lightweight client library that posts to an ingest API, which writes to Kafka. A minimal setup:
- Frontend: A 1KB JavaScript snippet that fires events on page load, product view, add-to-cart, and checkout steps
- Ingest API: A stateless HTTP endpoint (NGINX + a thin Go/Node service) that validates and writes to Kafka
- Kafka: A single topic
clickstreampartitioned bysession_idto preserve per-session ordering
This architecture handles hundreds of thousands of events per second. The Kafka topic is your durable event log. RisingWave reads from it continuously and maintains all the materialized views described above.
Frequently Asked Questions
Q: How does this handle anonymous users who later log in?
Track a session_id cookie for all visitors from first touch. When a user logs in, emit a user_identified event that maps session_id to user_id. A materialized view that joins session_summary on this mapping can retroactively associate the pre-login portion of the session with the identified user.
Q: Does this replace Google Analytics? It is complementary. Google Analytics excels at acquisition attribution and benchmarking against industry data. Streaming SQL on raw clickstream excels at live operational metrics, custom funnel definitions, and real-time interventions like abandonment recovery. Most teams run both.
Q: How do we handle bot traffic?
Add a bot filtering step in your ingest API before events reach Kafka — filter known bot user agents and apply rate limiting per IP. You can also create a bot_sessions materialized view in RisingWave to identify sessions with anomalously high event rates (e.g., more than 200 events per minute) and exclude them from aggregations.
Q: What is the storage cost for raw clickstream events? Kafka retains the raw event stream. Storage cost depends on your retention window and event volume. At 10,000 events per minute with a 72-hour retention window, you are storing roughly 1.3 billion events — manageable with Kafka's log compaction and tiered storage. RisingWave's materialized view state lives in S3, which is significantly cheaper than keeping a large Kafka retention.
Q: How long does it take to set up this pipeline? A working pipeline — Kafka source, basic funnel metrics materialized view, and a Postgres-compatible endpoint your BI tool can query — takes a few hours to stand up. The harder work is agreeing on event schema and ensuring your frontend fires events consistently across all pages and devices.

