Real-Time Data Quality: Validation in Streaming Pipelines
Data quality validation in streaming pipelines catches issues — null values, out-of-range numbers, duplicate records, schema violations — as data flows through the system, not hours later in a batch quality check. Streaming SQL materialized views continuously monitor quality metrics and flag violations in real time.
Quality Checks as Streaming SQL
-- Null rate monitoring
CREATE MATERIALIZED VIEW null_rates AS
SELECT source_name,
COUNT(*) as total,
COUNT(*) FILTER (WHERE user_id IS NULL)::DECIMAL / COUNT(*) as null_rate_user,
COUNT(*) FILTER (WHERE amount IS NULL)::DECIMAL / COUNT(*) as null_rate_amount
FROM events WHERE ts > NOW()-INTERVAL '5 minutes' GROUP BY source_name;
-- Volume anomaly detection
CREATE MATERIALIZED VIEW volume_anomaly AS
SELECT source_name,
COUNT(*) FILTER (WHERE ts > NOW()-INTERVAL '5 minutes') as current_5min,
COUNT(*) FILTER (WHERE ts BETWEEN NOW()-INTERVAL '65 minutes' AND NOW()-INTERVAL '60 minutes') as baseline_5min
FROM events GROUP BY source_name
HAVING COUNT(*) FILTER (WHERE ts > NOW()-INTERVAL '5 minutes') <
COUNT(*) FILTER (WHERE ts BETWEEN NOW()-INTERVAL '65 minutes' AND NOW()-INTERVAL '60 minutes') * 0.5;
-- Duplicate detection
CREATE MATERIALIZED VIEW duplicates AS
SELECT event_id, COUNT(*) as occurrences
FROM events WHERE ts > NOW()-INTERVAL '1 hour'
GROUP BY event_id HAVING COUNT(*) > 1;
Quality Dimensions
| Dimension | What to Check | SQL Pattern |
| Completeness | Null rates per column | COUNT FILTER (WHERE col IS NULL) |
| Freshness | Time since last event | NOW() - MAX(ts) |
| Volume | Events per period vs baseline | COUNT in windows |
| Uniqueness | Duplicate detection | GROUP BY id HAVING COUNT > 1 |
| Validity | Range and format checks | WHERE amount < 0 OR email NOT LIKE '%@%' |
Frequently Asked Questions
Why validate data quality in streaming vs batch?
Batch validation catches issues hours after they occur — by then, bad data has propagated downstream. Streaming validation catches issues in real time, enabling immediate remediation before downstream impact.
What data quality tools work with streaming?
RisingWave (SQL-based quality views), Monte Carlo (managed observability), Great Expectations (can query RisingWave via PostgreSQL), and Soda Core (SQL-based checks).

