Real-Time Data Quality: Validation in Streaming Pipelines

Real-Time Data Quality: Validation in Streaming Pipelines

Real-Time Data Quality: Validation in Streaming Pipelines

Data quality validation in streaming pipelines catches issues — null values, out-of-range numbers, duplicate records, schema violations — as data flows through the system, not hours later in a batch quality check. Streaming SQL materialized views continuously monitor quality metrics and flag violations in real time.

Quality Checks as Streaming SQL

-- Null rate monitoring
CREATE MATERIALIZED VIEW null_rates AS
SELECT source_name,
  COUNT(*) as total,
  COUNT(*) FILTER (WHERE user_id IS NULL)::DECIMAL / COUNT(*) as null_rate_user,
  COUNT(*) FILTER (WHERE amount IS NULL)::DECIMAL / COUNT(*) as null_rate_amount
FROM events WHERE ts > NOW()-INTERVAL '5 minutes' GROUP BY source_name;

-- Volume anomaly detection
CREATE MATERIALIZED VIEW volume_anomaly AS
SELECT source_name,
  COUNT(*) FILTER (WHERE ts > NOW()-INTERVAL '5 minutes') as current_5min,
  COUNT(*) FILTER (WHERE ts BETWEEN NOW()-INTERVAL '65 minutes' AND NOW()-INTERVAL '60 minutes') as baseline_5min
FROM events GROUP BY source_name
HAVING COUNT(*) FILTER (WHERE ts > NOW()-INTERVAL '5 minutes') <
  COUNT(*) FILTER (WHERE ts BETWEEN NOW()-INTERVAL '65 minutes' AND NOW()-INTERVAL '60 minutes') * 0.5;

-- Duplicate detection
CREATE MATERIALIZED VIEW duplicates AS
SELECT event_id, COUNT(*) as occurrences
FROM events WHERE ts > NOW()-INTERVAL '1 hour'
GROUP BY event_id HAVING COUNT(*) > 1;

Quality Dimensions

DimensionWhat to CheckSQL Pattern
CompletenessNull rates per columnCOUNT FILTER (WHERE col IS NULL)
FreshnessTime since last eventNOW() - MAX(ts)
VolumeEvents per period vs baselineCOUNT in windows
UniquenessDuplicate detectionGROUP BY id HAVING COUNT > 1
ValidityRange and format checksWHERE amount < 0 OR email NOT LIKE '%@%'

Frequently Asked Questions

Why validate data quality in streaming vs batch?

Batch validation catches issues hours after they occur — by then, bad data has propagated downstream. Streaming validation catches issues in real time, enabling immediate remediation before downstream impact.

What data quality tools work with streaming?

RisingWave (SQL-based quality views), Monte Carlo (managed observability), Great Expectations (can query RisingWave via PostgreSQL), and Soda Core (SQL-based checks).

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.