Watermarks in Stream Processing: How to Handle Late Data

Watermarks are timestamps that track the progress of event time in a stream, telling the system "all events with timestamps before this watermark have likely arrived." Watermarks enable stream processors to handle out-of-order and late-arriving data correctly — a problem that doesn't exist in batch processing. Flink, RisingWave, and Spark Structured Streaming all use watermarks to manage event-time processing.

Why Watermarks Are Needed

Events in real-world systems arrive out of order:

A mobile app sends an event at 10:00:01, but network delay means it arrives at 10:00:05
A batch of IoT sensor readings arrives 30 seconds after the events occurred
A user goes offline, and events are buffered and sent when they reconnect

Without watermarks, the system wouldn't know when to close a time window and emit results. Should it wait forever for potentially late events? Or emit immediately and risk missing late data?

Watermarks solve this: they define a threshold that says "I'm confident all events before this time have arrived. Process them now."

How Watermarks Work

Event stream:  e(10:01) e(10:03) e(10:02) e(10:04) e(10:01) ← late!
Watermark:     W(10:00)          W(10:02)          W(10:03)

When the watermark advances past a window's end time, the window closes and results are emitted. Events arriving after the watermark are "late."

Watermark Strategies

Strategy	How It Works	Trade-off
Bounded out-of-orderness	Watermark = max event time - fixed delay	Simple, handles most late data
Periodic	Emit watermark every N milliseconds	Low overhead
Punctuated	Emit watermark based on special events	Event-driven control
Idle source	Advance watermark when source is idle	Prevents stalls

Handling Late Events

When an event arrives after the watermark has passed:

Drop: Ignore late events (simplest, may lose data)
Allow lateness: Keep windows open for an additional period
Side output: Route late events to a separate stream for manual handling
Update results: Recompute and re-emit window results with the late data

Watermarks in RisingWave

CREATE SOURCE sensor_data (
  sensor_id INT,
  temperature DECIMAL,
  reading_time TIMESTAMP WITH TIME ZONE,
  -- Define watermark: accept events up to 10 seconds late
  WATERMARK FOR reading_time AS reading_time - INTERVAL '10 SECONDS'
) WITH (
  connector = 'kafka',
  topic = 'sensors',
  properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;

Common Watermark Problems

Idle Sources

If one Kafka partition stops producing events, its watermark stops advancing, blocking the entire system. Solution: configure idle timeout so the system advances the watermark for idle sources.

Too Aggressive

Setting a small watermark delay (e.g., 1 second) drops many legitimate late events. Balance between freshness and completeness.

Too Conservative

Setting a large watermark delay (e.g., 1 hour) means results are delayed by an hour. You get completeness but lose real-time responsiveness.

Frequently Asked Questions

What is a watermark in stream processing?

A watermark is a timestamp that tells the stream processor "all events with timestamps earlier than this have likely arrived." It enables the system to close time windows and emit results without waiting indefinitely for late data. Watermarks balance between emitting results quickly (freshness) and waiting for late events (completeness).

How do I choose the right watermark delay?

Analyze your data's typical lateness. If 99% of events arrive within 5 seconds, a 5-second watermark delay is reasonable. For mobile/IoT data with higher latency, use longer delays (30 seconds to minutes). Start conservative and tighten based on observed late event rates.

Can I handle late data without watermarks?

Yes, but with trade-offs. You can process on processing time (when events arrive, not when they occurred), which ignores event ordering entirely. This works for use cases where event-time accuracy isn't critical, like log counting.