Watermarks in Stream Processing: How to Handle Late Data
Watermarks are timestamps that track the progress of event time in a stream, telling the system "all events with timestamps before this watermark have likely arrived." Watermarks enable stream processors to handle out-of-order and late-arriving data correctly — a problem that doesn't exist in batch processing. Flink, RisingWave, and Spark Structured Streaming all use watermarks to manage event-time processing.
Why Watermarks Are Needed
Events in real-world systems arrive out of order:
- A mobile app sends an event at 10:00:01, but network delay means it arrives at 10:00:05
- A batch of IoT sensor readings arrives 30 seconds after the events occurred
- A user goes offline, and events are buffered and sent when they reconnect
Without watermarks, the system wouldn't know when to close a time window and emit results. Should it wait forever for potentially late events? Or emit immediately and risk missing late data?
Watermarks solve this: they define a threshold that says "I'm confident all events before this time have arrived. Process them now."
How Watermarks Work
Event stream: e(10:01) e(10:03) e(10:02) e(10:04) e(10:01) ← late!
Watermark: W(10:00) W(10:02) W(10:03)
When the watermark advances past a window's end time, the window closes and results are emitted. Events arriving after the watermark are "late."
Watermark Strategies
| Strategy | How It Works | Trade-off |
| Bounded out-of-orderness | Watermark = max event time - fixed delay | Simple, handles most late data |
| Periodic | Emit watermark every N milliseconds | Low overhead |
| Punctuated | Emit watermark based on special events | Event-driven control |
| Idle source | Advance watermark when source is idle | Prevents stalls |
Handling Late Events
When an event arrives after the watermark has passed:
- Drop: Ignore late events (simplest, may lose data)
- Allow lateness: Keep windows open for an additional period
- Side output: Route late events to a separate stream for manual handling
- Update results: Recompute and re-emit window results with the late data
Watermarks in RisingWave
CREATE SOURCE sensor_data (
sensor_id INT,
temperature DECIMAL,
reading_time TIMESTAMP WITH TIME ZONE,
-- Define watermark: accept events up to 10 seconds late
WATERMARK FOR reading_time AS reading_time - INTERVAL '10 SECONDS'
) WITH (
connector = 'kafka',
topic = 'sensors',
properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;
Common Watermark Problems
Idle Sources
If one Kafka partition stops producing events, its watermark stops advancing, blocking the entire system. Solution: configure idle timeout so the system advances the watermark for idle sources.
Too Aggressive
Setting a small watermark delay (e.g., 1 second) drops many legitimate late events. Balance between freshness and completeness.
Too Conservative
Setting a large watermark delay (e.g., 1 hour) means results are delayed by an hour. You get completeness but lose real-time responsiveness.
Frequently Asked Questions
What is a watermark in stream processing?
A watermark is a timestamp that tells the stream processor "all events with timestamps earlier than this have likely arrived." It enables the system to close time windows and emit results without waiting indefinitely for late data. Watermarks balance between emitting results quickly (freshness) and waiting for late events (completeness).
How do I choose the right watermark delay?
Analyze your data's typical lateness. If 99% of events arrive within 5 seconds, a 5-second watermark delay is reasonable. For mobile/IoT data with higher latency, use longer delays (30 seconds to minutes). Start conservative and tighten based on observed late event rates.
Can I handle late data without watermarks?
Yes, but with trade-offs. You can process on processing time (when events arrive, not when they occurred), which ignores event ordering entirely. This works for use cases where event-time accuracy isn't critical, like log counting.

