Late Data (or tardy events) refers to data records or events in a stream processing system that arrive after their associated Event Time has already passed a certain threshold, typically defined by a Watermark. In systems that process data based on Event Time (when the event actually occurred), late data poses a challenge because the system might have already considered a time window closed or a computation complete for the period to which the late event belongs.
Handling late data correctly is crucial for maintaining data accuracy and completeness in event-time processing, especially for time-windowed aggregations.
Late data can occur for various reasons:
When a stream processing system uses Event Time and Watermarks:
If not handled, late data can lead to:
Stream processing systems employ various strategies to deal with late data:
Dropping Late Data:
Allowed Lateness (Grace Period):
Side Output / Dead Letter Queue:
Updating Previously Emitted Results:
RisingWave, being an event-time-oriented stream processor, primarily uses Watermarks to manage the progress of event time and trigger window computations.
The specifics of how late data affects results in RisingWave often depend on whether you are querying a materialized view (which is continuously updated) or consuming from a sink designed for final window outputs.
Understanding the characteristics of your data streams and the acceptable trade-offs for your use case is crucial when configuring how late data should be managed.