Join our Streaming Lakehouse Tour!
Register Now.->

Late Data

Late Data (or tardy events) refers to data records or events in a stream processing system that arrive after their associated Event Time has already passed a certain threshold, typically defined by a Watermark. In systems that process data based on Event Time (when the event actually occurred), late data poses a challenge because the system might have already considered a time window closed or a computation complete for the period to which the late event belongs.

Handling late data correctly is crucial for maintaining data accuracy and completeness in event-time processing, especially for time-windowed aggregations.

Why Does Late Data Occur?

Late data can occur for various reasons:

  1. Network Latency: Delays in transmitting data from the source to the processing system.
  2. Distributed Sources: Events originating from different devices or servers with unsynchronized clocks or varying network conditions.
  3. Source System Delays: The source system itself might generate or emit data with a delay.
  4. Mobile Devices or IoT: Devices that are intermittently connected might buffer data and send it in bursts when connectivity is restored, causing older events to arrive late.
  5. Upstream Failures and Retries: Failures in upstream components or message queues can lead to delayed delivery of messages.
  6. Clock Skew: Differences in clock synchronization between the event-generating systems and the processing system.

The Challenge of Late Data

When a stream processing system uses Event Time and Watermarks:

  • A Watermark is a heuristic that indicates the system believes all (or most) events up to a certain event time have been received.
  • Once the watermark for a specific time T passes the end of a time window W, the system may consider window W closed and trigger computations or emit results for that window.
  • If an event with an event time belonging to window W arrives after the watermark has passed W's end time (and W has been closed), this event is considered "late."

If not handled, late data can lead to:

  • Inaccurate Results: Aggregations or analyses for the window would be incomplete because they didn't include the late event.
  • Inconsistent Views: Different queries or downstream systems might see different results depending on whether they processed the late data.

Strategies for Handling Late Data

Stream processing systems employ various strategies to deal with late data:

  1. Dropping Late Data:

    • The simplest approach: ignore any data that arrives after its corresponding window has been processed and closed by the watermark.
    • Pro: Easy to implement, predictable system behavior.
    • Con: Leads to data loss and potentially inaccurate results if late data is significant.
  2. Allowed Lateness (Grace Period):

    • The system keeps window state active for an additional "allowed lateness" period (or grace period) after the watermark passes the window's end.
    • Late events arriving within this grace period can still be incorporated into the window's computation, and the system can emit updated/refined results for that window.
    • Pro: Improves accuracy by including more late data.
    • Con: Increases state management overhead (windows stay active longer) and can lead to multiple outputs/updates for the same window. Downstream systems must be able to handle these updates.
  3. Side Output / Dead Letter Queue:

    • Late data (especially data arriving after the allowed lateness period) can be routed to a separate "side output" stream or a dead letter queue.
    • This data can then be processed separately, logged for analysis, or manually reconciled.
    • Pro: Prevents data loss while keeping the main pipeline's results timely based on watermarks.
    • Con: Requires additional logic and infrastructure to handle the side output.
  4. Updating Previously Emitted Results:

    • Some systems or configurations allow for retracting previously emitted results for a window and issuing new, corrected results when very late data arrives. This is complex and requires downstream systems to handle retractions and updates correctly. Materialized views in systems like RisingWave inherently manage updates.

Late Data Handling in RisingWave

RisingWave, being an event-time-oriented stream processor, primarily uses Watermarks to manage the progress of event time and trigger window computations.

  • Watermark Definition: Users define watermark strategies on their sources (e.g., WATERMARK FOR event_time_column AS event_time_column - INTERVAL '5' SECOND). This tells RisingWave how to generate watermarks based on observed event times, effectively defining how much out-of-orderness to expect.
  • Window Closing: When the watermark passes the end of a window, RisingWave will typically finalize computations for that window based on the data received so far.
  • Allowed Lateness in Materialized Views: When using materialized views for windowed aggregations, RisingWave's incremental computation model naturally handles updates. If a late event arrives that still falls within a window for which the state is being maintained (even if the "strict" watermark has passed but the system's internal mechanisms or configured grace periods allow), the materialized view can be updated. The exact behavior can depend on windowing strategies and internal optimizations. For very late data beyond reasonable bounds, it might be effectively dropped from that specific window's immediate computation if the state for that window has been aggressively finalized or cleaned up.
  • RisingWave's EMIT ON WINDOW CLOSE (Conceptual for Sinks): For sinks that require a single, final output per window (unlike continuously updating materialized views), the EMIT ON WINDOW CLOSE option (often combined with SET output_mode = 'final') is typically used. This strategy fires results when the watermark passes the window end. A configurable allowed_lateness parameter on the CREATE SINK or CREATE MATERIALIZED VIEW statement can then allow the window to accept and process late events for a specified duration after the watermark closure, refining the output.

The specifics of how late data affects results in RisingWave often depend on whether you are querying a materialized view (which is continuously updated) or consuming from a sink designed for final window outputs.

Considerations

  • Defining "Late": The definition of "late" is relative to the watermark strategy and any allowed lateness configuration.
  • Trade-offs: There's a trade-off between accuracy (accommodating more late data), latency (waiting longer for late data delays results), and resource consumption (keeping window state longer).
  • Downstream Impact: How late data is handled can impact downstream consumers, especially if they receive multiple updates for the same window.

Understanding the characteristics of your data streams and the acceptable trade-offs for your use case is crucial when configuring how late data should be managed.

Related Glossary Terms

The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.