How RisingWave Handles Late-Arriving Data and Watermarks

TL;DR: RisingWave models late-arriving data with a watermark expression declared right in the CREATE TABLE or CREATE SOURCE DDL. The watermark advances as event time progresses, windows close when the watermark crosses their end boundary, and rows that arrive after that point are dropped by default. The model collapses what takes a multi-line Flink WatermarkStrategy into a single SQL clause.

How does RisingWave handle late-arriving events in streaming?

RisingWave handles late-arriving events using a watermark declared in DDL. You write WATERMARK FOR event_time AS event_time - INTERVAL 'N seconds' on a table or source, and that expression tells the engine how much out-of-order tolerance to allow. The engine emits a watermark for each row equal to the expression's value, takes the running maximum, and uses it as the closing threshold for any time-based window. Rows whose event time is below the current watermark are dropped from window aggregations by default, and the windows they would have updated are already finalized.

This post walks through the semantics, shows the exact SQL pattern, demonstrates what happens when late rows show up, and compares the model to Apache Flink's watermark API.

What a watermark actually means

A watermark is a promise to the streaming engine that you do not expect to see any more events with timestamps earlier than a given value. Once the engine accepts that promise, it can close time windows, materialize their results, and free the state they consumed. Without watermarks, a streaming engine would need to keep every open window forever, because it could never be sure a late event would not arrive.

The promise is probabilistic, not absolute. You configure how much out-of-order tolerance you want, and the engine balances completeness against latency. A 5-second tolerance means windows close 5 seconds after event time crosses the boundary; if you set it to 1 hour, you wait an hour but catch more late stragglers. The right value depends on the source and the cost of dropping late data.

Declaring a watermark in RisingWave

The watermark clause in RisingWave lives inside the CREATE TABLE or CREATE SOURCE statement and takes a simple expression. The following table was created and tested against RisingWave 2.8.0:

CREATE TABLE orders_wm (
  order_id BIGINT,
  amount DECIMAL,
  event_time TIMESTAMPTZ,
  WATERMARK FOR event_time AS event_time - INTERVAL '5 seconds'
) APPEND ONLY;

The expression event_time - INTERVAL '5 seconds' says: for every row I see, compute a watermark that trails the row's timestamp by 5 seconds. The engine maintains the running maximum of those values. If a row arrives with event_time = 10:00:30, the candidate watermark is 10:00:25. If the running maximum was previously 10:00:20, it advances. If the next row arrives with event_time = 10:00:10, the candidate is 10:00:05, which is less than the current max, so the watermark does not move.

The column must be a TIMESTAMP or TIMESTAMPTZ and the expression must be monotonically related to that column. In practice that means column - INTERVAL '...' or column - some_constant.

Watermarks meet tumbling windows

When a watermarked table feeds a tumbling window, the windows finalize as the watermark crosses their window_end. A row with event_time below the watermark cannot update any window. Here is a concrete example using the watermarked table from the previous section:

INSERT INTO orders_wm VALUES
  (1, 100.0, '2026-05-22 10:00:00+00'),
  (2, 200.0, '2026-05-22 10:00:30+00'),
  (3, 150.0, '2026-05-22 10:01:00+00'),
  (4, 300.0, '2026-05-22 10:01:30+00'),
  (5, 250.0, '2026-05-22 10:02:00+00');

CREATE MATERIALIZED VIEW orders_per_minute AS
SELECT window_start, window_end, COUNT(*) AS num_orders, SUM(amount) AS total_amount
FROM TUMBLE(orders_wm, event_time, INTERVAL '1 minute')
GROUP BY window_start, window_end;

SELECT * FROM orders_per_minute ORDER BY window_start;

The materialized view returns:

       window_start        |        window_end         | num_orders | total_amount
---------------------------+---------------------------+------------+--------------
 2026-05-22 10:00:00+00:00 | 2026-05-22 10:01:00+00:00 |          2 |        300.0
 2026-05-22 10:01:00+00:00 | 2026-05-22 10:02:00+00:00 |          2 |        450.0
 2026-05-22 10:02:00+00:00 | 2026-05-22 10:03:00+00:00 |          1 |        250.0

Three tumbling windows of one minute each, each fed by the events whose event_time falls inside the window's range. The watermark trails the latest event by 5 seconds, so once we have observed event_time = 10:02:00, the watermark sits at 10:01:55, which means the 10:00:00 to 10:01:00 window is already considered closed.

What happens to late data

Late data, in RisingWave's default model, is dropped from window aggregations. "Late" means the event's timestamp falls below the current watermark when the engine processes the event. If your watermark is at 10:01:55 and a row arrives with event_time = 10:00:45, the engine will not update the 10:00:00 to 10:01:00 window, because it has already been finalized.

This default is the right behavior for most analytical workloads. The whole point of a watermark is that closed windows stay closed and downstream consumers can trust the numbers they see. If you want to handle late data without dropping it, you have two main options:

Set a more lenient watermark. If you observe that real events frequently arrive 30 seconds late, declare event_time - INTERVAL '30 seconds'. The cost is that windows take 30 seconds longer to finalize.
Route late events to a side stream. You can declare a second materialized view over the same source that does not use windowed aggregation, and join the two later if you need to amend completed windows.

The key insight is that "late" is a property of your watermark configuration, not of the data. Make the watermark generous enough to cover the realistic out-of-order distribution, and most events will be on time by definition.

How this compares to Apache Flink

Apache Flink's watermark API is more flexible than RisingWave's DDL clause, and correspondingly more verbose. In Flink you declare watermarks programmatically using a WatermarkStrategy, typically with forBoundedOutOfOrderness:

WatermarkStrategy
  .<Order>forBoundedOutOfOrderness(Duration.ofSeconds(5))
  .withTimestampAssigner((order, ts) -> order.getEventTime());

You then attach the strategy with assignTimestampsAndWatermarks on a DataStream. The model gives you fine-grained control: custom watermark generators, per-partition watermarks, periodic versus punctuated emission, and idle source handling. It also gives you the entire surface area of building, packaging, and deploying a Java application to run a one-line streaming query.

RisingWave trades flexibility for ergonomics. The WATERMARK FOR ... AS ... clause covers the bounded-out-of-orderness case directly, which is what teams reach for the vast majority of the time. The expression is part of the DDL, which means any tooling that understands your schema understands your watermark. There is no separate codebase to deploy.

For teams whose pipelines are dominated by "events arrive within N seconds of their timestamp," the SQL form is shorter, easier to review, and easier to evolve.

When to widen the watermark

There is a recurring question that comes up the first time a team hooks watermarks to a production source: "what value should I pick?" There is no universal answer, but there is a procedure that works.

Start by measuring. Collect a sample of your real event stream and compute the distribution of processing_time - event_time for each event. Plot that distribution. The 99th percentile of that delta is a reasonable first watermark tolerance, because it covers nearly all of the natural jitter without making windows wait excessively for the long tail.

Then sanity-check against business cost. If your dashboards refresh every minute, a 5-second watermark is invisible. If your alerting fires per row, a 30-second watermark adds 30 seconds to the time-to-detect. Match the watermark to the slowest acceptable feedback loop, not the fastest possible one.

Finally, monitor what you dropped. Run a side query that counts rows whose event_time was behind the current watermark when they arrived. If that number trends upward, the upstream is getting more out-of-order over time, and you need to widen the watermark or fix the source.

Watermark debugging tips

A few things tend to go wrong when teams adopt watermarks for the first time:

Watermark stuck behind a stale partition. If one input partition stops producing events, its contribution to the watermark stalls, and the global watermark cannot advance. RisingWave handles this with per-source progress tracking; verify that all upstream partitions are active.
Wall-clock skew. Watermarks are based on event time, not processing time. If your producers' clocks drift, your watermark drifts with them. Standardize on UTC and NTP.
Late data silently dropped. Add a COUNT(*) materialized view over the raw source and compare it to the windowed aggregate over time. A growing gap means rows are being dropped.
Over-tolerant watermarks. A watermark of INTERVAL '1 hour' makes windows close an hour late. Pick the smallest value that covers your real out-of-order distribution, not the largest value that feels safe.

Key takeaways

A watermark is a promise that no more events with earlier timestamps will arrive, and it is what lets the engine close windows.
RisingWave declares watermarks in DDL with WATERMARK FOR event_time AS event_time - INTERVAL 'N seconds'.
Late data is dropped from window aggregations by default; tune your watermark expression to control the tradeoff.
Apache Flink offers more flexibility with WatermarkStrategy, but RisingWave's DDL form covers the common bounded-out-of-orderness case with one line.
Monitor for stuck watermarks, clock skew, and silent drops as your pipeline evolves.

To try the SQL above, download RisingWave and run it locally, or spin up a free cluster on RisingWave Cloud.