If you have ever lost data during a system failure or found duplicate records polluting your analytics, you already know why delivery guarantees matter. In stream processing, the question of "how many times does each record get processed?" is not academic. It determines whether your real-time dashboards show accurate numbers, whether your fraud detection system fires once per suspicious transaction or floods your team with duplicates, and whether your billing pipeline charges customers correctly.
Exactly-once processing is the gold standard, but it is also the most misunderstood guarantee in distributed systems. Some engineers argue it is impossible. Others claim every modern system supports it. The truth is more nuanced: exactly-once semantics are achievable, but only through careful coordination of checkpointing, state management, and recovery mechanisms. In this article, you will learn what exactly-once really means, how it differs from weaker guarantees, and how systems like RisingWave and Apache Flink implement it using barrier-based checkpointing.
The Three Delivery Guarantees Explained
Before diving into exactly-once, you need a clear understanding of all three delivery guarantees and why each exists.
At-Most-Once: Fire and Forget
At-most-once is the simplest guarantee. The system sends each record downstream without tracking whether it was successfully processed. If a failure occurs, the record is lost. No retries, no recovery.
This approach has near-zero overhead. There are no acknowledgments to wait for, no state to persist, no coordination between operators. For use cases where occasional data loss is acceptable (think telemetry sampling or non-critical logging), at-most-once can be the right choice.
The failure mode is straightforward: records disappear. If a processing node crashes mid-computation, whatever it was working on is gone.
Source --[record A]--> Processor --[crash]--> (A is lost forever)
At-Least-Once: Retry Until Confirmed
At-least-once adds retry logic. If the system does not receive confirmation that a record was processed, it resends the record. This guarantees that no data is lost, but introduces a new problem: duplicates.
Consider a processor that receives record A, writes the result to a sink, and then crashes before sending an acknowledgment back to the source. The source, not having received confirmation, resends record A. The processor handles it again, writing a duplicate result to the sink.
Source --[record A]--> Processor --[writes result]--> Sink
Processor --[crash before ack]-->
Source --[record A]--> Processor --[writes result]--> Sink (duplicate!)
At-least-once is common in systems that prioritize data completeness over precision. Many early streaming architectures relied on downstream deduplication (using unique IDs or upsert semantics) to compensate for the duplicates introduced by at-least-once delivery.
Exactly-Once: Process Once, Effect Once
Exactly-once means each record is processed one time, and its effects appear exactly once in the output. This does not mean the system physically processes each record only once. In practice, records may be re-processed during recovery. The key insight is that the system's observable state reflects each record exactly once.
How? Through a combination of:
- Replayable sources that can rewind to a specific offset (like Kafka)
- Consistent state snapshots that capture the processing pipeline's state at a precise point
- Atomic commits that ensure state changes and output writes are visible together or not at all
When a failure occurs, the system restores its last consistent snapshot, rewinds the source to the corresponding offset, and replays records. Because the state is rolled back to before those records were processed, the replay produces the same results without duplication.
How Checkpointing Powers Exactly-Once
Checkpointing is the mechanism that makes exactly-once possible. A checkpoint is a consistent snapshot of the entire processing pipeline's state at a specific point in the data stream. If a failure occurs, the system rolls back to the last completed checkpoint and resumes processing from there.
The challenge is taking this snapshot without stopping the pipeline. In a distributed system with multiple operators running in parallel across different nodes, you cannot simply pause everything, save state, and resume. The pipeline would stall, latency would spike, and throughput would drop to zero during every checkpoint.
The Naive Approach: Stop-the-World
The simplest checkpointing strategy is stop-the-world: pause all operators, save their state, then resume. This guarantees consistency because nothing is in-flight during the snapshot. But it is impractical for production systems that need low-latency, continuous processing.
Barrier-Based Checkpointing: Snapshots Without Stopping
Barrier-based checkpointing solves this by injecting special markers, called barriers, into the data stream. These barriers flow through the pipeline alongside regular data records, dividing the stream into epochs. Each barrier marks the boundary between two epochs.
Here is how it works:
A coordinator injects barriers into the source operators. Each barrier carries a checkpoint ID (epoch number).
Barriers flow downstream with the data. They follow the same paths as regular records and never overtake them. This preserves ordering.
When an operator receives a barrier on all its input channels, it snapshots its state. This state corresponds to all records processed before the barrier and none after it.
The operator forwards the barrier downstream. The next operator in the pipeline repeats the process.
When all terminal operators (sinks) have received the barrier, the checkpoint is complete. The coordinator marks it as finalized.
graph LR
subgraph Epoch N
A1[Record 1] --> A2[Record 2] --> A3[Record 3]
end
A3 --> B["|Barrier N|"]
subgraph Epoch N+1
B --> C1[Record 4] --> C2[Record 5]
end
style B fill:#ff6b6b,stroke:#333,stroke-width:2px,color:#fff
The critical property is barrier alignment. When an operator has multiple input channels (for example, a join operator receiving from two upstream sources), it must wait for the barrier to arrive on all channels before taking its snapshot. Records arriving on channels that have already delivered the barrier are buffered until all channels have caught up.
This alignment ensures the snapshot represents a consistent cut across the entire pipeline: every record before the barrier is reflected in the state, and no record after the barrier is included.
Recovery from a Checkpoint
When a failure occurs, the recovery process is:
Identify the last completed checkpoint. This is the most recent checkpoint where all operators successfully persisted their state.
Restore operator state. Each operator loads its state from the checkpoint.
Rewind the source. The source operator resets to the offset recorded in the checkpoint.
Resume processing. The pipeline continues from the checkpoint, reprocessing records that arrived after it.
Because the state is rolled back to before those records were processed, and the source replays them in the same order, the system produces identical results. The net effect is exactly-once processing.
How RisingWave Achieves Exactly-Once
RisingWave implements a Chandy-Lamport style consistent snapshot algorithm adapted for its streaming SQL engine. The approach shares conceptual foundations with Apache Flink's Asynchronous Barrier Snapshots (ABS) but differs in several important ways due to RisingWave's architecture as a streaming database with persistent state.
Barriers and Epochs in RisingWave
In RisingWave, the Meta Node acts as the checkpoint coordinator. It periodically injects barriers into the data stream, with a default interval of one second. Each barrier defines an epoch boundary.
Here is the checkpoint flow:
The Meta Node sends barrier markers to all source operators on the Streaming Nodes.
Each operator processes all records belonging to the current epoch, updates its internal state, and then propagates the barrier downstream.
Operators asynchronously flush their state to shared cloud storage (S3, GCS, or Azure Blob via OpenDAL). This is a key distinction: the checkpoint does not block foreground processing.
Terminal operators (sinks and materialized view writers) return checkpoint acknowledgments to the Meta Node.
Once the Meta Node collects acknowledgments from all terminal operators, the checkpoint is finalized as a consistent global snapshot.
sequenceDiagram
participant MN as Meta Node
participant S1 as Source 1
participant Op as Join Operator
participant MV as Materialized View
participant Store as Object Storage (S3)
MN->>S1: Inject Barrier (Epoch N)
S1->>Op: Forward data + Barrier N
Op->>Op: Snapshot state
Op-->>Store: Async flush state
Op->>MV: Forward Barrier N
MV->>MV: Snapshot state
MV-->>Store: Async flush state
MV->>MN: Checkpoint N complete
MN->>MN: Finalize checkpoint N
Asynchronous State Persistence
One of RisingWave's key optimizations is fully asynchronous checkpointing. When an operator takes a snapshot, it does not wait for the state to be written to object storage before continuing to process the next epoch's records. Instead, the state is gathered in a shared buffer and written as SST (Sorted String Table) files in the background.
This design makes the checkpoint process almost invisible to foreground tasks. Unlike systems where checkpointing causes periodic latency spikes, RisingWave maintains stable processing latency even during state persistence.
Epoch-Based Batching for Efficiency
RisingWave introduces the concept of an epoch to enable batching within consistency boundaries. Multiple changes on a source table within one epoch are grouped together, allowing operators to batch their processing within a single logical transaction. This is equivalent to enlarging the write transaction size, which reduces the per-record overhead of maintaining consistency.
The epoch mechanism also guarantees snapshot isolation for queries. Changes in upstream tables and their corresponding changes in downstream materialized views are atomically committed in the same write transaction. This means any query always sees a consistent, point-in-time view of the data, even across multiple materialized views.
LRU-Based State Caching
RisingWave implements LRU (Least Recently Used) caching for operator state. When state is small enough to fit in memory, all state lives in memory for maximum performance. As state grows beyond available memory, less frequently accessed entries are evicted to object storage and fetched on demand. This adaptive behavior allows the system to handle both small-state and large-state workloads efficiently, without requiring manual tuning.
RisingWave vs. Flink: Checkpointing Compared
Both RisingWave and Apache Flink implement barrier-based checkpointing derived from the Chandy-Lamport algorithm, but their implementations diverge in meaningful ways.
Architectural Differences
| Aspect | RisingWave | Apache Flink |
| State storage | Shared cloud object storage (S3/GCS) | Local RocksDB per task, with remote backup |
| Barrier interval | Default 1 second | Default 10 seconds (configurable) |
| State persistence | Fully async to object storage | Async local snapshots, sync upload to checkpoint store |
| Recovery model | Global rollback from object storage | Global rollback from distributed checkpoint store |
| Query capability | Direct SQL queries on checkpointed state | No built-in query layer for state |
| Channel state | Not stored (uses rewindable sources) | Stored during aligned checkpoints |
The Chandy-Lamport Foundation
The original Chandy-Lamport algorithm, published in 1985 by K. Mani Chandy and Leslie Lamport, captures both process states (local variables, counters) and channel states (messages in transit between processes). It uses marker messages that propagate through a strongly connected graph of processes.
Flink's Asynchronous Barrier Snapshots (ABS) adapts this algorithm for stream processing with several modifications:
- Directed acyclic graph (DAG) topology instead of strongly connected graphs. Streaming jobs have a clear direction from sources to sinks.
- Centralized coordinator that initiates checkpoints, unlike Chandy-Lamport where any process can initiate a snapshot.
- Aligned barriers that require buffering records on faster channels until all channels deliver the barrier.
RisingWave follows a similar adaptation but takes it further by:
- Eliminating channel state storage. Because sources are rewindable (Kafka offsets can be reset), there is no need to capture in-flight messages. On recovery, the source simply rewinds and replays.
- Using shared cloud storage as the single source of truth. Every operator's state is persisted to the same object store, simplifying recovery coordination.
- Treating checkpointed state as queryable tables. This is unique to RisingWave's streaming database model, where internal state is treated as a logical table that users can query directly.
Barrier Alignment: Exactly-Once vs. At-Least-Once
Barrier alignment is what separates exactly-once from at-least-once in barrier-based checkpointing. When alignment is enabled, an operator with multiple inputs waits for the barrier on all channels before snapshotting. Records from channels that have delivered the barrier are buffered.
Flink offers a configuration option to disable alignment (unaligned checkpoints), which avoids the buffering overhead but downgrades the guarantee to at-least-once. RisingWave uses aligned barriers by default to maintain exactly-once semantics.
graph TD
subgraph Aligned Barrier - Exactly-Once
I1[Input Channel 1<br/>Records + Barrier] --> |Barrier arrives first| J[Join Operator<br/>Buffers Ch1 records]
I2[Input Channel 2<br/>Records + Barrier] --> |Barrier arrives second| J
J --> |Snapshot after both barriers| S[State Snapshot]
end
End-to-End Exactly-Once
Achieving exactly-once within the processing engine is necessary but not sufficient. For end-to-end exactly-once, the guarantee must extend from source to sink.
RisingWave handles this through its built-in storage layer. Because materialized views are the primary output mechanism, and their updates are committed atomically as part of the checkpoint, the "sink" is inherently transactional. When RisingWave writes to external sinks (PostgreSQL, Kafka, Iceberg), it coordinates these writes with the checkpoint cycle to maintain end-to-end consistency.
Flink achieves end-to-end exactly-once through a two-phase commit protocol. When a sink operator receives a checkpoint barrier, it pre-commits its buffered data (opens a transaction in the external system but does not finalize it). Only when the checkpoint completes does the coordinator signal all sinks to commit their transactions. This requires the external sink to support transactions (for example, Kafka transactions or database two-phase commit).
Practical Example: Exactly-Once Aggregation in RisingWave
To make this concrete, consider a real-time order analytics pipeline. You want to track revenue per product category, updated continuously as new orders arrive. With exactly-once processing, every order is counted once in the aggregate, even if the system restarts mid-computation.
-- Create a source table for incoming orders
CREATE TABLE orders (
order_id INT,
product_category VARCHAR,
amount DECIMAL,
order_time TIMESTAMP
);
-- Create a materialized view for real-time revenue by category
CREATE MATERIALIZED VIEW revenue_by_category AS
SELECT
product_category,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue
FROM orders
GROUP BY product_category;
When you insert orders:
INSERT INTO orders VALUES
(1, 'Electronics', 299.99, '2026-04-01 10:00:00'),
(2, 'Electronics', 149.50, '2026-04-01 10:01:00'),
(3, 'Clothing', 59.99, '2026-04-01 10:02:00'),
(4, 'Clothing', 89.99, '2026-04-01 10:03:00'),
(5, 'Electronics', 499.00, '2026-04-01 10:04:00');
Query the materialized view:
SELECT * FROM revenue_by_category ORDER BY total_revenue DESC;
Expected output:
product_category | order_count | total_revenue
------------------+-------------+---------------
Electronics | 3 | 948.49
Clothing | 2 | 149.98
If a checkpoint occurs after processing orders 1-3 and the system crashes while processing orders 4-5, recovery works as follows:
- RisingWave restores the materialized view state from the last checkpoint (reflecting orders 1-3).
- The source rewinds to the offset after order 3.
- Orders 4-5 are replayed and processed again.
- The final result is identical, with no duplicates and no missing data.
This is exactly-once in action: the observable output reflects each order precisely once, regardless of failures during processing.
What Is the Difference Between At-Least-Once and Exactly-Once Processing?
At-least-once processing guarantees that every record is processed, but it may process some records more than once. This leads to duplicate results in sinks and aggregations. Exactly-once processing guarantees that the observable effect of each record appears once and only once. The key mechanism is combining replayable sources with consistent state snapshots: on failure, the system rolls back to a checkpoint and replays records, but because state is also rolled back, the replay produces the same result without duplication. At-least-once is simpler to implement and has lower overhead, while exactly-once requires coordinated checkpointing and transactional output but provides correct results without downstream deduplication.
How Does Barrier-Based Checkpointing Work in Stream Processing?
Barrier-based checkpointing works by injecting special marker messages (barriers) into the data stream that flow alongside regular records. A central coordinator periodically emits barriers into source operators. As barriers propagate through the processing graph, each operator snapshots its state when it has received the barrier on all input channels. This creates a globally consistent snapshot without pausing the pipeline. The barriers divide the stream into epochs, and each completed set of snapshots forms a checkpoint that the system can restore to in case of failure. This mechanism is derived from the Chandy-Lamport distributed snapshot algorithm published in 1985.
Can Exactly-Once Processing Handle Failures Across Distributed Nodes?
Yes. Exactly-once processing is specifically designed for distributed environments. When a node fails, the checkpoint coordinator detects the failure and initiates a global rollback to the last completed checkpoint. All operators across all nodes restore their state from the checkpoint, sources rewind to the recorded offsets, and processing resumes. Because the checkpoint represents a consistent cut across the entire distributed pipeline, the recovered state is globally consistent. In RisingWave, the Meta Node orchestrates this recovery, rebuilding the streaming pipeline on available nodes and restoring state from shared object storage.
Does Exactly-Once Processing Add Latency?
The latency impact depends on the implementation. Barrier alignment, where operators buffer records while waiting for barriers on all input channels, can add latency during checkpoints, especially when data is skewed across channels. However, modern implementations minimize this impact. RisingWave's asynchronous checkpointing flushes state to object storage in the background without blocking data processing, keeping the performance overhead minimal. The 1-second default barrier interval balances checkpoint frequency against overhead, and the async persistence ensures foreground latency remains stable.
Conclusion
Exactly-once processing is not a myth, but it is not magic either. It is an engineering solution built on three pillars: replayable sources, consistent distributed snapshots, and atomic state commits. Here are the key takeaways:
- At-most-once loses data, at-least-once creates duplicates, and exactly-once provides correct results by combining state rollback with source replay.
- Barrier-based checkpointing enables consistent snapshots without stopping the pipeline, using markers that flow with the data stream.
- RisingWave implements exactly-once with async barrier checkpointing, flushing state to shared cloud storage without blocking processing, and treating checkpointed state as queryable SQL tables.
- The Chandy-Lamport algorithm provides the theoretical foundation, adapted by both RisingWave and Flink for streaming DAG topologies.
- End-to-end exactly-once requires coordination beyond the processing engine, extending to sources and sinks through transactional commits.
Ready to build exactly-once streaming pipelines with SQL? Try RisingWave Cloud free - no credit card required. Sign up here.
Join our Slack community to ask questions and connect with other stream processing developers.

