Exactly-Once Semantics in Stream Processing: How It Works

Exactly-once semantics means every event in a stream is processed exactly one time — no duplicates, no data loss — even when failures occur. This is the strongest processing guarantee in stream processing and is critical for financial transactions, billing systems, and any workload where duplicate or missing events cause business impact. Apache Flink, RisingWave, and Kafka Streams all support exactly-once semantics through different mechanisms.

The Three Processing Guarantees

Guarantee	Description	Duplicates?	Data Loss?	Complexity
At-most-once	Fire and forget	No	Possible	Lowest
At-least-once	Retry on failure	Possible	No	Medium
Exactly-once	Each event processed once	No	No	Highest

How Exactly-Once Works

Checkpoint-Based (Flink, RisingWave)

The system periodically snapshots all operator state. On failure, it restores from the last checkpoint and replays events from the source:

Checkpoint: Save consistent snapshot of all state to durable storage (S3)
Failure: Node crashes, losing in-memory state
Recovery: Restore state from checkpoint, reset source offsets to checkpoint position
Replay: Reprocess events from checkpoint offset — but state ensures results are identical

Flink: Uses Chandy-Lamport distributed snapshots with two-phase commit for end-to-end exactly-once RisingWave: Barrier-based checkpointing with 1-second intervals, state on S3

Changelog-Based (Kafka Streams)

Every state mutation is written to a Kafka changelog topic. On failure, replay the changelog:

Processing: Each state change written to both local RocksDB and Kafka changelog
Failure: Node crashes
Recovery: New node replays changelog from last committed offset
Result: State rebuilt identically

Combined with Kafka's transactional producers (processing.guarantee=exactly_once_v2), this provides end-to-end exactly-once within the Kafka ecosystem.

The Cost of Exactly-Once

Exactly-once isn't free:

Latency overhead: Checkpoint coordination adds milliseconds to processing
Storage cost: Checkpoints/changelogs consume storage
Throughput impact: Transactional writes reduce peak throughput by 10-30%

For many workloads (logging, metrics, clickstream), at-least-once with downstream deduplication is sufficient and cheaper.

When You Need Exactly-Once

Financial transactions (payments, trades)
Billing and metering
Inventory management
Any system where duplicates cause monetary impact

When At-Least-Once Is Fine

Log aggregation (duplicate log lines are harmless)
Metrics collection (idempotent counters handle duplicates)
Clickstream analytics (slight overcounting is acceptable)

Frequently Asked Questions

Is exactly-once semantics really possible?

Yes, within the scope of the processing system. Flink, RisingWave, and Kafka Streams achieve exactly-once through coordinated checkpointing and transactional writes. End-to-end exactly-once (including external sinks) requires the sink to support transactions or idempotent writes.

Which stream processor has the best exactly-once support?

Apache Flink has the most mature end-to-end exactly-once implementation, using two-phase commit across sources and sinks. RisingWave provides exactly-once with 1-second checkpoints and S3 state. Kafka Streams supports exactly-once within the Kafka ecosystem.

Does exactly-once affect performance?

Yes. Exactly-once adds overhead from checkpoint coordination, transactional writes, and state snapshots. Throughput typically decreases 10-30% compared to at-least-once. For most workloads, this trade-off is worth the correctness guarantee.