RocksDB in Stream Processing: How State Backends Work

RocksDB in Stream Processing: How State Backends Work

What Is Kafka Connect? Connectors, Transforms, and Converters

RocksDB is an embeddable key-value store used as the state backend in Apache Flink and Kafka Streams. It stores streaming operator state (aggregation results, join buffers, window data) on local disk, enabling processing of state larger than memory.

How RocksDB Works in Streaming

Streaming Operator
    ↓ State reads/writes
RocksDB (LSM Tree)
    ↓ Compaction
Local SSD / Disk
    ↓ Checkpoint
Remote Storage (S3)

LSM Tree Architecture

RocksDB uses a Log-Structured Merge Tree:

  1. Memtable: In-memory write buffer (fast writes)
  2. SSTable: Immutable sorted files on disk (fast reads with bloom filters)
  3. Compaction: Background process merging SSTables to reduce read amplification

RocksDB Challenges in Streaming

ChallengeImpactMitigation
Tuning complexity30+ configuration parametersExpert knowledge required
Local disk dependencyDisk failure = state lossCheckpoints to S3
Recovery timeDownload checkpoint → rebuildMinutes for large state
Memory managementBlock cache competes with JVM heapCareful sizing

The S3 Alternative

RisingWave's Hummock and Flink 2.0's ForSt replace local RocksDB with S3-based state:

  • No local disk failures
  • Seconds-level recovery (state already in S3)
  • No RocksDB tuning required
  • Elastic scaling without state migration

Frequently Asked Questions

Why is RocksDB used in stream processing?

RocksDB provides fast key-value access with support for state larger than memory (spills to disk). Its LSM tree architecture is optimized for write-heavy workloads common in streaming.

Is RocksDB being replaced?

For cloud-native streaming, yes. Flink 2.0 (ForSt) and RisingWave (Hummock) store state on S3 instead of local RocksDB. This eliminates tuning complexity and enables faster recovery.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.