The Architecture of a Distributed Streaming Database

The Architecture of a Distributed Streaming Database

Most databases are built for data at rest. You load rows into tables, index them, and query when you need answers. But modern applications generate continuous streams of events: clicks, transactions, sensor readings, log entries. By the time a traditional database has ingested and indexed a batch, the window for action has already closed.

A streaming database flips this model. Instead of waiting for queries, it processes data the moment it arrives, maintaining always-fresh results through incrementally updated materialized views. The architectural challenge is significant: how do you build a system that handles unbounded, continuously arriving data while providing the consistency guarantees and SQL interface that developers expect from a relational database?

This article breaks down the architecture of RisingWave, a distributed streaming database built from the ground up for cloud-native environments. You will learn how its four core components work together, why separating compute from storage matters for streaming workloads, and how an LSM-tree state store designed specifically for streaming achieves sub-second latency on top of object storage like S3.

The Four Pillars: Frontend, Streaming Engine, State Store, and Meta Service

RisingWave uses a disaggregated architecture with four distinct node types, each scaling independently. The nodes communicate via gRPC, and all persistent state lives in object storage (S3, GCS, or Azure Blob). No component depends on local disk for durability.

graph TB
    Client["PostgreSQL Client<br/>(psql, JDBC, any PG driver)"]

    subgraph Frontend["Frontend Nodes (Port 4566)"]
        Parser["SQL Parser"]
        Binder["Binder & Type Inference"]
        Optimizer["Query Optimizer"]
        Planner["Physical Planner<br/>(Batch or Streaming)"]
    end

    subgraph Meta["Meta Service (Port 5690)"]
        Catalog["Catalog Controller"]
        Barrier["Global Barrier Manager"]
        StreamMgr["Global Stream Manager"]
        HummockMgr["Hummock Manager"]
    end

    subgraph Compute["Compute Nodes (Port 5688)"]
        StreamExec["Streaming Actors"]
        BatchExec["Batch Executor"]
        StateImpl["State Store Client"]
        MemMgr["Memory Manager"]
    end

    subgraph Compactor["Compactor Nodes (Port 6660)"]
        Merge["SSTable Merge & GC"]
    end

    S3["Object Storage<br/>(S3 / GCS / MinIO)"]

    Client --> Frontend
    Frontend -->|DDL| Meta
    Frontend -->|SELECT| Compute
    Meta -->|Schedule actors| Compute
    Meta -->|Compaction tasks| Compactor
    Compute -->|Read/Write| S3
    Compactor -->|Merge SSTables| S3
    Compute -->|Barrier ack| Meta

Let's examine each component in detail.

Frontend: PostgreSQL Protocol Compatibility

The frontend node is the SQL gateway. It implements the PostgreSQL wire protocol, which means any PostgreSQL client, driver, or tool works out of the box: psql, JDBC, Python's psycopg2, Go's pgx, and hundreds of BI tools that speak PostgreSQL.

When a SQL statement arrives, the frontend runs a multi-stage pipeline:

  1. Parsing - The risingwave_sqlparser crate converts raw SQL text into an abstract syntax tree (AST). RisingWave supports PostgreSQL dialect with streaming extensions like CREATE MATERIALIZED VIEW and CREATE SOURCE.

  2. Binding and type inference - The Binder resolves table names, column references, and function calls against the system catalog. It infers column types and validates that the query is semantically correct.

  3. Logical planning - The optimizer generates a logical plan and applies rule-based and cost-based optimizations: predicate pushdown, join reordering, projection pruning, and common subexpression elimination.

  4. Physical planning - The planner converts the logical plan into either a batch execution plan (for ad-hoc SELECT queries) or a streaming execution plan (for CREATE MATERIALIZED VIEW and CREATE SINK statements). This is where the paths diverge: batch plans go to compute nodes for immediate execution, while streaming plans go to the meta service for long-running deployment.

The frontend is stateless. You can run multiple frontend nodes behind a load balancer to handle more concurrent connections. Each frontend coordinates schema changes (DDL) through the meta service via a CatalogWriter interface, ensuring that CREATE TABLE or ALTER operations are applied consistently across the cluster.

Streaming Engine: Actors, Barriers, and Dataflow Graphs

The streaming engine is the core of RisingWave's real-time processing. It uses an actor-based dataflow model that draws from both the dataflow programming paradigm and the actor concurrency model.

Dataflow Graphs and Fragments

When you create a materialized view, the frontend compiles the SQL into a streaming execution plan, which the meta service's GlobalStreamManager converts into a directed acyclic graph (DAG) of operators: Source, Filter, Project, HashJoin, HashAgg, Materialize, and others. Each operator is implemented as an actor, a lightweight async task running on a Tokio runtime inside a compute node.

Related actors are grouped into fragments. A fragment is the unit of scheduling: the meta service assigns each fragment to a compute node, and all actors within a fragment execute on the same node. This locality reduces network overhead for operators that need to exchange data frequently.

graph LR
    subgraph Fragment1["Fragment 1 (Compute Node A)"]
        Source1["Source Actor<br/>(Kafka)"]
        Filter1["Filter Actor<br/>(amount > 100)"]
    end

    subgraph Fragment2["Fragment 2 (Compute Node B)"]
        Agg["HashAgg Actor<br/>(SUM by category)"]
        Mat["Materialize Actor"]
    end

    Source1 -->|Data chunks| Filter1
    Filter1 -->|Shuffle by key| Agg
    Agg -->|Updates| Mat

Actors exchange data through channels as data chunks - columnar batches of rows, similar to Apache Arrow record batches. This columnar format enables vectorized execution within operators.

The Barrier Mechanism: Distributed Checkpointing

Consistency in a distributed streaming system is hard. RisingWave solves this with barriers, a technique inspired by the Chandy-Lamport distributed snapshot algorithm (also used by Apache Flink, though RisingWave's implementation is tightly integrated with its storage layer).

Here is how it works:

  1. The meta service's GlobalBarrierManager periodically injects barrier messages into all source actors. Each barrier carries an epoch number, a monotonically increasing logical timestamp.

  2. When an actor receives a barrier on one of its input channels, it stops processing data from that channel and waits until barriers arrive on all input channels (for multi-input operators like joins).

  3. Once all barriers are aligned, the actor flushes its in-memory state changes to the state store, tagged with the current epoch. It then forwards the barrier downstream.

  4. When all sink/materialize actors at the bottom of the DAG acknowledge the barrier, the meta service marks that epoch as committed. The checkpoint is now durable.

This mechanism provides exactly-once semantics for state updates. If a compute node crashes, the system rolls back to the last committed epoch and replays from source offsets associated with that epoch. With default settings, RisingWave checkpoints every 10 seconds, meaning recovery requires replaying at most 10 seconds of data.

The barrier also serves as a coordination point for schema changes and dynamic reconfiguration. When the meta service needs to rescale a fragment (add or remove actors), it piggybacks the reconfiguration command on a barrier, ensuring that the topology change happens at a consistent point in the data stream.

State Store: Hummock, an LSM-Tree Built for Streaming

Every streaming operator needs persistent state. A join operator stores both sides of the join. An aggregation operator stores partial results. A deduplication operator stores seen keys. In RisingWave, all operator state is managed by Hummock, a custom LSM-tree (Log-Structured Merge-tree) storage engine designed specifically for streaming workloads.

Why Not RocksDB?

Many streaming systems (including Apache Flink) use RocksDB as their local state backend. RisingWave's team evaluated RocksDB and rejected it for several reasons:

  • Cloud migration complexity - RocksDB is designed for local SSDs. Retrofitting it to use S3 as primary storage requires significant engineering and sacrifices performance guarantees.
  • Compaction interference - RocksDB compaction runs on the same node as computation, causing CPU and I/O spikes that affect streaming latency.
  • No epoch awareness - RocksDB has no concept of streaming epochs or barriers. Integrating it with a distributed checkpoint protocol requires an additional coordination layer.
  • Generic overhead - RocksDB supports features (transactions, column families, merge operators) that a streaming state store does not need, adding unnecessary complexity.

Hummock is purpose-built to avoid these issues.

Hummock's Architecture

Hummock organizes data in a classic LSM-tree structure, but every layer is designed around cloud object storage:

Write path:

  1. Streaming operators write key-value pairs to an in-memory MemTable during processing.
  2. When a barrier arrives, the operator's state changes are flushed to a Shared Buffer, a per-node write buffer that collects state from all local actors.
  3. A background task serializes the Shared Buffer contents into SSTable (Sorted String Table) files and uploads them to S3. This happens asynchronously, so the upload does not block data processing.
  4. Once the upload completes, the compute node notifies the meta service's HummockManager, which atomically updates the HummockVersion metadata to include the new files.

Read path:

  1. An operator issues a get(key, epoch) or iter(prefix, epoch) call.
  2. Hummock first checks the local MemTable (uncommitted writes from the current epoch).
  3. Then it checks the Shared Buffer (committed but not yet uploaded data).
  4. Then it checks the block cache (recently accessed SSTable blocks, held in memory).
  5. Then it checks the Foyer disk cache (SSTable blocks cached on local SSD/EBS).
  6. Finally, if no cache hit, it fetches the relevant SSTable blocks from S3.

In practice, the caching layers (memory and local disk) handle 80-90% of reads, so S3 latency rarely impacts streaming performance.

graph TB
    Op["Streaming Operator<br/>get(key, epoch)"]

    MemTable["MemTable<br/>(in-memory, current epoch)"]
    SharedBuf["Shared Buffer<br/>(committed, pending upload)"]
    BlockCache["Block Cache<br/>(40% of storage memory)"]
    DiskCache["Foyer Disk Cache<br/>(local SSD/EBS)"]
    S3Store["S3 Object Storage<br/>(immutable SSTables)"]

    Op --> MemTable
    MemTable -->|miss| SharedBuf
    SharedBuf -->|miss| BlockCache
    BlockCache -->|miss| DiskCache
    DiskCache -->|miss| S3Store

    style MemTable fill:#2d6a2d,color:#fff
    style SharedBuf fill:#2d6a4f,color:#fff
    style BlockCache fill:#1b4332,color:#fff
    style DiskCache fill:#40916c,color:#fff
    style S3Store fill:#52b788,color:#fff

Epoch-Based MVCC

Hummock uses the streaming engine's epoch as its MVCC (Multi-Version Concurrency Control) version. Every key-value pair is tagged with the epoch in which it was written. When a batch query executes a SELECT, the frontend obtains the most recently committed epoch from the meta service and passes it to all compute nodes, ensuring a consistent snapshot across the entire cluster.

This epoch-based versioning also enables schema-aware bloom filters. Instead of building bloom filters on entire keys, Hummock can create filters on specific key prefixes relevant to each operator. For example, a join operator's bloom filter might target only the join column, skipping irrelevant SSTable blocks and reducing I/O.

Remote Compaction

In a traditional LSM-tree, compaction (merging and reorganizing SSTable files) runs on the same machine as reads and writes. This creates resource contention that can spike latency. Hummock separates compaction into dedicated compactor nodes.

The meta service's HummockManager schedules compaction tasks, and compactor nodes pull SSTable files from S3, merge them, write new files back to S3, and report completion. This design has three benefits:

  • Compaction never steals CPU or memory from streaming operators
  • Compactor nodes can scale independently based on write volume
  • In RisingWave Cloud, compaction runs as a serverless shared service across tenants

Hummock uses a tiered compaction strategy: L0 contains overlapping files grouped by checkpoint, which are compacted into non-overlapping L1 files. L1 through L6 follow a standard leveled compaction policy, with each level roughly 10x larger than the previous. The sub-level organization at L0 enables selective compaction that limits write amplification.

Meta Service: The Cluster's Brain

The meta service is the central coordinator. It does not sit in the data path (streaming data flows directly between compute nodes), but it manages everything else:

  • Catalog management - Stores all schema metadata (tables, sources, sinks, materialized views, user-defined functions) in a PostgreSQL or SQLite meta store. Every frontend caches the catalog locally and receives updates when schema changes occur.

  • Barrier management - The GlobalBarrierManager drives the checkpoint protocol. It injects barriers, collects acknowledgments, and advances the committed epoch. It also embeds reconfiguration commands in barriers for online scaling.

  • Stream graph scheduling - The GlobalStreamManager converts logical streaming plans into physical actor assignments. When you run CREATE MATERIALIZED VIEW, it determines how to partition the dataflow, assigns fragments to compute nodes, and starts the actors.

  • Storage version management - The HummockManager maintains the current HummockVersion, tracks which SSTable files belong to which version, schedules compaction, and garbage-collects obsolete files.

  • Cluster membership - Tracks which frontend, compute, and compactor nodes are alive, handles node failures, and triggers actor migration when a compute node goes down.

The meta service is the only component that maintains hard state (in its meta store). All other nodes are effectively stateless: they can be replaced or rescaled without data loss because all durable state resides in S3.

How Compute and Storage Decoupling Changes the Game

The traditional approach to streaming state management couples storage with compute. Apache Flink, for example, runs RocksDB on the same node as the streaming operators. This creates several operational challenges:

ConcernCoupled (e.g., Flink + RocksDB)Decoupled (RisingWave + Hummock on S3)
Scaling computeMust rebalance state across nodes (redistributing RocksDB data)Add compute nodes instantly; they pull state from S3 on demand
Recovery after failureRestore from remote checkpoint, rebuild local RocksDBRestart node, read committed epoch from S3 (no local state to rebuild)
Storage costAttached SSDs on every compute node (expensive, often underutilized)S3 costs ~$0.023/GB/month, shared across all nodes
Compaction impactCompetes for CPU/IO with streaming operatorsRuns on separate compactor nodes
Multi-tenancyEach tenant needs dedicated diskShared object storage with logical isolation

The decoupled model means that compute nodes in RisingWave are nearly stateless. They hold in-memory caches and the current epoch's uncommitted writes, but nothing that cannot be reconstructed from S3 and the meta service. This dramatically simplifies operations: scaling, failover, and upgrades become faster because there is no local state to migrate.

The tradeoff is latency. S3 access typically takes 100-300ms, far too slow for streaming state lookups. RisingWave mitigates this through its multi-tier caching architecture:

  • Memory block cache: Allocated 40% of storage memory budget. Hot SSTable blocks stay here.
  • Foyer disk cache: Uses local SSD or EBS as a second-level cache. Absorbs cache evictions from memory.
  • Prefetching: The query planner identifies which SSTable blocks a scan will need and issues prefetch requests ahead of actual access.
  • 4MB block reads: Instead of reading individual rows from S3, Hummock fetches data in 4MB blocks with sparse indexes mapping key ranges to S3 offsets. This minimizes the number of S3 API calls.

With these optimizations, the system achieves sub-100ms processing latency for streaming workloads while storing all persistent state in object storage.

Putting It All Together: A Query's Journey

To make the architecture concrete, let's trace what happens when you run:

CREATE MATERIALIZED VIEW order_totals AS
SELECT customer_id, SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id;
  1. Frontend parses the SQL, binds orders to a Kafka source defined earlier, optimizes the plan, and generates a streaming physical plan with three operators: Source (reading from Kafka), HashAgg (grouping by customer_id), and Materialize (writing results).

  2. Meta service receives the streaming plan. The GlobalStreamManager creates fragments, assigns actors to compute nodes based on available resources, and starts the actors.

  3. Compute nodes begin running. The Source actor reads messages from Kafka and emits data chunks. The HashAgg actor maintains a hash map of partial sums, backed by a Hummock StateTable. The Materialize actor writes final results to another StateTable.

  4. Barriers flow every 10 seconds. When a barrier passes through the HashAgg actor, it flushes its updated partial sums to the Shared Buffer. The background uploader serializes these as SSTables and pushes them to S3. Once all actors acknowledge, the epoch is committed.

  5. A user queries the materialized view with SELECT * FROM order_totals WHERE customer_id = 42. The frontend routes this as a batch query to a compute node, which reads from the Materialize actor's StateTable. The read hits the block cache (if the data is hot) or fetches from S3 (if cold), using the latest committed epoch for a consistent snapshot.

  6. Compactor nodes periodically merge SSTable files in the background, reducing read amplification and reclaiming space from deleted keys, without affecting streaming latency.

What Happens When Materialized Views Have Complex Joins?

Joins in streaming are fundamentally different from batch joins. In a batch system, both sides of a join are finite and available. In a streaming system, data arrives continuously, and you need to maintain state for both sides to handle late-arriving records on either input.

RisingWave implements streaming joins by maintaining two-sided state: each side of the join is stored in its own Hummock StateTable. When a new record arrives on the left side, the actor probes the right side's state to find matches, emits joined results, and stores the new left record for future matching against right-side arrivals.

The epoch-based MVCC and schema-aware bloom filters are critical here. The bloom filter for a join's left-side state can be built specifically on the join key columns, enabling the actor to skip SSTable blocks that cannot contain matches. This optimization reduces I/O significantly for high-cardinality joins.

Because join state can grow large (it persists until records expire via a watermark or TTL), the decoupled storage model is especially valuable. The state scales with data volume, not with compute cluster size, and the cost of storing it in S3 is a fraction of what attached SSDs would cost.

How Does RisingWave Handle Failure Recovery?

Failure recovery in a distributed streaming system must be fast and correct. RisingWave's approach relies on the combination of committed epochs, S3 durability, and source replayability.

When a compute node crashes:

  1. The meta service detects the failure (via heartbeat timeout) and identifies the affected fragments and actors.
  2. It reassigns the affected fragments to surviving compute nodes (or waits for a replacement node).
  3. The new actors load their operator state from S3, reading the SSTables associated with the last committed epoch.
  4. Source actors rewind to the Kafka offsets (or other source positions) associated with the last committed epoch.
  5. Processing resumes from the committed epoch. At most 10 seconds of data (one checkpoint interval) needs reprocessing.

Because compute nodes hold no critical local state, there is nothing to "recover" from local disk. The S3 state is always consistent and durable. This makes RisingWave's recovery significantly simpler than systems that must restore a local RocksDB instance from a remote checkpoint.

The meta service itself stores its metadata in an external PostgreSQL or SQLite database, so meta node failures are handled by standard database HA mechanisms (replicas, failover).

When Should You Choose a Distributed Streaming Database Over a Stream Processor?

A stream processor like Apache Flink or Kafka Streams processes events and produces output, but it does not serve queries directly. You typically need a separate serving layer (Redis, PostgreSQL, Elasticsearch) to store and query the results.

A streaming database like RisingWave combines processing and serving. It ingests streams, maintains materialized views, and serves queries over those views through a standard SQL interface. This eliminates an entire category of infrastructure:

  • No separate serving database to deploy and manage
  • No ETL pipeline between the processor and the serving layer
  • No consistency issues from the delay between processing and serving
  • No impedance mismatch between processing APIs and query APIs

Choose a distributed streaming database when your use case requires both continuous processing and low-latency queries on the results: real-time dashboards, operational analytics, feature stores for ML, event-driven microservices, or any scenario where freshness and queryability matter together.

Choose a standalone stream processor when you only need to transform and route events (ETL between Kafka topics, for example) without serving queries, or when you need custom non-SQL processing logic.

How Does RisingWave Compare to Other Streaming Architectures?

Several systems tackle the streaming database problem with different architectural choices:

  • Apache Flink + serving DB - Flink is a powerful stream processor but requires a separate database (PostgreSQL, Redis) for serving query results. State is coupled with compute via RocksDB, making scaling and recovery more complex.

  • Materialize - Another streaming SQL database. Materialize uses a single-node Timely Dataflow engine with a differential dataflow computation model. RisingWave's actor-based distributed model is designed for horizontal scaling across many nodes.

  • ksqlDB - Built on Kafka Streams, ksqlDB provides SQL over Kafka topics. It is tightly coupled to the Kafka ecosystem and uses RocksDB for local state, inheriting the scaling and recovery challenges of coupled storage.

RisingWave's distinguishing architectural choices are: PostgreSQL wire compatibility (no new protocol to learn), fully decoupled compute and storage (S3 as the single source of truth), and remote compaction (no resource contention on compute nodes). For a deeper comparison, see The Case for Distributed SQL Streaming Databases.

Conclusion

The architecture of a distributed streaming database requires careful coordination between four components: a SQL frontend that speaks a familiar protocol, a streaming engine that processes data with exactly-once guarantees, a state store that provides fast reads despite living on object storage, and a meta service that orchestrates the entire system.

Key takeaways:

  • PostgreSQL compatibility at the frontend eliminates the need for new SDKs, drivers, or query languages. Anything that speaks PostgreSQL works.
  • Actor-based dataflow with barrier-driven checkpointing provides exactly-once processing semantics without sacrificing throughput.
  • Hummock's LSM-tree on S3 decouples storage from compute, enabling independent scaling, cheaper storage, and simpler operations.
  • Remote compaction on dedicated nodes ensures that background maintenance never impacts streaming latency.
  • Epoch-based MVCC unifies the streaming and serving layers, giving batch queries consistent snapshots without stopping the stream.

The result is a system where you can write SQL to define real-time transformations and query the results instantly, all within a single, operationally simple deployment.


Ready to explore RisingWave's architecture yourself? Get started with RisingWave in 5 minutes. Quickstart

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.