Streaming Database Architecture Explained: How It Works Under the Hood

If you have worked with traditional databases, you know the pattern: data lands in tables, and you run queries against it when you need answers. The database does all the heavy lifting at read time, scanning rows, joining tables, and computing aggregates on every request. This works fine until your data volume grows and your users expect fresh results every second instead of every hour.

A streaming database flips this model. Instead of computing results when you ask for them, it computes results as data arrives. The query runs once, and the database keeps the answer up to date continuously. No cron jobs, no batch refreshes, no stale dashboards.

But how does this actually work? What architectural decisions make continuous computation possible without reprocessing everything from scratch? This article breaks down the internal architecture of a streaming database, explains each layer, and shows you the concepts in action with verified SQL examples running on RisingWave v2.8.0.

What Makes a Streaming Database Different from a Traditional Database?

A traditional relational database stores data on write and computes results on read. When you run SELECT region, SUM(amount) FROM orders GROUP BY region, the database scans the orders table, groups the rows, and calculates the sums. Run the same query a minute later with new data, and it does all that work again from scratch.

A streaming database inverts this. You define the query once as a materialized view, and the database computes the result incrementally as each new row arrives. It does not rescan the entire table. It takes the new row, determines how it affects the existing result, and updates only the relevant parts.

This is called incremental view maintenance (IVM), and it is the core computation model behind every streaming database. The database treats your query as a standing program rather than a one-shot request.

Here is what that looks like in practice. The following SQL was verified against RisingWave v2.8.0.

Setting up the data

CREATE TABLE orders (
    order_id INT,
    customer_id INT,
    product_name VARCHAR,
    amount DECIMAL,
    region VARCHAR,
    order_time TIMESTAMP
);

CREATE TABLE customers (
    customer_id INT,
    customer_name VARCHAR,
    signup_date DATE,
    tier VARCHAR
);

INSERT INTO orders VALUES
(1, 101, 'Laptop', 1299.99, 'us-east', '2025-06-01 10:00:00'),
(2, 102, 'Keyboard', 79.99, 'us-west', '2025-06-01 10:05:00'),
(3, 101, 'Monitor', 449.99, 'us-east', '2025-06-01 10:10:00'),
(4, 103, 'Mouse', 29.99, 'eu-west', '2025-06-01 10:15:00'),
(5, 104, 'Headphones', 199.99, 'us-west', '2025-06-01 10:20:00'),
(6, 102, 'Webcam', 89.99, 'us-west', '2025-06-01 10:25:00'),
(7, 105, 'Laptop', 1299.99, 'eu-west', '2025-06-01 10:30:00'),
(8, 103, 'Tablet', 599.99, 'eu-west', '2025-06-01 10:35:00');

INSERT INTO customers VALUES
(101, 'Alice', '2024-01-15', 'gold'),
(102, 'Bob', '2024-03-20', 'silver'),
(103, 'Charlie', '2024-06-10', 'bronze'),
(104, 'Diana', '2024-09-01', 'silver'),
(105, 'Eve', '2025-01-05', 'gold');

Defining the materialized view

CREATE MATERIALIZED VIEW order_metrics AS
SELECT
    region,
    COUNT(*) AS total_orders,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
GROUP BY region;

Query the result:

SELECT * FROM order_metrics ORDER BY total_revenue DESC;

 region  | total_orders | total_revenue | avg_order_value
---------+--------------+---------------+----------------
 eu-west |            3 |       1929.97 |          643.32
 us-east |            2 |       1749.98 |          874.99
 us-west |            3 |        369.97 |          123.32

Now insert a new order:

INSERT INTO orders VALUES
(9, 101, 'Server', 4999.99, 'us-east', '2025-06-01 11:00:00');

Query the materialized view again, with no manual refresh needed:

SELECT * FROM order_metrics ORDER BY total_revenue DESC;

 region  | total_orders | total_revenue | avg_order_value
---------+--------------+---------------+----------------
 us-east |            3 |       6749.97 |         2249.99
 eu-west |            3 |       1929.97 |          643.32
 us-west |            3 |        369.97 |          123.32

The us-east row updated from $1,749.98 to $6,749.97 without rescanning the entire table. The database processed only the delta: one new row worth $4,999.99 added to the existing aggregate.

How Does the Computation Engine Process Data Incrementally?

The incremental computation engine is the most important architectural component in a streaming database. When you create a materialized view, the database does not just store a query string. It builds a dataflow graph: a directed acyclic graph (DAG) of operators that mirrors your SQL query.

For the order_metrics view above, the dataflow graph looks roughly like this:

Source operator reads changes from the orders table
Project operator selects the relevant columns (region, amount)
Hash aggregate operator groups by region and maintains running counts, sums, and averages
Materialized operator stores the final result and serves queries

Each operator in the graph is stateful. The hash aggregate operator, for example, maintains a hash map keyed by region with the current count, sum, and average for each group. When a new row arrives, the operator looks up the key, updates the running totals, and emits the change downstream.

This is fundamentally different from batch processing, where the aggregation operator would scan all rows and build the hash map from scratch every time. In a streaming database, the operator maintains its state across events and only processes the change.

Streaming joins work the same way

Joins in a streaming database are also maintained incrementally. When you join two tables in a materialized view, the database maintains a hash table for each side of the join. When a new row arrives on one side, the engine probes the hash table of the other side to find matching rows and emits the joined result.

CREATE MATERIALIZED VIEW customer_spend_tier AS
SELECT
    c.customer_name,
    c.tier,
    COUNT(o.order_id) AS order_count,
    SUM(o.amount) AS total_spent
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.customer_name, c.tier;

SELECT * FROM customer_spend_tier ORDER BY total_spent DESC;

 customer_name |  tier  | order_count | total_spent
---------------+--------+-------------+------------
 Alice         | gold   |           3 |     6749.97
 Eve           | gold   |           1 |     1299.99
 Charlie       | bronze |           2 |      629.98
 Diana         | silver |           1 |      199.99
 Bob           | silver |           2 |      169.98

This view joins orders with customers, groups by customer name and tier, and computes the totals. Internally, the database maintains hash tables for both sides of the join. When a new order arrives, it looks up the customer in the customers hash table, produces the joined row, and feeds it to the aggregate operator. The entire pipeline, from source to aggregate to materialized result, runs incrementally.

Multi-way joins across three tables

Streaming databases handle complex multi-table joins the same way. Each join stage maintains its own state, and changes propagate through the graph:

CREATE TABLE shipments (
    shipment_id INT,
    order_id INT,
    shipped_at TIMESTAMP,
    delivered_at TIMESTAMP,
    carrier VARCHAR
);

INSERT INTO shipments VALUES
(1001, 1, '2025-06-01 14:00:00', '2025-06-03 09:00:00', 'FedEx'),
(1002, 2, '2025-06-01 15:00:00', '2025-06-02 12:00:00', 'UPS'),
(1003, 3, '2025-06-01 16:00:00', NULL, 'FedEx'),
(1004, 4, '2025-06-02 09:00:00', '2025-06-05 14:00:00', 'DHL'),
(1005, 5, '2025-06-01 18:00:00', '2025-06-02 10:00:00', 'UPS');

CREATE MATERIALIZED VIEW late_shipment_alerts AS
SELECT
    o.order_id,
    c.customer_name,
    o.product_name,
    s.carrier,
    s.shipped_at,
    s.delivered_at
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN shipments s ON o.order_id = s.order_id
WHERE s.delivered_at IS NULL
   OR s.delivered_at - s.shipped_at > INTERVAL '2 days';

SELECT * FROM late_shipment_alerts ORDER BY order_id;

 order_id | customer_name | product_name | carrier |     shipped_at      |    delivered_at
----------+---------------+--------------+---------+---------------------+--------------------
        3 | Alice         | Monitor      | FedEx   | 2025-06-01 16:00:00 |
        4 | Charlie       | Mouse        | DHL     | 2025-06-02 09:00:00 | 2025-06-05 14:00:00

This view joins three tables and filters for shipments that are either undelivered or took more than two days. When a new shipment record arrives, the database probes both the orders and customers hash tables to produce the joined result, then applies the filter. If the shipment was delivered on time, it never appears in the materialized view.

How Does a Streaming Database Manage State and Storage?

Every operator in the dataflow graph maintains state: the hash aggregate keeps running totals, the join operator keeps hash tables for both sides, and the materialized view operator keeps the final result. This state can grow large. A join between two tables with millions of rows requires maintaining millions of entries in the join hash tables.

Streaming databases solve this with a tiered storage architecture. RisingWave, for example, uses a design with three tiers:

In-memory buffer: Hot state lives in memory for fast access. Recent updates to hash tables and aggregates are served from here.
Local disk cache: Warm state spills to local SSD using an LSM-tree structure (similar to RocksDB). This gives the engine access to state that does not fit in memory without hitting remote storage.
Object storage: Cold state is persisted to S3, GCS, or Azure Blob Storage. This is the durable, persistent layer. All state is eventually checkpointed here.

This separation of compute and storage is a key architectural choice. Compute nodes can scale independently of storage. If you need more processing power, add compute nodes. If you need more state capacity, the object storage layer scales automatically.

Checkpointing: ensuring consistency

State persistence happens through checkpointing. The meta node periodically injects barrier markers into the data streams (by default, every second in RisingWave). When a barrier reaches an operator, the operator flushes its current state to durable storage. This creates a consistent snapshot of the entire dataflow graph.

If a node fails, the system restores state from the most recent checkpoint and replays events from that point forward. This gives you exactly-once processing semantics: every event is processed exactly once, even in the face of failures.

The checkpointing mechanism is based on the Chandy-Lamport algorithm for distributed snapshots. Barriers flow through the dataflow graph alongside regular data. When a multi-input operator (like a join) receives barriers from all its inputs, it knows that all upstream state is consistent and can safely checkpoint its own state.

How Do Cascading Materialized Views Work?

One of the most powerful features of a streaming database is the ability to build materialized views on top of other materialized views. This creates a pipeline of incremental computations where each layer refines the data further.

Consider this example. The order_metrics view computes raw revenue by region. You can build a classification layer on top of it:

CREATE MATERIALIZED VIEW revenue_by_region AS
SELECT
    region,
    total_revenue,
    CASE
        WHEN total_revenue > 5000 THEN 'high'
        WHEN total_revenue > 1000 THEN 'medium'
        ELSE 'low'
    END AS revenue_tier
FROM order_metrics;

SELECT * FROM revenue_by_region ORDER BY total_revenue DESC;

 region  | total_revenue | revenue_tier
---------+---------------+-------------
 us-east |       6749.97 | high
 eu-west |       1929.97 | medium
 us-west |        369.97 | low

Now insert a high-value order in us-west:

INSERT INTO orders VALUES
(10, 104, 'Desktop', 2199.99, 'us-west', '2025-06-01 12:00:00');

Both materialized views update automatically, in a chain:

SELECT * FROM revenue_by_region ORDER BY total_revenue DESC;

 region  | total_revenue | revenue_tier
---------+---------------+-------------
 us-east |       6749.97 | high
 us-west |       2569.96 | medium
 eu-west |       1929.97 | medium

The us-west region moved from low to medium without you touching either materialized view. The change propagated through two layers of computation: first order_metrics updated its sum, then revenue_by_region re-evaluated the CASE expression for the changed row.

In practice, teams build three or four layers of cascading views to separate concerns. A common pattern is:

Layer 1: Raw ingestion and normalization (clean data from sources)
Layer 2: Business logic (joins, aggregations, filters)
Layer 3: Serving layer (formatted output for dashboards or APIs)

This is similar to the medallion architecture (bronze, silver, gold), but every layer updates incrementally in real time.

What Does the Node Architecture Look Like?

A production streaming database is a distributed system with specialized node types. RisingWave uses four types of nodes, each handling a distinct responsibility:

Serving nodes

Serving nodes handle client connections using the PostgreSQL wire protocol. This means you can connect with psql, any PostgreSQL driver, or tools like DBeaver and Grafana without special connectors. The serving node parses SQL, optimizes queries, and routes them appropriately. For ad-hoc SELECT queries against materialized views, the serving node reads directly from the storage layer. For CREATE MATERIALIZED VIEW statements, it generates a streaming execution plan and hands it off to the streaming nodes.

Streaming nodes

Streaming nodes run the actual dataflow graph. Each streaming node hosts a set of operators (sources, joins, aggregates, materialized view writers) and processes data continuously. Operators are distributed across streaming nodes using consistent hashing, so the load is spread evenly. When you scale out by adding more streaming nodes, the system redistributes operators automatically.

Meta node

The meta node is the coordinator. It manages metadata (which tables exist, which materialized views are running), orchestrates barrier injection for checkpointing, handles streaming job lifecycle (creating, pausing, dropping materialized views), and coordinates failure recovery. In a production deployment, the meta node runs with its own persistence layer for fault tolerance.

Compactor nodes

Because state is persisted using an LSM-tree structure, compaction is necessary to merge small files into larger ones and remove deleted entries. Compactor nodes handle this background work without affecting the performance of streaming or serving nodes. This separation means that compaction, which can be CPU-intensive, does not compete with query serving or stream processing for resources.

How Does a Streaming Database Connect to External Systems?

A streaming database does not exist in isolation. It ingests data from upstream systems and serves results to downstream consumers. The connection points fall into two categories.

Sources: where data comes in

Streaming databases typically support these ingestion patterns:

Message brokers: Apache Kafka, Apache Pulsar, Amazon Kinesis, and Redpanda are common sources. The database acts as a consumer, reading events from topics and feeding them into the dataflow graph.
CDC connectors: Change Data Capture from PostgreSQL, MySQL, and other databases. The database reads the write-ahead log (WAL) of the source database and ingests row-level changes. RisingWave supports direct CDC ingestion from PostgreSQL and MySQL without needing Debezium.
Direct writes: You can INSERT data directly using SQL, as shown in the examples above. This is useful for prototyping and for systems that can make direct database connections.

Sinks: where results go

Materialized view results can be pushed to downstream systems:

Databases: PostgreSQL, MySQL, or other OLTP databases for serving layers
Data lakes: Apache Iceberg, Delta Lake, or Hudi for long-term analytical storage
Message brokers: Back to Kafka or other brokers for further downstream processing
Search engines: Elasticsearch or OpenSearch for full-text search use cases

The combination of sources and sinks makes the streaming database a processing hub in your data architecture. Data flows in from operational systems, gets transformed and enriched through materialized views, and flows out to wherever it needs to serve the business.

FAQ

What is a streaming database?

A streaming database is a database system that processes data continuously as it arrives rather than on-demand when queries are executed. It uses incremental view maintenance to keep query results up to date in real time without full table scans. You define queries as materialized views, and the database maintains the results automatically.

How is a streaming database different from a stream processing framework like Apache Flink?

A streaming database provides a full database experience with SQL interface, storage layer, and PostgreSQL-compatible connections. Stream processing frameworks like Apache Flink require you to write Java or Python programs, manage state backends, and connect separate serving layers. A streaming database combines the processing engine, state management, and serving layer into one system. See this comparison for a deeper look.

When should I use a streaming database instead of a batch pipeline?

Use a streaming database when your use case requires sub-second data freshness for analytics, alerting, or operational dashboards. Batch pipelines (Spark, dbt, Airflow) are better when hourly or daily latency is acceptable and your queries span very large historical datasets. Many teams use both: a streaming database for real-time views and a batch pipeline for historical analysis.

Does a streaming database replace my existing OLTP or OLAP database?

No. A streaming database complements your existing stack. Your OLTP database (PostgreSQL, MySQL) handles transactional workloads. Your OLAP database (Snowflake, BigQuery, ClickHouse) handles heavy analytical queries. The streaming database sits between them, ingesting changes from the OLTP layer, computing real-time transformations, and serving fresh results or pushing them to your OLAP or serving layer.

Conclusion

A streaming database is not a black box. It is a carefully layered system with distinct architectural components:

Incremental computation engine that builds a dataflow graph from your SQL and processes only the changes, not the full dataset
Stateful operators that maintain hash tables, aggregates, and join state across events
Tiered storage with in-memory buffers, local disk cache, and object storage for durability
Barrier-based checkpointing that creates consistent snapshots for fault tolerance
Cascading materialized views that let you build multi-layer pipelines where every layer updates automatically
Specialized nodes (serving, streaming, meta, compactor) that separate concerns and scale independently

Understanding these internals helps you make better decisions about when a streaming database fits your architecture and how to design your materialized views for optimal performance. The SQL interface means you do not need to learn a new programming model. If you know SQL, you already know how to use a streaming database.

Ready to try this yourself? Get started with RisingWave in 5 minutes with our quickstart guide.

Join our Slack community to ask questions and connect with other stream processing developers.