The Real-Time Data Stack in 2026: Architecture, Tools, and Trade-offs

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is the real-time data stack?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "The real-time data stack is a set of tools that process and serve data with sub-second latency. It replaces the batch-oriented modern data stack (Fivetran, Snowflake, dbt, Looker) with a streaming-first architecture: CDC or Kafka at the ingestion layer, a streaming database for processing, a serving layer queryable in milliseconds, and optionally an Iceberg sink for historical storage. RisingWave is a streaming database that spans the processing and serving layers simultaneously."
      }
    },
    {
      "@type": "Question",
      "name": "What is the difference between the modern data stack and the real-time data stack?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "The modern data stack (roughly 2018-2023) was built around batch loading: Fivetran or Airbyte extracted data, Snowflake or BigQuery stored it, dbt transformed it, and Looker visualized it. End-to-end latency was measured in hours. The real-time data stack replaces batch loading with CDC or event streaming (Kafka, Pulsar), replaces the warehouse as the processing layer with a streaming database like RisingWave, and serves results with millisecond latency instead of hours."
      }
    },
    {
      "@type": "Question",
      "name": "What tools make up the real-time data stack in 2026?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "A typical real-time data stack in 2026 includes: Apache Kafka or Redpanda for event streaming, Debezium or native CDC connectors for change data capture, RisingWave as the streaming SQL processing and serving layer, Apache Iceberg on S3 for long-term analytical storage, and optionally ClickHouse for historical analytics queries. RisingWave connects to all these components as a first-class integration."
      }
    },
    {
      "@type": "Question",
      "name": "Where does RisingWave fit in the real-time data stack?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "RisingWave spans multiple layers of the real-time data stack simultaneously. It acts as the CDC processor (with built-in connectors for PostgreSQL, MySQL, MongoDB, and SQL Server), the streaming database (maintaining incremental materialized views over Kafka events and CDC streams), and the serving layer (queryable via PostgreSQL wire protocol on port 4566). It also writes to Apache Iceberg as a sink, connecting the streaming and storage layers."
      }
    },
    {
      "@type": "Question",
      "name": "Do I still need Kafka if I use RisingWave?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "It depends on your architecture. RisingWave reads from Kafka, Kinesis, and Pulsar as streaming sources, so if you already have Kafka, RisingWave integrates naturally. For CDC use cases from PostgreSQL, MySQL, MongoDB, or SQL Server, RisingWave's built-in CDC connectors let you skip Kafka entirely and go directly from the source database to RisingWave. Kafka remains valuable when you need a durable event bus shared by multiple consumers beyond RisingWave."
      }
    }
  ]
}

Three years ago, the term "modern data stack" meant Fivetran loading data into Snowflake, dbt transforming it, and Looker displaying it. The pipeline was batch. End-to-end latency was measured in hours. That was acceptable when the primary consumer of analytics was a human analyst refreshing a dashboard once a day.

That assumption no longer holds in 2026. AI agents query data every few seconds and make decisions based on what they find. LLMs are expensive to run, so pre-computing results and serving them instantly is preferable to running inference over stale snapshots. Apache Iceberg has standardized how streaming data lands in a lakehouse. The infrastructure for sub-second pipelines has matured to the point where it is no longer a specialist's game.

This article documents what the real-time data stack looks like in 2026: which tools belong at each layer, how they connect, and where the genuine trade-offs are.

What Changed from the Modern Data Stack

The modern data stack served its era well. The components were best-in-class for their time, the SQL-everywhere philosophy worked, and managed services reduced operational burden significantly. But the architecture had a structural constraint baked in from the beginning: it was designed for batch.

How the modern data stack worked (2018-2023):

Fivetran or Airbyte extracted data from source systems on a schedule (every 15 minutes to every 24 hours)
Snowflake, BigQuery, or Redshift stored the raw data
dbt ran transformation models on a schedule to produce clean tables
Looker, Metabase, or Tableau read from those clean tables

The best-case end-to-end latency was around 15 minutes. Typical pipelines ran in 1-4 hour cycles. For business intelligence, that was fine. For operational decisions, customer-facing features, or AI agents, it is not.

What the real-time data stack looks like (2024+):

CDC connectors or event producers push changes into Kafka or directly into a streaming database the moment they happen
A streaming database like RisingWave maintains continuously updated materialized views over those streams
Applications, AI agents, and dashboards query the streaming database directly, getting answers that reflect the current state of the data
Processed results are optionally written to Apache Iceberg for long-term storage and historical analytics

End-to-end latency drops from hours to under one second. The architecture is not more complex -- it is differently composed. The ETL pipeline disappears and is replaced by continuous view maintenance.

Three forces drove this transition:

AI agents require fresh data. An AI agent that answers customer questions, routes support tickets, or triggers automated actions needs to know what happened in the last few seconds, not the last few hours. Agents deployed in production in 2026 are a real consumer of real-time data, not a hypothetical one.

LLM costs encourage pre-computation. Running an LLM over raw data at query time is expensive. Pre-computing structured results in a streaming database and serving them at query time is much cheaper. The streaming database handles the continuous computation so the LLM only handles the reasoning.

Apache Iceberg standardized lakehouse storage. The debate between Delta Lake, Hudi, and Iceberg resolved largely in Iceberg's favor for new projects. With a standard format for long-term storage, streaming tools that write to Iceberg with exactly-once semantics can feed a lakehouse without a separate batch pipeline.

The Layers of the Real-Time Data Stack

Every layer of the stack has a clear function and a set of concrete tools that fill it. Here is how each layer works and which tools to consider at each.

Ingestion Layer: Event Streaming

The ingestion layer receives events from producers and makes them available to consumers. In the real-time data stack, this is a persistent, replayable event log.

Apache Kafka (open source from Apache Software Foundation, managed via Confluent Cloud or Amazon MSK) is the most widely deployed event streaming platform in 2026. Its partition model, consumer group semantics, and broad ecosystem support make it the default choice for organizations building a shared event bus.

Redpanda is a Kafka-compatible alternative written in C++ without JVM overhead. It offers lower operational complexity for organizations that do not want to manage JVM tuning and is a drop-in replacement for Kafka consumers, including RisingWave.

Apache Pulsar is the third major option, with a multi-tenancy model that separates compute from storage. It is commonly used in organizations where topic isolation and geographic replication are primary concerns. RisingWave reads from Pulsar natively.

For most new projects, Kafka or Redpanda is the right choice. Pulsar is worth considering if multi-tenancy or cross-region replication is a hard requirement.

CDC Layer: Change Data Capture

The CDC layer captures row-level changes from source databases and makes them available as a stream. There are two approaches: routing through Kafka, or capturing directly into the streaming database.

Debezium is the most widely used CDC tool. It reads the binary replication log from PostgreSQL, MySQL, MongoDB, SQL Server, and other databases, and publishes change events to Kafka topics. Downstream consumers, including RisingWave, then read from those Kafka topics. Debezium is the right choice when you need a single CDC feed consumed by multiple downstream systems.

RisingWave built-in CDC connects directly to PostgreSQL, MySQL, MongoDB, and SQL Server without Kafka as an intermediary. The CDC pipeline is defined entirely in SQL. For use cases where RisingWave is the only consumer of CDC data, this removes Debezium and the associated Kafka topics from the architecture. The pipeline is simpler and has fewer moving parts.

The decision rule is straightforward: if multiple systems need to consume the same CDC stream, use Debezium and Kafka for fan-out. If RisingWave is the only destination, use the built-in CDC connectors and eliminate Kafka from that path.

Processing Layer: Stream Computation

The processing layer consumes events from Kafka or CDC sources, applies transformations and aggregations, and maintains the results.

RisingWave is a streaming database (Apache 2.0) that maintains incremental materialized views. You write SQL to define the computation. As new events arrive, RisingWave updates only the rows affected by those events, not the entire dataset. The results are immediately queryable via PostgreSQL wire protocol. No separate serving layer is needed.

Apache Flink is the most powerful stream processor available. For complex stateful event processing, custom CEP (complex event processing) patterns, or workloads that require the Java DataStream API, Flink remains the right choice. Flink SQL handles most common patterns well. The key limitation is that Flink is not a database: results go to a sink, and you need a separate serving layer to query them.

The processing layer choice drives significant downstream consequences. Flink requires a downstream serving database (ClickHouse, PostgreSQL, Redis) to make results queryable. RisingWave eliminates that requirement by combining processing and serving. For SQL-expressible workloads, RisingWave reduces the architecture by one component.

Serving Layer: Query Interface

The serving layer is where applications, dashboards, and AI agents query the results of stream processing.

RisingWave serves materialized view results over the PostgreSQL wire protocol on port 4566. Any PostgreSQL client, driver, or tool connects to RisingWave and queries the current state of a materialized view with a standard SELECT statement. Because the view is maintained incrementally, the query returns immediately without triggering computation.

Redis is the right choice for hot key-value lookups where the access pattern is a single-key read (get user profile, get session data, get feature vector). Redis does not support complex SQL queries, so it complements rather than replaces a streaming database.

ClickHouse is an OLAP database optimized for analytical queries over historical data. It does not maintain continuously updated views; it is efficient at scanning large datasets for aggregate queries. ClickHouse fits well in hybrid architectures where RisingWave handles the real-time serving and ClickHouse handles historical analytics over data that has been written to it by RisingWave or a Kafka consumer.

Storage Layer: Long-Term Analytical Storage

The storage layer persists processed data for long-term retention, historical analysis, and re-processing.

Apache Iceberg is the standard table format for the streaming lakehouse in 2026. It supports schema evolution, time travel (query past snapshots), hidden partitioning, and atomic commits. RisingWave writes to Iceberg as a sink with exactly-once semantics, so every record that passes through the streaming layer lands in the lakehouse exactly once.

Amazon S3, Google Cloud Storage, and Azure Blob Storage are the object storage backends that Iceberg sits on top of. The cost is low (roughly $0.023 per GB per month on S3 standard), and the durability is high.

Data in Iceberg is queryable by engines like Apache Spark, Trino, and DuckDB for historical analysis. The streaming layer writes to Iceberg; the analytical layer reads from it. The two concerns are cleanly separated.

AI and Agent Layer

The AI layer is where LLMs and autonomous agents interact with the real-time data stack.

MCP servers (Model Context Protocol) allow AI agents to query structured data sources using a standard protocol. RisingWave has an official MCP server (risingwavelabs/risingwave-mcp) that exposes materialized views to agents. An agent can query the current state of any materialized view, filter by conditions, and use the result in its reasoning or actions.

Vector support is required for semantic search and retrieval-augmented generation (RAG) workloads. RisingWave has a built-in vector(n) type, an HNSW index for approximate nearest-neighbor search, and an openai_embedding() function that converts text to embeddings at query or ingestion time. For stacks that already use RisingWave, adding a separate vector database for embedding search is unnecessary.

Architecture Diagram

The components of the real-time data stack connect in a clear flow:

                        REAL-TIME DATA STACK (2026)

Source Systems          Ingestion              Processing & Serving
+-----------+           +-----------+          +-------------------+
| PostgreSQL | --CDC---> |           |          |                   |
| MySQL      | --CDC---> | RisingWave|--------> | Materialized Views| ---> Applications
| MongoDB    | --CDC---> |           |          | (PostgreSQL API)  | ---> AI Agents
+-----------+           |           |          |                   | ---> Dashboards
                        |           |          +-------------------+
+-----------+           |           |                   |
| Kafka      | -------> |           |                   |
| Redpanda   | -------> |           |                   v
| Pulsar     | -------> |           |          +-------------------+
+-----------+           +-----------+          | Apache Iceberg    |
                                               | (S3/GCS/ABS)      |
                                               +-------------------+
                                                       |
                                                       v
                                               Historical Analytics
                                               (Spark, Trino, DuckDB)

RisingWave sits at the center of the diagram: it reads from source databases via CDC, reads from event streaming systems like Kafka, maintains materialized views, serves those views over PostgreSQL protocol, and writes to Iceberg for long-term storage.

Three Reference Architectures

Architecture 1: Pure Streaming

Best for teams building real-time applications with a single source of truth in the streaming database.

Stack: Kafka + RisingWave + Application

Data flow:

Producers write events to Kafka topics (user actions, transactions, sensor readings)
RisingWave reads from Kafka and maintains materialized views (per-user aggregates, rolling window metrics, real-time leaderboards)
Applications connect to RisingWave via the PostgreSQL protocol and query the views directly

When to use it: When your primary use case is serving real-time query results to applications or dashboards and you do not need historical analytics beyond what fits in RisingWave's storage.

Trade-offs: No historical data layer. Long-term retention requires either keeping RisingWave storage or adding an Iceberg sink.

Example SQL:

-- Kafka source
CREATE SOURCE user_events (
    user_id BIGINT,
    event_type VARCHAR,
    page VARCHAR,
    event_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'user_events',
    properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;

-- Real-time aggregation
CREATE MATERIALIZED VIEW active_users_per_page AS
SELECT
    page,
    COUNT(DISTINCT user_id) AS active_users,
    window_start,
    window_end
FROM TUMBLE(user_events, event_time, INTERVAL '1' MINUTE)
GROUP BY page, window_start, window_end;

-- Application queries
SELECT page, active_users
FROM active_users_per_page
WHERE window_end > NOW() - INTERVAL '5 minutes'
ORDER BY active_users DESC;

Architecture 2: Hybrid Real-Time and Historical

Best for teams that need sub-second latency for live data alongside deep historical analytics over months or years.

Stack: Kafka + RisingWave (real-time) + Iceberg (storage) + ClickHouse (historical analytics)

Data flow:

Kafka receives events from producers
RisingWave reads from Kafka and maintains real-time materialized views for application serving
RisingWave also sinks processed data to Apache Iceberg on S3 with exactly-once semantics
ClickHouse reads from Iceberg (or receives data via a separate Kafka consumer) for historical analytical queries

When to use it: When you need millisecond-latency serving for live data AND efficient analytical queries over 12+ months of history that would be expensive to keep in RisingWave.

Trade-offs: More components to operate. ClickHouse adds an additional system with its own operational surface.

Architecture 3: AI-Native

Best for teams building AI agents or LLM-powered applications that need real-time context.

Stack: CDC + RisingWave + MCP server + AI Agent

Data flow:

RisingWave connects directly to PostgreSQL or MySQL via built-in CDC, no Kafka required
Materialized views compute real-time features: customer LTV, recent order history, product availability, embeddings of recent support tickets
The risingwavelabs/risingwave-mcp server exposes those views to AI agents via the Model Context Protocol
AI agents query RisingWave for current context before reasoning or taking action

When to use it: When the primary consumer of real-time data is an AI agent rather than a human-facing dashboard. Also useful when you want to eliminate the vector database by using RisingWave's built-in vector support.

Example: real-time customer context for an AI agent:

-- Direct CDC from PostgreSQL, no Kafka needed
CREATE SOURCE pg_source WITH (
    connector = 'postgres-cdc',
    hostname = 'prod-postgres.internal',
    port = '5432',
    username = 'risingwave',
    password = '${POSTGRES_PASSWORD}',
    database.name = 'ecommerce',
    schema.name = 'public',
    publication.name = 'risingwave_pub'
);

CREATE TABLE customers (...) FROM pg_source TABLE 'public.customers';
CREATE TABLE orders (...) FROM pg_source TABLE 'public.orders';
CREATE TABLE support_tickets (...) FROM pg_source TABLE 'public.support_tickets';

-- Real-time customer profile for AI agent context
CREATE MATERIALIZED VIEW customer_context AS
SELECT
    c.customer_id,
    c.name,
    c.email,
    COUNT(o.order_id) AS total_orders,
    SUM(o.amount) AS lifetime_value,
    MAX(o.created_at) AS last_order_at,
    COUNT(st.ticket_id) FILTER (WHERE st.status = 'open') AS open_tickets
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN support_tickets st ON c.customer_id = st.customer_id
GROUP BY c.customer_id, c.name, c.email;

The AI agent queries customer_context via the MCP server before each interaction, getting a view that reflects all CDC changes in real time.

Why RisingWave Spans Multiple Layers

Most tools in the data stack occupy a single layer. Kafka is an ingestion and event bus system. ClickHouse is a serving and analytics system. Debezium is a CDC system. Each does one thing and requires a handoff to the next component.

RisingWave is designed differently. It occupies three layers simultaneously:

CDC layer: Built-in connectors for PostgreSQL, MySQL, MongoDB, and SQL Server. No Debezium, no Kafka required for database change capture.
Processing layer: Streaming SQL with incremental materialized views. Aggregations, joins, window functions, and CDC processing all happen inside RisingWave.
Serving layer: PostgreSQL wire protocol on port 4566. Any application that connects to PostgreSQL connects to RisingWave the same way.

It also connects to the storage layer via Iceberg sinks. In practice, RisingWave can replace three components in a typical data stack (a CDC system, a stream processor, and a serving database) with a single system.

This matters not just for simplicity but for correctness. When CDC processing, stream computation, and serving all happen inside one system, there are fewer consistency boundaries to reason about. An event that arrives via CDC is reflected in materialized view queries at the next read, within milliseconds. There is no transfer between systems where data can get lost, duplicated, or delayed.

Tools Not to Use in 2026

Honest recommendations include knowing what to avoid.

Do not use Spark Structured Streaming for sub-second latency requirements. Spark Structured Streaming uses a micro-batch execution model. Each micro-batch is a small Spark job. Even with aggressive configuration, latency is measured in seconds, not milliseconds. If your use case requires sub-second freshness (live dashboards, real-time fraud detection, AI agent context), Spark Structured Streaming cannot deliver it. For batch-adjacent streaming with latency requirements above five seconds, Spark remains a valid choice, particularly if you already have Spark infrastructure.

Do not use Snowflake as a real-time serving layer. Snowflake is an excellent data warehouse for analytical queries over historical data. Dynamic tables in Snowflake refresh on a schedule measured in minutes, not seconds. Connecting an application to Snowflake and expecting fresh data is a category error. Snowflake belongs in the historical analytics path after data lands in Iceberg, not in the real-time serving path.

Do not add a separate vector database if you are already on RisingWave. Pinecone, Weaviate, and similar vector databases are well-built for pure vector search workloads. But if your stack already uses RisingWave for streaming SQL, adding a separate vector database for embedding search duplicates infrastructure and introduces a synchronization problem: keeping the vector database in sync with the streaming data in RisingWave. RisingWave's built-in vector(n) type, HNSW index, and openai_embedding() function handle semantic search over streaming data without an additional system.

Do not use ksqlDB as a general-purpose streaming database. ksqlDB is useful for basic transformations over Kafka topics within the Confluent ecosystem. It does not support arbitrary SQL sources, its join capabilities are limited, and the full feature set requires a Confluent Cloud subscription. For anything beyond simple Kafka topic transformations, a proper streaming database is more appropriate.

A Complete Real-Time Stack in SQL

Here is a miniature real-time data stack defined entirely in SQL, combining CDC, Kafka, joins, and materialized view serving:

-- Step 1: CDC source from PostgreSQL (products catalog)
CREATE SOURCE pg_source WITH (
    connector = 'postgres-cdc',
    hostname = 'postgres.internal',
    port = '5432',
    username = 'rw',
    password = '${POSTGRES_PASSWORD}',
    database.name = 'store',
    schema.name = 'public',
    publication.name = 'rw_pub'
);

CREATE TABLE products (
    product_id BIGINT PRIMARY KEY,
    name VARCHAR,
    category VARCHAR,
    price DECIMAL
) FROM pg_source TABLE 'public.products';

-- Step 2: Kafka source for real-time purchase events
CREATE SOURCE purchases_stream (
    purchase_id BIGINT,
    product_id BIGINT,
    user_id BIGINT,
    quantity INT,
    purchase_time TIMESTAMPTZ
) WITH (
    connector = 'kafka',
    topic = 'purchases',
    properties.bootstrap.server = 'kafka:9092'
) FORMAT PLAIN ENCODE JSON;

-- Step 3: Materialized view joining CDC table with Kafka stream
CREATE MATERIALIZED VIEW purchase_enriched AS
SELECT
    ps.purchase_id,
    ps.user_id,
    ps.quantity,
    ps.purchase_time,
    p.name AS product_name,
    p.category,
    p.price,
    ps.quantity * p.price AS total_amount
FROM purchases_stream ps
JOIN products p ON ps.product_id = p.product_id;

-- Step 4: Real-time revenue by category (tumble window)
CREATE MATERIALIZED VIEW revenue_by_category AS
SELECT
    category,
    SUM(total_amount) AS revenue,
    COUNT(*) AS purchase_count,
    window_start,
    window_end
FROM TUMBLE(purchase_enriched, purchase_time, INTERVAL '5' MINUTE)
GROUP BY category, window_start, window_end;

-- Step 5: Applications query directly -- always fresh, millisecond latency
SELECT category, revenue, purchase_count
FROM revenue_by_category
WHERE window_end > NOW() - INTERVAL '10 minutes'
ORDER BY revenue DESC;

This stack has three components total: PostgreSQL (the source), Kafka (the event bus), and RisingWave (processing plus serving). There is no separate transformation layer, no serving database, and no batch pipeline.

Cost Considerations

A realistic mid-scale real-time data stack (processing roughly 100,000 events per second, maintaining 50 materialized views) has the following approximate costs in 2026:

Event streaming (Kafka): Confluent Cloud charges based on throughput and storage. For 100,000 events per second at an average of 1KB per event, expect roughly $2,000-5,000 per month on Confluent Cloud, depending on retention settings and regional pricing. Self-managed Kafka on cloud VMs with 3-6 broker nodes runs $800-2,000 per month in compute, plus storage.

Streaming database (RisingWave): RisingWave is open source (Apache 2.0). Self-managed on Kubernetes, a mid-scale deployment uses 4-8 compute nodes with 16 CPU cores and 64GB RAM each. Cloud compute cost is roughly $2,000-4,000 per month depending on cloud provider and region. RisingWave Cloud (managed) is available with per-resource pricing.

Storage (Apache Iceberg on S3): S3 Standard pricing is approximately $0.023 per GB per month. At 10TB of processed data, that is roughly $230 per month for storage. Data transfer and request costs add another 10-20%.

Historical analytics (ClickHouse, optional): ClickHouse Cloud starts at roughly $500-1,500 per month for mid-scale deployments. Self-managed ClickHouse on cloud VMs is comparable to RisingWave compute costs.

Total for a mid-scale hybrid architecture: $5,000-12,000 per month, depending on whether you use managed or self-managed services and whether you include a ClickHouse layer. The batch-oriented modern data stack at similar scale on Snowflake could run $8,000-20,000 per month for warehouse compute alone, not counting the ETL tools.

The real-time data stack is not necessarily more expensive than the batch alternative. The cost profile is different: more compute for continuous processing, less compute for large batch scans.

Conclusion

The real-time data stack in 2026 is not a futuristic concept. It is a set of production-ready tools with clear roles, honest trade-offs, and a growing track record at scale. The batch-first assumptions of the modern data stack made sense when the primary consumers of data were humans refreshing dashboards. Those assumptions do not hold for AI agents, real-time applications, or any use case where decisions happen at the speed of events.

The architecture is: CDC or Kafka at the ingestion layer, RisingWave at the processing and serving layer, Apache Iceberg at the storage layer, and optionally ClickHouse for deep historical analytics. RisingWave's position at the center of the stack, spanning CDC, stream processing, and serving, reduces the number of components that need to be operated, deployed, and coordinated.

The complexity is not gone. Understanding window semantics, watermarks, state management, and exactly-once delivery still requires real knowledge. But the tools have matured enough that this knowledge is learnable and applicable without building custom infrastructure from scratch.