RAG Architecture in 2026: How to Keep Retrieval Actually Fresh

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is RAG architecture in 2026?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "RAG (Retrieval-Augmented Generation) architecture in 2026 is the infrastructure that retrieves relevant context from a knowledge base and injects it into an LLM prompt at query time. The dominant evolution in 2026 is moving from batch re-indexing (nightly or hourly) to streaming re-indexing, where only changed documents are re-embedded as soon as they change. This produces fresher retrieval results at a fraction of the embedding API cost of full nightly re-indexing."
      }
    },
    {
      "@type": "Question",
      "name": "What is streaming RAG?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Streaming RAG is a retrieval-augmented generation architecture where the embedding pipeline is driven by document change events rather than a schedule. When a document is created or updated in the source database, CDC (Change Data Capture) captures that change and sends it to a streaming database like RisingWave. RisingWave re-embeds only the changed document and updates the vector index. The rest of the corpus is untouched. The result is sub-second embedding freshness with embedding API costs proportional to the change rate, not the corpus size."
      }
    },
    {
      "@type": "Question",
      "name": "How do I keep RAG embeddings fresh automatically?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "The most reliable approach is to define embeddings as a materialized view in RisingWave over a CDC source table. RisingWave connects to your PostgreSQL (or MySQL, MongoDB, SQL Server) database via CDC, captures every INSERT and UPDATE, and applies the openai_embedding() function to compute new embeddings for changed rows. The materialized view stays continuously updated. You do not write any scheduling logic or polling code."
      }
    },
    {
      "@type": "Question",
      "name": "Is nightly RAG re-indexing acceptable in 2026?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "For static content that changes rarely, nightly re-indexing is acceptable. For any knowledge base that changes daily, including product documentation, support articles, pricing information, policy documents, and internal wikis, nightly re-indexing is a design failure in 2026. During the gap between runs, the LLM answers questions based on outdated context and generates responses that are factually incorrect relative to the current state of the knowledge base."
      }
    },
    {
      "@type": "Question",
      "name": "What is the cost difference between batch RAG and streaming RAG?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Batch RAG costs are proportional to the full corpus size, paid on every re-indexing run. If you have 50,000 documents and re-index nightly, you pay for 50,000 embedding API calls per day regardless of how many documents changed. Streaming RAG costs are proportional to the change rate. If 500 documents change per day, you pay for 500 embedding API calls. For a corpus where 1% of documents change daily, streaming RAG is approximately 100x cheaper on embedding API costs."
      }
    }
  ]
}

What RAG Is and Why It Became the Dominant LLM Pattern

Retrieval-Augmented Generation (RAG) emerged as the dominant approach for grounding LLM responses in specific, controlled knowledge because it solved the core problem with raw language models: their knowledge is frozen at training time.

Fine-tuning a model on your specific domain is expensive, slow, and produces a model that is stale within weeks as your knowledge base changes. RAG takes a different approach: keep the base model unchanged, and at query time retrieve the most relevant content from your knowledge base and inject it into the prompt. The model reasons over fresh context it has never been trained on.

This approach scales well. Adding new knowledge means adding documents to the vector index, not retraining a model. The LLM generalizes over whatever context you provide, so a single base model can handle many different knowledge domains. And because retrieval is decoupled from the model, you can swap models without rebuilding your knowledge base.

The pattern became standard across support automation, internal knowledge management, product documentation, legal research, and any application where the LLM needs to know about a specific, frequently-changing body of content.

But the standard implementation of RAG carries an assumption that most teams accept without examining: that the vector index can be batch-refreshed on a schedule and still produce useful results. In 2026, that assumption is increasingly untenable.

The Standard RAG Architecture and Its Problem

The standard RAG pipeline works like this:

A batch job runs on a schedule (nightly, hourly, or every few hours)
The job fetches all documents from the source database
Each document is chunked into smaller segments
Each chunk is passed to an embedding model to produce a vector
The vectors are written to a vector index
At query time, the user's question is embedded and used to retrieve the most similar chunks
The retrieved chunks are injected into the LLM prompt

This architecture is straightforward to implement and works well for static content. The problem emerges when documents change between re-indexing runs.

The Staleness Gap

Consider a company running a nightly RAG re-index at midnight. An editor updates a refund policy at 2 PM. For the next ten hours, every customer support agent query about refund policies retrieves the old version of that document and generates an answer based on the old policy. The LLM does not know the policy changed. It confidently answers based on stale context.

This is not a fringe scenario. Product documentation is updated after releases. Pricing changes throughout the day. Support articles are edited when the product changes. Legal and compliance documents are revised as regulations evolve. Internal wikis reflect decisions that were made last week, not the current state of projects.

The staleness gap is the time between when a document changes and when that change is reflected in the vector index. In a nightly batch architecture, the staleness gap is up to 24 hours. In an hourly batch architecture, it is up to 60 minutes. No matter how short you make the interval, batch re-indexing always has a gap.

The Cost Problem with Full Re-Indexing

Batch RAG architectures have a cost structure that does not match the access pattern of document updates.

Embedding API calls are priced per token. If you have 50,000 documents averaging 2,000 words each, a full re-index costs roughly 50,000 embedding calls. If you run this nightly, you pay for 50,000 calls every day. The cost scales with corpus size, not with how many documents actually changed.

In a typical knowledge base, the change rate is far lower than the corpus size. If 500 documents change per day in a 50,000-document corpus, a nightly full re-index calls the embedding API 100 times more than necessary. The 49,500 documents that did not change are re-embedded at full cost.

For small corpora this is acceptable. For large enterprise knowledge bases with hundreds of thousands of documents, the embedding cost of nightly full re-indexing is a significant line item, and most of that cost is pure waste.

What Changed in 2026

Three specific developments converged to make streaming RAG practical in 2026.

Streaming databases added native embedding functions. Two years ago, if you wanted to compute embeddings inside a database query, you needed to call an external service from application code. In 2026, RisingWave includes openai_embedding(), a built-in function that calls the OpenAI embedding API and returns a vector. You can reference it directly in a CREATE MATERIALIZED VIEW statement. This means the embedding step can live inside the database layer, executed automatically when documents change, without any external orchestration.

CDC became low-friction. Change Data Capture, the mechanism for capturing database changes as they happen, used to require significant infrastructure: a Kafka cluster, a Debezium deployment, connector configuration, schema registry management. RisingWave in 2026 connects directly to the PostgreSQL write-ahead log without any of that intermediate infrastructure. A single CREATE SOURCE statement opens a CDC connection. This made the "stream document changes into an embedding pipeline" pattern trivially easy to implement.

The cost and freshness case became impossible to ignore. As more teams ran nightly RAG pipelines in production and measured their actual behavior, the combination of stale embeddings causing quality degradation and redundant embedding calls inflating costs became a documented problem, not a theoretical concern. The streaming RAG pattern emerged as the architectural response.

The Streaming RAG Architecture

The streaming RAG architecture replaces the batch re-indexing loop with a continuous pipeline driven by document change events.

PostgreSQL (documents) --> CDC --> RisingWave --> openai_embedding() in MV --> vector search

When a document is created or updated in PostgreSQL, the change flows through CDC into RisingWave within milliseconds. RisingWave applies the openai_embedding() function to the changed document and updates the materialized view that holds the embeddings. Only the changed document is re-embedded. The rest of the corpus is untouched.

At query time, the agent or application embeds the user's question and runs a vector similarity query against the materialized view directly in RisingWave. Because RisingWave uses the PostgreSQL wire protocol, no specialized vector database client is needed.

This architecture has four properties that the batch approach lacks:

Event-driven freshness: embeddings update when documents change, not on a schedule
Change-proportional cost: embedding API calls scale with the number of changed documents, not corpus size
No orchestration layer: the streaming pipeline is defined in SQL and managed by RisingWave, with no cron jobs, Airflow DAGs, or custom scripts
Unified serving: the same system stores embeddings and serves similarity queries, eliminating the synchronization problem of maintaining a separate vector database

Complete SQL Implementation

Step 1: Prepare PostgreSQL for CDC

CDC from PostgreSQL requires logical replication. Set wal_level = logical in postgresql.conf, then create the documents table and enable publication:

-- In PostgreSQL: enable CDC publication
ALTER SYSTEM SET wal_level = 'logical';
-- Restart PostgreSQL after this change

-- Create the documents table
CREATE TABLE documents (
    doc_id      TEXT PRIMARY KEY,
    title       TEXT,
    content     TEXT,
    category    TEXT,
    updated_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Create publication for RisingWave CDC
CREATE PUBLICATION risingwave_docs_pub FOR TABLE documents;

Step 2: Connect RisingWave to PostgreSQL via CDC

In RisingWave, create a CDC source that reads from the PostgreSQL write-ahead log:

-- In RisingWave: create the CDC source
CREATE SOURCE docs_source WITH (
    connector = 'postgres-cdc',
    hostname = 'localhost',
    port = '5432',
    username = 'postgres',
    password = '${PG_PASSWORD}',
    database.name = 'knowledge_base',
    schema.name = 'public',
    publication.name = 'risingwave_docs_pub'
);

-- Declare the table structure in RisingWave
CREATE TABLE documents (
    doc_id      TEXT PRIMARY KEY,
    title       TEXT,
    content     TEXT,
    category    TEXT,
    updated_at  TIMESTAMPTZ
) FROM docs_source TABLE 'public.documents';

Step 3: Create the Embedding Materialized View

This materialized view calls openai_embedding() for every document and stores the result as a vector(1536). When a document changes in PostgreSQL, CDC propagates the change to the documents table in RisingWave, and RisingWave updates the materialized view for that specific row.

-- In RisingWave: materialized view that auto-embeds changed documents
CREATE MATERIALIZED VIEW document_embeddings AS
SELECT
    doc_id,
    title,
    category,
    updated_at,
    openai_embedding(
        '${OPENAI_API_KEY}',
        'text-embedding-3-small',
        title || '. ' || content
    )::vector(1536) AS embedding
FROM documents;

Note that the embedding text combines the title and the content. For longer documents, you will embed chunks rather than full documents (see the chunking section below).

Step 4: Create the HNSW Index

HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor algorithm that enables fast similarity search. Create an HNSW index on the embedding column:

-- In RisingWave: create HNSW index for fast cosine similarity search
CREATE INDEX doc_embedding_idx ON document_embeddings
USING hnsw (embedding vector_cosine_ops);

With this index in place, vector similarity queries scan a small candidate set rather than the full embedding table, producing millisecond query times even for large corpora.

Step 5: Document the Materialized View for Agent Discoverability

If AI agents will query this materialized view through MCP, add comments that describe what it contains and how to use it:

COMMENT ON MATERIALIZED VIEW document_embeddings IS
    'Embedding vectors for all knowledge base documents. 
     Query using cosine similarity (<=>). Filter by category for domain-specific retrieval.
     Embeddings update within seconds of document changes in PostgreSQL.';

COMMENT ON COLUMN document_embeddings.embedding IS
    'text-embedding-3-small 1536-dimensional vector. Use <=> for cosine similarity.';

Retrieval at Query Time

At query time, you embed the user's question and run a vector similarity query against the materialized view. Because RisingWave uses the PostgreSQL wire protocol, you can use any PostgreSQL client library.

import psycopg2
import openai

rw_conn = psycopg2.connect(
    host="localhost",
    port=4566,
    database="dev",
    user="root",
    password=""
)

openai_client = openai.Client(api_key=OPENAI_API_KEY)

def retrieve(query: str, category: str = None, top_k: int = 5):
    # Embed the query using the same model as the documents
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_embedding = response.data[0].embedding

    # Build the retrieval query
    if category:
        sql = """
            SELECT
                doc_id,
                title,
                category,
                1 - (embedding <=> %s::vector) AS similarity
            FROM document_embeddings
            WHERE category = %s
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """
        params = (query_embedding, category, query_embedding, top_k)
    else:
        sql = """
            SELECT
                doc_id,
                title,
                category,
                1 - (embedding <=> %s::vector) AS similarity
            FROM document_embeddings
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """
        params = (query_embedding, query_embedding, top_k)

    with rw_conn.cursor() as cur:
        cur.execute(sql, params)
        return cur.fetchall()


def generate_answer(question: str, category: str = None):
    results = retrieve(question, category=category)

    context_blocks = "\n\n".join(
        f"Document: {title}\nSimilarity: {similarity:.3f}\n"
        for _, title, _, similarity in results
    )

    # Pass context to your LLM of choice
    return context_blocks, [doc_id for doc_id, _, _, _ in results]

The <=> operator computes cosine distance. Subtracting from 1 converts it to cosine similarity, where higher values indicate more similar documents. The ORDER BY embedding <=> %s::vector LIMIT %s pattern uses the HNSW index for approximate nearest neighbor retrieval.

Chunking Strategies in 2026

Documents longer than the embedding model's context window must be split into chunks. Each chunk is embedded separately, and retrieval operates at the chunk level. This affects how you structure the source tables and the materialized view.

Fixed-Size Chunking

The simplest approach splits documents into overlapping fixed-size segments. Store chunks in a separate table in PostgreSQL:

-- In PostgreSQL: store document chunks
CREATE TABLE document_chunks (
    chunk_id    TEXT PRIMARY KEY,  -- e.g. "{doc_id}_chunk_{n}"
    doc_id      TEXT NOT NULL REFERENCES documents(doc_id),
    chunk_index INT NOT NULL,
    chunk_text  TEXT NOT NULL,
    updated_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE PUBLICATION risingwave_chunks_pub FOR TABLE document_chunks;

In your application, when a document is updated, recompute its chunks and replace the old chunk rows. RisingWave CDC captures those row-level changes and updates only the affected chunk embeddings.

In RisingWave:

-- CDC source and table for chunks
CREATE SOURCE chunks_source WITH (
    connector = 'postgres-cdc',
    hostname = 'localhost',
    port = '5432',
    username = 'postgres',
    password = '${PG_PASSWORD}',
    database.name = 'knowledge_base',
    schema.name = 'public',
    publication.name = 'risingwave_chunks_pub'
);

CREATE TABLE document_chunks (
    chunk_id    TEXT PRIMARY KEY,
    doc_id      TEXT,
    chunk_index INT,
    chunk_text  TEXT,
    updated_at  TIMESTAMPTZ
) FROM chunks_source TABLE 'public.document_chunks';

-- Materialized view with per-chunk embeddings
CREATE MATERIALIZED VIEW chunk_embeddings AS
SELECT
    c.chunk_id,
    c.doc_id,
    c.chunk_index,
    d.title,
    d.category,
    openai_embedding(
        '${OPENAI_API_KEY}',
        'text-embedding-3-small',
        c.chunk_text
    )::vector(1536) AS embedding,
    c.updated_at
FROM document_chunks c
JOIN documents d ON d.doc_id = c.doc_id;

CREATE INDEX chunk_embedding_idx ON chunk_embeddings
USING hnsw (embedding vector_cosine_ops);

This structure means a single document update in PostgreSQL triggers re-embedding only for the chunks of that document. Other documents are unaffected.

Semantic Chunking

Semantic chunking splits documents at natural boundaries: paragraphs, sections, or headings. For Markdown or HTML content, these boundaries are explicit. For plain text, boundary detection requires sentence parsing.

The streaming RAG approach works the same way for semantic chunks: store chunks in PostgreSQL, let CDC capture changes, and let RisingWave re-embed only the changed chunks. The chunking logic lives in the application layer that writes to PostgreSQL.

Multi-modal RAG involves retrieving context across content types: text documents, images, tables, and code. The question of what RisingWave handles and what requires external processing is worth addressing directly.

RisingWave's openai_embedding() function calls the OpenAI embedding API with a text input and returns a vector(1536). This handles text embedding natively. For images, the OpenAI embedding models accept image URLs in addition to text, which means you can embed images by passing image URLs to openai_embedding() using a multimodal embedding model.

For pre-computed embeddings (for example, embeddings you computed outside RisingWave using a custom model), you can insert them directly as vector(n) values. RisingWave stores and indexes them exactly the same way as embeddings computed by openai_embedding().

The practical multi-modal pattern is:

Text chunks: embed inline using openai_embedding() in the materialized view
Images: store the image URL and embed using openai_embedding() if the model supports image URLs, or pre-compute embeddings externally and store the vectors
Tables and structured data: convert to text representation and embed as text

Evaluating Streaming RAG: What to Measure

A RAG system that retrieves stale context may score well on standard metrics that evaluate the reasoning step but miss the freshness problem entirely. For streaming RAG, you need to measure freshness explicitly.

Standard RAG metrics (apply to both batch and streaming):

Context recall: what fraction of the gold-standard relevant documents are in the retrieved set?
Faithfulness: does the LLM's answer stay within the bounds of the retrieved context?
Answer relevance: does the answer address the user's actual question?

Freshness-specific metrics for streaming RAG:

Embedding lag: for a sample of recently changed documents, what is the delay between the document update timestamp and the timestamp when the new embedding is indexed? In a working streaming RAG system this should be in the low single-digit seconds.
Stale retrieval rate: what fraction of retrievals return a document whose embedding was computed before the document's most recent update? This should be near zero in a streaming system.
Freshness degradation test: update a set of documents, wait a defined interval, and measure whether the updated content is retrieved correctly. In a batch system this test fails during the gap window. In a streaming system it passes within seconds.

Running these measurements against a baseline batch RAG system and your streaming RAG system side by side is the clearest way to quantify the improvement.

Comparison: Nightly Batch RAG vs Streaming RAG

Dimension	Nightly Batch RAG	Streaming RAG
Embedding freshness	Up to 24 hours stale	Seconds after document change
Embedding API cost	Proportional to full corpus size	Proportional to daily change rate
Accuracy on updated content	Fails for 24-hour window after update	Correct within seconds of update
Infrastructure complexity	Batch orchestration (cron, Airflow)	SQL materialized view, no scheduler
Re-indexing overhead	Full corpus re-processed every run	Only changed documents processed
Operational failure modes	Job failure leaves stale index	CDC handles connection recovery
Scaling behavior	Cost grows linearly with corpus size	Cost grows linearly with change rate

For a knowledge base where 1% of documents change daily, the embedding cost difference is roughly 100x. For a 50,000-document corpus, that translates from 50,000 embedding calls per day to 500. At current OpenAI pricing, this is a meaningful budget difference for any team running production RAG at scale.

The accuracy difference during update windows is harder to quantify in aggregate but easy to observe in user impact. A support agent that quotes an updated refund policy correctly prevents escalations. A sales assistant that references current pricing closes deals. A compliance tool that cites current regulations avoids errors. These outcomes are not visible in standard retrieval benchmarks, which is why freshness measurement requires dedicated evaluation.

Conclusion

RAG became the dominant LLM grounding pattern because it decoupled knowledge from model weights, allowing frequently changing knowledge to be retrieved dynamically rather than baked into training. The standard batch implementation carried forward an assumption from the pre-streaming era: that periodic re-indexing was the only practical way to keep the vector index current.

That assumption no longer holds in 2026. Streaming databases with native CDC connectors and built-in embedding functions make continuous, event-driven re-indexing straightforward to implement in SQL. The streaming RAG architecture keeps embeddings fresh within seconds of document changes, costs proportionally to change rate rather than corpus size, and eliminates the scheduling layer entirely.

The complete implementation described in this article requires three SQL statements in PostgreSQL (a table, a publication, and logical replication), two in RisingWave for the CDC source and the table, one materialized view with the openai_embedding() call, and one HNSW index. The Python retrieval layer uses a standard psycopg2 connection to port 4566.

Nightly re-indexing was a reasonable constraint when the tools for continuous embedding did not exist. Those tools exist now. The choice to run a nightly batch job in 2026 is no longer a pragmatic simplification. It is a design decision with measurable costs in accuracy and embedding spend.