{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "What is AI agent data infrastructure?",
"acceptedAnswer": {
"@type": "Answer",
"text": "AI agent data infrastructure is the stack of systems that give AI agents access to the data they need to make decisions. It includes the transactional databases where source data lives, the streaming layer that keeps derived views fresh, the serving layer that agents query, and the vector layer for semantic retrieval. The key property that distinguishes an agent-ready data infrastructure from a standard analytics stack is freshness: agents need data that reflects current reality, not a snapshot from hours ago."
}
},
{
"@type": "Question",
"name": "Why can't AI agents just query a data warehouse?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Data warehouses refresh on batch schedules, typically every hour or longer. AI agents that make decisions affecting real money, inventory, or customer experience cannot operate on data that is an hour old. A financial agent working from yesterday's positions may generate trades that are already invalid. A customer support agent reading a cached order table may tell a customer their package is in transit when it was already delivered. The batch model introduces a staleness gap that agents cannot work around."
}
},
{
"@type": "Question",
"name": "What database should I use for AI agents in 2026?",
"acceptedAnswer": {
"@type": "Answer",
"text": "For AI agent use cases requiring fresh, queryable, and semantically searchable data, RisingWave is the strongest option in 2026. It provides streaming SQL with incremental materialized views, a built-in vector type with HNSW indexing, an openai_embedding() function for computing embeddings inline, and an official MCP server (risingwavelabs/risingwave-mcp) that exposes 100+ tools for AI agents. It is open source under Apache 2.0 and uses the PostgreSQL wire protocol on port 4566."
}
},
{
"@type": "Question",
"name": "What is the Model Context Protocol (MCP) and why does it matter for data?",
"acceptedAnswer": {
"@type": "Answer",
"text": "The Model Context Protocol (MCP) is an open standard created by Anthropic that defines how AI agents discover and access external data sources and tools. Instead of each agent integration requiring custom code, MCP provides a common protocol. For data infrastructure, MCP matters because it is the mechanism by which agents discover what data exists, understand its schema and meaning, and query it in a structured way. A streaming database with an MCP server becomes a data source that any MCP-compatible agent can use without custom integration work."
}
},
{
"@type": "Question",
"name": "Do I need a separate vector database for AI agents?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Not if your streaming database has built-in vector support. RisingWave includes a native vector(n) type, cosine similarity operator (<=>), L2 distance operator (<->), HNSW indexing, and an openai_embedding() function. Adding a separate vector database alongside your streaming database doubles operational complexity without adding capability. You can perform semantic retrieval and structured SQL queries against the same materialized views in the same system."
}
}
]
}
What Changed in 2026
In 2023 and 2024, the AI agent conversation was mostly theoretical. Demos showed agents browsing the web, writing code, and summarizing documents. The infrastructure question was secondary: if the agent mostly read static documents or made API calls, data freshness was not the critical path.
That changed when agents became autonomous workers with financial authority. In 2026, production agents are buying advertising inventory, routing support tickets, adjusting dynamic pricing, filing expense reports, approving small transactions, and managing cloud resource allocation. They run continuously, not on demand. They make decisions that have real consequences.
When an agent makes a purchasing decision based on inventory data that is six hours old, that is not a demo failure. That is a financial loss. When a customer support agent misquotes a refund policy that was updated two days ago, that is a compliance problem. The gap between what the agent believes and what is actually true has become measurably expensive.
This is why the AI agent data infrastructure conversation in 2026 is fundamentally different from the general analytics conversation of three years ago. It is not about dashboards or reports. It is about giving autonomous systems access to the current state of the world, continuously, at low latency, with enough richness to support semantic reasoning.
The LLM itself is now a commodity. Every major cloud provider offers API access to capable models at competitive prices. The teams that build durable advantages are not the ones with access to better models. They are the ones with better data pipelines feeding those models.
The Observe-Think-Act Loop and Why Data Freshness is the Bottleneck
Every agent, regardless of how it is implemented, follows some version of an observe-think-act loop. The agent receives a trigger or a task, gathers context about the current state of the world, reasons about what to do, and takes an action. The loop repeats.
The quality of the thinking step is constrained by the quality of the observation step. A well-reasoned decision based on stale facts produces a wrong outcome. The model cannot fix a data problem by reasoning harder.
Consider a pricing agent for an e-commerce platform. Its job is to adjust product prices in response to competitor pricing, inventory levels, and demand signals. It runs every thirty seconds. At each cycle it does the following:
- Observe: query current inventory, recent sales velocity, and competitor prices
- Think: decide whether to raise, lower, or hold the price for each SKU
- Act: write the new price to the catalog system
If the inventory query returns data from an hourly ETL job, the agent may lower prices on items that are actually out of stock, or hold prices on items where competitors moved twenty minutes ago. The logic is correct. The observations are wrong.
The freshness requirement is not a nice-to-have. It is a correctness requirement. For agents with financial authority, stale observations produce incorrect actions, and incorrect actions have real costs.
The same pattern appears across every autonomous agent use case:
- A fraud detection agent scoring transactions in real time needs to know what the customer did in the last sixty seconds, not the last hour.
- A supply chain agent deciding whether to trigger a reorder needs current inventory, not this morning's snapshot.
- A document processing agent routing incoming contracts needs to know the current state of each contract in the pipeline, not a batch export from last night.
The observe-think-act loop runs at agent speed. The data layer must keep up.
The Four Data Requirements for AI Agents
The requirements for an AI agent data layer are different from those for a business intelligence platform. Four properties matter most.
Freshness
Agent context must reflect the current state of the system being managed, not a snapshot from the last batch run. Sub-second to low-single-digit second freshness is the target for most production agent use cases. Anything beyond ten seconds introduces meaningful decision error for agents operating in fast-moving environments.
This rules out most traditional data warehouses and any architecture where data propagation depends on scheduled batch jobs. The data layer must process changes continuously as they happen.
Queryability
Agents need to retrieve specific, structured context quickly. "Give me the current inventory level for SKU-1042" must return in milliseconds. "Give me all customers who changed their shipping address in the last five minutes" must be fast enough to support real-time fraud logic.
Full table scans, slow analytical queries, or APIs that require constructing complex request payloads all degrade agent performance. The serving layer must support SQL or structured queries against pre-materialized results, not raw event streams.
Richness
In 2026, agents need both structured data and semantic retrieval from the same context layer. A customer support agent needs to query the order table by customer ID (structured) and also find the most relevant support article for the customer's question (semantic vector search). Building two separate systems for these two operations doubles complexity and introduces synchronization problems.
A data layer that serves both structured SQL and vector similarity queries from the same system is a significant operational simplification.
Discoverability
Agents using MCP need to understand what data is available before they can use it. A materialized view named mv_7_v3_final is not useful to an agent that is discovering the data catalog autonomously. The data layer needs to support metadata that tells agents what each view contains, what columns mean, how to filter them, and how fresh the data is.
This is not just a convenience. Agents that cannot discover and understand available data fall back to guessing or failing. Metadata quality directly affects agent task success rate.
The Stack, Layer by Layer
A production AI agent data infrastructure in 2026 has four layers. Each has a clear responsibility.
The Write Layer: PostgreSQL and MySQL for Transactions
Transactional databases are where source truth lives. Customer records, order state, inventory counts, contract status, employee data: these live in PostgreSQL, MySQL, MongoDB, or SQL Server. These systems are optimized for transactional workloads: ACID guarantees, point queries, high-frequency writes.
Agents should not query these systems directly for context enrichment. Transactional databases are not designed for the kinds of aggregation, joining, and lookback queries that agent context requires. Querying them at agent speed would degrade transaction performance for the operational applications that depend on them.
The write layer is the source of truth, not the serving layer. Its job is to capture every change reliably.
The Streaming Layer: RisingWave as the Continuous Transformation Engine
RisingWave sits between the write layer and the serving layer. It subscribes to your transactional databases via Change Data Capture (CDC) and maintains continuously updated materialized views that agents can query.
When a customer updates their profile in PostgreSQL, that change flows through CDC into RisingWave within milliseconds. RisingWave applies the change to any materialized views that reference the customer profile table. Agents querying those views see the updated state without any polling, batch refresh, or manual intervention.
RisingWave supports CDC directly from PostgreSQL, MySQL, MongoDB, and SQL Server. It also reads from Kafka, Kinesis, and Pulsar. This means the streaming layer can consolidate data from multiple operational systems into unified agent-ready views without requiring all sources to use the same event bus.
Here is how you connect RisingWave to a PostgreSQL operational database via CDC:
-- In RisingWave: create a CDC source pointing to PostgreSQL
CREATE SOURCE app_db_source WITH (
connector = 'postgres-cdc',
hostname = 'postgres.internal',
port = '5432',
username = 'cdc_reader',
password = '${POSTGRES_PASSWORD}',
database.name = 'app_db',
schema.name = 'public',
publication.name = 'risingwave_pub'
);
Once connected, every INSERT, UPDATE, and DELETE in the source database flows into RisingWave automatically. You define materialized views over the CDC stream, and RisingWave maintains them incrementally.
The Serving Layer: Agents Query RisingWave via PostgreSQL Protocol or MCP
RisingWave uses the PostgreSQL wire protocol on port 4566. Any PostgreSQL-compatible client can connect to it. Agents written in Python, TypeScript, Go, or any other language can query RisingWave the same way they would query PostgreSQL.
For agents using MCP, RisingWave provides an official MCP server at risingwavelabs/risingwave-mcp. This server exposes over 100 tools that agents can use to discover schemas, query materialized views, and understand data relationships. Configure it in your agent's MCP settings:
{
"mcpServers": {
"risingwave": {
"command": "npx",
"args": ["-y", "@risingwavelabs/risingwave-mcp@latest"],
"env": {
"RISINGWAVE_CONNECTION_STR": "postgresql://root@localhost:4566/dev"
}
}
}
}
With this configuration, an agent can list all materialized views in the database, inspect their schemas, read their comments, and execute queries against them. The discovery and querying are both handled through the MCP protocol.
The Vector Layer: RisingWave Built-in Vector for Semantic Retrieval
Adding semantic search to agent context does not require a separate vector database. RisingWave includes a native vector(n) type, cosine similarity (<=>), L2 distance (<->), and HNSW indexing. The openai_embedding() function computes embeddings inline, inside a SQL query or materialized view definition.
This means you can maintain a materialized view that holds both structured fields and embedding vectors, updated continuously as source documents change. Agents can perform semantic retrieval against the same system they use for structured queries, using the same connection string and the same protocol.
The operational cost difference is significant. A separate vector database requires its own deployment, its own connection management, its own backup strategy, and its own scaling decisions. When your streaming database handles vector natively, that complexity disappears.
Making Data Discoverable with COMMENT ON
The quality of agent decisions depends not just on data freshness, but on whether agents can understand what data means. An agent that discovers a materialized view named user_context needs to know what it contains, how to filter it, and what latency to expect.
RisingWave supports COMMENT ON for tables, materialized views, and columns. These comments are surfaced through the MCP server's discovery tools, making them visible to any agent that queries the schema.
COMMENT ON MATERIALIZED VIEW user_context IS
'Current user profile: categories, preferences, recent orders.
Query with WHERE user_id = $1. Updates within 500ms of user actions.';
COMMENT ON COLUMN user_context.preferred_categories IS
'Comma-separated list of category IDs ranked by 30-day purchase frequency.';
COMMENT ON COLUMN user_context.last_order_status IS
'Status of the most recent order: pending, shipped, delivered, or returned.';
When an agent discovers this view through the MCP server, it sees both the schema and the documentation. It can construct correct queries without human guidance. This is how you make your data layer genuinely agent-friendly, not just technically accessible.
SQL Examples: What Agents Actually Query
The value of this architecture becomes concrete when you look at the materialized views agents query in real production systems. Here are four common patterns.
Real-Time User Context View
A recommendation or personalization agent needs a complete picture of the user: what they have bought, what categories they prefer, how recently they were active, and what their current cart contains. This view joins data from multiple operational tables into a single agent-queryable result:
CREATE MATERIALIZED VIEW user_context AS
SELECT
u.user_id,
u.email,
u.account_tier,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(o.total_amount) AS lifetime_value,
MAX(o.created_at) AS last_order_at,
COUNT(DISTINCT o.order_id) FILTER (
WHERE o.created_at > NOW() - INTERVAL '30 days'
) AS orders_last_30d,
string_agg(DISTINCT oi.category, ', ') AS purchased_categories
FROM users u
LEFT JOIN orders o ON o.user_id = u.user_id
LEFT JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY u.user_id, u.email, u.account_tier;
An agent querying SELECT * FROM user_context WHERE user_id = $1 gets a fresh, pre-joined answer in milliseconds.
Budget and Spending View for Financial Agents
A financial agent managing departmental budgets needs current spend against approved budget, updated as transactions close. A nightly ETL job creates a staleness gap that makes real-time approval decisions impossible.
CREATE MATERIALIZED VIEW department_budget_status AS
SELECT
d.department_id,
d.department_name,
d.approved_budget,
COALESCE(SUM(t.amount), 0) AS spent_to_date,
d.approved_budget - COALESCE(SUM(t.amount), 0) AS remaining_budget,
ROUND(
COALESCE(SUM(t.amount), 0) / NULLIF(d.approved_budget, 0) * 100, 2
) AS pct_utilized
FROM departments d
LEFT JOIN transactions t
ON t.department_id = d.department_id
AND t.fiscal_year = EXTRACT(YEAR FROM NOW())
AND t.status = 'approved'
GROUP BY d.department_id, d.department_name, d.approved_budget;
Inventory View for Procurement Agents
A procurement agent deciding whether to trigger a purchase order needs current on-hand quantity minus pending fulfillment, not yesterday's warehouse count.
CREATE MATERIALIZED VIEW inventory_status AS
SELECT
p.sku,
p.product_name,
p.reorder_point,
COALESCE(w.quantity_on_hand, 0) AS on_hand,
COALESCE(SUM(so.quantity_allocated), 0) AS allocated,
COALESCE(w.quantity_on_hand, 0)
- COALESCE(SUM(so.quantity_allocated), 0) AS available,
CASE
WHEN COALESCE(w.quantity_on_hand, 0)
- COALESCE(SUM(so.quantity_allocated), 0)
<= p.reorder_point THEN true
ELSE false
END AS needs_reorder
FROM products p
LEFT JOIN warehouse_inventory w ON w.sku = p.sku
LEFT JOIN sales_order_lines so
ON so.sku = p.sku AND so.status IN ('open', 'picking')
GROUP BY p.sku, p.product_name, p.reorder_point, w.quantity_on_hand;
Behavioral Signal View for Recommendation Agents
A recommendation agent needs recent behavioral signals, not just purchase history. What pages did the user visit in the last hour? What did they add to and remove from their cart? What searches did they run?
CREATE MATERIALIZED VIEW user_behavioral_signals AS
SELECT
user_id,
COUNT(*) FILTER (
WHERE event_type = 'product_view'
AND event_time > NOW() - INTERVAL '1 hour'
) AS product_views_1h,
COUNT(*) FILTER (
WHERE event_type = 'add_to_cart'
AND event_time > NOW() - INTERVAL '1 hour'
) AS cart_adds_1h,
COUNT(*) FILTER (
WHERE event_type = 'search'
AND event_time > NOW() - INTERVAL '1 hour'
) AS searches_1h,
string_agg(
search_query, ' '
ORDER BY event_time DESC
) FILTER (
WHERE event_type = 'search'
AND event_time > NOW() - INTERVAL '1 hour'
) AS recent_search_terms
FROM user_events
WHERE event_time > NOW() - INTERVAL '1 hour'
GROUP BY user_id;
What Not to Do: Common Mistakes in 2026
As agent data infrastructure has matured, a set of failure patterns has emerged. These are worth naming directly.
Do Not Give Agents Direct Write Access to Production Databases
Agents that can read and write to the same transactional database where your application runs create risk that is disproportionate to the convenience. A bug in agent logic, an unexpected input, or a prompt injection attack can translate directly into corrupt production data. The read layer for agents should be separate from the write layer for applications, even if the same underlying data is involved.
RisingWave provides a clean separation: applications write to PostgreSQL or MySQL, CDC flows those changes into RisingWave read-only materialized views, and agents query RisingWave. The agent has no path to modify source data directly.
Do Not Serve Agents from a Batch Warehouse
Every data warehouse in production has a latency floor determined by its refresh schedule. Even a warehouse with an aggressive fifteen-minute refresh cadence means agents are working with context that is up to fifteen minutes old. For autonomous agents making financial, operational, or customer-facing decisions, that gap is unacceptable.
If your current infrastructure has agents querying Snowflake, BigQuery, or Redshift, the agents are operating on batch data. The solution is not a faster refresh schedule. It is a streaming layer that eliminates the batch boundary.
Do Not Build a Separate Vector Database If Your Streaming Database Has Built-In Support
Vector databases became popular because traditional OLTP and OLAP systems did not support vector types or approximate nearest neighbor search. That gap has closed. RisingWave has native vector support, HNSW indexing, and an embedding function. Adding Pinecone, Weaviate, or Qdrant alongside a system that already handles vectors doubles your operational surface area without adding capability.
The coordination problem is particularly damaging: when a document changes, you now need to update both the streaming database and the vector database. Two systems that must stay in sync are two opportunities for them to diverge. One system with vector built in eliminates the coordination problem entirely.
Why 2026 is Different from 2024
Three converging developments explain why the AI agent data infrastructure conversation looks fundamentally different in 2026 compared to two years ago.
MCP standardized agent-to-data connectivity. Before MCP, connecting an agent to a data source required building a custom integration. Each combination of agent framework and data source needed its own connector. MCP provided the protocol layer that made data sources pluggable. Today, any MCP-compatible agent can connect to any MCP server without custom code. For data infrastructure, this means the integration layer is standardized, and what differentiates data sources is quality, not connectivity.
Vector support became table stakes. In 2024, most streaming and operational databases did not support vector types. Teams that needed semantic retrieval were forced to add a separate vector database. In 2026, mature streaming databases have built-in vector support. The idea of adding a separate system just for vectors now reads as unnecessary complexity in most architectures.
Streaming databases added native AI tooling. The openai_embedding() function in RisingWave is representative of a broader trend: streaming databases are adding functions and integrations that are specifically useful for AI workloads. You can now compute embeddings inside a materialized view definition, which means embeddings stay fresh automatically as documents change, without any orchestration layer.
Autonomous agents with financial authority arrived. In 2024, most production agents were read-only: they summarized, classified, and retrieved, but a human approved any consequential action. In 2026, agents are writing to systems, making purchases, routing work, and adjusting parameters without human approval in the loop for each action. The financial stakes of stale data increased proportionally.
These four shifts together created the demand for a purpose-built AI agent data infrastructure that did not exist as a recognized category two years ago.
Conclusion
Running an LLM is a solved problem. API providers offer capable models at competitive prices, and the gap between foundation models continues to narrow. The durable competitive advantage in AI agent systems in 2026 is the data layer: how fresh the context is, how quickly it reflects system state, how semantically rich it is for retrieval, and how discoverable it is for autonomous agents.
The infrastructure stack that meets these requirements has four layers: a transactional write layer in PostgreSQL or MySQL, a streaming layer in RisingWave that maintains continuously updated materialized views via CDC, a serving layer that agents query over the PostgreSQL protocol or MCP, and a vector layer that lives inside RisingWave rather than as a separate system.
The COMMENT ON pattern for documenting materialized views, the MCP server that exposes those views to agents, and the openai_embedding() function that keeps embeddings fresh without orchestration are the implementation details that turn a technically sound architecture into one that agents can actually use effectively.
Build the data layer first. The agent logic will follow.

