Streaming SQL vs Batch SQL: When to Use Which

Streaming SQL vs Batch SQL: When to Use Which

You write SQL for both your dbt models and your streaming pipelines, but the two feel completely different in practice. One runs on a schedule and scans the full dataset. The other runs continuously and updates only what changed. One is easy to reason about and debug. The other gives you live results the moment data arrives.

So when should you reach for streaming SQL instead of batch SQL? And when should you stick with the batch approach you already know?

This guide gives you a concrete decision framework, side-by-side SQL comparisons, and real examples you can run yourself. All streaming SQL examples are verified against RisingWave, a streaming database that accepts standard PostgreSQL-compatible SQL.

What Is Batch SQL?

Batch SQL is a query that runs at a point in time, scans some or all of its input data, and produces a result. You schedule it, it runs, it finishes. Tools like dbt, Apache Spark SQL, and traditional database scheduled queries all work this way.

The defining characteristic is that batch SQL is stateless between runs. When your dbt model runs at 2:00 AM, it has no memory of the run at 1:00 AM. It reads the current state of the input tables, computes the result, and writes it somewhere.

A typical batch SQL query looks like this:

-- Run by dbt or a scheduled job
SELECT
    region,
    COUNT(*)                       AS order_count,
    SUM(quantity * unit_price)     AS revenue,
    AVG(quantity * unit_price)     AS avg_order_value
FROM sqltype_orders
WHERE order_status = 'completed'
GROUP BY region
ORDER BY revenue DESC;

This query is correct, readable, and works in any SQL database. The limitation is that it only reflects data as of the moment it ran. If you schedule it to run every hour, your dashboards are up to 60 minutes stale.

What Is Streaming SQL?

Streaming SQL is a continuous query that runs indefinitely. Instead of scanning data and stopping, it watches for incoming data and updates its result incrementally. You define the query once and the system keeps results current as data changes.

In RisingWave, streaming SQL takes the form of a materialized view:

-- Created once, maintained continuously
CREATE MATERIALIZED VIEW sqltype_mv_streaming_revenue AS
SELECT
    p.category,
    COUNT(o.order_id)              AS total_orders,
    SUM(o.quantity * o.unit_price) AS total_revenue
FROM sqltype_orders o
JOIN sqltype_products p ON o.product_id = p.product_id
WHERE o.order_status = 'completed'
GROUP BY p.category;

After this statement runs, you never have to refresh it. When new orders arrive, RisingWave updates the affected rows in sqltype_mv_streaming_revenue within milliseconds. Reading from the view always returns the current result:

SELECT * FROM sqltype_mv_streaming_revenue ORDER BY total_revenue DESC;
  category   | total_orders | total_revenue
-------------+--------------+---------------
 Electronics |            5 |        939.89
 Office      |            3 |        179.88

The key difference: batch SQL is pull-based (you ask for results) and streaming SQL is push-based (results stay current as data arrives).

Side-by-Side Comparison

Here is the same use case written in both styles. The scenario: you want revenue totals by product category, and you want to watch them update as new orders come in.

Batch approach (dbt model)

-- models/revenue_by_category.sql
SELECT
    p.category,
    COUNT(o.order_id)              AS total_orders,
    SUM(o.quantity * o.unit_price) AS total_revenue
FROM {{ ref('orders') }} o
JOIN {{ ref('products') }} p ON o.product_id = p.product_id
WHERE o.order_status = 'completed'
GROUP BY p.category

You schedule this in your dbt job. It runs hourly, scans all rows in orders and products, and overwrites the output table. If 10,000 orders arrived in the last hour and 50 million rows exist in total, the query processes 50 million rows to produce updated numbers.

Streaming approach (RisingWave materialized view)

-- Created once in RisingWave
CREATE MATERIALIZED VIEW sqltype_mv_streaming_revenue AS
SELECT
    p.category,
    COUNT(o.order_id)              AS total_orders,
    SUM(o.quantity * o.unit_price) AS total_revenue
FROM sqltype_orders o
JOIN sqltype_products p ON o.product_id = p.product_id
WHERE o.order_status = 'completed'
GROUP BY p.category;

When the same 10,000 orders arrive, RisingWave processes only those 10,000 rows, updating the running counts and sums incrementally. The other 50 million rows are untouched. Results are fresh within seconds.

Where they differ

DimensionBatch SQLStreaming SQL
FreshnessStale until next runSub-second latency
Compute costProportional to total dataset sizeProportional to change volume
Query triggerScheduled (cron, orchestrator)Automatic on data arrival
Failure recoveryRe-run the jobCheckpointed, resumes automatically
Historical analysisNatural fitRequires window functions
Learning curveStandard SQL you already knowSmall set of streaming-specific extensions
Toolingdbt, Airflow, Spark, any SQL DBStreaming databases like RisingWave

The Decision Framework

Use batch SQL when...

1. You need historical analysis over large datasets

If your query looks back over months or years of data and latency of minutes or hours is acceptable, batch SQL is the right tool. Reprocessing historical datasets is what batch engines are optimized for.

-- Historical year-over-year comparison: run once or daily
SELECT
    EXTRACT(YEAR FROM order_time)  AS year,
    EXTRACT(MONTH FROM order_time) AS month,
    SUM(quantity * unit_price)     AS revenue
FROM sqltype_orders
GROUP BY year, month
ORDER BY year, month;

2. Your upstream data arrives in scheduled batches

If your data lands in S3 or a data warehouse once per day via nightly exports, streaming SQL does not help. There is nothing to stream. Match your processing model to your data delivery model.

3. You need complex transformations with full dataset context

Some transformations require seeing all rows before producing output: ranking across the full dataset, computing percentiles with exact counts, or deduplication with complex business rules. These are batch-friendly operations.

4. Your team already owns a dbt project

dbt is excellent tooling with mature testing, documentation, and CI/CD integration. If your use case fits hourly or daily freshness requirements, there is no reason to add a new technology to your stack. Keep your data modeling in dbt.

5. You are doing ad-hoc exploration

Interactive SQL on a data warehouse or analytical database is batch SQL. Notebooks, BI tools, ad-hoc queries: all batch. Streaming SQL is for production pipelines with defined, recurring information needs.

Use streaming SQL when...

1. You need results within seconds of data arriving

Real-time dashboards, live leaderboards, operational monitoring: any use case where data older than a minute becomes stale. Streaming SQL keeps results continuously fresh without polling or scheduled jobs.

2. You are processing event streams

Click events, sensor readings, financial transactions, CDC from databases: data that arrives continuously and needs to be aggregated, filtered, or joined as it arrives is the native workload for streaming SQL.

Time-windowed aggregations over event streams use RisingWave's TUMBLE function:

CREATE MATERIALIZED VIEW sqltype_mv_live_order_count AS
SELECT
    window_start,
    window_end,
    COUNT(*)               AS order_count,
    SUM(quantity * unit_price) AS revenue
FROM TUMBLE(sqltype_orders, order_time, INTERVAL '1 HOUR')
GROUP BY window_start, window_end;

This view updates within seconds whenever a new order arrives. No scheduler, no cron, no job submission.

3. You need to react to data as it changes

Fraud detection, inventory threshold alerts, anomaly detection: use cases where you need to act on a condition immediately when it becomes true. Streaming SQL lets you define those conditions as materialized views and query them at any moment.

-- Always-current view of flagged orders
CREATE MATERIALIZED VIEW sqltype_mv_late_orders AS
SELECT
    order_id,
    customer_id,
    order_status,
    order_time
FROM sqltype_orders
WHERE order_status = 'cancelled';

A downstream service can poll this view continuously or RisingWave can push changes to a Kafka topic or webhook. The materialized view is always current.

4. Your batch jobs are too slow to maintain

If your dbt model runs for 20 minutes every hour and the team complains that data is always stale, streaming SQL is the right migration path. The incremental computation model means processing time stays proportional to the volume of new data, not total historical data.

5. You need to join streams with tables

Enriching events with dimensional data, joining two event streams with a time constraint: these are streaming join patterns that batch SQL handles poorly because it lacks the notion of time-ordered event arrival.

Practical Decision Tree

Use this decision tree when evaluating a new use case:

Is your input data arriving continuously (Kafka, CDC, event streams)?
├── Yes → Are results needed within seconds?
│         ├── Yes → Use STREAMING SQL (materialized views)
│         └── No  → Either approach works; batch may be simpler
└── No  → Does your data arrive in scheduled batches (S3, nightly exports)?
          ├── Yes → Use BATCH SQL (dbt, Spark, warehouse queries)
          └── No  → Is this ad-hoc exploration?
                    ├── Yes → Use BATCH SQL (interactive query)
                    └── No  → Evaluate data arrival pattern before deciding

For the most common production scenario, the rule is simpler: if stakeholders ask "why is this data from an hour ago?", you need streaming SQL.

Example: Real-Time Product Leaderboard

Here is a complete example of migrating a batch leaderboard to a streaming one.

Batch version (scheduled SQL)

-- Run every 15 minutes by your orchestrator
SELECT
    p.product_name,
    SUM(o.quantity)                AS units_sold,
    SUM(o.quantity * o.unit_price) AS total_revenue
FROM sqltype_orders o
JOIN sqltype_products p ON o.product_id = p.product_id
GROUP BY p.product_name
ORDER BY total_revenue DESC
LIMIT 5;

Every time a product manager refreshes the leaderboard, they see data up to 15 minutes old. If a flash sale pushes a product to the top, the leaderboard reflects it 15 minutes later.

Streaming version (RisingWave materialized view)

CREATE MATERIALIZED VIEW sqltype_mv_top_products_streaming AS
SELECT
    p.product_name,
    SUM(o.quantity)                AS units_sold,
    SUM(o.quantity * o.unit_price) AS total_revenue
FROM sqltype_orders o
JOIN sqltype_products p ON o.product_id = p.product_id
GROUP BY p.product_name;

To retrieve the leaderboard:

SELECT * FROM sqltype_mv_top_products_streaming
ORDER BY total_revenue DESC
LIMIT 5;
    product_name     | units_sold | total_revenue
---------------------+------------+---------------
 Wireless Headphones |          6 |        479.94
 Mechanical Keyboard |          3 |        389.97
 USB-C Hub           |          4 |        199.96
 Standing Desk Mat   |          3 |        119.97
 Notebook Pack       |         10 |         99.90

The view reflects every order the moment it is inserted, with no schedule, no polling, and no full table scan. When a new order arrives, only the affected product row in the materialized view changes.

When Both Work Together

Batch and streaming SQL are not mutually exclusive. Most mature data platforms use both:

  • Streaming SQL handles operational metrics: live dashboards, alerting, real-time feature stores for ML
  • Batch SQL handles analytical workloads: historical trend reports, data quality checks, long-range forecasting models

A common pattern is to run RisingWave alongside a data warehouse. RisingWave materializes the real-time metrics your product needs to run. The data warehouse stores the historical record for reporting and compliance. Both read from the same Kafka topics or CDC streams.

For teams starting out: begin with batch SQL (dbt is an excellent starting point) and add streaming SQL only when a specific use case demands lower latency. You will know it is time when your batch jobs are running too frequently and stakeholders are still unhappy with freshness.

Common Misconceptions

"Streaming SQL is just batch SQL that runs more often."

No. Batch SQL re-reads all the data on every run. Streaming SQL processes only the new or changed data (the delta) and propagates the effect to maintained results. Running a batch query every 10 seconds would be expensive and still not achieve the same latency as streaming SQL. More detail on how this works is in the incremental materialized views guide.

"Streaming SQL is harder to write."

The SQL syntax in RisingWave is nearly identical to PostgreSQL. The CREATE MATERIALIZED VIEW statement uses the same SELECT, JOIN, WHERE, and GROUP BY clauses you already know. The main additions are time window functions like TUMBLE and HOP for event-time aggregations. See the RisingWave SQL reference for the complete syntax.

"Batch SQL is always cheaper."

At small scale, yes. At high volume, incremental computation is dramatically more efficient. If your batch job scans 500 million rows to process 50,000 new events, the per-event cost is enormous. A streaming pipeline that processes only the 50,000 new events scales far more economically.

"I need Kafka to use streaming SQL."

Not necessarily. RisingWave can read from Kafka, but it also accepts direct inserts via the PostgreSQL protocol. For a development environment or a use case without an existing message bus, you can insert rows directly into RisingWave tables and define materialized views on top of them, as the examples in this article demonstrate. See the RisingWave sources documentation for the full list of supported connectors.

FAQ

What is the difference between streaming SQL and batch SQL?

Streaming SQL runs continuously and maintains precomputed results that update incrementally as new data arrives. Batch SQL runs at a point in time, scans the relevant input data, and produces a result before stopping. The practical consequence is that streaming SQL delivers sub-second freshness while batch SQL delivers results as of the last scheduled run, which may be minutes, hours, or days ago. Streaming SQL is efficient for high-frequency data changes because it processes only new events. Batch SQL is efficient for historical analysis because it can read large datasets in optimized sequential scans.

Can I replace dbt with streaming SQL?

Not entirely, and you should not try to. dbt excels at data modeling, testing, documentation, and managing the relationships between SQL models in a DAG. Those capabilities are not replicated by a streaming database. The better framing is: use dbt for historical and analytical workloads that tolerate hourly or daily freshness, and add a streaming database like RisingWave for use cases that need results within seconds. The two tools address different parts of the data stack and are often deployed together.

How does streaming SQL handle late-arriving data?

Late-arriving events are a well-known challenge in stream processing. RisingWave handles them through watermarks, which define how long the system waits for late data before finalizing a time window. You can configure a watermark on any timestamp column in a source or table: WATERMARK FOR order_time AS order_time - INTERVAL '30 SECONDS'. Events arriving within that delay are included in the correct window. Events arriving after the watermark threshold may be dropped or handled by your application logic. For most operational use cases, a 30-second to 5-minute watermark covers the vast majority of late arrivals. See the watermarks documentation for full details.

When should I use a materialized view versus a regular view?

A regular view (CREATE VIEW) is a saved SQL query that executes every time you query the view. A materialized view (CREATE MATERIALIZED VIEW) stores the computed result and updates it incrementally as underlying data changes. In RisingWave, regular views and materialized views have a clear rule: if you need to query the same aggregation frequently and the underlying data changes often, use a materialized view. If the query is simple, runs rarely, or the data rarely changes, a regular view is sufficient. Materialized views consume storage for the precomputed result; regular views consume no additional storage.

Conclusion

The choice between streaming SQL and batch SQL comes down to one question: how fresh does the result need to be?

If hourly or daily freshness is acceptable, batch SQL tools like dbt and Spark SQL are mature, well-understood, and easy to operate. Build your analytical layer there.

If stakeholders need results that reflect the last few seconds of data, or if your pipeline processes continuous event streams from Kafka or a CDC-enabled database, streaming SQL with continuously maintained materialized views is the right model. The SQL syntax is familiar, the operational model is simpler than traditional stream processing frameworks, and the freshness improvement is significant.

Most production data platforms end up with both: batch SQL for historical analysis and reporting, streaming SQL for operational metrics and real-time use cases. The good news is that both are written in SQL, so your team's existing skills transfer directly.


Ready to try streaming SQL? Get started with RisingWave in 5 minutes. Quickstart

Join the RisingWave Slack community to ask questions and connect with other data engineers building real-time pipelines.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.