What Is Streaming SQL?
Streaming SQL is a game-changer in the realm of real-time data processing. This article discusses the emergence of streaming SQL, how it works, and provides an example of Streaming SQL with Materialized View.
Streaming SQL is a game-changer in the realm of real-time data processing. This article discusses the emergence of streaming SQL, how it works, and provides an example of Streaming SQL with Materialized View.
In the age of big data, where data is generated at an unprecedented pace, the need for real-time data processing is more critical than ever. Enter Streaming SQL, a game-changer in the realm of real-time data processing.
Streaming SQL is an extension of SQL (Structured Query Language) that is designed to process and analyze data streams in real-time. Unlike traditional SQL, which performs full computations on static data stored in relational databases, Streaming SQL performs incremental computations on continuously generated data streams, such as sensor data, log files, social media posts, or stock market trades. This incremental computation is crucial for handling the vast amounts of data generated in real-time and providing timely insights.
In the modern world, businesses and organizations generate a vast amount of data every second. This data comes from various sources like web applications, IoT devices, social media, and more. Analyzing this data in real-time is crucial for making timely decisions and gaining insights into business operations, customer behavior, and market trends.
For example, a financial institution needs to monitor stock market data in real-time to make investment decisions. Similarly, a retail business needs to analyze customer interactions and transactions in real-time to provide personalized recommendations and offers.
Traditional batch processing methods, where data is collected, stored, and then processed at a later time, are not suitable for these use cases. This is where real-time data processing with Streaming SQL comes into play.
Streaming SQL enables you to write SQL-like queries to process and analyze data streams in real-time. It provides a familiar and declarative language for specifying operations on data streams, making it easier for developers and analysts who are already familiar with SQL to work with streaming data.
Here's how it works:
Suppose you are running an online store and want to monitor the shopping activity of your customers in real-time. You have a stream of shopping cart events (additions, removals, etc.) and want to maintain a real-time materialized view of the current state of each cart.
Let's assume you have an events_stream
with the following schema:
event_id
: Unique identifier for the event.cart_id
: Unique identifier for the shopping cart.product_id
: Unique identifier for the product.event_type
: Type of event ('add' or 'remove').quantity
: Quantity of the product added or removed.event_time
: Timestamp of the event.CREATE TABLE event_stream(
event_id int,
cart_id int,
product_id int,
event_type varchar,
quantity int,
event_time timestamptz
);
You can create a materialized view of the current state of each cart using Streaming SQL as follows:
CREATE MATERIALIZED VIEW cart_state AS
SELECT
cart_id,
product_id,
SUM(CASE WHEN event_type = 'add' THEN quantity ELSE -quantity END) as quantity
FROM events_stream
GROUP BY cart_id, product_id;
In this example, the cart_state
materialized view will be continuously updated as new events arrive in the events_stream
. The query will perform incremental computations to aggregate the quantity of each product in each cart by adding the quantity for 'add' events and subtracting the quantity for 'remove' events, rather than performing a full computation on the entire data set.
Now, you can query the cart_state
materialized view to get the current state of a specific cart or product.
SELECT *
FROM cart_state
WHERE cart_id = '1234';
This query will return the current state of the cart with cart_id
'1234'.
Streaming SQL goes beyond the capabilities of standard SQL by introducing advanced features like window functions and watermarks that are essential for real-time data processing.
Window functions allow you to perform operations on each segment of data that falls into a particular window or time frame. This is crucial for analyzing data streams that are continuously generated.
There are different types of window functions, but let's focus on two common ones: tumbling windows and hopping windows.
A tumbling window is a fixed-size, non-overlapping, and contiguous time interval. It is used to group events that arrive within a specified time period.
For example, if you want to calculate the amount changes every minute, you can use a tumbling window of one minute.
CREATE MATERIALIZED VIEW product_quantity_changes_per_minute AS
SELECT
window_start,
window_end,
product_id,
SUM(CASE WHEN event_type = 'add' THEN quantity ELSE -quantity END) as quantity_changes
FROM TUMBLE(events_stream, event_time, INTERVAL '1' MINUTE)
GROUP BY window_start, window_end, product_id;
In this example, the query will calculate the quantity_changes
for each one-minute tumbling window.
A hopping window is a fixed-size time interval that hops by a specified hop size. The windows can overlap if the window size is larger than the hop size.
For example, if you want to calculate the quantity changes every minute, but update the result every 30 seconds, you can use a hopping window of one minute with a hop size of 30 seconds.
CREATE MATERIALIZED VIEW product_quantity_changes_per_minute AS
SELECT
window_start,
window_end,
product_id,
SUM(CASE WHEN event_type = 'add' THEN quantity ELSE -quantity END) as quantity_changes
FROM HOP(events_stream, event_time, INTERVAL '30' SECOND, INTERVAL '1' MINUTE)
GROUP BY window_start, window_end, product_id;
In this example, the query will calculate the quantity_changes
for each one-minute window, but the result will be updated every 30 seconds.
Watermarks are a crucial feature in streaming SQL that helps handle late events or out-of-order events in data streams.
A watermark is a metadata associated with events in a data stream that represents the point in time up to which the events are complete. It is used to specify how late an event can arrive relative to the current processing time.
For example, if you want to calculate the quantity changes every minute, but allow events to arrive up to 30 seconds late, you can specify a watermark of 30 seconds.
CREATE TABLE event_stream (
event_id int,
cart_id int,
product_id int,
event_type varchar,
quantity int,
event_time timestamptz,
watermark for event_time as event_time - INTERVAL '30' SECOND
) APPEND ONLY;
CREATE MATERIALIZED VIEW product_quantity_changes_per_minute AS
SELECT
window_start,
window_end,
product_id,
SUM(CASE WHEN event_type = 'add' THEN quantity ELSE -quantity END) as quantity_changes
FROM HOP(events_stream, event_time, INTERVAL '30' SECOND, INTERVAL '1' MINUTE)
GROUP BY window_start, window_end, product_id;
In this example, the query will calculate the quantity_change
for each one-minute tumbling window, but allow events to arrive up to 30 seconds late.
Conclusion
Streaming SQL extends the capabilities of standard SQL with advanced features like window functions and watermarks that are essential for real-time data processing and analysis. It performs incremental computations on data streams, rather than full computations on static data, making it a powerful and flexible tool for real-time data processing and analysis.
Whether you are monitoring the stock market, analyzing social media trends, or optimizing supply chains, Streaming SQL can help you process and analyze data streams in real-time, providing timely insights and supporting data-driven decision-making.
Wondering how to build a real-time market analysis pipeline? In this tutorial, we'll combine Databento's market data streams with RisingWave's stream processing capabilities to create a live market analysis system.
Join us in exploring some of the most highly anticipated and prominent features of this new release!This is a major leap forward and we are excited to share these developments with you.
RisingWave has already demonstrated its ability to simplify and accelerate complex workflows in portfolio monitoring and risk management. For engineers in capital markets, adopting RisingWave isn’t just about keeping up—it’s about staying ahead.