Streaming Aggregation
Streaming Aggregation is the process of continuously calculating summary statistics (like counts, sums, averages, minimums, maximums, or more complex user-defined aggregates) over groups of data within unbounded data streams. Unlike batch aggregation, which computes these statistics once over a finite dataset, streaming aggregation produces results that update dynamically and incrementally as new data arrives or old data expires from a defined scope (often a window).
This is a core stateful operation in stream processing, essential for understanding trends, patterns, and key metrics in real-time.
Core Concepts
- Continuous Calculation: Aggregates are not computed once but are maintained and updated over time.
- Grouping Keys: Aggregations are typically performed per group, defined by one or more key fields in the data (e.g., SUM(sales) GROUP BY product_category).
- Windowing: Since streams are infinite, aggregations are almost always performed over a defined window to make them meaningful and manageable. Common window types include:
- Tumbling Windows: Fixed-size, non-overlapping windows (e.g., hourly sales).
- Hopping Windows: Fixed-size, overlapping windows (e.g., 1-hour sales, updated every 5 minutes).
- Session Windows: Windows based on periods of activity for a key (e.g., average click count per user session).
- Sliding Windows (less common in emit-continuously systems): Similar to hopping but with more focus on continuous updates.
- Global Aggregation (less common for true streams without retraction): Aggregating over all data ever seen (can lead to unbounded state if not carefully managed or if the underlying data isn't naturally finite or retractable).
- Incremental Updates: Efficient stream processing engines update aggregate results incrementally. When a new event arrives, only the affected aggregate values are updated, rather than recomputing from all raw data in the window. This is crucial for low latency and scalability.
- State Management: The system needs to maintain the intermediate state for each group and window (e.g., current sum and count to calculate an average). This state must be managed reliably and durably.
- Emit Strategies: Defines when and how the results of the aggregation are emitted:
- Emit on Window Close: Results are sent downstream only when a window closes.
- Continuous Updates (Emit on Change): Results are updated and emitted whenever a new event causes a change in an aggregate value (common in systems like RisingWave that use materialized views).
Common Aggregate Functions
Standard SQL aggregate functions are typically supported:
- COUNT()
- SUM()
- AVG()
- MIN()
- MAX()
- And often more specialized ones like APPROX_COUNT_DISTINCT() (for approximate distinct counts like HyperLogLog), or support for User-Defined Aggregations (UDAFs).
Challenges
- State Management: Potentially large state if grouping by high-cardinality keys or using long windows.
- Out-of-Order Data: Handling late data correctly is critical for accuracy, especially with event-time windowing. Watermarks are used to manage this.
- Window Completeness vs. Latency: Deciding when a window is "complete enough" to emit results is a trade-off.
Streaming Aggregation in RisingWave
RisingWave provides powerful and easy-to-use streaming aggregation capabilities through its SQL interface, primarily via Materialized Views:
- SQL Syntax: Users define streaming aggregations using standard SQL aggregate functions (SUM, COUNT, AVG, etc.) and GROUP BY clauses. Windowing is often incorporated using time window functions or conditions.
- Materialized Views: The results of streaming aggregations are typically stored in Materialized Views. These views are incrementally and automatically updated by RisingWave as new data arrives.
- Low-Latency Results: Because the aggregates are pre-computed and maintained in materialized views, querying them is extremely fast, providing low-latency access to fresh summary statistics.
- Efficient State Management: RisingWave's Hummock state store efficiently manages the underlying state required for these aggregations.
- Support for Various Window Types: RisingWave supports different types of windowed aggregations.
Example (Conceptual SQL for a Tumbling Window Aggregation):
CREATE MATERIALIZED VIEW hourly_sales_summary AS
SELECT
TUMBLE_START(order_timestamp, INTERVAL '1 hour') AS hour_window_start,
product_category,
SUM(order_amount) AS total_sales,
COUNT(*) AS number_of_orders
FROM
orders_stream
GROUP BY
TUMBLE(order_timestamp, INTERVAL '1 hour'),
product_category;
This materialized view would provide an hourly summary of total sales and order counts for each product category, updated continuously as new orders stream in.
Related Glossary Terms