Join our Streaming Lakehouse Tour!
Register Now.->

Streaming Aggregation

Streaming Aggregation is the process of continuously calculating summary statistics (like counts, sums, averages, minimums, maximums, or more complex user-defined aggregates) over groups of data within unbounded data streams. Unlike batch aggregation, which computes these statistics once over a finite dataset, streaming aggregation produces results that update dynamically and incrementally as new data arrives or old data expires from a defined scope (often a window).

This is a core stateful operation in stream processing, essential for understanding trends, patterns, and key metrics in real-time.

Core Concepts

  • Continuous Calculation: Aggregates are not computed once but are maintained and updated over time.
  • Grouping Keys: Aggregations are typically performed per group, defined by one or more key fields in the data (e.g., SUM(sales) GROUP BY product_category).
  • Windowing: Since streams are infinite, aggregations are almost always performed over a defined window to make them meaningful and manageable. Common window types include:
    • Tumbling Windows: Fixed-size, non-overlapping windows (e.g., hourly sales).
    • Hopping Windows: Fixed-size, overlapping windows (e.g., 1-hour sales, updated every 5 minutes).
    • Session Windows: Windows based on periods of activity for a key (e.g., average click count per user session).
    • Sliding Windows (less common in emit-continuously systems): Similar to hopping but with more focus on continuous updates.
    • Global Aggregation (less common for true streams without retraction): Aggregating over all data ever seen (can lead to unbounded state if not carefully managed or if the underlying data isn't naturally finite or retractable).
  • Incremental Updates: Efficient stream processing engines update aggregate results incrementally. When a new event arrives, only the affected aggregate values are updated, rather than recomputing from all raw data in the window. This is crucial for low latency and scalability.
  • State Management: The system needs to maintain the intermediate state for each group and window (e.g., current sum and count to calculate an average). This state must be managed reliably and durably.
  • Emit Strategies: Defines when and how the results of the aggregation are emitted:
    • Emit on Window Close: Results are sent downstream only when a window closes.
    • Continuous Updates (Emit on Change): Results are updated and emitted whenever a new event causes a change in an aggregate value (common in systems like RisingWave that use materialized views).

Common Aggregate Functions

Standard SQL aggregate functions are typically supported:

  • COUNT()
  • SUM()
  • AVG()
  • MIN()
  • MAX()
  • And often more specialized ones like APPROX_COUNT_DISTINCT() (for approximate distinct counts like HyperLogLog), or support for User-Defined Aggregations (UDAFs).

Challenges

  • State Management: Potentially large state if grouping by high-cardinality keys or using long windows.
  • Out-of-Order Data: Handling late data correctly is critical for accuracy, especially with event-time windowing. Watermarks are used to manage this.
  • Window Completeness vs. Latency: Deciding when a window is "complete enough" to emit results is a trade-off.

Streaming Aggregation in RisingWave

RisingWave provides powerful and easy-to-use streaming aggregation capabilities through its SQL interface, primarily via Materialized Views:

  • SQL Syntax: Users define streaming aggregations using standard SQL aggregate functions (SUM, COUNT, AVG, etc.) and GROUP BY clauses. Windowing is often incorporated using time window functions or conditions.
  • Materialized Views: The results of streaming aggregations are typically stored in Materialized Views. These views are incrementally and automatically updated by RisingWave as new data arrives.
  • Low-Latency Results: Because the aggregates are pre-computed and maintained in materialized views, querying them is extremely fast, providing low-latency access to fresh summary statistics.
  • Efficient State Management: RisingWave's Hummock state store efficiently manages the underlying state required for these aggregations.
  • Support for Various Window Types: RisingWave supports different types of windowed aggregations.

Example (Conceptual SQL for a Tumbling Window Aggregation):

CREATE MATERIALIZED VIEW hourly_sales_summary AS
SELECT
    TUMBLE_START(order_timestamp, INTERVAL '1 hour') AS hour_window_start,
    product_category,
    SUM(order_amount) AS total_sales,
    COUNT(*) AS number_of_orders
FROM
    orders_stream
GROUP BY
    TUMBLE(order_timestamp, INTERVAL '1 hour'), -- Defines 1-hour tumbling window
    product_category;

This materialized view would provide an hourly summary of total sales and order counts for each product category, updated continuously as new orders stream in.

Related Glossary Terms

The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.