Stream Processing Glossary

Core Concepts

Deserialization

The reverse process of converting serialized data back into its original data structure or object.

Downstream

Refers to components or operations that come later in a data processing pipeline, closer to the sink.

Event Stream

A continuous flow of data generated by events, such as user interactions, sensor readings, or system logs.

Event Time

The time at which an event actually occurred, as recorded in the event data.

Event-Time Latency

The time difference between when an event actually occurred and when it is processed by the stream processing system.

Latency

The time delay between an event occurring and the result of processing that event being available.

Processing Time

The time at which an event is processed by the stream processing system.

Real-Time Analytics

The practice of analyzing data as it is generated, providing immediate insights and enabling timely decision-making.

Serialization

The process of converting data structures or objects into a format that can be stored or transmitted (e.g., JSON, Avro, Protobuf).

Timestamp

A piece of metadata associated with an event that indicates when it occurred.

Throughput

The rate at which a stream processing system can process events, typically measured in events per second or bytes per second.

Upstream

Refers to components or operations that come earlier in a data processing pipeline, closer to the source.

Stream Processing Operations

Aggregation

The process of summarizing data in a stream by computing metrics such as sum, average, count, or max.

Complex Event Processing (CEP)

A method of analyzing streams of events to detect patterns, correlations, and trends.

Deduplication

The process of removing duplicate events from a stream to ensure accurate processing.

Hopping Window

A fixed-size window that advances by a smaller step than its size, creating overlaps.

Pattern Detection

The process of identifying sequences of events that match a specific pattern within a stream.

Session Window

A window defined by periods of activity followed by gaps of inactivity.

Side Input

An auxiliary input to a stream processing pipeline, often used to provide additional context or data to enrich the main stream.

Sliding Window Join

A type of join operation that matches events from two streams within a specified time range.

Stream Aggregation

The process of computing summary statistics (e.g., sum, average, count) over a data stream.

Stream Enrichment

The process of adding additional data to events in a stream, often by joining the stream with a reference dataset.

Stream Join

An operation that combines events from two or more streams based on a common key, often within a specified time window.

Tumbling Window

A fixed-size, non-overlapping window. Each event belongs to exactly one tumbling window.

Stateful Processing

A type of stream processing where the system maintains and updates state based on processed events.

Stateless Processing

A type of stream processing where each event is processed independently, without relying on any information about past events.

Window Aggregation

The process of computing an aggregate function (e.g., sum, average, count) over the events within a specific window.

Window Closing

The point in time when a window is considered complete and its results are finalized.

Windowing

A technique for dividing a continuous stream of data into finite chunks or "windows" based on time, count, or other criteria.

Trigger

A mechanism that determines when the results of a windowed computation should be emitted.

Data Handling and Ordering

Allowed Lateness

The maximum amount of time after a window closes that late events can still be processed.

Data Partitioning

The process of dividing a data stream into smaller, independent subsets (partitions) based on a key.

Keyed Stream

A data stream where each event has a key, allowing for partitioning and stateful processing.

Late Data

Events that arrive after the watermark for their corresponding window has already passed.

Out-of-Order Events

Events that arrive at the stream processing system with timestamps earlier than the current watermark.

Stream Shuffling

The process of redistributing events in a stream based on a key, often to balance load across processing tasks.

Time Skew

The difference between event time and processing time.

Watermark

A mechanism in event stream processing for tracking the progress of event time.

Delivery Guarantees & Fault Tolerance

At-Least-Once Semantics

A delivery guarantee ensuring that each event is processed at least one time.

At-Most-Once Semantics

A delivery guarantee ensuring that each event is processed at most once.

Dead-Letter Queue

A special queue used to store events that cannot be processed due to errors or invalid data.

Exactly-Once Semantics

A strong delivery guarantee ensuring that each event is processed exactly once, even with failures.

Fault Tolerance

The ability of a system to continue operating correctly in the presence of failures.

Idempotence

A property of an operation where applying it multiple times has the same effect as applying it once.

Outage Recovery

The process of restoring a stream processing system to normal operation after a failure.

State Management

Checkpoint

A consistent snapshot of the state of a stream processing application at a specific point in time.

Incremental Checkpoint

A type of checkpoint that only stores the changes to the state since the last checkpoint, rather than the entire state.

State

Information maintained by the stream processing system about past events.

System Architecture and Components

Backpressure

A mechanism that regulates the flow of data in a stream processing system.

Dynamic Scaling

The ability of a stream processing system to automatically adjust resources based on workload demands.

Scalability

The ability of a stream processing system to handle increasing amounts of data and processing demands by adding more resources.

Sink

The destination where the processed data from a stream processing system is sent.

Source

The origin of a data stream.

RisingWave-Specific Concepts

Actor

A unit of concurrent computation in RisingWave.

Arrangement

The physical representation of data streams in RisingWave's storage layer.

Backfill (RisingWave-Specific)

A process in RisingWave for populating a materialized view with historical data while keeping it consistent with real-time updates.

Barrier

A control message used for synchronizing operators and ensuring consistent checkpointing in RisingWave.

Compactor

A service in RisingWave responsible for optimizing the storage layer by merging and reorganizing data.

Compute Node

A node in RisingWave's architecture that performs data processing tasks.

Connector

An interface in RisingWave that enables integration with external systems for data ingestion and output.

Continuous Query

A query in RisingWave that runs continuously on incoming data streams, updating its results in real time.

Frontend Node

The entry point for users to interact with a RisingWave cluster.

Materialized View

A precomputed and continuously updated result of a query in RisingWave.

Meta Node

The central coordinator in a RisingWave cluster that manages metadata and orchestrates cluster operations.

Query Plan

A representation of how a query will be executed in RisingWave.

Schema Change/Schema Evolution

The ability of a database to handle changes to the structure of data without requiring downtime or manual intervention.

Storage Node

A node in RisingWave's architecture that manages the storage of data streams and state.

Stream Fragment

A logical unit of a query plan in RisingWave.

Streaming SQL

A SQL-based language used in RisingWave for defining and querying data streams.

Unified Batch and Stream Processing

RisingWave's ability to handle both batch and stream processing workloads within a single system.

Watermark (RisingWave-Specific)

In RisingWave, a watermark is an event-time mechanism used to track progress in processing streams, similar to the general watermark concept.

Systems and Tools

Apache Kafka

A distributed, high-throughput, fault-tolerant event streaming platform often used as a message broker for stream processing applications.

Apache Spark Streaming

A micro-batch-based stream processing framework built on top of Apache Spark.

Apache Beam

A unified programming model for defining both batch and stream processing pipelines.

Apache Samza

A distributed stream processing framework that is tightly coupled with Apache Kafka.

ksqlDB

A streaming database built on top of Apache Kafka.

Materialize

A streaming database that provides low-latency access to materialized views over streaming data.

Advanced Topics

Stream Replay

The ability to reprocess historical events from a stream, often for debugging or backfilling purposes.

The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.