RisingWave Glossary

A

ACID Transactions (on Data Lake)

ACID Transactions bring database-style reliability guarantees (Atomicity, Consistency, Isolation, Durability) to data lake operations. Enabled by open table formats like Apache Iceberg, they ensure data integrity even with concurrent reads and writes on cloud object storage. This allows building more reliable data pipelines and analytics directly on the lake.

Aggregation (Streaming)

Streaming aggregation involves continuously calculating summary statistics (like counts, sums, or averages) over groups of data within unbounded data streams. This is often performed within time windows to understand trends or patterns over specific intervals. RisingWave supports powerful streaming aggregations via SQL, often materialized for low-latency access.

Apache Iceberg

Apache Iceberg is an open table format designed for huge analytic datasets stored in data lakes (like S3, GCS, ADLS). It brings database-like features such as ACID transactions, scalable metadata management, robust schema evolution, and time travel capabilities directly to lake storage. RisingWave can integrate with Iceberg, using it as a sink for streaming results within a streaming lakehouse architecture.

At-Least-Once Semantics

At-Least-Once semantics is a data processing guarantee ensuring that each input event is processed one or more times, even in the face of failures. While it prevents data loss, it can result in duplicate processing or output records if failures and retries occur. Systems providing this guarantee often require downstream consumers to handle potential duplicates.

At-Most-Once Semantics

At-Most-Once semantics is a data processing guarantee ensuring that each input event is processed at most one time. This means events might be processed once or potentially skipped entirely (data loss) if failures occur before completion is acknowledged. It's simpler to implement but generally less desirable than at-least-once or exactly-once for critical data pipelines.

B

Backpressure

Backpressure is a feedback mechanism in data streaming systems where a slower downstream operator or sink signals to a faster upstream operator or source to reduce its data sending rate. This prevents the slower component from being overwhelmed, avoiding excessive buffering, potential data loss, or system instability. Effective backpressure management is crucial for building robust and stable streaming pipelines.

C

Change Data Capture (CDC)

Change Data Capture (CDC) is the process of observing changes occurring in a source database (inserts, updates, deletes) and extracting them as a stream of event data. This allows downstream systems, like RisingWave, to react to database modifications in real-time without directly querying the source database frequently. Common tools like Debezium generate CDC streams that RisingWave can ingest and process.

Checkpointing

Checkpointing is a fault tolerance mechanism used in stateful stream processing systems like RisingWave. It involves periodically taking consistent snapshots of the internal state (e.g., aggregation results, join states) and storing them durably (like in RisingWave's State Store). If a failure occurs, the system can restore its state from the latest successful checkpoint and resume processing, minimizing data loss and ensuring consistency.

Compaction (Data Lake / Table Format)

Compaction is an optimization process used within open table formats like Apache Iceberg on data lakes. It involves merging many small data files into fewer larger files, which improves query performance by reducing metadata overhead and optimizing read operations. This is typically a background maintenance task essential for maintaining efficiency in large-scale lakehouse tables.

Connector

A Connector is a software component that enables a data system, like RisingWave, to communicate with external systems for data ingestion (sources) or data egress (sinks). Connectors handle the specifics of interacting with different technologies, such as message queues (Kafka, Pulsar), databases (via JDBC or CDC), or file systems (S3). They abstract away the protocol and format details, simplifying pipeline construction.

Consistency (State Management)

Consistency in stateful stream processing refers to the guarantees provided about the visibility and ordering of state updates, especially in the presence of failures and recovery. Systems like RisingWave aim for strong consistency, ensuring that operations reflect a valid state and that recovered state aligns correctly with processed inputs, often tied to achieving exactly-once semantics.

Continuous Query

A Continuous Query is a query designed to run perpetually over unbounded, continuously arriving data streams, producing results that update dynamically as new data flows in. Unlike traditional database queries that run once on static data, continuous queries define ongoing computations. In RisingWave, SQL is used to define continuous queries, often materialized into views for efficient access.

D

Data Lake

A Data Lake is a centralized repository designed to store large quantities of structured, semi-structured, and unstructured data in its raw format. Built on cost-effective storage like AWS S3 or Google Cloud Storage, it provides flexibility but historically lacked the performance and transactional guarantees of data warehouses. Data lakes serve as the foundation for modern architectures like the Lakehouse.

Data Lakehouse

A Data Lakehouse is a modern data management architecture that combines the low-cost, flexible storage of data lakes with the structure, governance, and performance features of data warehouses. It typically leverages open table formats like Apache Iceberg directly on lake storage, enabling ACID transactions, schema enforcement, and efficient SQL querying. RisingWave can act as a key component in a real-time lakehouse, processing streams and serving fresh results.

Data Partitioning (Streaming)

Data Partitioning in streaming involves distributing incoming data events across multiple parallel instances of processing operators based on a specific key within the data (e.g., user ID, sensor ID). This is essential for achieving scalability, as it allows processing load to be spread across multiple nodes or cores. Correct partitioning is crucial for stateful operations like joins and aggregations to ensure related data is processed together.

Data Stream / Event Stream

A Data Stream (or Event Stream) is an unbounded, ordered, and immutable sequence of data records or events generated continuously over time. Examples include sensor readings, user clickstreams, financial transactions, or database change events (CDC). Stream processing systems like RisingWave are designed to ingest, process, and analyze these continuous streams in real-time.

Dataflow Graph / Pipeline

A Dataflow Graph, or Streaming Pipeline, represents the logical structure of a stream processing job. It typically consists of sources where data originates, transformation operators (like filters, aggregations, joins) that process the data, and sinks where the results are sent. RisingWave compiles SQL queries into optimized dataflow graphs executed across its compute nodes.

Debezium

Debezium is a popular open-source distributed platform for Change Data Capture (CDC). It monitors source databases (like PostgreSQL, MySQL, MongoDB) and produces event streams capturing row-level changes (inserts, updates, deletes). RisingWave has built-in support for ingesting Debezium-formatted CDC streams (often via Kafka), enabling real-time processing of database changes.

Durability (State Management)

Durability in stateful stream processing ensures that once state updates are successfully committed, they persist and survive system failures, such as node restarts or crashes. In RisingWave, durability for streaming state is primarily achieved through its State Store (Hummock), which persists state snapshots and changes to reliable cloud storage, enabling recovery via checkpointing.

E

Emit Strategy / Output Mode

An Emit Strategy (or Output Mode) defines when and how results are produced from stateful streaming operations, particularly windowed aggregations. Common strategies include emitting results only when a time window closes, or emitting updates continuously as new data arrives and affects the window's results. The chosen strategy impacts downstream consumers and the semantics of the output stream.

Event Streaming Platform (ESP)

An Event Streaming Platform (ESP) is a system designed to capture, store, and process streams of events in real-time, durably, and at scale. Key examples include Apache Kafka, Apache Pulsar, and Redpanda. These platforms act as the central nervous system for event-driven architectures, often serving as the primary source and/or sink for stream processing engines like RisingWave.

Event Time

Event Time is the timestamp embedded within a data record itself, indicating when the actual event occurred in the real world (e.g., when a sensor reading was taken or a transaction happened). Processing data based on Event Time allows for accurate results even if data arrives out of order or is delayed. It's crucial for windowing operations and requires mechanisms like watermarks to handle correctly.

Event-Driven Architecture (EDA)

An Event-Driven Architecture (EDA) is a software design pattern where components communicate asynchronously by producing and consuming events. Events represent significant occurrences (e.g., an order placed, a sensor reading updated). Systems like Kafka act as the event backbone, and stream processors like RisingWave consume these events to trigger real-time processing, analytics, or downstream actions.

Exactly-Once Semantics (EOS)

Exactly-Once Semantics (EOS) is the strongest processing guarantee in stream processing, ensuring that each input event's effect on the system's state is recorded precisely once, even amidst failures and retries. Achieving true end-to-end EOS often involves coordination between the stream processor (like RisingWave, which provides EOS for its internal state) and the source/sink connectors using mechanisms like transactions or idempotent writes.

F

Fault Tolerance

Fault Tolerance is the ability of a system, like RisingWave, to continue operating correctly and maintain data integrity even when components fail (e.g., node crashes, network issues). In stateful streaming, this is typically achieved through mechanisms like state checkpointing, replication, and automatic recovery processes. Robust fault tolerance is essential for reliable real-time data processing pipelines.

Filtering (Streaming)

Filtering in streaming involves selectively processing or discarding incoming data events based on specified criteria applied to the event's content. It's a fundamental stateless transformation used to reduce data volume or isolate events of interest for subsequent processing steps like aggregation or joins. In RisingWave, filtering is typically done using the `WHERE` clause in SQL queries.

H

High Availability (HA)

High Availability (HA) refers to system design principles and configurations aimed at ensuring continuous operational performance and minimizing downtime, typically through redundancy and automatic failover. For RisingWave, this involves running multiple instances of critical control plane components and having mechanisms for the system to automatically recover from compute node failures without interrupting service significantly.

Hopping Window

A Hopping Window is a type of time window used in stream processing that has a fixed duration but slides forward by a specified "hop" interval, allowing windows to overlap. For example, a 1-hour window hopping every 10 minutes would produce aggregations covering the last hour, updated every 10 minutes. This is useful for generating frequently updated, moving-average style results.

Hummock (RW Specific)

Hummock is RisingWave's purpose-built, cloud-native storage engine used within its State Store. It employs a Log-Structured Merge-tree (LSM-tree) architecture to efficiently manage large amounts of streaming state on durable object storage (like S3). Hummock provides features like checkpointing, versioning, and compaction, underpinning RisingWave's fault tolerance, scalability, and state management capabilities.

I

Idempotency

Idempotency refers to the property of an operation where executing it multiple times with the same input produces the same result or state as executing it just once. In stream processing, idempotent operations (especially on sinks) are crucial for achieving effective exactly-once semantics, as they allow safe retries after failures without causing duplicate side effects. For example, writing results with a unique key that allows upserts is an idempotent operation.

Incremental Computation (RW Specific Capability)

Incremental Computation is the core processing model used by RisingWave, particularly for materialized views. Instead of recomputing results from scratch when new data arrives, RisingWave processes only the incoming changes (inserts, updates, deletes) and efficiently updates the existing state and materialized results. This approach drastically reduces computation load and latency compared to batch reprocessing, enabling fresh, low-latency results.

Ingestion Time

Ingestion Time is the timestamp assigned to an event when it is first received or ingested by a specific system or component, such as RisingWave itself or an upstream broker like Kafka. While easier to manage than Event Time, processing based purely on Ingestion Time can lead to inaccurate results if events experience variable delays before ingestion, especially for time-windowed operations. It's distinct from Event Time (when the event occurred) and Processing Time (when computation happens).

J

Join (Streaming)

A Streaming Join combines data from two or more streams (or a stream and a table) based on common attributes or keys, producing a resulting stream of correlated events. Unlike batch joins on static datasets, streaming joins operate continuously on potentially unbounded, ever-changing inputs, requiring state management to store relevant data from each stream. RisingWave supports various streaming join types (e.g., stream-stream, stream-table) defined using SQL syntax.

K

Kappa Architecture

The Kappa Architecture is a data processing architecture that aims to handle both real-time and batch processing needs using a single streaming-based system. It contrasts with the Lambda Architecture by eliminating the separate batch processing layer, instead relying on the stream processing engine and a durable event log (like Kafka) to reprocess historical data if needed. Systems like RisingWave align well with the Kappa philosophy by providing a unified platform for stream processing and serving.

L

Lambda Architecture

The Lambda Architecture is a traditional big data processing pattern designed to handle both real-time and batch processing needs using two separate layers. It consists of a batch layer for comprehensive, accurate processing of large historical datasets, and a speed (or streaming) layer for low-latency processing of recent data, with results merged in a serving layer. Modern streaming systems like RisingWave, especially within a Lakehouse context, often aim to simplify or replace the Lambda Architecture with a more unified approach like Kappa.

Late Data

Late Data refers to events arriving at the stream processing system after their associated event time has already passed a certain threshold, typically defined by a watermark. Handling late data correctly is crucial for accuracy in event-time processing, especially for windowed operations. Systems may handle late data by dropping it, processing it separately, or allowing updates to previously computed window results if configured to do so.

Latency (Query vs. End-to-End)

Latency in streaming systems refers to the delay experienced by data. End-to-End Latency measures the total time from event occurrence to the final result being available, while Query Latency specifically measures the time taken to retrieve results from the system (e.g., querying a materialized view in RisingWave). Streaming databases like RisingWave aim to minimize both, providing low end-to-end processing latency via incremental computation and low query latency via pre-computed materialized views.

Lookup Join

A Lookup Join is a type of stream enrichment where each incoming event triggers a synchronous request (a "lookup") to an external system (like a database or microservice) to fetch additional data. While conceptually simple, lookup joins can introduce significant latency and become a bottleneck due to the external dependencies for each event. RisingWave's stream-table joins using its internal state offer a more efficient alternative for many enrichment scenarios.

M

Materialized View (RW Specific Feature)

A Materialized View in RisingWave is a core concept where the results of a continuous SQL query (defined using `CREATE MATERIALIZED VIEW`) over streaming data are automatically computed, stored, and kept up-to-date incrementally. Unlike traditional database MVs requiring manual refreshes, RisingWave MVs update dynamically with low latency as source streams change via incremental computation. This makes complex streaming results readily available for fast querying without recomputing on the fly.

Message Queue (MQ)

A Message Queue (MQ) is a system designed for asynchronous communication, typically between different services, using point-to-point or publish-subscribe patterns (e.g., RabbitMQ, SQS). While they handle message buffering and delivery, MQs traditionally focus on transient message delivery and may lack the strong ordering guarantees, long-term persistence, and replay capabilities of log-based Event Streaming Platforms like Kafka, which are more commonly paired with stream processors.

Microservices (in Streaming)

Microservices often leverage streaming systems for communication and data sharing in event-driven architectures. Services can publish events to an Event Streaming Platform (like Kafka) when significant state changes occur, and other services (or a stream processor like RisingWave) can consume these events to react, update their own state, or perform real-time analytics. This decouples services and enables scalable, real-time data flow.

O

Observability (Metrics, Logging, Tracing)

Observability refers to the ability to understand the internal state and behavior of a system based on the data it generates, typically through metrics, logs, and traces. For a streaming system like RisingWave, observability is crucial for monitoring performance, diagnosing issues, understanding data flow, and managing cluster resources effectively. RisingWave exposes various metrics and logs to support observability tooling.

Open Table Format

An Open Table Format (e.g., Apache Iceberg, Apache Hudi, Delta Lake) defines specifications for organizing and managing large analytical datasets stored in data lakes (like S3, GCS). They bring reliability features like ACID transactions, schema evolution guarantees, time travel, and performance optimizations (partitioning, compaction) directly to lake storage, enabling Data Lakehouse architectures. RisingWave can integrate with these formats, particularly Iceberg, as sinks for building real-time lakehouses.

P

Parallelism / Parallel Execution (RW Specific)

Parallelism in RisingWave refers to the execution of different parts of a streaming dataflow graph concurrently across multiple Compute Nodes or even multiple CPU cores within a node. RisingWave automatically parallelizes streaming jobs based on data partitioning and available resources to achieve high throughput and scalability. The degree of parallelism can often be configured to match the workload and cluster size.

Partitioning (Data Lake / Table Format)

Partitioning in the context of data lakes and open table formats involves organizing data files into separate directories based on the values of specific columns (e.g., date, region, customer ID). This physical layout allows query engines to skip reading irrelevant data, significantly improving query performance by reducing the amount of data scanned. Table formats like Iceberg manage partitioning complexity and allow partition schemes to evolve over time.

PostgreSQL Compatibility (RW Specific Feature)

PostgreSQL Compatibility means that RisingWave aims to understand the PostgreSQL SQL dialect and communicate using the PostgreSQL wire protocol. This allows users familiar with PostgreSQL to leverage their existing SQL knowledge when writing streaming queries in RisingWave. It also enables compatibility with a wide range of standard PostgreSQL client tools, libraries, and BI tools for connecting and interacting with RisingWave.

Processing Time

Processing Time refers to the time at the stream processing system's internal clock when a specific event is processed or an operation occurs. Processing based on Processing Time is simpler than Event Time as it doesn't require handling out-of-order events or watermarks. However, results can be non-deterministic and sensitive to system load or processing delays, making it less suitable for applications requiring accuracy based on when events actually happened.

Projection (Streaming)

Projection in streaming involves selecting specific fields (columns) from data events and potentially transforming their values as they flow through a pipeline. It's a common stateless operation used to reshape data, remove unnecessary fields to reduce data size, or prepare data for subsequent processing steps. In RisingWave, projection is typically handled by the `SELECT` clause in SQL queries.

R

Real-time Data Processing

Real-time Data Processing involves analyzing and acting on data almost instantaneously as it is generated or received, typically within milliseconds or seconds. This contrasts with traditional batch processing which operates on large, static datasets with higher latency. Streaming systems like RisingWave are specifically designed for real-time data processing, enabling immediate insights and automated responses.

Real-time Lakehouse / Streaming Lakehouse

A Real-time Lakehouse (or Streaming Lakehouse) extends the Data Lakehouse concept by incorporating real-time data ingestion, continuous processing, and low-latency querying capabilities directly on lake storage. It aims to unify batch and streaming workloads on a single architecture, often using a streaming engine like RisingWave alongside open table formats like Iceberg. This provides fresh, queryable data without needing separate speed layers.

Recovery Point Objective (RPO)

Recovery Point Objective (RPO) defines the maximum acceptable amount of data loss an organization can tolerate, measured in time, during a disaster or system failure. For a streaming system, RPO relates to how frequently state is checkpointed; a lower RPO (e.g., seconds) requires more frequent checkpointing to minimize potential data loss upon recovery. It's a key metric for designing fault-tolerant systems.

Recovery Time Objective (RTO)

Recovery Time Objective (RTO) defines the target maximum time allowed for restoring a system's functionality after a failure or disaster. For a streaming system like RisingWave, RTO depends on factors like the time needed to detect failure, provision replacement resources (if needed), load the last checkpoint, and resume processing from the correct point in the input streams. It dictates how quickly the system must be back online.

Reverse ETL

Reverse ETL is the process of copying data *from* a central data store (like a data warehouse or lakehouse) *back into* operational systems (like CRMs, marketing tools, or databases) where it can be used to drive business actions. While traditionally batch-oriented, streaming systems like RisingWave can potentially power real-time Reverse ETL use cases by sinking processed results directly into operational tools with low latency.

S

Scalability (Compute & Storage)

Scalability refers to a system's ability to handle increasing load (data volume, query complexity, request rate) by adding more resources. In RisingWave, this often involves scaling compute resources (adding Compute Nodes for processing power) and storage resources (leveraging cloud storage for state) independently. This separation allows tailoring the cluster configuration efficiently to specific workload demands.

Schema (Streaming)

In streaming, the Schema defines the structure and data types of events within a data stream. Managing schemas is crucial as data structures can change over time (schema evolution). Stream processing systems need mechanisms to handle schema changes gracefully, often involving integration with a Schema Registry, especially when using binary formats like Avro or Protobuf.

Schema Evolution (in Lakehouse)

Schema Evolution in a Lakehouse context refers to the ability to safely modify the schema (add, drop, rename, reorder, or change types of columns) of tables managed by open table formats like Apache Iceberg, even after data has been written. These formats provide guarantees that schema changes won't break existing data or require expensive table rewrites, which is essential for long-term data management on data lakes.

Schema Registry

A Schema Registry is a centralized service used to manage and validate schemas, particularly for serialization formats like Avro and Protobuf used in streaming platforms like Kafka. Producers register schemas before sending data, and consumers retrieve them to correctly deserialize incoming messages. This ensures data consistency and facilitates safe schema evolution across different services and applications.

Separation of Storage and Compute (RW Specific)

Separation of Storage and Compute is a key architectural principle in RisingWave where the components responsible for data processing (Compute Nodes) are distinct from the system responsible for durable state storage (like the Hummock-based State Store using cloud object storage). This allows independent scaling of compute and storage resources based on workload needs, offering better elasticity, fault isolation, and potentially cost efficiency compared to tightly coupled architectures.

Serialization Format

A Serialization Format (or wire format) defines how structured data is converted into a sequence of bytes for transmission over a network or storage (e.g., JSON, Avro, Protobuf). Choosing an appropriate format impacts performance, storage size, schema evolution capabilities, and language interoperability. RisingWave supports various common serialization formats for interacting with sources and sinks like Kafka.

Session Window

A Session Window groups events based on periods of activity separated by defined gaps of inactivity. Unlike time-based windows (Tumbling, Hopping), Session Windows have variable durations determined by the data itself. They are useful for analyzing user sessions, tracking device activity periods, or grouping related events that occur closely together in time but without fixed intervals.

Sink

A Sink represents the destination for results processed by RisingWave, defined using the `CREATE SINK` SQL command. It specifies an external system (like a Kafka topic, database, or Iceberg table) where RisingWave should send the output stream of changes from a table or materialized view. Sinks enable integration with downstream applications, databases, or data lakes, effectively pushing processed data out of RisingWave.

Source

A Source represents an external origin of streaming data ingested into RisingWave, defined using the `CREATE SOURCE` SQL command. It specifies the connection details, data format (e.g., JSON, Avro), and schema for an upstream system like Kafka, Pulsar, Kinesis, or a CDC stream. Sources allow RisingWave to continuously read data from these external systems to power streaming pipelines and materialized views.

State Store (RisingWave) (RW Specific)

The State Store in RisingWave is the component responsible for persistently managing the internal state required for stateful streaming operations (like joins, aggregations, and materialized views). It primarily utilizes Hummock, RisingWave's cloud-native LSM-tree based storage engine, which leverages durable object storage (like S3). The State Store ensures state durability, enables fault tolerance via checkpointing, and provides efficient state access for compute operations.

Stateful Stream Processing

Stateful Stream Processing refers to computations on data streams that require maintaining and accessing state derived from previously seen events. Operations like aggregations (e.g., counting events over time), joins (correlating events from different streams), or pattern detection inherently require storing and updating state. RisingWave is designed specifically for efficient and reliable stateful stream processing.

Stateless Stream Processing

Stateless Stream Processing involves operations that act on each incoming data event independently, without needing context or information from previous events. Examples include simple filtering (e.g., `WHERE amount > 100`) or projection (selecting specific fields). While useful, many real-world streaming applications require stateful processing to derive meaningful insights.

Stream Processing

Stream Processing is the paradigm of continuously processing data "in motion" as it arrives, typically with low latency. It involves ingesting unbounded sequences of events (streams), performing transformations, aggregations, and joins, and producing results in real-time. This contrasts with batch processing, which operates on fixed, bounded datasets periodically.

Stream-Stream Join

A Stream-Stream Join combines two or more unbounded data streams based on a common key and typically within a specified time window constraint. Since both inputs are continuously changing, the system needs to maintain state for potentially matching events from each stream that have arrived within the join window. RisingWave supports stream-stream joins defined using SQL.

Stream-Table Join / Temporal Join

A Stream-Table Join (often called a Temporal Join) enriches events from a data stream with relatively static or slowly changing data from a table. For each stream event, the system looks up the corresponding value(s) in the table based on a join key, often considering a specific point in time (using event time or processing time). This is common for adding dimensional data to fact streams.

Streaming Database

A Streaming Database is a specialized database system designed to ingest, store, process, and serve data from continuous streams in real-time. Unlike traditional databases optimized for storing and querying static data-at-rest, streaming databases like RisingWave excel at continuous query execution, incremental state management, and providing low-latency access to fresh results derived from data-in-motion, often using SQL.

Streaming SQL (RW Specific Capability)

Streaming SQL refers to the use of SQL syntax and semantics to define and execute continuous queries over unbounded data streams. RisingWave uses Streaming SQL as its primary interface, allowing users to express complex stream processing logic (including windowing, joins, aggregations) declaratively. It extends standard SQL concepts to handle the temporal and continuous nature of streaming data.

Subscription (RW Specific Feature)

A Subscription in RisingWave is a mechanism that allows external applications to efficiently consume the stream of changes (inserts, updates, deletes) occurring in a Materialized View. Instead of repeatedly polling the view, applications can subscribe to receive change events pushed by RisingWave, enabling reactive applications and real-time downstream workflows based on the view's results.

T

Table

A Table in RisingWave represents a collection of relational data with a defined schema, created using the `CREATE TABLE` SQL command. Unlike traditional databases, RisingWave tables can be defined directly on top of external data streams (often using `CREATE SOURCE` first, then `CREATE TABLE ... FROM SOURCE`) or serve as targets for data sinks. They function as foundational datasets within RisingWave that can be queried or joined with other streams and materialized views.

Throughput

Throughput measures the rate at which a streaming system can process data, typically expressed in events or bytes per second. It reflects the system's processing capacity and is influenced by factors like hardware resources, query complexity, data partitioning, and network bandwidth. Achieving high throughput is crucial for handling large-volume data streams effectively.

Time Travel (in Lakehouse)

Time Travel in a Lakehouse context refers to the capability provided by open table formats like Apache Iceberg to query data as it existed at a specific point in the past. Users can query historical snapshots of a table using either a timestamp or a specific snapshot ID. This is invaluable for auditing, debugging data issues, reproducing reports, or analyzing historical trends directly on the data lake.

Time Window

A Time Window is a mechanism in stream processing for grouping events that fall within a specific time interval, enabling operations like aggregations or joins over bounded subsets of an unbounded stream. Windows are defined based on time attributes (Event Time or Processing Time) and come in various types (Tumbling, Hopping, Session) to suit different analytical needs. They are fundamental for analyzing trends and patterns over time.

Tumbling Window

A Tumbling Window is a type of fixed-size, non-overlapping, and contiguous time window used in stream processing. Imagine time being divided into consecutive, fixed-length blocks (e.g., 5-minute intervals); each event belongs to exactly one window. Tumbling windows are commonly used for periodic reports or aggregations like calculating metrics every hour or minute.

U

User-Defined Function (UDF)

A User-Defined Function (UDF) allows users to extend the built-in capabilities of a query language like SQL by writing custom code to perform specific logic. In RisingWave, UDFs can be written in languages like Python, Java, or Rust and can perform scalar computations (acting on single rows), aggregations, or even generate table outputs. They enable embedding complex business logic or specialized calculations directly into streaming queries.

W

Watermark

A Watermark is a mechanism used in stream processing, particularly with Event Time semantics, to estimate the progress of time and handle out-of-order or late-arriving data. It acts as a heuristic threshold, indicating that the system generally expects no more events with an event time earlier than the watermark. Watermarks are crucial for triggering window computations reliably when the system believes all relevant data for a window has likely arrived.
The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.