RisingWave Glossary

A

ACID Transactions

ACID Transactions bring database-style reliability guarantees (Atomicity, Consistency, Isolation, Durability) to data lake operations. Enabled by open table formats like Apache Iceberg, they ensure data integrity even with concurrent reads and writes on cloud object storage. This allows building more reliable data pipelines and analytics directly on the lake.

Apache Flink

Apache Flink is a popular open-source, unified stream-processing and batch-processing framework. It provides data distribution, communication, and fault tolerance for distributed computations over data streams, known for its stateful processing capabilities and flexibility. It's a key technology in the stream processing landscape, often compared with systems like RisingWave.

Apache Hudi

Apache Hudi (Heard Updates Deletes and Incrementals) is an open-source data management framework used to build streaming data lakes. It provides table storage services like upserts, deletes, and incremental processing directly on data lake storage (e.g., S3). Hudi is another key open table format enabling Lakehouse architectures.

Apache Iceberg

Apache Iceberg is an open table format designed for huge analytic datasets stored in data lakes (like S3, GCS, ADLS). It brings database-like features such as ACID transactions, scalable metadata management, robust schema evolution, and time travel capabilities directly to lake storage. RisingWave can integrate with Iceberg, using it as a sink for streaming results within a streaming lakehouse architecture.

Apache Kafka

Apache Kafka is a distributed, open-source event streaming platform widely used for building real-time data pipelines and streaming applications. It provides durable, ordered, and scalable message queuing capabilities, often acting as the central nervous system for microservices and serving as a primary data source and/or sink for stream processing engines like RisingWave.

Apache Pulsar

Apache Pulsar is another open-source, distributed messaging and event streaming platform designed for scalability and high performance. It features a multi-layered architecture separating compute and storage, offering features like multi-tenancy and geo-replication. Pulsar serves as an alternative to Kafka as a source/sink for systems like RisingWave.

At-Least-Once Semantics

At-Least-Once semantics is a data processing guarantee ensuring that each input event is processed one or more times, even in the face of failures. While it prevents data loss, it can result in duplicate processing or output records if failures and retries occur. Systems providing this guarantee often require downstream consumers to handle potential duplicates.

At-Most-Once Semantics

At-Most-Once semantics is a data processing guarantee ensuring that each input event is processed at most one time. This means events might be processed once or potentially skipped entirely (data loss) if failures occur before completion is acknowledged. It's simpler to implement but generally less desirable than at-least-once or exactly-once for critical data pipelines.

Avro

Apache Avro is a row-oriented remote procedure call and data serialization framework. It uses JSON for defining data types and protocols and serializes data in a compact binary format. Avro relies heavily on schemas and is commonly used in Apache Kafka ecosystems, often managed via a Schema Registry. RisingWave supports Avro data ingestion.

B

Backpressure

Backpressure is a feedback mechanism in data streaming systems where a slower downstream operator or sink signals to a faster upstream operator or source to reduce its data sending rate. This prevents the slower component from being overwhelmed, avoiding excessive buffering, potential data loss, or system instability. Effective backpressure management is crucial for building robust and stable streaming pipelines.

Batch Processing

Batch Processing is a method of running data processing jobs in "batches," where a large volume of data is collected over a period, stored, and then processed all at once as a group. This contrasts with Stream Processing, which processes data continuously as it arrives. Batch processing is typically used for tasks that can tolerate some latency, involve large datasets, and require comprehensive, often complex, computations.

C

Change Data Capture (CDC)

Change Data Capture (CDC) is the process of observing changes occurring in a source database (inserts, updates, deletes) and extracting them as a stream of event data. This allows downstream systems, like RisingWave, to react to database modifications in real-time without directly querying the source database frequently. Common tools like Debezium generate CDC streams that RisingWave can ingest and process.

Checkpointing

Checkpointing is a fault tolerance mechanism used in stateful stream processing systems like RisingWave. It involves periodically taking consistent snapshots of the internal state (e.g., aggregation results, join states) and storing them durably (like in RisingWave's State Store). If a failure occurs, the system can restore its state from the latest successful checkpoint and resume processing, minimizing data loss and ensuring consistency.

Cloud Object Storage

Cloud Object Storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage) provides highly scalable, durable, and cost-effective storage for large amounts of unstructured data. It serves as the foundational storage layer for data lakes and lakehouses. RisingWave leverages object storage for its Hummock state store, enabling the separation of storage and compute.

Connector

A Connector is a software component that enables a data system, like RisingWave, to communicate with external systems for data ingestion (sources) or data egress (sinks). Connectors handle the specifics of interacting with different technologies, such as message queues (Kafka, Pulsar), databases (via JDBC or CDC), or file systems (S3). They abstract away the protocol and format details, simplifying pipeline construction.

Consistency (State Management)

Consistency in stateful stream processing refers to the guarantees provided about the visibility and ordering of state updates, especially in the presence of failures and recovery. Systems like RisingWave aim for strong consistency, ensuring that operations reflect a valid state and that recovered state aligns correctly with processed inputs, often tied to achieving exactly-once semantics.

Consumer

In messaging systems and event streaming platforms, a Consumer (also sometimes called a Subscriber or Receiver) is an application, service, or component responsible for connecting to a message queue or an event streaming platform (like Apache Kafka or Apache Pulsar), retrieving messages (or events), and then processing them.

Context Engine

A context engine is the operational software system that automates the instructions designed by a context engineer. It sits between the user and the Large Language Model, managing the entire, real-time flow of information needed for a useful conversation. It is the active component that does the work.

Context Engineering

Context engineering is the professional discipline of designing, building, and maintaining the systems that provide an AI with the right information, in the right format, at the right time, to perform a specific task.

Context Lake

A context lake is not just another storage bin. It is a highly specialized, intelligent system designed to prepare, manage, and serve your organization's unstructured data as perfect, just-in-time context for a Large Language Model. It acts as the single source of truth that your AI can draw from, ensuring every response is grounded in fact.

Continuous Query

A Continuous Query is a query designed to run perpetually over unbounded, continuously arriving data streams, producing results that update dynamically as new data flows in. Unlike traditional database queries that run once on static data, continuous queries define ongoing computations. In RisingWave, SQL is used to define continuous queries, often materialized into views for efficient access.

D

Data Lake

A Data Lake is a centralized repository designed to store large quantities of structured, semi-structured, and unstructured data in its raw format. Built on cost-effective storage like AWS S3 or Google Cloud Storage, it provides flexibility but historically lacked the performance and transactional guarantees of data warehouses. Data lakes serve as the foundation for modern architectures like the Lakehouse.

Data Lake Compaction

Compaction is a general optimization technique used in various data systems. In the context of open table formats like Apache Iceberg on data lakes, it involves merging many small data files into fewer larger files. This improves query performance by reducing metadata overhead and optimizing read operations, and is typically a background maintenance task essential for maintaining efficiency in large-scale lakehouse tables.

Data Lake Partitioning

Partitioning in the context of data lakes and open table formats involves organizing data files into separate directories based on the values of specific columns (e.g., date, region, customer ID). This physical layout allows query engines to skip reading irrelevant data, significantly improving query performance by reducing the amount of data scanned. Table formats like Iceberg manage partitioning complexity and allow partition schemes to evolve over time.

Data Lakehouse

A Data Lakehouse is a modern data management architecture that combines the low-cost, flexible storage of data lakes with the structure, governance, and performance features of data warehouses. It typically leverages open table formats like Apache Iceberg directly on lake storage, enabling ACID transactions, schema enforcement, and efficient SQL querying. RisingWave can act as a key component in a real-time lakehouse, processing streams and serving fresh results.

Data Partitioning (Streaming)

Data Partitioning in streaming involves distributing incoming data events across multiple parallel instances of processing operators based on a specific key within the data (e.g., user ID, sensor ID). This is essential for achieving scalability, as it allows processing load to be spread across multiple nodes or cores. Correct partitioning is crucial for stateful operations like joins and aggregations to ensure related data is processed together.

Data Stream

A Data Stream (or Event Stream) is an unbounded, ordered, and immutable sequence of data records or events generated continuously over time. Examples include sensor readings, user clickstreams, financial transactions, or database change events (CDC). Stream processing systems like RisingWave are designed to ingest, process, and analyze these continuous streams in real-time.

Data Warehouse

A Data Warehouse is a central repository of integrated data from disparate sources, optimized for querying and analysis (BI). Traditionally focused on batch loading and historical reporting, data warehouses differ from data lakes in their emphasis on structured, processed data and from streaming databases in their handling of real-time data. The Lakehouse architecture aims to combine the benefits of both.

Dataflow Graph

A Dataflow Graph, or Streaming Pipeline, represents the logical structure of a stream processing job. It typically consists of sources where data originates, transformation operators (like filters, aggregations, joins) that process the data, and sinks where the results are sent. RisingWave compiles SQL queries into optimized dataflow graphs executed across its compute nodes.

Debezium

Debezium is a popular open-source distributed platform for Change Data Capture (CDC). It monitors source databases (like PostgreSQL, MySQL, MongoDB) and produces event streams capturing row-level changes (inserts, updates, deletes). RisingWave has built-in support for ingesting Debezium-formatted CDC streams (often via Kafka), enabling real-time processing of database changes.

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and time travel capabilities to data lakes built on storage like S3, ADLS, or GCS. Developed by Databricks, it's one of the major open table formats enabling the Lakehouse architecture, alongside Apache Iceberg and Hudi.

Durability (State Management)

Durability in stateful stream processing ensures that once state updates are successfully committed, they persist and survive system failures, such as node restarts or crashes. In RisingWave, durability for streaming state is primarily achieved through its State Store (Hummock), which persists state snapshots and changes to reliable cloud storage, enabling recovery via checkpointing.

E

Emit Strategy

An Emit Strategy (or Output Mode) defines when and how results are produced from stateful streaming operations, particularly windowed aggregations. Common strategies include emitting results only when a time window closes, or emitting updates continuously as new data arrives and affects the window's results. The chosen strategy impacts downstream consumers and the semantics of the output stream.

Event Streaming Platform (ESP)

An Event Streaming Platform (ESP) is a system designed to capture, store, and process streams of events in real-time, durably, and at scale. Key examples include Apache Kafka, Apache Pulsar, and Redpanda. These platforms act as the central nervous system for event-driven architectures, often serving as the primary source and/or sink for stream processing engines like RisingWave.

Event Time

Event Time is the timestamp embedded within a data record itself, indicating when the actual event occurred in the real world (e.g., when a sensor reading was taken or a transaction happened). Processing data based on Event Time allows for accurate results even if data arrives out of order or is delayed. It's crucial for windowing operations and requires mechanisms like watermarks to handle correctly.

Event-Driven Architecture (EDA)

An Event-Driven Architecture (EDA) is a software design pattern where components communicate asynchronously by producing and consuming events. Events represent significant occurrences (e.g., an order placed, a sensor reading updated). Systems like Kafka act as the event backbone, and stream processors like RisingWave consume these events to trigger real-time processing, analytics, or downstream actions.

Exactly-Once Semantics (EOS)

Exactly-Once Semantics (EOS) is the strongest processing guarantee in stream processing, ensuring that each input event's effect on the system's state is recorded precisely once, even amidst failures and retries. Achieving true end-to-end EOS often involves coordination between the stream processor (like RisingWave, which provides EOS for its internal state) and the source/sink connectors using mechanisms like transactions or idempotent writes.

F

Fault Tolerance

Fault Tolerance is the ability of a system, like RisingWave, to continue operating correctly and maintain data integrity even when components fail (e.g., node crashes, network issues). In stateful streaming, this is typically achieved through mechanisms like state checkpointing, replication, and automatic recovery processes. Robust fault tolerance is essential for reliable real-time data processing pipelines.

Filtering (Streaming)

Filtering in streaming involves selectively processing or discarding incoming data events based on specified criteria applied to the event's content. It's a fundamental stateless transformation used to reduce data volume or isolate events of interest for subsequent processing steps like aggregation or joins. In RisingWave, filtering is typically done using the `WHERE` clause in SQL queries.

H

High Availability (HA)

High Availability (HA) refers to system design principles and configurations aimed at ensuring continuous operational performance and minimizing downtime, typically through redundancy and automatic failover. For RisingWave, this involves running multiple instances of critical control plane components and having mechanisms for the system to automatically recover from compute node failures without interrupting service significantly.

Hopping Window

A Hopping Window is a type of time window used in stream processing that has a fixed duration but slides forward by a specified "hop" interval, allowing windows to overlap. For example, a 1-hour window hopping every 10 minutes would produce aggregations covering the last hour, updated every 10 minutes. This is useful for generating frequently updated, moving-average style results.

Hummock (RisingWave Specific)

Hummock is RisingWave's purpose-built, cloud-native storage engine used within its State Store. It employs a Log-Structured Merge-tree (LSM-tree) architecture to efficiently manage large amounts of streaming state on durable object storage (like S3). Hummock provides features like checkpointing, versioning, and compaction, underpinning RisingWave's fault tolerance, scalability, and state management capabilities.

I

Idempotency

Idempotency refers to the property of an operation where executing it multiple times with the same input produces the same result or state as executing it just once. In stream processing, idempotent operations (especially on sinks) are crucial for achieving effective exactly-once semantics, as they allow safe retries after failures without causing duplicate side effects. For example, writing results with a unique key that allows upserts is an idempotent operation.

Incremental Computation

Incremental Computation is the core processing model used by RisingWave, particularly for materialized views. Instead of recomputing results from scratch when new data arrives, RisingWave processes only the incoming changes (inserts, updates, deletes) and efficiently updates the existing state and materialized results. This approach drastically reduces computation load and latency compared to batch reprocessing, enabling fresh, low-latency results.

K

Kappa Architecture

The Kappa Architecture is a data processing architecture that aims to handle both real-time and batch processing needs using a single streaming-based system. It contrasts with the Lambda Architecture by eliminating the separate batch processing layer, instead relying on the stream processing engine and a durable event log (like Kafka) to reprocess historical data if needed. Systems like RisingWave align well with the Kappa philosophy by providing a unified platform for stream processing and serving.

L

Lambda Architecture

The Lambda Architecture is a traditional big data processing pattern designed to handle both real-time and batch processing needs using two separate layers. It consists of a batch layer for comprehensive, accurate processing of large historical datasets, and a speed (or streaming) layer for low-latency processing of recent data, with results merged in a serving layer. Modern streaming systems like RisingWave, especially within a Lakehouse context, often aim to simplify or replace the Lambda Architecture with a more unified approach like Kappa.

Late Data

Late Data refers to events arriving at the stream processing system after their associated event time has already passed a certain threshold, typically defined by a watermark. Handling late data correctly is crucial for accuracy in event-time processing, especially for windowed operations. Systems may handle late data by dropping it, processing it separately, or allowing updates to previously computed window results if configured to do so.

Lookup Join

A Lookup Join is a type of stream enrichment where each incoming event triggers a synchronous request (a "lookup") to an external system (like a database or microservice) to fetch additional data. While conceptually simple, lookup joins can introduce significant latency and become a bottleneck due to the external dependencies for each event. RisingWave's stream-table joins using its internal state offer a more efficient alternative for many enrichment scenarios.

M

Materialized View

A Materialized View in RisingWave is a core concept where the results of a continuous SQL query (defined using `CREATE MATERIALIZED VIEW`) over streaming data are automatically computed, stored, and kept up-to-date incrementally. Unlike traditional database MVs requiring manual refreshes, RisingWave MVs update dynamically with low latency as source streams change via incremental computation. This makes complex streaming results readily available for fast querying without recomputing on the fly.

Message Queue

A Message Queue (MQ) is a system designed for asynchronous communication, typically between different services, using point-to-point or publish-subscribe patterns (e.g., RabbitMQ, SQS). While they handle message buffering and delivery, MQs traditionally focus on transient message delivery and may lack the strong ordering guarantees, long-term persistence, and replay capabilities of log-based Event Streaming Platforms like Kafka, which are more commonly paired with stream processors.

O

Observability in Streaming Systems

Observability refers to the ability to understand the internal state and behavior of a system based on the data it generates, typically through metrics, logs, and traces. For a streaming system like RisingWave, observability is crucial for monitoring performance, diagnosing issues, understanding data flow, and managing cluster resources effectively. RisingWave exposes various metrics and logs to support observability tooling.

Open Table Format

An Open Table Format (e.g., Apache Iceberg, Apache Hudi, Delta Lake) defines specifications for organizing and managing large analytical datasets stored in data lakes (like S3, GCS). They bring reliability features like ACID transactions, schema evolution guarantees, time travel, and performance optimizations (partitioning, compaction) directly to lake storage, enabling Data Lakehouse architectures. RisingWave can integrate with these formats, particularly Iceberg, as sinks for building real-time lakehouses.

P

Parallelism

Parallelism in RisingWave refers to the execution of different parts of a streaming dataflow graph concurrently across multiple Compute Nodes or even multiple CPU cores within a node. RisingWave automatically parallelizes streaming jobs based on data partitioning and available resources to achieve high throughput and scalability. The degree of parallelism can often be configured to match the workload and cluster size.

PostgreSQL Compatibility

PostgreSQL Compatibility means that RisingWave aims to understand the PostgreSQL SQL dialect and communicate using the PostgreSQL wire protocol. This allows users familiar with PostgreSQL to leverage their existing SQL knowledge when writing streaming queries in RisingWave. It also enables compatibility with a wide range of standard PostgreSQL client tools, libraries, and BI tools for connecting and interacting with RisingWave.

Processing Time

Processing Time refers to the time at the stream processing system's internal clock when a specific event is processed or an operation occurs. Processing based on Processing Time is simpler than Event Time as it doesn't require handling out-of-order events or watermarks. However, results can be non-deterministic and sensitive to system load or processing delays, making it less suitable for applications requiring accuracy based on when events actually happened.

Producer

In messaging systems and event streaming platforms, a Producer (also sometimes called a Publisher or Sender) is an application, service, or component responsible for creating and sending messages (or events) to a destination, typically a message queue or a topic within an event streaming platform like Apache Kafka or Apache Pulsar.

Protobuf

Protocol Buffers (Protobuf) is a language-neutral, platform-neutral, extensible mechanism for serializing structured data, developed by Google. It uses `.proto` files to define data structures and generates efficient binary output. Like Avro, it's often used with Kafka and a Schema Registry. RisingWave supports Protobuf data ingestion.

Publish-Subscribe Pattern

The Publish-Subscribe Pattern (often abbreviated as Pub/Sub) is a messaging pattern where senders of messages, called Publishers (or Producers), do not send messages directly to specific receivers, called Subscribers (or Consumers). Instead, publishers categorize published messages into classes (often called topics or channels) without knowledge of which subscribers, if any, there may be. Similarly, subscribers express interest in one or more topics and only receive messages that are of interest, without knowledge of which publishers, if any, there are.

R

Real-time Analytics

Real-time Analytics is the practice of analyzing data as soon as it becomes available, enabling businesses to gain immediate insights and react quickly to changing conditions. Stream processing systems like RisingWave are fundamental enablers of real-time analytics, processing data continuously and making results available with low latency, often via materialized views.

Real-time Data Processing

Real-time Data Processing involves analyzing and acting on data almost instantaneously as it is generated or received, typically within milliseconds or seconds. This contrasts with traditional batch processing which operates on large, static datasets with higher latency. Streaming systems like RisingWave are specifically designed for real-time data processing, enabling immediate insights and automated responses.

Recovery Point Objective

Recovery Point Objective (RPO) defines the maximum acceptable amount of data loss an organization can tolerate, measured in time, during a disaster or system failure. For a streaming system, RPO relates to how frequently state is checkpointed; a lower RPO (e.g., seconds) requires more frequent checkpointing to minimize potential data loss upon recovery. It's a key metric for designing fault-tolerant systems.

Recovery Time Objective

Recovery Time Objective (RTO) defines the target maximum time allowed for restoring a system's functionality after a failure or disaster. For a streaming system like RisingWave, RTO depends on factors like the time needed to detect failure, provision replacement resources (if needed), load the last checkpoint, and resume processing from the correct point in the input streams. It dictates how quickly the system must be back online.

Reverse ETL

Reverse ETL is the process of copying data *from* a central data store (like a data warehouse or lakehouse) *back into* operational systems (like CRMs, marketing tools, or databases) where it can be used to drive business actions. While traditionally batch-oriented, streaming systems like RisingWave can potentially power real-time Reverse ETL use cases by sinking processed results directly into operational tools with low latency.

S

Scalability in Streaming Systems

Scalability in streaming systems refers to the ability of the system to handle increasing data volumes, velocity, and processing complexity by adding more resources, either vertically (more power to existing nodes) or horizontally (more nodes). RisingWave is designed for horizontal scalability, allowing independent scaling of compute and storage to meet varying workload demands.

Schema Evolution in Data Lakehouses

Schema Evolution in data lakehouses refers to the ability to change the schema of a table (e.g., add/remove columns, change data types) over time as data requirements evolve, without needing to rewrite all existing data. Open table formats like Apache Iceberg provide robust support for schema evolution, making it easier to manage tables in a data lakehouse environment. RisingWave supports schema evolution for its internal tables and can adapt to upstream schema changes.

Schema in Streaming Data

The schema defines the structure and data types of events within a data stream (e.g., field names, types like integer, string, timestamp). While some streams are schemaless (raw bytes or JSON), many modern streaming systems leverage schemas (e.g., Avro, Protobuf) for data validation, efficient serialization/deserialization, and enabling features like schema evolution. RisingWave requires schemas for its sources and tables.

Schema Registry

A Schema Registry is a centralized service for storing and managing data schemas, especially for event streaming platforms like Apache Kafka. It allows producers and consumers to share schema information, ensuring data compatibility and enabling schema evolution. Common implementations include Confluent Schema Registry. RisingWave can integrate with schema registries to fetch schemas for Avro or Protobuf data.

Separation of Storage and Compute

Separation of Storage and Compute is a cloud-native architectural principle where the resources used for data storage are decoupled from the resources used for data processing. This allows independent scaling of each layer based on demand, improving cost-efficiency and elasticity. RisingWave embodies this with its Hummock state store (on object storage) and stateless compute nodes.

Serialization Format

A Serialization Format defines how structured data (like objects or records) is converted into a byte stream for storage or transmission, and then deserialized back into its original form. Common formats in streaming include JSON, Avro, and Protocol Buffers (Protobuf). The choice of format impacts performance, data size, schema evolution capabilities, and ease of use. RisingWave supports multiple serialization formats for its sources and sinks.

Session Window

A Session Window groups events based on periods of activity separated by inactivity gaps. Unlike fixed-duration windows (tumbling/hopping), session windows have a dynamic size determined by a timeout interval. A new session starts when an event arrives after the timeout period has elapsed since the last event for a given key. Useful for analyzing user sessions, activity bursts, etc.

Sink

A Sink is a component in a stream processing pipeline that receives the processed data output by the streaming engine and writes it to an external system. Examples of sinks include databases (PostgreSQL, MySQL), message queues (Kafka), data lakes (Iceberg on S3), data warehouses, or custom applications. RisingWave supports various sink connectors to integrate with downstream systems.

Source

A Source is a component in a stream processing pipeline that ingests data from an external system into the streaming engine for processing. Sources can be message queues (Kafka, Pulsar, Kinesis), database change streams (CDC), file systems, or other data-generating systems. RisingWave provides various source connectors to read data from different external platforms.

State Store (RisingWave Specific)

The State Store in RisingWave is the component responsible for reliably managing and persisting the internal state required for stateful stream processing operations (e.g., aggregations, joins, materialized views). RisingWave uses Hummock, a cloud-native LSM-tree based storage engine, as its state store, which separates compute from storage and leverages object storage like S3 for durability and scalability.

Stateful Stream Processing

Stateful Stream Processing refers to computations on data streams that require maintaining and accessing intermediate data or context from past events. Examples include aggregations (e.g., counting events over a time window), joins between streams, or pattern matching. Managing state reliably and efficiently is a key challenge in stream processing, addressed by systems like RisingWave through their state store.

Stateless Stream Processing

Stateless Stream Processing involves operations on data streams where each event is processed independently, without reference to past or future events. Examples include simple filters (e.g., remove events with a certain value) or transformations (e.g., convert temperature from Celsius to Fahrenheit). Stateless operations are generally easier to scale and parallelize than stateful ones.

Stream Enrichment

Stream Enrichment is the process of augmenting incoming data events in a stream with additional information, typically by looking up data in an external table, database, or another stream. This adds more context to the events, making them more valuable for subsequent analysis or processing. RisingWave supports stream enrichment through various types of joins, including stream-table joins.

Stream Processing

Stream Processing is a data processing paradigm that deals with unbounded, continuous sequences of data (streams) in real-time or near real-time. Unlike batch processing which operates on static, finite datasets, stream processing systems ingest, analyze, and react to data as it arrives, enabling applications like real-time analytics, monitoring, and alerting. RisingWave is a distributed stream processing system.

Stream-Stream Join

A Stream-Stream Join combines records from two different data streams based on a common join key and a specified window or temporal condition. Since streams are unbounded, the join operation typically needs to maintain state for records from both streams that fall within the defined join window, emitting results as matching pairs are found. RisingWave supports various types of stream-stream joins using SQL.

Stream-Table Join

A Stream-Table Join combines records from an incoming data stream with records from a relatively static table (which could be a materialized view in RisingWave). For each event in the stream, the join operation typically looks up matching records in the table based on a join key. This is a common pattern for enriching streaming data with contextual information. RisingWave efficiently supports stream-table joins.

Streaming Aggregation

Streaming Aggregation involves computing aggregate functions (like COUNT, SUM, AVG, MIN, MAX) over data streams, typically within defined windows (time-based or count-based). Unlike batch aggregation, streaming aggregation continuously updates results as new data arrives. This requires state management to maintain intermediate aggregate values. RisingWave performs streaming aggregations using SQL and materializes the results.

Streaming Database

A Streaming Database is a type of database specifically designed to ingest, store, process, and query continuous streams of data in real-time. Unlike traditional databases optimized for static data, streaming databases like RisingWave excel at handling dynamic, unbounded data flows, often using SQL for continuous queries and providing features like materialized views for fresh results.

Streaming Join

A Streaming Join combines data from two or more streams (or a stream and a table) based on common attributes or keys, producing a resulting stream of correlated events. Unlike batch joins on static datasets, streaming joins operate continuously on potentially unbounded, ever-changing inputs, requiring state management to store relevant data from each stream. RisingWave supports various streaming join types (e.g., stream-stream, stream-table) defined using SQL syntax.

Streaming Lakehouse

A Streaming Lakehouse is a modern data architecture that extends a Data Lakehouse by deeply integrating real-time stream processing. It aims to unify batch and streaming workloads on a common open storage foundation (the data lake), providing low-latency ingestion, continuous processing, and immediate access to fresh, queryable insights. RisingWave is a key enabler for building Streaming Lakehouse architectures.

Streaming Latency

Streaming Latency refers to the end-to-end delay experienced by a data event as it travels through a stream processing system. It's the time from event generation or ingestion until its processed output is available. Minimizing latency is crucial for real-time applications.

Streaming Microservices

Microservices often leverage streaming systems for communication and data sharing in event-driven architectures. Services can publish events to an Event Streaming Platform (like Kafka) when significant state changes occur, and other services (or a stream processor like RisingWave) can consume these events to react, update their own state, or perform real-time analytics. This decouples services and enables scalable, real-time data flow.

Streaming SQL

Streaming SQL extends traditional SQL to define and execute continuous queries on unbounded data streams. It allows for filtering, transformations, joins, and aggregations on real-time data, often incorporating time-based windowing and state management. RisingWave uses PostgreSQL-compatible Streaming SQL.

Subscription

In RisingWave, a Subscription allows external applications to consume changes from a Materialized View or Table in a streaming fashion. RisingWave pushes these changes (inserts, updates, deletes) to active subscribers, enabling real-time reaction to data updates without polling.

T

Throughput

Throughput measures the rate at which a streaming system can process data, typically expressed in events or bytes per second. It reflects the system's processing capacity and is influenced by factors like hardware resources, query complexity, data partitioning, and network bandwidth. Achieving high throughput is crucial for handling large-volume data streams effectively.

Time Travel in LakeHouses

Time Travel in a Lakehouse context refers to the capability provided by open table formats like Apache Iceberg to query data as it existed at a specific point in the past. Users can query historical snapshots of a table using either a timestamp or a specific snapshot ID. This is invaluable for auditing, debugging data issues, reproducing reports, or analyzing historical trends directly on the data lake.

Time Window

A Time Window is a mechanism in stream processing for grouping events that fall within a specific time interval, enabling operations like aggregations or joins over bounded subsets of an unbounded stream. Windows are defined based on time attributes (Event Time or Processing Time) and come in various types (Tumbling, Hopping, Session) to suit different analytical needs. They are fundamental for analyzing trends and patterns over time.

Tumbling Window

A Tumbling Window is a type of fixed-size, non-overlapping, and contiguous time window used in stream processing. Imagine time being divided into consecutive, fixed-length blocks (e.g., 5-minute intervals); each event belongs to exactly one window. Tumbling windows are commonly used for periodic reports or aggregations like calculating metrics every hour or minute.

U

User-Defined Function (UDF)

A User-Defined Function (UDF) allows users to extend the built-in capabilities of a query language like SQL by writing custom code to perform specific logic. In RisingWave, UDFs can be written in languages like Python, Java, or Rust and can perform scalar computations (acting on single rows), aggregations, or even generate table outputs. They enable embedding complex business logic or specialized calculations directly into streaming queries.

W

Watermark

A Watermark is a mechanism used in stream processing, particularly with Event Time semantics, to estimate the progress of time and handle out-of-order or late-arriving data. It acts as a heuristic threshold, indicating that the system generally expects no more events with an event time earlier than the watermark. Watermarks are crucial for triggering window computations reliably when the system believes all relevant data for a window has likely arrived.

The Modern Backbone for Your
Event-Driven Infrastructure

Sign up for our to stay updated.