Streaming-First Approach to Deliver Unified Data Processing for All

This blog post continues our series of thought leadership pieces that provide insights into the product strategy and evolution for RisingWave (see the first installment here). In this installment, I would like to explain the rationale behind the need for a unified data processing framework and how RisingWave’s streaming-first approach is pioneering this vision.

The Need for a Unified Data Processing Framework

As we unveil the release of RisingWave 2.0 GA, we want to take a moment to reflect on the broader data management landscape. Stream processing has evolved from a trend to a critical necessity for innovative data-driven organizations. Historically, batch and streaming workloads were treated separately, requiring distinct data infrastructures. However, leading data platform vendors now recognize the importance of converging batch and stream processing.

As business challenges have evolved, the line between batch and streaming has blurred, paving the way for a unified data processing model.

Incumbent batch-oriented systems have attempted to handle streaming workloads through “micro-batching,” essentially a faster version of traditional batch processing. However, processing entire tables whenever new data arrives is inefficient and costly, especially as use cases become more complex and diverse, resulting in slower results and increased expenses.

In contrast, streaming systems offer superior efficiency and performance by processing incremental changes rather than relying on full micro-batched or snapshotted tables. This approach enables users to write queries that combine and layer derivative and intermediate data, ensuring real-time updates.

What’s truly needed is a shift in perspective, where workloads are organized based on data latency needs rather than focusing solely on query performance. This model places workloads on a spectrum that prioritizes data latency requirements, aligning processing approaches with the specific needs of the data.

Let’s take a closer look at the concepts of data latency and query latency. Data latency refers to the time between when data is generated and when it is ready for processing and analysis. This latency is typically influenced by factors like network speed and ingest overhead. In contrast, query latency is the delay between when a query is submitted to a data processing system and when the results are returned. This type of latency is mainly affected by query complexity, data volume, storage type, and the sophistication of the query engine.

The term “real-time” is frequently overused in the data industry, with no vendor openly marketing their platform as a ‘non-real-time’ solution. However, understanding the key technical and functional differences can help practitioners see beyond the marketing hype.

For streaming databases like RisingWave, “real-time” means generating insights from live data, with systems optimized for processing and analyzing high-volume, high-velocity data nearly instantaneously. In contrast, traditional batch systems use “real-time” to describe the delivery of query results, focusing on processing large volumes of historical data in batch mode. While batch systems are effective for identifying past trends and patterns, they fall short in supporting in real-time decision-making or embedding analytics into downstream applications.

Ultimately, modern businesses are looking for more than just historical insights or a snapshot of current events. They now prioritize advanced analytics that blend live data with historical information to deliver deeper, actionable insights. RisingWave 2.0 marks the beginning of our journey to address this new frontier and meet the evolving needs of these use cases.

Heralding the New Unified Data Processing Framework with RisingWave 2.0

Traditional data systems built on batch processing are poorly equipped to handle workloads with varying data latency requirements. These systems lack native streaming connections to data sources and sinks, relying instead on fragile ETL processes and complex data pipelines. As a result, users must maintain data consistency through intermediate data sets, which significantly increases data latency and adds substantial operational overhead.

RisingWave’s streaming-first approach effectively addresses both data latency and query latency requirements, ensuring optimal performance across a wide range of use cases. Now, let’s explore the key features that enable RisingWave to provide a unified solution for both streaming and batch workloads.

Rich and Extensive Connectivity: RisingWave offers native connectors and adapters for various data sources, including databases, message queues, data lakes, APIs, and IoT devices. This capability is essential for timely data processing, as it handles both real-time streaming data and bulk batch data loads. Our purpose-built streaming connectors are equipped with built-in intelligence to detect back pressure, enabling efficient data ingestion from numerous sources in a decentralized manner. This capability not only allows for the ingestion of the most recent data but also supports the reprocessing of older data sets on demand.
Unified Data Model in SQL: A unified approach requires a common data model and a standard language, to reduce context switching between workloads, and enhance developer productivity. A shared data model also simplifies managing the inherent complexity of diverse data characteristics. RisingWave adopts a standard relational model, enabling the creation of complex data pipelines using standard SQL. This allows SQL query writers to treat their data tables as building blocks, facilitating sophisticated use cases and supporting asynchronously developed pipelines. Even the most complex data pipelines can be constructed using cascading materialized views.
Composable Data Pipelines: Modern data applications often require multi-stage pipelines with the flexibility to easily inject business logic into event data. First-generation stream processing systems fell short in this area, especially for average data engineers, limiting their usefulness and hindering the transition from batch-oriented to dynamic, real-time systems. RisingWave addresses this challenge by making data pipelines composable, allowing tables and views generated by one query to be seamlessly used as inputs for downstream queries. This composability ensures that software can be adapted to new requirements without the need for extensive rewrites, facilitating the integration of fresh solutions.
Built-in Serving Layer: In modern applications, real-time insights are often accessed by thousands of users through data-driven apps, such as ride-sharing platforms or financial trading desks. To manage high volume of fast reads, insights must be delivered from a high-speed serving layer, typically an in-memory data store or an operational database. This can introduce extra latency and complexity due to additional data hops. RisingWave addresses this by unifying these design patterns, eliminating the need for a separate serving store. With its memory-first architecture, data is immediately available as soon as the streaming job completes. RisingWave also supports disaggregated compute with dedicated serving nodes for ad hoc queries, ensuring efficient data access.
Continuous Processing of Live data: Live data has short shelf life. RisingWave utilizes the familiar database concept of materialized views. Traditionally used to accelerate queries by caching results, materialized views in RisingWave are continuously updated to ensure consistently fresh results. Incremental updates are triggered automatically. This removes the tradeoff between speed and freshness in data insights. Additionally, RisingWave manages the entire lifecycle of data, including retention, archival, and purging, to maintain performance and effectively manage storage costs.
Temporality of Data: The age of data is a crucial aspect that differs for batch and stream processing workloads. In streaming systems, this refers to the capability to process live data continuously within time windows. For batch processing, the focus is on historical data, often supported by features like ‘time travel’ and ASOF joins. RisingWave offers a comprehensive feature set to address both scenarios, ensuring robust handling of data temporality across various workloads. It supports various time windowing strategies, such as tumbling, sliding, and session windows, along with watermarks and temporal filters to handle unbounded data streams effectively.
Interoperability with Other Systems: A system designed to support a unified data processing model should prioritize common standards over custom solutions. This means embracing widely adopted standards for file and storage formats. Interoperability is a core design principle of RisingWave. As Iceberg and Delta increasingly become the de facto standards for data lakehouse table formats, RisingWave provides robust read and write support for both. In addition to bridging batch and streaming paradigms, RisingWave unifies development and usage patterns across Python, SQL, Java, and JavaScript through a common UDF framework. This enables the easy embedding of custom business logic within data pipelines, facilitating more sophisticated data processing, advanced analytics, and ML inferencing.

RisingWave began its journey as a stream processing system. The design choices made during the development of RisingWave 1.0 laid a solid foundation for addressing a wide range of real-time data workloads. With RisingWave 2.0, we are building on this foundation by introducing features that advance our vision of a unified data processing model. As I conclude this blog, it’s important to acknowledge that while the goal of unified data processing is commendable, it may seem overly ambitious. However, we believe that “unified” doesn’t necessarily imply a single platform that addresses every type of workload comprehensively. Instead, RisingWave is uniquely positioned to meet this challenge with our streaming-first approach, focusing on delivering efficient and flexible solutions for modern data needs.

Previous Installment

Why RisingWave Is the Best Choice for Stream Processing Workloads

Streaming-First Approach to Deliver Unified Data Processing for All

Table of contents

The Need for a Unified Data Processing Framework

Heralding the New Unified Data Processing Framework with RisingWave 2.0

Previous Installment