Redefining Stream Processing: The Outlook for RisingWave Streaming Database in 2024

Stream processing has significantly evolved over the past two decades, transitioning from a theoretical concept to a vital commercial application. This evolution began with the first generation of data stream management systems, including notable examples like IBM System S, Oracle CQL, and Esper. It then progressed to a second generation characterized by MapReduce-based distributed stream processing platforms such as Apache Spark Streaming and Apache Flink. Presently, we are witnessing the third generation, marked by the rise of cloud-native stream processing systems, exemplified by RisingWave. This ongoing evolution highlights the increasing adoption and refinement of stream processing, reinforcing its essential role in various industries.

Image source: Fragkoulis, Marios, et al. 'A survey on the evolution of stream processing systems.' The VLDB Journal (2023): 1-35.

Born in early 2021, RisingWave has quickly established itself as a leading example of third-generation stream processing systems, a status it has attained through three years of dedicated development. Since making its open-source debut in April 2022, RisingWave has seen swift global adoption, with its applications extending across diverse sectors such as internet services, finance, energy, aerospace, supply chains, and self-driving cars. Today, RisingWave maintains a robust global presence, with hundreds of clusters in active operation daily worldwide.

Worldwide distribution of RisingWave clusters as of December 2023.

For RisingWave, the year 2024 marks a significant turning point towards maturity. We are fully committed to delivering a stream processing system that is both easier to use and more cost-efficient for everyone. In this article, we'll delve into our outlook for RisingWave in 2024.

Enhancing usability and cost-effectiveness

Since its inception, RisingWave has been dedicated to the mission of 'democratizing stream processing'. Our commitment focuses on two key aspects: ease-of-use and cost-efficiency.

In terms of ease-of-use, RisingWave has achieved protocol compatibility with PostgreSQL, enabling effective interaction within the PostgreSQL ecosystem. Users can create materialized views using SQL statements that align with PostgreSQL syntax, facilitating the direct execution of stream processing tasks.

Additionally, RisingWave enables users to build cascading materialized views, ensuring consistent processing and real-time capabilities. This streamlined approach is in stark contrast to traditional stream processing architecture, which requires the integration of multiple components, such as stream processors, message queues, databases, and orchestration tools. Consequently, this approach significantly reduces both the development and maintenance costs of the system.

Using RisingWave greatly simplifies the architectural complexity of developing stream processing applications.

In terms of cost-effectiveness, RisingWave consistently enhances its compute-storage separation architecture to provide stable and efficient support for stream processing involving large internal states. RisingWave employs remote object storage as the persistent medium for these internal states, enabling second-level system failure recovery and elastic scaling capabilities. Furthermore, RisingWave incorporates tiered storage and intelligent caching mechanisms to minimize the need for remote object storage access, effectively reducing potential latency issues.

Automatic elastic scaling and performance optimization

By adopting a compute-storage separation architecture, RisingWave achieves unlimited horizontal scalability, ensuring both high availability and elastic scaling capabilities. In 2024, we aim to comprehensively upgrade both RisingWave's internal components and the user experience.

Currently, RisingWave supports second-level elastic scaling, but this process still necessitates manual configuration based on the online workload, which limits its convenience. To address this issue, RisingWave plans to introduce an automatic scaling feature in upcoming versions. This feature will autonomously adjust system configurations in response to streaming workloads, fully adapting to changes in cluster scale.

Furthermore, RisingWave is set to further optimize its tiered storage system. Performance fluctuations can occur due to issues with remote object storage, especially when cache misses happen. We plan to refine our caching strategy to decrease the likelihood of these fluctuations. Additionally, for complex stateful queries like multi-stream joins, RisingWave will upgrade its optimizer and executor, further reinforcing its leadership position in the stream processing domain.

Fully embracing data lakes

RisingWave has successfully achieved read-write support for various data lake formats. Notably, through collaboration with communities like Apache Iceberg, RisingWave has made significant contributions to the Apache Iceberg Rust project. This collaboration has led to the provision of a Rust interface for Iceberg and a threefold improvement in write performance for the Iceberg format.

Looking ahead to 2024, RisingWave plans to strengthen its partnerships with key data lake communities. This enhancement is set to significantly boost RisingWave's direct read-write capabilities for data lakes. Consequently, users will be able to perform real-time data lake writes through RisingWave, construct materialized views directly on data lakes, and observe real-time computational results reflecting changes in data lakes. This advancement means users will have the capability to conduct unified analyses on both real-time stream data and historical batch data using RisingWave.

Furthermore, RisingWave intends to collaborate with leading data lake and real-time analysis system vendors to develop a streaming lakehouse. This initiative aims to offer users a more cost-effective and real-time data management experience.

Enhancing online serving capability

RisingWave functions as a streaming database, not merely as a stream processing engine. This distinction means users often utilize RisingWave in online data services to augment traditional operational databases (such as MySQL, PostgreSQL, MongoDB, etc.). Common applications include using RisingWave to directly consume Change Data Capture (CDC) from operational databases, build real-time materialized views, and provide direct online data query services for user applications.

In 2024, RisingWave is set to intensify efforts to boost online serving experiences across three dimensions:

Data Ingestion: RisingWave will enhance direct data ingestion capabilities for more data systems, enabling seamless real-time data transmission to RisingWave.
Data Storage: Plans are in place to introduce new table formats for efficient data compression, which will reduce storage costs while allowing external engines direct access to RisingWave data.
Query Serving: RisingWave will integrate advanced capabilities like full-text search and optimize responses for high-concurrency queries, empowering users to create more stable and efficient online data applications.

Giving back to the community

RisingWave's role in the stream processing field is deeply rooted in the support and contributions from nearly 150 open-source contributors and approximately 3,000 community members. We are dedicated to listening to the community, actively responding to user feedback, and incorporating user suggestions. In the upcoming year, RisingWave plans to host multiple online and offline events to further promote the application of stream processing systems.

RisingWave eagerly looks forward to collaborating with community members to develop the next-generation stream processing system, fostering the prosperity and wider adoption of stream processing technology.

The article outlines the primary development directions of RisingWave, a representative third-generation stream processing system, for 2024. These include enhancing usability and cost-effectiveness to lower the barriers to learning, using, and maintaining stream computations; introducing automatic scaling features and further optimizing its tiered storage; strengthening integration with major data lakes; enhancing online serving capability; and hosting more community events to promote the application of stream processing systems.