Unveiling the Future of Event-driven Workloads with RisingWave 2.0: Unified SQL Streaming and Batch processing

On April 8th, 2022, we open-sourced RisingWave, a distributed SQL database purpose-built for stream processing. Over the past two years, RisingWave has seen incredible growth. On the community side, we’ve gained nearly 7,000 stars on GitHub, brought together over 2,000 members in our Slack community, and have seen thousands of deployments across various industries worldwide, including financial services, entertainment, Manufacturing, Ad tech, e-commerce, and many others. Product-wise, we’ve consistently released new versions every month, with the latest version 1.10 releasing in July 2024.

What Have We Achieved?

We’re incredibly proud of what we’ve built and genuinely believe our technology is game-changing. While there are plenty of successful stream processing projects out there—like ksqlDB, Apache Spark Streaming, and Apache Flink—many people still choose RisingWave or are making the switch to us for their stream processing needs. We recently ran a survey in our community, and here’s what we found.

The Top 3 Reasons Why People Choose RisingWave:

Ease of Use: Most people prefer building business applications over spending weeks learning new tools. RisingWave’s PostgreSQL compatibility eliminates the need to learn complex Java/Scala APIs or specialized SQL dialects. It also offers tools like UDFs and APIs for seamless integration into existing data stacks. Additionally, it abstracts internal complexities such as state management, freeing users from configuring storage parameters (e.g., RocksDB) to handle large workloads.
Simplified Data Stack: Traditionally, building a stream processing app requires piecing together multiple components, like messaging queues (e.g., Kafka), stream processors, and a serving database. RisingWave simplifies this by integrating ingestion, computation, and serving functionalities into a single system, allowing users to power their entire stream processing application with ease.
Cost-Efficiency: Many data-driven businesses need to process complex business logic involving continuous stateful computations, such as joins, aggregations, and time windowing. RisingWave is highly optimized for these complex queries, enabling efficient processing at a fraction of the cost.

The Top 10 Features Loved by RisingWave Users:

PostgreSQL wire compatibility
Source and sink integrations
High-performance streaming joins
Dynamic scaling
Instant failure recovery
Serverless backfilling
Support for UDF, UDAF, and UDTF
Automatic schema changes
Full-fledged streaming SQL (including watermarks, time windowing, and temporal filtering)
Time travel capabilities

With hundreds of enterprises and startups trusting RisingWave, we believe it has already established itself as one of the best SQL stream processing systems out there. Over the past two years, our focus has been on lowering the barrier to mastering stream processing by 10X. But now that RisingWave has become more powerful, we’re thinking bigger. Instead of just improving the stream processing experience, we’re considering what systems people truly need in their data stack. We need to build something new to take RisingWave to the next level.

Unified Streaming and Batch Processing

After months of research and conversations with customers and users, We have come to a conclusion: offer unified streaming and batch processing. Now, you might be wondering—does this mean RisingWave is pivoting into a batch system? Absolutely not. What we’re seeing is that more and more businesses are becoming streaming-first, or are on the path to becoming streaming-first. These businesses operate with data in motion and want to ingest fresh data as quickly as possible to gain real-time insights.

This trend is clear not only in latency-sensitive industries like trading firms and brokerages but also in more traditional sectors like manufacturing and energy, where real-time data is increasingly driving decision-making.

However, stream processing alone often isn’t enough. In many cases, businesses still rely on batch processing to build complete applications. We’ve identified three classic use cases that highlight the need for a unified streaming and batch processing system:

Continuous Monitoring and Analytics

In continuous monitoring and analytics, users often need to join an event stream (e.g., from Kafka) with a batch table (e.g., from S3 or Postgres). A typical scenario is a marketing team wanting to join real-time clickstream data with user profiles or real-time lead data with CRM information. To support these use cases, a unified batch and streaming system must enable users to batch load data from S3, continuously ingest streaming data from Kafka, and, crucially, join event streams with batch data in real-time.

Feature Engineering and Time Series Analytics

In feature engineering and time-series analytics, users want to apply the same code to both batch and streaming data pipelines. Developing the same logic in separate systems is error-prone and can lead to inconsistencies. For example, feature transformation for training and inference must be identical to avoid performance degradation in production. Ideally, a system should efficiently transform both batch and streaming data, delivering them to an offline store (e.g., S3) for training and an online store (e.g., Redis) for real-time inference, ensuring consistent model performance.

Metrics and Event Stores

In metrics and event stores, users want the ability to analyze newly inserted data while storing historical data in S3 for cost-effective, ad-hoc analytics. For example, in a billing system, users might want to monitor spending spikes in real-time while also building dashboards from historical data.

Clearly, there’s a need for unified streaming and batch processing. This realization has motivated us to look ahead and expand our vision. That’s why we’re gearing up to release RisingWave 2.0, the SQL database for Unified Streaming and Batch Processing.

Key Focus in RisingWave 2.0

Our goal isn’t to build a full-fledged batch system—we’re not out to create another Spark, or Snowflake, or Redshift. We understand the boundaries of RisingWave and know there’s no silver bullet in the data infrastructure world. Instead of chasing unrealistic, massive projects, we’re focused on practical solutions that deliver real value to our users. Here are the top 3 items on our list:

Streaming and Batch Ingestion

Schemaless Ingestion: Right now, when RisingWave users consume data from upstream services like Kafka or Postgres, they often have to manually specify the schema—deciding which columns or fields to ingest into RisingWave—unless they use a schema registry. But we think this process could be much smoother. Users shouldn’t have to worry about defining their schemas. Instead, they should just point RisingWave to the source system, and RisingWave should automatically extract the schema. This would significantly simplify the data import process, making it easier for users to get their data into RisingWave and focus on writing queries instead of dealing with ingestion complexities.
Schema Evolution: In any organization, database schemas change frequently. RisingWave needs to be able to detect these schema changes in upstream databases and automatically adjust. Once a schema change is detected, RisingWave should recalculate materialized views and propagate these changes downstream. This way, users can focus on their core business operations without worrying about evolving schemas.
Support for More Batch Sources: Right now, RisingWave supports various streaming data sources like Kafka, Pulsar, Redpanda, and PostgreSQL CDC, and it can monitor changes in S3. But users have data stored in more places, like Data Lakes and Data Warehouses. We’re planning to expand RisingWave’s capabilities to support more batch sources, allowing users to import data in bulk or as a stream from these batch systems.

Streaming and Batch Execution

Ad-Hoc Queries on Expanded list of Data Sources: RisingWave is frequently used to ingest data from various sources and perform stream processing. But many users also want to run ad-hoc queries on these data sources. While RisingWave already supports ad hoc queries on systems like Kafka, S3, and Iceberg, we know that’s not enough. We’re planning to improve the performance of these queries and expand support to include more data sources, such as different databases, messaging queues, and Data Lakes.
Tunable Materialized Views: RisingWave’s materialized views are currently event-based, meaning each incoming event triggers a refresh. But users don’t always need such high freshness and might be satisfied with updates on an hourly or daily basis. By lowering the freshness requirement, computational costs could be significantly reduced. RisingWave plans to introduce tunable materialized views, allowing users to specify the refresh frequency. This will enable them to balance system freshness with computational resource usage, tailoring the trade-off to their specific needs.
Time Series Support: Although not all streaming data takes the form of time series data, a significant portion do. Many of RisingWave’s users, especially in financial services and IoT companies, need specialized operations on time series data. RisingWave will offer various time series operations, including but not limited to:
1. As-Of Join: Efficiently joining time series data based on the closest preceding time value.
2. Downsampling: Reducing the frequency of time series data by aggregating or selecting data points to create a more manageable dataset.
3. Resampling: Aggregating or interpolating time series data to a different time frequency.
4. Time Series Aggregation: Computing aggregates over fixed time intervals, such as daily, weekly, or monthly summaries.
5. Gap Filling: Detecting and filling in missing data points within a time series using interpolation or other methods.
6. Time-Based Joins: Performing joins where the time dimension is a critical factor, such as joining different time series based on overlapping time intervals.

Data Lake Integration

Data Lakes as a Single Source of Truth: More companies are turning to Data Lakes as their single source of truth to consolidate all their data, effectively breaking down data silos. We’re fully aware of this trend and want RisingWave to be part of the Data Lake ecosystem. In RisingWave 2.0, we’re going to strengthen our integration with Data Lakes. We already support reading from Iceberg and continuously writing data to it. Moving forward, we’ll continue to support Iceberg, expand to other Data Lakes, and add support for mainstream Data Lake Catalogs. This will help users better manage schema information and make RisingWave an even more integral part of the ecosystem.

RisingWave 2.0’s Vision

As of today, RisingWave has been in development for three and a half years. In that time, we’ve built the system from the ground up and have seen rapid growth in both our community and commercial sectors. We’re confident in the future of RisingWave and real-time stream processing. RisingWave will continue as an independent, Apache 2.0-based open-source project focused on better serving our users.

For those with higher demands for real-time stream processing, we also offer RisingWave Cloud and RisingWave Self-Hosted editions, which include premium features, making it easier for users to adopt stream processing and leverage powerful, cost-efficient streaming capabilities.

We’re extremely proud of our mission: to democratize stream processing by making it simple, accessible, and affordable.