Is RisingWave the Right Streaming Database for Your Stream Processing Application?
As a distributed SQL streaming database, RisingWave is well-optimized for stream processing. Let's take a look at whether it is meant for your stream processing application.
If you are looking for a stream processing system for your new streaming analytics or streaming ETL applications, consider RisingWave. As a distributed SQL streaming database, RisingWave is well-optimized for stream processing.
Is it meant for your stream processing application?
The underlying premise of this question is whether RisingWave is well-suited for your workload and business requirements. In this article, we will guide you through the different aspects of RisingWave. We cover user interface, performance, correctness, scalability and elasticity, fault tolerance and availability, durability, serving capability, integrations, and deployment.
RisingWave is a streaming database
RisingWave is an open-source distributed SQL database designed for stream processing. Also known as a streaming database, RisingWave processes streaming data in a database way. It uses SQL as the interface to ingest, manage, query, and store continuously generated data streams.
RisingWave is designed to reduce the complexity and cost of building real-time applications. It consumes streaming data, performs incremental computations when new data come in, and updates results dynamically.
As a streaming database system, RisingWave maintains results in its own storage so that users can access data efficiently. You can also sink data from RisingWave to an external stream for storage or additional processing.
The open-source version of RisingWave is released on GitHub under Apache License 2.0, a real open-source license. It is ready for production use and has been deployed in dozens of companies. The managed version, named RisingWave Cloud, is currently in beta release. Early access is available.
The decoupled compute-storage architecture provides better leverage
RisingWave is a distributed system. It leverages the decoupled compute-storage architecture to achieve infinite scalability at a low cost.
There are currently four types of nodes in the cluster:
- Frontend: Frontend is a stateless proxy that accepts user queries through Postgres protocol. It is responsible for parsing, validating, optimizing, and providing each query's results.
- ComputeNode: ComputeNode is responsible for executing the optimized query plan.
- Compactor: Compactor is a stateless worker node responsible for executing the compaction tasks for the storage engine.
- MetaServer: MetaServer is the central metadata management service. It also acts as a failure detector that periodically sends heartbeats to frontends and ComputeNodes in the cluster.
The topmost component is the Postgres client. It issues queries through TCP-based Postgres wire protocol. The leftmost component is the streaming data source
Kafka is the most representative system for streaming sources. Alternatively, Redpanda, Apache Pulsar, AWS Kinesis, Google Pub/Sub are also widely used. Streams from Kafka will be consumed and processed through the pipeline in the database. The bottom-most component is AWS S3, or MinIO (an open-sourced s3-compatible system).
RisingWave is a SQL streaming database. Users can compose Postgres-style SQL to consume, process, and manage streaming data. Unlike big-data-style stream processing systems (such as Flink and Spark Streaming), RisingWave does not provide native Java, Scala, or Python API. However, since RisingWave is wire- compatible with PostgreSQL, users can directly use third-party PostgreSQL drivers to interact with RisingWave using other programming languages such as Java, Python, and Node.js.
To extend RisingWave's expressiveness in supporting complex queries, we are developing our UDF framework, and Python will be the first supported language. RisingWave will host the UDFs on a dedicated process and call them remotely. We are also exploring deploying UDF with serverless computing services such as AWS Lambda, which offers excellent scalability and low cost.
RisingWave is highly optimized for high-throughput stream processing. Our performance evaluation using the Nexmark benchmark shows RisingWave can achieve 10-50% higher throughput than Apache Flink for most queries. A detailed performance report will be released in Q2 2023.
RisingWave is not an in-memory system. It uses tiered storage architecture to manage its computation states and persistent data: the newly ingested and frequently accessed data are maintained in memory, and the infrequently accessed data are stored in S3 or S3-equivalent remote storage.
RisingWave is written in Rust, a low-level programming language designed for high- performance programming. Compared to JVM-based systems, RisingWave does not suffer any performance penalty caused by runtime management or garbage collection.
RisingWave is a stream processing system. It guarantees “consistency and completeness.” Specifically:
- RisingWave ensures exactly-once semantics, meaning that every single data event will be processed once and only once, even if a system failure occurs.
- RisingWave supports out-of-order processing. Users can enforce RisingWave to process data events in a predefined order, even if data events arrive out of order.
RisingWave is also a database system. But it is not an OLTP database (≤ version 0.1.16) and cannot be used to process database transactions. RisingWave guarantees that all reads will see a consistent snapshot of the database.
Scalability and elasticity
RisingWave was born as a distributed system. All data is stored in an underlying object store, while the compute layer can be swiftly scaled in and out whenever workload fluctuation occurs. The scaling process can be completed almost instantaneously without interrupting data processing.
Internally, RisingWave dynamically partitions data into each compute node using a consistent hashing algorithm. These compute nodes collaborate by computing their unique portion of the data and then exchanging output with each other. Thanks to partitioning, RisingWave can achieve cache locality on each node and get an overall performance comparable to shared-nothing systems.
Fault tolerance and availability
Like many other stream processing systems, RisingWave implements the Chandy- Lamport algorithm to take a consistent snapshot. The consistent snapshot helps RisingWave efficiently recover from failure.
The recovery process is faster than similar systems due to its specialized shared- storage design. The system can continue from the last checkpoint almost immediately after recovery. The only thing affected is the cache, which is rebuilt gradually afterward.
Currently, the default checkpoint frequency is 10 seconds. Hence, the recovery process will bring the system back to where it was 10 seconds ago and continue from there, aka RPO (Recovery Point Objective). In addition, it may take about 10 seconds to detect the failure through heartbeat timeout. Together they make up the Recovery Time Objective or RTO.
RisingWave persists its stored data. Users can insert data into RisingWave by issuing DML statements like
INSERT or directly using connectors.
We recommend using connectors from message brokers or direct CDC from upstream systems for production use. Incoming data is stored in RisingWave’s storage along with the checkpoint. To use a connector, users only need to add some extra properties when creating the table. Connector offsets are persisted in checkpoints to achieve data durability and exactly-once delivery.
In the case of a system failure, RisingWave will continue to consume data from the checkpointed offset, ensuring that no data is missing or lost. However, the users are responsible for ensuring that data stored in message brokers is persisted.
DML statements, including
INSERT/ UPDATE/ DELETE statements, are beneficial for manually modifying data. However, they may suffer from data loss if the system inadvertently fails before they are persisted. The
FLUSH command can explicitly tell them to persist data. In future releases, we plan to provide a write-ahead log (WAL) to ensure zero data loss — better data management.
RisingWave uses materialized views to store stream processing results. It can serve concurrent queries from users at ultra-low latency. Our experimental evaluation using a standard benchmark is a good example. It showed that RisingWave can answer concurrent queries with single-digit millisecond latency. We will release a public experiment report later this year.
Check our comprehensive list of integrations. Here we highlight some of them.
- Message brokers: Apache Kafka, Redpanda, Apache Pulsar, AWS Kinesis, Google Pub/Sub
- Database (direct CDC connection): MySQL, PostgreSQL
- File: S3
- Message brokers: Apache Kafka, Redpanda
- Databases: MySQL, PostgreSQL, Redis, Cassandra, DynamoDB
- Data lake: Apache Hudi, Apache Iceberg
- Realtime dashboards: Grafana
- BI tools: Superset, Metabase, DBeaver
The easiest way to deploy RisingWave is through our official cloud hosting platform, RisingWave Cloud. We now support AWS in most regions, with support for GCP coming soon. Alternatively, we provide an open-source Kubernetes operator to run RisingWave on your local Kubernetes cluster. For storage, any S3-compatible storage service is accepted, including MinIO or Ceph.
We suggest not deploying RisingWave using only binaries, as it may incur additional efforts to achieve high availability.
Others: OLTP and OLAP
RisingWave does not support transaction semantics. So if you are looking for a system to support transactional workloads, there may be better choices than RisingWave. Consider cloud databases like AWS Aurora or Google Spanner. RisingWave does not implement a columnar store and is not optimized for interactive analytical workloads. But RisingWave can easily be integrated with popular OLTP and OLAP databases.
RisingWave is a modern distributed SQL database for stream processing. It greatly saves the efforts of building streaming analytics or streaming ETL applications.
RisingWave can achieve higher cost efficiency than other stream processing systems due to its decoupled compute-storage architecture. RisingWave has already been deployed in many pioneering companies. Learn more about RisingWave.