The last decade witnessed the rapid growth of stream processing technology. My first exposure to stream processing was in 2012. At that time, I was fortunate to be an intern at Microsoft Research Asia, working on a distributed stream processing system that can distribute at a large scale. Then, I continued my research and development on stream processing and database systems at the National University of Singapore, Carnegie Mellon University, IBM Almaden Research Center, and AWS Redshift. Most recently, I founded a startup called RisingWave Labs, developing a distributed SQL streaming database.
2023 marks my 11th year in the stream processing area. I never imagined that I could have been working in this field for over a decade, jumping out of my comfort zone and gradually shifting from pure research and engineering to exploring the path of commercialization. In a startup, the most exciting and challenging thing is to predict the future: we need to constantly evaluate the technology and business trend and predict the development direction of the industry based on our understanding and intuition. As a startup founder building next-generation stream processing systems, I am always optimistic yet humble in the market. I keep asking myself over and over again: Why do we need stream processing? Why do we need a streaming database? Can stream processing really replace batch processing? In this blog, I will combine my software development and customer engagement experiences to discuss stream processing and streaming databases' past, present, and future from a practitioner's perspective.
When Do We Really Need a Stream Processing System?
When talking about stream processing systems, people naturally think of some low-latency use cases: stock trading, fraud detection, ad monetization, etc. However, how exactly are stream processing systems used in these use cases? How low does the latency need to be when using a stream processing system? Why not use batch processing systems to solve the problems? In this section, I will answer these questions based on my experience.
Classic Stream Processing Scenarios
Regardless of the specific use case, stream processing systems are typically used in the following two scenarios: streaming ingestion and streaming analytics.
Streaming ingestion: joining streams from OLTP databases and message queues and delivering results to data warehouses/data lakes.
Streaming ingestion: joining streams from OLTP databases and message queues and delivering results to data warehouses/data lakes.
Streaming ingestion (ETL). Streaming ingestion provides the capability of continuously sending data from one set of systems to another. In many cases, users need to perform some transformation on the data during ingestion; hence, people sometimes use the terms "streaming ingestion" and "streaming ETL" interchangeably. When do we need streaming ingestion? Here’s an example. Let’s assume we are running an e-commerce website that uses an OLTP database (e.g., AWS Aurora, CockroachDB, TiDB, etc.) to support online transactions. To better analyze user behaviors, our website needs to track user events (e.g., ad clicks) and record these events into message queues (e.g., Kafka, Pulsar, Redpanda, etc.). To improve the sales strategy based on the most recent data, we have to continuously combine and ingest the transaction data and user behavior logs into a single data warehouse, such as Snowflake or Redshift, for deep analytics. Stream processing systems play a critical role in this case: data engineers use stream processing systems to clean the data, join the two streams, and move the joined results into the data warehouse in real-time. In real-world scenarios, data ingested into the stream processing systems typically come from OLTP databases, messaging queues, or storage systems. After stream processing, the results are most likely to be dumped back into these systems or inserted into data warehouses or data lakes. One thing worth noticing is that data seldom comes out from data warehouses and data lakes, as these systems are typically considered the “terminal” of the data. As few data warehouses and data lakes provide data change logs, capturing data changes from these systems can be very challenging.
Streaming analytics. You may have already heard of the term real-time analytics. Streaming analytics is similar to real-time analytics but more focused on data freshness over query latency (we will discuss this topic later in the blog). When performing streaming analytics, users are interested in continuously receiving fresh results over recent data (e.g., data generated within the last 30 minutes, 1 hour, or seven days). Just think of the stock market data. Users may be more interested in getting insights from the most recent data, and processing historical data, in this case may not bring too much value. In the streaming analytics scenario, data typically comes from OLTP databases, message queues, and storage systems. Results are usually ingested into a key-value store (such as Cassandra, DynamoDB, etc.) or a caching system (such as Redis, etc.) to support user-triggered requests. Some users may directly send stream processing results to the application layer, which is a typical pattern in alerting applications.
While batch processing systems can also support data ingestion and analysis, stream processing systems can reduce latency from days or hours to minutes or seconds. In many businesses, this can bring huge benefits. In the data ingestion scenario, reducing latency can allow users of downstream systems (such as data warehouses) to obtain the latest data promptly and process the latest data. In the data analytics scenario, the downstream system can see the latest data processing results in real time so that the results can be presented to users in a timely manner.
- In the data ingestion (or ETL) scenario, wouldn’t it be enough for us to use an orchestration tool (such as Apache Airflow) to periodically trigger a batch processing system (such as Apache Spark) to do computations?
- In the data analysis scenario, many real-time analysis systems (such as Clickhouse, Apache Pinot, Apache Doris, etc.) can analyze data online. Why do we need a stream processing system?
These two issues are worth exploring, and we discuss them in the last section.
User Expectations: How Low Does Latency Need to Be?
People are always talking about low-latency processing. But what “low latency” means? What are the users’ expectations of latencies? Seconds? Milliseconds Microseconds? The lower, the better, is it? We have interviewed hundreds of users who have adopted or are considering adopting processing technologies in their production. We find that most use cases require latencies ranging from hundreds of milliseconds to tens of minutes. Many data engineers told us, “After adapting stream processing systems, our computation latency has been reduced from days to minutes. We are very satisfied with this improvement and are not strongly motivated to reduce the latency further.” Ultra-low latency sounds appealing, but, unfortunately, there is not much use case to support it.
In some ultra-low-latency scenarios, general-purpose stream processing systems cannot meet the latency requirements. A typical ultra-low-latency scenario is high-frequency quantitative trading. Quantitative trading firms expect their systems to be able to respond to market fluctuations in microseconds and to buy and sell stocks or futures timely. To achieve this level of latency, many companies move their data centers to buildings physically closer to the stock exchange and carefully choose network providers to reduce any possible network delay. In this scenario, almost all trading firms build their own systems instead of adopting general-purpose stream processing systems such as Flink or RisingWave. This is not only because general-purpose stream processing systems often fail to meet the latency requirements but also because quantitative trading usually requires some specialized expressions, which are not well supported by general-purpose stream processing systems.
Another set of low-latency scenarios includes IoT (Internet of Things) and network monitoring. The latency requirement in these scenarios is usually on the millisecond level and can vary from a few milliseconds or hundreds of milliseconds. In such scenarios, general-purpose stream processing systems might be a good fit. However, in some use cases, it is still necessary to use specialized systems. A good example is autonomous vehicles. Autonomous vehicles are typically equipped with radar and sensors to monitor roadside conditions, speed, coordinates, driver behaviors (e.g., pedaling and braking), and other information. Such information may not be sent back to the data center due to privacy issues and network bandwidth constraints. Therefore, a common solution is directly deploying embedded computing devices on the vehicle for data processing.
Stream processing systems are primarily adopted in scenarios where latency falls into the range of hundreds of milliseconds to several minutes. These scenarios can cover ad recommendations, fraud detection, stock market dashboard, food delivery, etc. In these scenarios, lower latency might bring more value but is not considered critical. Let’s take the stock market dashboard as an example. Individual investors usually make trading decisions by staring at websites or mobile phones to watch stock fluctuations. Do these individual investors really need to react to market fluctuation within ten milliseconds? Not at all. Human eyes can process no more than 60 frames per second, which means a human cannot distinguish refreshes below 16 milliseconds. At the same time, humans’ reaction speed is at the second level, and even well-trained sprint athletes can only react at about 100-200 milliseconds. Therefore, for the stock market dashboard, which is solely for humans to get insights and make decisions, achieving less than 100-millisecond latency is unnecessary. General-purpose stream processing systems (such as Flink, RisingWave, etc.) can be a great fit in this scenario.
Not all scenarios require low-latency processing. Considering examples like hotel reservations and inventory management, a 30-minute delay won’t be a big issue. In these cases, users may consider using either stream processing systems or batch processing systems, and they should make the decision based on other factors, such as cost efficiency, flexibility, tech stack complexity, etc.
Stream processing systems do not fit into scenarios with no latency requirements. These scenarios include machine learning model training, data science, and financial reporting. It is obvious that batch processing systems are more suitable for these scenarios. But it is worth noticing that people have started moving from batch to streaming in these cases. In fact, some companies have adopted stream processing systems to build their online machine-learning platforms.
Looking Back at (Forgotten) History
The previous section described the typical scenarios where stream processing systems fit. In this section, let's talk about the history of stream processing systems.
Apache Storm and Its Successors
For many experienced engineers, Apache Storm may be the first stream processing system they have ever used. Apache Storm is a distributed stream computing engine written in Clojure, a niche JVM-based language that many junior developers may not know. Storm was open-sourced by Twitter in 2011. In the big-data era dominated by Hadoop, the emergence of Storm blew many developers' minds in the data processing. Traditionally, the way users process data is to first import a large amount of data into HDFS, and then use batch computing engines such as Hadoop to analyze the data. With Storm, data could be processed on-the-fly, immediately after it flows into the system. With Storm, data processing latency was drastically reduced: Storm users were able to receive the latest results in just a few seconds, without waiting for hours or even days.
The initial version of Storm was not entirely user-friendly. Users need to learn many Storm-specific concepts before running a basic “word count” example. Image credit: Getting Started with Storm, by Jonathan Leibiusky, Gabriel Eisbruch, and Dario Simonassi (O'Reilly).
The initial version of Storm was not entirely user-friendly. Users need to learn many Storm-specific concepts before running a basic “word count” example. Image credit: Getting Started with Storm, by Jonathan Leibiusky, Gabriel Eisbruch, and Dario Simonassi (O'Reilly).
Apache Storm was groundbreaking at its time. However, the initial design of Storm was far from perfect. In fact, it lacked many basic functionalities that modern stream processing systems by default have to provide: state management, exactly-once semantics, dynamic scaling, SQL interface, etc. But it inspired many talented engineers to build next-generation stream processing systems. Just a few years after Storm emerged, many new stream processing systems were invented: Apache Spark Streaming, Apache Flink, Apache Samza, Apache Heron, Apache S4, and many more. Spark Streaming and Flink eventually stand out and become legends in the stream processing field.
History Before Apache Storm
If you think that the history of stream processing systems begins with the birth of Storm, you are probably wrong. In fact, stream processing systems have evolved for over two decades. Not surprisingly, the concept of stream processing originated from the database community. At the VLDB conference in 2002, a group of researchers from Brown and MIT published a paper titled "Monitoring Streams - A New Class of Data Management Applications", pointing out the demand for stream processing for the first time. In this paper, the authors proposed that to process streaming data, we should shift from the traditional "Human-Active, DBMS-Passive" processing mode and turn to the new "DBMS-Active, Human-Passive" mode. That is, instead of passively reacting to user-issued queries, the new class of database systems should trigger computation upon data arrival and actively push the results to the users. In academia, the earliest stream processing system was called Aurora. After that, Borealis, STREAM, and many other systems were born.
Aurora: The first stream processing system in academia.
Aurora: The first stream processing system in academia.
Just a few years after being studied in academia, stream processing technology was adopted by large enterprises. The top three database vendors, Oracle, IBM, and Microsoft, consecutively launched their stream processing solutions, which were known as Oracle CQL, IBM System S, and Microsoft SQLServer StreamInsight. Interestingly, instead of developing a standalone stream processing system, all these vendors have chosen to integrate stream processing functionality into their existing systems.
2002 to present: what has changed?
The evolution of the stream processing systems: from subsystems of commercial database systems to standalone open-source big data systems.
The evolution of the stream processing systems: from subsystems of commercial database systems to standalone open-source big data systems.
As mentioned above, stream processing technology was first introduced in academia, and soon adopted in the industry world. In the early days, stream processing was implemented as an integrated module inside commercial database systems. But since 2010, dozens of standalone stream processing systems have been built and adopted in tech companies. What caused this change? Well, one of the major factors is the birth and growth of the Hadoop-centric big data ecosystems. In those conventional commercial database systems, computation and storage modules were well integrated and encapsulated into the same system. While the design greatly simplified the system architecture, it also restricted these database systems from scaling in the distributed environment. Observing the limitation, big data systems were built, to offer high scalability that can span across dozens or hundreds of machines. The computation and the storage layers were divided into independent systems, and users can use low-level interfaces to determine the best configurations and strategies to scale out on their own. This design perfectly met the needs of a new batch of Internet companies, such as Twitter, LinkedIn, and Uber. The stream processing systems built at the time (notably Storm and Flink) tightly followed the design principle of those batch processing systems (notably Hadoop and Spark), and their high scalability and flexibility perfectly fit the needs of the big data engineers.
The Renaissance of Streaming Databases
Looking back at history, we find that the concept of streaming databases was already proposed and implemented more than 20 years ago. For example, the aforementioned Aurora system is already a streaming database. However, streaming databases are not as popular as stream processing systems. History moves in a spiral pattern. The big data era witnessed the trend of separating stream processing from databases and overthrowing the monopoly of the three giants: Oracle, IBM, and Microsoft. More recently, in the cloud era starting around 2012, in the field of batch processing systems, we are witnessing systems such as Redshift, Snowflake, and Clickhouse bring the "computing engine" back to the "database". Now it is time to bring stream processing back into databases, which is the core idea behind modern streaming databases such as RisingWave and Materialize. But why now? We analyze it in detail in this section.
The Story of PipelineDB and ksqlDB
Since Apache Storm was open-sourced in 2011, stream processing systems have entered the fast lane of development. But streaming databases didn't just disappear. There are still two streaming databases born in the era of big data and were popular at the time: PipelineDB and ksqlDB.
Although both PipelineDB and ksqlDB are streaming databases, their technical routes are quite different.
- PipelineDB is entirely based on PostgreSQL. It is a PostgreSQL extension to process continuous queries. Once the user has installed PostgreSQL, they can install PipelineDB as a plugin. This design is very user-friendly since users can directly use PipelineDB without migrating their own PostgreSQL databases. The core selling point of PipelineDB is the real-time updated materialized view. After the user inserts data into PipelineDB, the latest results are reflected on the materialized view created by the user in real-time.
- ksqlDB has chosen a different path. It is strongly coupled with Kafka. In ksqlDB, the intermediate results of the computation are stored in Kafka, and the communication between nodes also relies on Kafka. The advantages and disadvantages of this technical route are clear. The advantage is that mature components can be highly reused, greatly reducing development costs. The disadvantage is that ksqlDB is strongly bound with Kafka, which seriously lacks flexibility, and dramatically reduces the performance and scalability of the system.
Aside from technological routes, the two systems have much to do with each other commercially. PipelineDB was founded in 2013. Initially, the PipelineDB team joined Y Combinator, a well-known incubator in Silicon Valley, in 2014. On the other hand, ksqlDB was built by Confluent in 2016 (in fact, it was Kafka Stream first). Confluent is a commercial company founded by the original Apache Kafka team. At the time, ksqlDB was not receiving much attention outside the Kafka ecosystem due to its strong binding to Kafka and the lack of technical maturity. Nevertheless, both PipelineDB and ksqlDB were still inferior to Spark Streaming and Flink in terms of user recognition. In 2019, Confluent acquired PipelineDB, and PipelineDB stopped updating since then. Sounds familiar? Yes, more recently in January 2023, Confluent acquired Immerok, a Flink commercialization company founded by Flink core members, and announced with a high profile that it plans to launch a computing platform based on Flink SQL. PipelineDB and Immerok are important enhancements to the Kafka-centric real-time data stack and direct replacements of ksqlDB. This makes the future status of ksqlDB within Confluent foggier.
The Rise of Cloud-Native Streaming Databases
In recent years, many cloud-native systems have gradually surpassed and subverted their big data system counterparts. One of the most well-known examples is Snowflake. Snowflake brings cloud warehouses to the next level with its innovative architecture: separating storage and computing. It rapidly dominated the market that belonged to the data analytic systems (such as Impala) in the big data era. In the world of streaming databases, similar things are happening again. Cloud-based streaming databases like RisingWave, Materialize, DeltaStream, and TimePlus have emerged in recent years. Although there are differences in their commercial and technical routes, their general objective is the same: to provide users with streaming database services on the cloud. To achieve that objective, the focus is on designing an architecture that fully utilizes resources on the cloud to achieve unlimited horizontal scalability and supreme cost efficiency. With a solid technical route and a reasonable product positioning, I believe streaming databases will gradually take over the leading position of stream processing systems of the previous generation.
Stream Processing Systems and Streaming Databases
The trend of shifting toward “cloud-native” is unstoppable. Now the biggest question is: why would you build a "cloud-native streaming database" instead of a "cloud-native stream processing system"?
Some people may argue that it is the impact of SQL. One of the major changes brought about by cloud services is the popularization of data processing systems. In the era of big data, users are typically engineers familiar with programming languages for general-purpose application development such as Java. Whereas in the cloud era, cloud services are adopted by a large wider group of users beyond well-trained programmers. This urgently requires the system to provide a simple and easy-to-understand interface to benefit developers without programming knowledge. SQL, the standard query language, obviously meets the requirement. SQL is a declarative language solely for querying and manipulating data. It has been the industry standard for decades. The widespread use of SQL has accelerated the popularization of the concept of "databases" in data engineering and hence stream processing.
However, SQL is only an indirect factor, not a direct factor. The logic is simple: many stream computing engines (such as Flink and Spark Streaming) already provide SQL interfaces, but they are not streaming databases. Although the SQL interface of these systems differs from the SQL experience of traditional databases such as MySQL and PostgreSQL, it already proves that having SQL does not necessarily mean having a database.
The key difference between a "streaming database" and a "stream processing system" is that a streaming database has its own storage, while a stream processing system does not. The deeper insight is that streaming databases treat storage as a first-class citizen. Based on this insight, streaming databases meet the two most fundamental needs of users: being simple and easy to use. How is this done? We look at the following aspects.
- Unifying data operation objects. In stream processing systems, the stream is the basic object to do operations, whereas in databases, the table is the basic object to do operations. In streaming databases, the concepts of stream and table are perfectly unified. Users no longer need to consider the difference between a stream and a table. Instead, a stream is simply a table of unbounded size, hence a stream can be processed, queried, and manipulated in the same fashion as doing these corresponding operations on a table in traditional databases.
- Simplifying the data stack. Differing from stream processing systems, streaming databases have ownership of data. Users can directly store data in a database. After the user finishes processing the data through the streaming database, the processed result can be directly stored in the same system and accessed concurrently by different users. In stream processing systems that cannot store data, the results of any computation must be exported to an external downstream system, e.g., a key-value store or a cache system, before users can access them. That is to say, users can use a single streaming database system to replace the previous service combinations such as Flink+Cassandra/DynamoDB. Simplifying the user's data stack and reducing operation and maintenance costs are obviously what many companies expect.
A streaming database can perfectly replace the combination of a streaming engine and a serving system.
A streaming database can perfectly replace the combination of a streaming engine and a serving system.
- Reducing external access overhead. The data sources of modern enterprises are diversified. When using a stream processing system, users often need to access data from external systems, like joining the data stream from Kafka with a table in MySQL. In a stream processing system, to access a MySQL table, a cross-system external call must be made. The performance cost of such calls is enormous regarding data-intensive computations. On the other hand, streaming databases have the ability to store data, therefore data in the external system can be stored (or cached) inside the streaming database. This greatly improves the overall performance by eliminating the costly external call.
In stream processing systems, cross-system data access introduces huge performance penalties.
In stream processing systems, cross-system data access introduces huge performance penalties.
- Providing result interpretability and consistency guarantees. A major pain point of stream processing systems is the lack of interpretability and consistency guarantees for computational results. In Flink, different jobs run and output results to the downstream system continuously and independently. The different computational progress of different jobs in stream processing systems introduces the inconsistency between the results of different jobs received by the downstream system. Consider a simple example: the user submits two jobs to Flink, one is to find how many times the Tesla stock has been bought in the past ten minutes, and the other is to find how many times the stock has been sold in the past ten minutes. A downstream system receives continuous output from the two jobs. One may be the result at 8:10, and the other may be the result at 8:11. Furthermore, the inconsistency in the results cannot be traced. Seeing such a result, users are confused: they cannot judge whether the result is correct, and how it is generated and evolved. In streaming databases, with the ownership of the input and output data, the computing behavior of the system becomes observable, explainable, and strongly consistent from a theoretical point of view. For example, RisingWave is one of the systems that achieve these guarantees. All intermediate results of computations are stored in tables and stamped with timestamps, and all computational results can be stored in materialized views and traced through timestamps. In this way, the user can better understand the computational results by querying the historical intermediate results with a specific timestamp.
- Deeply optimized execution plans. "Treating storage as a first-class citizen" means that the computing layer of streaming databases can perceive the storage layer. This perception opens new opportunities to optimize the computation and storage collectively. A great example is to share the intermediate state of stream processing better to save the resource overhead. We will not dive into the technical details of this topic, and we refer readers to our previous blog.
Design Principles for Cloud Native Streaming Databases
The biggest difference between cloud deployment and on-premise deployment is that the cloud can have unlimited and decoupled resources. In contrast, the on-premise deployment has only limited and coupled resources. What exactly does that mean?
- First, users on the cloud no longer need to perceive the number of physical machines. While cluster users in the era of big data often only have limited physical machines.
- Second, cloud vendors expose classified resources to users. They can directly purchase computing, storage, cache, or other resources separately according to their needs. In comparison, users for on-premise deployment can only request resources according to the number of machines.
The first difference causes a fundamental discrepancy in the design goals of big data and cloud-native systems. The goal of big data systems is to maximize system performance within limited resources, while the goal of cloud-native systems is to minimize cost overhead under unlimited resources. The second difference has caused a fundamental division in the implementation of the data system. The big data system achieves embarrassingly parallelism based on the shared-nothing architecture which couples storage with computing. In contrast, cloud-native systems realize on-demand scaling through the shared-storage architecture that separates storage and computing.
These design principles bring new challenges to cloud-native streaming databases. In streaming databases, the storage essentially stores the intermediate states in the computation. When intermediate states need to be separated from computing nodes and placed on persistent cloud storage services such as S3, the obvious challenge is that data access latency brought by S3 may greatly affect the latency of the stream processing system itself. Therefore, a cloud-native streaming database has to consider how to use the local storage of computing nodes and external storage (such as EBS, etc.) to cache data, to minimize the performance degradation caused by S3 access.
Big data systems and cloud systems have different optimization goals.
Big data systems and cloud systems have different optimization goals.
Stream Processing and Batch Processing: Replace or Coexist?
As stream processing technology can greatly reduce query latency, many people consider it a technology that can subvert batch processing. But an opposite view is that modern batch processing systems have been "upgraded" to real-time analytical systems that can generate results in milliseconds, the value of stream processing systems is limited. While I am extremely optimistic about the future of stream processing, I do not really think stream processing can fully replace batch processing. In my view, these two technologies will coexist and can complement each other in the long run.
Stream Processing Systems and Real-Time Analytical Systems
At present, most batch processing systems, including Snowflake, Redshift, Clickhouse, etc., claim to be capable of doing real-time analytics on large-scale data. It naturally comes to a question that we raised in the first chapter: why do we need a stream processing system when there are already many real-time analytical systems? Because there is a fundamental difference in the definition of "real-time". In stream processing systems, a query is defined by the user in advance, and the query processing is driven by the data. In contrast, in real-time analytical systems, a query could be defined by the user at any time, and it is actually the users who drive the query processing. The "real-time" mentioned by stream processing systems means that the query results reflect the latest received data. In contrast, the "real-time" mentioned by real-time analytical systems refers to the low latency of random issued queries. To put it more clearly, stream processing systems emphasize the real-time nature of computational results, while real-time analytical systems emphasize the real-time nature of user interaction. For those scenarios that require high freshness of data results, such as a stock trading dashboard, advertising recommendations, or network anomaly detection, stream processing systems are the better choice. For scenarios with high requirements for user interactive experience, such as interactive custom reports or analyzing user behavior logs, real-time analysis systems suit better.
The difference between a streaming database and a real-time analytical database. A streaming database processes data before storing them; data drive the computation; it focuses on delivering fresh results. A real-time analysis database stores data before processing them; the computation is driven by users; it focuses on enabling real-time user interaction.
The difference between a streaming database and a real-time analytical database. A streaming database processes data before storing them; data drive the computation; it focuses on delivering fresh results. A real-time analysis database stores data before processing them; the computation is driven by users; it focuses on enabling real-time user interaction.
Some people may say that since real-time analytical systems can give real-time results to the queries sent by users, as long as the user keeps sending the same query to the real-time analysis system periodically, wouldn't it be possible to ensure the freshness of the results at all times and achieve the same effect as stream processing systems? There are two problems. The first is implementation complexity. A user is not a machine, and cannot stay in front of the computer and send queries continuously. To issue a query periodically, you must either write a program or use an operation and maintenance orchestration tool, such as Airflow, to send queries in a loop. Either way, it means that users need to pay extra costs to operate and maintain external systems or programs. The second and more fundamental problem is that the cost is too high. Real-time analytic systems perform computations on the whole dataset, while stream processing systems perform incremental computations. When it comes to a large volume of data, the high number of redundant computations carried out by real-time analytical systems results in a huge waste of resources. This is why we should never use an orchestration tool to trigger batch processing in real-time analytical systems regularly.
Stream processing systems use incremental computation to avoid redundant processing costs.
Stream processing systems use incremental computation to avoid redundant processing costs.
Stream Processing and Real-Time Materialized Views
Nowadays, many real-time analytical systems have implemented real-time materialized views. Since streaming databases also use materialized views to define stream processing inside its kernel, there is an argument that given an analytical system with real-time materialized views, we no longer need a separate stream processing system. I don't agree with this point of view. Consider the following reasons.
- Competition for resources. The core problem to be solved by analytical databases is to process complex queries on large-scale data sets efficiently. In such systems, the positioning of materialized views is essentially no different from that of database indexes. They are both caches of computations. Creating real-time materialized views has two benefits: on the one hand, redundant computations over frequently used queries can be reduced; on the other hand, pre-computing shared subqueries of different queries can speed up query execution. Such positioning makes it almost impossible to update the materialized view in a timely manner. Actively updating the materialized view will inevitably seize computing and memory resources, hampering the timely responses to user queries. To prevent "putting the cart before the horse", almost all analytical databases adopt either a passive updating strategy (that is, need to be actively driven by users) or a delayed updating strategy (wait until the system is idle to update) for materialized views.
- Correctness guarantee. The problem of resource competition can be solved easily by adding more computing nodes, however, the problem of guaranteeing correctness is much more non-trivial for real-time analysis systems. Batch processing systems process finite and bounded data, while stream processing systems process infinite and unbounded data. When processing unbounded data, data disorder may occur due to various reasons such as unstable network communication or different network latencies from different sources. Stream processing systems introduce the concept of watermarks to solve this problem. The output result is considered to be semantically correct when the out-of-order data are processed in a certain order. Outdated data will be abandoned when they pass a threshold called a watermark. However, real-time analytical systems lack the basic design of watermarks, which makes it impossible for them to deliver correct results as stream processing systems do. This correctness guarantee is crucial in various application scenarios such as risk control and advertising recommendations. A system without correctness guarantees on real-time data sources cannot replace stream processing systems.
Considerations Before Adopting Stream Processing
Stream processing is not a panacea, and neither can stream processing completely replace batch processing in all scenarios. Before adopting stream processing in your tech stack, you may want to evaluate the following aspects.
- Flexibility. Stream processing requires users to predefine queries, to achieve continuous computation over streaming data. This requirement makes stream processing less flexible than batch processing, which can actively respond to user-triggered one-shot queries. As a result, stream processing is more suitable for scenarios where queries can be created before execution, while batch processing is more suitable for scenarios where frequent user interactions are required.
- Expressiveness. Stream processing uses incremental computation to reduce redundant overhead. However, incremental computation also brings a major problem, that is, the system expressiveness is limited. The main reason is that not all computations can be handled incrementally. A simple example is calculating the median from a data sequence: no algorithm can find the exact median number incrementally without scanning the whole dataset. Keeping this in mind, you should evaluate if stream processing is suitable when handling some highly complex scenarios.
- Computing costs. Stream processing can greatly reduce the cost of real-time computing. But this does not mean that stream processing can be more cost-effective than batch processing in all scenarios. In fact, in scenarios where result freshness is not critical, batch processing can save more cost. Just think of generating the quarterly financial report inside a company. Stream processing would be an overkill for this case. This is because to perform incremental computation, stream processing systems have to continuously maintain the internal computation state, which keeps consuming resources. In contrast, batch processing only performs computations when user requests arrive, and no additional resources will be spent on maintaining computation states.
Unifying Stream Processing and Batch Processing
Stream processing and batch processing have their pros and cons, and it is difficult to completely replace each other in many real-world scenarios. Many believe that stream processing and batch processing will coexist in the long run. Observing this, it is natural to consider unifying stream processing and batch processing using a single system. Many existing systems have explored this direction, including those well-known systems such as Flink and Spark. While the unified approach to stream and batch processing can fit into some scenarios, the reality is that, at least for now, most users still choose to adopt separate solutions for stream processing and batch processing, respectively. The reason is straightforward: integrating two different systems collectively not only addresses performance and stability concerns but also offers users the opportunity to achieve the best of both worlds.
At the current stage, RisingWave is more focused on stream processing, and users can leverage RisingWave connectors to integrate with mainstream batch processing systems. In the current RisingWave version, users can easily export data to systems such as Snowflake, Redshift, Spark, and ClickHouse. I feel optimistic about the future of a unified approach to batch and stream processing, and in the long run, RisingWave will also explore this direction. In fact, streaming databases are already unifying stream processing and batch processing: when new data flows into the streaming database, the streaming database performs stream processing, but when a user initiates ad-hoc queries, the streaming database performs batch processing.
While I am proud of having been working in the stream processing domain for over a decade, I have to be honest and admit that I once thought there was not much room for growth in this area. I can hardly imagine how cloud computing changed everything and brought the stream processing system back to the center of the stage after 2020. Nowadays, more and more people are researching, developing, and adopting stream processing technology. I shared my thoughts on stream processing in this blog, and hope it will help. Everyone is welcome to join our community to discuss the future of stream processing.
RisingWave is a distributed SQL streaming database in the cloud. It is an open-source system released under Apache 2.0 license.