When it comes to serverless computing, many people immediately associate it with some popular and exciting terms such as “elastic scaling,” “scale-to-zero,” and “pay-as-you-go.” Nowadays, with the popularity of public cloud, we hear about new serverless computing products being introduced almost every now and then. At the same time, many well-known cloud service products, such as AWS Aurora, AWS Redshift, Databricks, etc., have also launched their serverless versions. Serverless computing quickly became the focus in the industry, with both startups and large-scale enterprises exploring the possibilities of launching serverless products. Even in the highly anticipated field of large-language models (LLM), there has been a lot of discussion about serverless machine learning platforms.

Undoubtedly, the concept of serverless computing is very appealing. It precisely addresses the pain points of users. After all, hardly any company wants to spend too much effort on managing operations and no company wants to pay high fees for basic services. However, having a good concept alone is not enough. For any product to be widely adopted and implemented, it ultimately depends on whether it can meet the interests and needs of both buyers and sellers.

In this article, I will delve into serverless technology from both a business and technical perspective, hoping to bring a more comprehensive understanding to everyone.


Software Services in the Market


Although the concept of cloud computing has become widely known, not all companies are willing to embrace cloud services. Due to various reasons, different companies have their own views on using cloud services, and many companies cannot even access cloud platforms, let alone use cloud services, due to regulatory and risk control requirements. Before discussing serverless computing in detail, let’s talk about the types of software services available in the market.

Let’s start with the on-premises deployment model. On-premises deployment refers to customers directly deploying software services on their own servers or data centers. In this deployment model, enterprises have 100% complete control over their data and software services, providing them with a high level of data security assurance. Of course, having control over data security also means that companies are responsible for their own security. The drawbacks of on-premises deployment are also obvious, namely the high costs of deployment and maintenance. Companies not only need to purchase a large number of servers, but also figure out how to install the desired software on their own servers. If there are failures or upgrades needed, companies may need frequent communication with suppliers and even on-site support from them.

Close
Featured In the fully managed mode, both the data plane and the control plane are in the public network; in the BYOC mode, the data plane is in the user's cloud network environment, while the control plane is in the public network.

The emergence of cloud is to solve the pain points of on-prem deployment. In the cloud, the default mode is fully managed mode, also known as public SaaS. Fully managed mode means that customers use software services provided by vendors on the public cloud. In this case, the software used by the company and its own data will be stored in the cloud. This mode almost eliminates the pressure for software operation and maintenance for the enterprise, but it also means that the enterprise may face a series of security and privacy issues. After all, as the software vendor has the permission to manage the software, in theory (although it is almost impossible in practice), the vendor can access the user’s data. In the event of data leaks and other situations, the consequences can be unimaginable.

To address the concerns of enterprises about the security of fully-managed mode, many vendors are beginning to offer private SaaS, also known as Bring Your Own Cloud (BYOC) mode. Literally, it may be a bit difficult to understand. But simply put, it can be imagined as a way for enterprises to achieve on-prem deployment in the cloud era. The BYOC mode requires the customer company to already have an account with a cloud service provider (such as AWS, GCP, etc.), and the software vendor will deploy the software in the customer’s cloud account. This way, the ownership of the data is completely controlled by the enterprise, and the vendor only has access to the control platform of the customer’s system. This significantly limits the vendor’s access to data, thereby reducing the risk of data leakage. Of course, the BYOC mode also presents various requirements for the vendor in terms of operation and maintenance, which we will not discuss further here.

The last mode is what we want to focus on today, namely serverless. Serverless can be seen as an extension of the fully-managed mode. In this mode, all services are deployed by the vendor in the cloud, and the customers simply use these services without worrying about the backend operation and maintenance work. In other words, users no longer need to consider how much computing resources and storage resources they have used, they only need to pay for the services they use.

Take a database system as an example. For the cloud database service in the fully-managed mode, users need to know how many machines their cloud database actually occupies, how many CPUs, and how much memory is used. For the serverless cloud database, users no longer need to know these resources, they can just directly use it, and after using it, the cloud database vendor will charge according to the amount of resources consumed.

The greatest advantage of the serverless mode is convenience, which may also save users a lot of costs. At the same time, serverless allows users to worry less about underlying details and theoretically achieve automatic elastic scaling based on workload. However, depending on the implementation, serverless may bring potential performance and security risks, which require users to carefully evaluate before using. We will discuss this in detail in the next paragraph.

Some vendors, even though their core services are not sold as serverless, allow users to serverlessize some of their workloads. There are two classic examples. Database vendor Databricks only sells services in BYOC mode, but they still allow users to use serverless services to accelerate some queries. This mode is also used in AWS Redshift. Although Redshift is in fully managed mode, if users want to accelerate a specific query, they can use the concurrency scaling feature to obtain resources from the shared resource pool for elastic computation.


Approach to Implementing Serverless


For users, serverless is a highly abstracted layer: as long as users are unaware of the underlying deployment and operational modes, it appears as serverless from their perspective. Serverless can be implemented in various ways, and these approaches can affect performance, security, cost, and other aspects. The two most popular implementation methods are the container-based mode and the multitenancy-based mode.

Close
Featured In a container-based serverless implementation, each user actually gets an independent cluster isolated by containers, and users do not need to know the cluster configuration details. In a multi-tenant serverless implementation, multiple users share a single large cluster, and resource isolation is handled by the system.

The simplest way to implement serverless is to use Kubernetes on the cloud to orchestrate containers and handle resource isolation at the container level. This implementation method is essentially similar to traditional fully-managed approaches, but from the user’s perspective, they no longer need to be concerned with the underlying details. Although this serverless implementation does not require consideration of resource isolation at the lower level, it places high demands on product encapsulation and automated operations and maintenance. With users no longer needing to manage resources, serverless service providers need to dynamically allocate resources based on user workloads. If true on-demand resource allocation can be achieved, significant cost savings can be achieved for users. However, dynamic scaling itself poses many challenges, and resource scheduling based on runtime conditions is even more difficult. If the implementation is not ideal, it can lead to resource wastage and even service downtime due to resource shortages.

The multi-tenant mode is another classic approach to implementing serverless. Some well-known OLTP database vendors provide serverless services based on this architecture. This architecture requires the database itself to handle resource isolation between multiple tenants directly at the database system level. In terms of service provisioning, it can be seen as the vendor opening a large instance where different users share resources within the same instance. This implementation method places high demands on resource control and isolation at the system kernel level, but it achieves efficient resource utilization. After all, compared to container-based implementations, the multi-tenant mode allows different users to share the same resources instead of individually managing resources for each user. The multi-tenant mode is theoretically the most efficient way to utilize resources.


How Customers Think?


When considering the selection of software systems, everything starts with the needs and expectations of the users.

For users who can only accept on-prem deployment, the real work begins after signing a contract with the software vendor. Customers may need to assemble a dedicated technical team to deploy and manage the system to ensure smooth operation within their internal network. If you are familiar with early IBM and Oracle services, you should know that customers need to configure professionals for software setups. For software vendors, each customer’s environment is unique, so highly customized services need to be provided. This undoubtedly increases the workload and costs for both sides.

Transitioning to cloud services simplifies the entire process. The cloud provides unparalleled convenience and elasticity, eliminating the hassle of purchasing and maintaining hardware for customers and allowing them to adjust service scales flexibly based on actual needs. In order to achieve this level of convenience and elasticity, cloud service providers have standardized everything and built trust with users through standardization.

Close
Featured Comparison of the four modes: on-prem deployment, BYOC mode, fully managed mode and serverless mode.

If we list all four modes – on-prem deployment, BYOC mode, fully managed mode, and serverless mode – we can see that users consider convenience, elasticity, performance, security, and price when making choices.

  • In terms of convenience, on-prem deployment is highly inconvenient as it requires users to purchase and configure hardware in addition to software. BYOC mode offers moderate convenience as users need to maintain their own cloud accounts and perform lightweight operations. Fully managed mode provides high convenience as users only need to select the specifications of the service. Serverless mode offers extremely high convenience as users may not even need to select specifications and can fully use pay-as-you-go pricing.
  • In terms of elasticity, on-prem deployment lacks flexibility as users generally need to purchase specific machines to install the corresponding software. Both BYOC and fully managed modes can achieve elasticity in the cloud, but in BYOC mode, as the data plane is on the user’s side, there may be certain limitations on dynamic resource allocation and release based on technical and policy considerations. Fully managed mode and serverless mode have almost no limitations regarding elasticity and can achieve high elasticity.
  • In terms of performance, the multi-tenant implementation of serverless, which requires multiple users to share storage and compute resources, may cause performance interference, potentially placing it at a disadvantage compared to dedicated modes. In the other implementation methods, each user has their own independent computing and storage resources, ensuring high-performance guarantees.
  • In terms of security, on-prem deployment mode does not have many security risks. BYOC mode provides high security because data is stored on the user’s side. Fully managed mode and container-based serverless mode clearly have certain security risks. The multi-tenant serverless mode has relatively higher security risks because the data access is not physically isolated.
  • In terms of price, on-prem deployment costs are quite high as users need to invest in hardware and software purchases as well as require maintenance teams. From the vendor’s perspective, BYOC and fully managed modes generally maintain consistent pricing models, resulting in minimal price differences. Serverless mode, due to its ability to dynamically adjust resource usage based on user loads and resource sharing, achieves significant cost reduction. Considering these five aspects, we can conclude that small companies may be more inclined to sacrifice certain performance and security aspects to reduce costs and gain convenience and elasticity. On the other hand, large enterprises, considering their large user bases and data assets, would prioritize performance and security.


How Vendors Think?


For software service providers, the ultimate goal is profitability. From a business perspective, regardless of the services provided, vendors aim to maximize profits. When we examine the table from the previous section, we can see that if profitability is desired from serverless services, the provided services must have broad appeal and market demand. This is not only because the average revenue per user from serverless services may be lower than other types of services, but also because providing serverless services entails higher costs for vendors than traditional fully managed or BYOC modes.

We can imagine that one of the core selling points of serverless is elasticity. To ensure users can obtain the necessary resources at any time and enjoy instant startup experiences, vendors generally need to maintain a certain number of ready resources, often referred to as “warm pools.” The cost of maintaining these ready resources must be covered by the vendors themselves. In contrast, if only fully managed services or BYOC modes are sold, the cost of basic resources is borne by users, and the revenue earned by vendors consists of software service fees above the basic costs.

Of course, if a serverless service has a large number of users, compared to traditional cloud services, software vendors will discover more opportunities for profitability. They can increase revenue in various ways, such as resource sharing, metadata node sharing, or overbooking. Overbooking is based on the simple fact that although users may reserve a large amount of resources, a portion of them may not fully utilize their quotas in practice. This provides an opportunity for vendors to sell more resources.

Close
Featured Differences in business models between serverless and traditional hosting modes.

The above diagram illustrates the differences in business models between serverless and traditional hosting services.

This diagram describes the differences in the business models between serverless and traditional hosting services. In the case of traditional hosting mode, the ratio of costs to revenue for vendors is relatively fixed and increases linearly with the number of customers. In the case of serverless, when there are only a few customers, vendors are generally operating at a loss, and profitability is only possible when the number of customers reaches a certain scale. However, unlike traditional hosting services, when the number of customers increases, the profit margin for vendors may increase significantly, resulting in super-linear growth. This may be why the serverless story is highly sought after in the venture capital circle.

However, it is easy to imagine that not all software is suitable for serverless services. From the perspective of market acceptance, only widely accepted technologies have the potential to truly profit from serverless. Additionally, since serverless services have a large number of users, the likelihood of encountering online failures is increased. Therefore, generally speaking, serverless services are suitable for commercializing systems that are already mature and well-established, such as MySQL, PostgreSQL, etc., as it significantly reduces the chances of system failures.


Case Study: Providing Serverless Stream Processing


Since I have been involved in the development and commercialization of stream processing systems, I often think about how to provide cloud services like RisingWave as serverless solutions to users. Streaming databases, such as RisingWave, are relatively unique compared to traditional databases, making them suitable as a case study for discussing how to achieve serverless.

Let’s start by considering the user’s perspective. As mentioned earlier, users often consider convenience, elasticity, performance, security, and price when using different service modes. From the perspectives of convenience, security, and price, streaming databases do not have much difference. However, from the perspectives of elasticity and performance, the differentiation of streaming databases becomes apparent. The core concept of streaming databases is materialized views, which represent continuous incremental calculations on data streams. Data streams may experience significant changes over time, making elasticity crucial. Moreover, since stream processing involves continuous calculations, and users may be highly sensitive to the freshness of the calculation results, the system needs to maintain high performance. From the table presented earlier in this article, we can see that the multi-tenant serverless mode, due to resource sharing, may result in performance interference, making it potentially unsuitable for stream processing. Therefore, the more reliable approach to achieve serverless for stream processing systems should be through container-based implementations. However, as mentioned earlier, container-based serverless implementations require efficient resource allocation based on user loads, which is not easy to achieve. Considering the user base, it is safe to say that stream processing technology has not yet reached the level of popularity like PostgreSQL and MySQL, making it challenging to achieve super-linear growth.

In conclusion, considering the needs of both customers and vendors, we can draw a conclusion: stream processing system vendors are currently facing difficulties in maximizing profits through serverless services.

Confluent (Apache Kafka)RedpandaStreamNative (Apache Pulsar)Upstash (Apache Kafka)RisingWave
System PurposeMessage streaming platformMessage streaming platformMessage streaming platformMessage streaming platformStream processing
On-prem deploymentTechnical supportLicensingTechnical support / licensingNot supportedTechnical support
BYOC modeNot supportedSupportedSupportedNot supportedSupported
Fully managed modeSupportedSupportedSupportedNot supportedSupported
Serverless modeNot supportedComing soonNot supported (but supported at the kernel level)Supported (container-based)Coming soon

Of course, this situation may change over time. In the field of event streaming and messaging systems, companies such as Redpanda and StreamNative (Apache Pulsar) are considering the possibility of providing serverless services. Redpanda’s serverless services will be launched soon, and StreamNative’s commercialized Apache Pulsar supports multi-tenancy capabilities at the kernel level, making it a matter of time before serverless mode is provided. Recently, we also saw the emergence of Upstash, a new vendor that provides serverless Kafka service. RisingWave, the stream processing company, will soon launch its serverless RisingWave service in the upcoming months, making it possible for everyone to access the new technology.

CONCLUSION

In recent years, serverless services on the cloud have gradually emerged, attracting the attention of developers and enterprises. The numerous advantages it brings, such as elasticity, cost-effectiveness, and simplified operations and maintenance, have made more and more projects consider adopting this model. However, every technology has its suitable use cases, and serverless is no exception. It may not be suitable for all software systems and business needs. In the decision-making process, enterprises and developers need to clearly understand its limitations. In general, whether a vendor should develop serverless products and whether a user should purchase serverless products should be decided after an overall analysis of their own situation and the market.

Two weeks ago, Current 2023, the biggest data streaming event in the world, was held in San Jose. It stands out as one of my favorite events in 2023. Not only was the conference venue conveniently located just a 10-minute drive from my home, but it was also the unique gathering where data streaming experts from around the world openly engage in discussions about technology.

Close
Featured Current 2023, organized by Confluent, is one of the biggest events in real-time data streaming space.

If you missed the event, fret not. My friend Yaroslav Tkachenko from Goldsky has penned a comprehensive blog detailing the key insights from the event. Among the many insights he shared, one that particularly piqued my interest was his comments on streaming databases:

  • “They [Streaming databases] can cover 50% to 80% of use cases, but they’ll never arrive close to 100%. But they can be surprisingly sticky!”
  • “I approached some of the vendors and asked if they planned to introduce programmatic APIs down the road. Everyone said no - SQL should give them enough market share.”

As the founder of RisingWave, a leading streaming database (of which I am shamelessly proud), Yaroslav's observations prompted a whirlwind of reflections on my end. My reflections are not rooted in disagreement; on the contrary, I wholeheartedly concur with his viewpoints. His blog post has spurred me to contemplate SQL stream processing and streaming databases from various angles. I'm eager to share these musings with everyone in the data streaming community and the wider realm of data engineering.


SQL’s Expressiveness


Streaming databases enable users to process streaming data in the same way as they use databases, with SQL naturally being the primary language. Like most modern databases, RisingWave and several other streaming databases prioritize SQL, and they also offer User Defined Functions (UDFs) in languages like Python and Java. However, these databases do not really provide lower-level programmatic APIs.

So, the question we grapple with is: is the expressiveness of SQL (even with UDF support) sufficient?

In my converstation with hundreds of data streaming practitioners, I've found many who argue that SQL alone doesn't suffice for stream processing. The top three use cases that immediately come to my mind are: (1) (rule-based) fraud detection; (2) financial trading; (3) machine learning.

For fraud detection, many applications continue to rely on a rule-based approach. For these, Java often proves more straightforward for expressing rules and directly integrating with applications. Why? Primarily because numerous system backends are developed in Java. If the streaming data doesn't require persistence in a database, it becomes much easier to articulate the logic using a consistent programming language like Java..

In financial trading, there are concerns about the limitations of SQL, especially when it comes to handling specialized expressions that go beyond standard SQL. While it is possible to embed this logic in User-Defined Functions (UDFs), there are concerns about the increased latency that UDFs can introduce. Traditional UDF implementations, which often involve hosting a UDF server, are known for causing significant delays.

When it comes to machine learning use cases, practitioners have a strong preference for Python. They predominantly develop their applications using Python and rely heavily on popular libraries such as Pandas and Numpy. Using SQL to express their logic is not an instinctive choice for them.

While there are numerous real-world scenarios where SQL, with or without UDF support, adequately meets the requirements (you can refer to some of RisingWave's use cases in fintech, and machine learning), it's important to acknowledge that SQL may not match the expressiveness of Java or Python. However, it's worth discussing whether, if SQL can fulfill their stream processing needs, people would choose SQL-centric interfaces or continue relying on Java-centric frameworks. In my opinion, most individuals would opt for SQL. The key point of my argument lies in the widespread adoption and fundamental nature of SQL. Every data engineer, analyst, and scientist is familiar with it. If basic tools can fulfill their needs, why complicate matters with a more complex solution?


SQL’s Audience


Most data systems, like Hadoop, Spark, Hive, Flink, that emerged during the big data era were Java-centric. However, newer systems, like ClickHouse, RisingWave, and DuckDB, are fundamentally database systems which prioritize SQL. Who exactly uses these Java-centric systems, and who exactly uses SQL-centric systems?

I often find it challenging to convince established companies founded before 2015, such as LinkedIn, Uber, and Pinterest, to adopt SQL for stream processing. For sure, many of them didn’t like my pitch solely because they have already had well established data infrastructures and prefer focusing on developing more application-level projects (well, I know these days many companies are looking into LLMs!). But a closer examination of their data infrastructure reveals some patterns:

  • These companies began with their own data centers;
  • They initiated with a Java-centric big data ecosystem, including technologies like Hadoop, Spark, and Hive;
  • They maintain extensive engineering teams dedicated to data infrastructure;
  • Even when transitioning to the cloud, they lean towards building custom data infrastructure on platforms like EC2, S3, or EKS, rather than opting for hosted data services.

There are several challenging factors that hinder the adoption of stream processing by these enterprises:

  • The enterprises have a codebase that is entirely in Java. Although SQL databases do provide Java UDFs, it is not always feasible to frequently invoke Java code within these UDFs.
  • Some of the big data systems they utilize are customized to their specific requirements, making migration exceptionally difficult.
  • Integrating with extensive big data ecosystems often requires significant engineering efforts.
  • There is also a human element to consider. Teams that have been maintaining Java-centric systems for a long time may perceive a shift to hosted services as a threat to their job security.

While selling SQL-centric systems to these corporations can be daunting, don’t be disheartened. We do have success stories. There are some promising indicators to watch for:

  • Companies transitioning to data warehouses like Snowflake, Redshift, or Big Query signal a positive shift. The rising endorsement of the "modern data stack" prompts these companies to recenter their data infrastructure around these SQL-centric warehouses. As a result, they are less inclined to manage their infrastructure or stick with Java-centric systems;
  • Another interesting signal to consider is the occurrence of business shifts and leadership changes. Whether driven by changes in business priorities or new leadership taking charge, these transitions often trigger a reevaluation of existing systems. New leaders may not be inclined to simply maintain the status quo (as there may be limited ROI in doing so) and may be more receptive to exploring alternatives, such as upgrading their data infrastructure.

Interestingly, the growing popularity of SQL stream processing owes much to the support and promotion from Java-centric big data technologies like Apache Spark Streaming and Apache Flink. While these platforms initially started with a Java interface, they have increasingly recognized the importance of SQL. The prevailing trend suggests that most newcomers to these platforms begin their journey with SQL, which predisposes them towards the SQL ecosystem rather than the Java ecosystem. Moreover, even if they initially adopted Java-centric systems, transitioning to SQL streaming databases in the future may be a more seamless pivot than anticipated.


Market Size of SQL Stream Processing


Before delving into the market size of SQL stream processing, it's essential to first consider the broader data streaming market. We must recognize that the data streaming market, as it stands, is somewhat niche compared to the batch data processing market. Debating this would be fruitless. A simple examination of today's market value for Confluent (the leading streaming company) compared to Snowflake (the dominant batch company) illustrates this point. Regardless of its current stature, the streaming market is undoubtedly booming. An increasing amount of venture capital is being invested, and major data infrastructure players, including Snowflake, Databricks, and Mongo, are beginning to develop their own modern streaming systems.

Close
Featured Today’s market cap of Confluent, the leading streaming company, is $9.08B.
Close
Featured Today’s market cap of Snowflake, the dominant batch company, is $53.33B.

It's plausible to suggest that the stream processing market will eventually mirror the batch processing market in its patterns and trends. So, within the batch processing market, what's the size of the SQL segment? The revenue figures for SQL-centric products, such as Snowflake, Redshift, and Big Query, speak volumes. Then what about the market for products that primarily offer Java interface? Well at least in the data infrastructure space, I didn’t see any strong cash cow at the moment. Someone may mention Databricks, the rapidly growing pre-IPO company commercializing Spark. While no one can deny the fact that Spark is the most widely big data system in the world, a closer look at Databricks’ offerings and marketing strategies would soon lead to the conclusion that the SQL-centric data lakehouse is the thing they bet in.

Close
Featured The streaming world vs the batching world.

This observation raises a paradox: SQL's expressiveness might be limited compared to Java, yet SQL-centric data systems manages to generate more revenue. Why is that?

Firstly, as highlighted in Yaroslav’s blog, SQL caters to approximately 50-80% of use cases. While the exact figure remains elusive, it's evident that SQL suffices for a significant proportion of organizational needs. Hypothetically, if a company determines that SQL stream processing aligns with its use cases, which system would they likely opt for? A SQL-centric one or a Java-centric one? If you're unsure, consider this analogy: if you aim to cut a beef rib, would you opt for a specialized beef rib knife or a Swiss Army knife? The preference is clear.

Secondly, consider the audience for Java. Individuals with a computer science background might proficiently navigate Java and grasp system-specific Java APIs. However, expecting those without such a background to master Java is unrealistic. Even if they did, wouldn't they prefer managing the service independently? While it's not absolute, companies boasting a robust engineering team seem less inclined to outsource.


More Than Just Stream Processing


While we've extensively discussed SQL stream processing, it's time to pivot our attention to streaming databases. Classic stream processing engines like Spark Streaming and Flink have incorporated SQL. As previously mentioned, these engines have begun to use SQL as an entry-level language. Vendors are primarily building on SQL, with Confluent’s Flink offering standing as a notable example. Given that both Spark Streaming and Flink provide SQL interfaces, why the push for streaming databases?

The distinctions are significant. Big data SQL fundamentally diverges from database SQL. For instance, big data SQL often lacks standard SQL statements common in database systems, such as create, drop, alter, insert, update, and delete, among others. Digging deeper, one discerns that a pivotal difference lies in storage capabilities: streaming databases possess them, while stream processing engines typically do not. This discrepancy influences design, implementation, system efficiency, cost, performance, and various other dimensions. For those intrigued by these distinctions, I'd recommend my QCon San Francisco 2023 talk (slide deck here).

Furthermore, the fundamental idea behind a database is distinct from that of a computation engine. To illustrate, database users often employ BI tools or client libraries for result visualization, while those using computation engines typically depend on an external system for storage and querying.

Some perceive a streaming database as a fusion of a stream processing engine and a traditional database. Technically, you could construct a streaming database by merging, say, Flink (a stream processing engine) and Postgres (a database). However, such an endeavor would present myriad challenges, including maintenance, consistency, failure recovery, and other intricate technical issues. I'll delve into these in an upcoming blog.

CONCLUSION

Engaging in debates about SQL’s expressiveness becomes somewhat irrelevant. While SQL possesses enough expressiveness for numerous scenarios, languages like Java, Python, and others may outperform it in specific use cases. However, the decision to embrace SQL stream processing is not solely driven by its expressiveness. It is often influenced by a company’s current position and its journey with data infrastructure. Similar to trends observed in the batch domain, one can speculate about the future market size of SQL stream processing. Nonetheless, streaming databases provide capabilities that go beyond mere stream processing. It will be intriguing to witness how the landscape evolves in the years to come.

In today's fast-paced digital landscape, the ability to harness real-time data has become essential for organizations striving to make informed decisions and remain competitive. This comprehensive guide explores the world of streaming databases, shedding light on their distinct characteristics and advantages in handling streaming data. We will delve into the differences between streaming databases and other technologies like stream processing engines, OLAP databases, data warehouses, and messaging systems. By the end of this journey, you'll have a clear understanding of how streaming databases can revolutionize your data management strategies and shape the future of data-driven decision-making.


Streaming Database vs. Stream Processing Engine


Let's dive into a fundamental question: What sets a streaming database apart from a stream processing engine? While both deal with data in motion, they serve distinct purposes.

A stream processing engine primarily focuses on the real-time processing and analysis of data as it flows. Popular stream processing engines include Apache Flink, Apache Spark Streaming, Apache Samza, Apache Storm, and more. These engines handle tasks like data enrichment, filtering, and transformation. Stream processing engines excel at executing complex computations on streaming data, often offering low-level APIs in Java or Python languages. Some, like Apache Flink and Apache Spark Streaming, even provide a SQL wrapper on top of their low-level APIs.

On the other hand, a streaming database combines the capabilities of a traditional database with the ability to handle high-speed, real-time data ingestion. Popular streaming databases include RisingWave and KsqlDB. They store and manage data while enabling real-time querying and retrieval. In essence, a streaming database is a database equipped to handle continuous streams of data alongside traditional query operations.

With a stream processing engine, data transformation is possible, but these engines lack storage capabilities. Consequently, a separate database is required to store input or output data. In contrast, streaming databases provide native storage support, allowing you to persist data and serve ad-hoc queries seamlessly.


Is a streaming database just a fusion of a stream processing system and a database?


While it may seem that way, a streaming database is more specialized. It combines features from both systems and optimizes them for real-time ingestion, processing, and querying of streaming data. Notably, streaming databases do not typically support transaction processing. Instead, they leverage a key database concept known as materialized views to represent streaming jobs, ensuring these views are always up-to-date.

From a design perspective, incorporating storage capabilities into a streaming computation engine is a natural approach to process streaming data efficiently. This is because during stream processing, a system must maintain its internal states, representing the data or information retained over time. These internal states must be fault-tolerant and easily scalable across different machines. Storing these states in a persistent storage system is an elegant way to achieve fault tolerance and dynamic scaling.


What are the differences between a streaming database and a real-time OLAP database?


Streaming databases like RisingWave and KsqlDB prioritize result freshness, employing an incremental computation model to optimize latency. In RisingWave, users can pre-define queries, and the system updates query results incrementally as new data arrives. Users can choose to store input and output within RisingWave or deliver them directly to downstream systems. While RisingWave can handle ad-hoc queries, it's not optimized for supporting concurrent user-initiated analytical queries that require long-range scans.

On the other hand, OLAP databases, such as Apache Druid, Apache Pinot, and ClickHouse, excel at efficiently answering user-initiated analytical queries. They often implement columnar stores and a vectorized execution engine to accelerate complex query processing over large datasets. OLAP databases shine in use cases where interactive queries are crucial. However, they are not optimized for incremental computation, making it challenging to ensure result freshness. They also lack features like windowing functions and out-of-order processing, making them unsuitable for supporting stream processing applications.

Streaming DatabasesOLAP Databases
ExamplesRisingWave, KsqlDBApache Druid, Apache Pinot, ClickHouse
Optimized forContinuous analyticsInteractive analytics
Computation ModelIncremental computation, event-drivenFull computation, user-triggered
Sample ApplicationsContinuously monitor "the top 30 longest trips in the last 2 hours" or "the top 10 hottest zip codes for passengers"Quickly answer questions like "how many users used the Uber app yesterday?" or "What's the average mileage for Uber drivers daily?"
Several differences distinguish a streaming database from a real-time analytical database

Streaming databases and OLAP databases both support real-time analytics, but stream processing systems emphasize the real-time nature of computational results, while real-time analytical systems focus on the real-time nature of user interaction. By design, OLAP databases may not support many stream-processing applications. Here are some examples:

  • Streaming ETL: Users often need to continuously join multiple data streams from different sources (e.g., messaging queues, OLTP databases, file systems) and deliver results to downstream systems like data warehouses and data lakes.
  • Continuous monitoring: Users may want to monitor query results continuously.
  • Out-of-order processing: Due to network or system issues, data can arrive out of order in many scenarios. However, users may require results to be computed in a predetermined order.


What are the differences between a streaming database and a data warehouse?


A data warehouse serves as a repository for storing and managing historical data, primarily used for business intelligence and decision support. It enables in-depth analysis of historical trends and often involves batch processing for data loading and updates.

On the other hand, a streaming database focuses on real-time data, providing applications with access to the most current information. While both solutions handle data storage and retrieval, streaming databases are tailored for low-latency access to streaming data, while data warehouses prioritize historical data analysis.

In practice, streaming databases can complement data warehouses. Many users employ data warehouses like Redshift and Snowflake to store historical data and support complex analytical queries. Concurrently, they use streaming databases like RisingWave to maintain materialized views for real-time data processing.


Streaming Database vs. Streaming Platforms


Streaming Platforms, such as Apache Kafka, Apache Pulsar, or Redpanda, are designed for message passing and event-driven communication between different components of a system. They play a pivotal role in building decoupled and distributed applications.

A streaming database complements streaming platforms by ingesting data from these platforms, processing the streaming data, and storing it in a structured format. This enables SQL-like querying capabilities for real-time analytics. Typically, streaming databases work alongside streaming platforms to provide end-to-end solutions.


Key Benefits of Using Streaming Databases


Now that we've explored what streaming databases are and how they differ from other technologies, let's delve into their key advantages:

  1. Real-Time Insights: Streaming databases grant instant access to up-to-the-second data, empowering businesses to make informed decisions in real-time.
  2. Low Latency: They offer low-latency data ingestion and query responses, making them ideal for applications requiring immediate responses.
  3. Scalability: Streaming databases can horizontally scale to accommodate growing data volumes, ensuring they can handle increasing workloads.
  4. Simplified Architecture: By consolidating data storage and processing in a single platform, streaming databases simplify the architecture of real-time applications.


Streaming Database Examples


Several streaming databases have gained popularity in recent years, each with unique features and strengths:

  1. RisingWave: RisingWave is an open-source distributed SQL database designed for stream processing. It processes streaming data using PostgreSQL-style SQL as the interface, allowing users to ingest, manage, query, and store continuously generated data streams. RisingWave reduces the complexity and cost of building real-time applications by consuming streaming data, performing incremental computations as new data arrives, and updating results dynamically.
  2. KsqlDB: KsqlDB is an open-source streaming SQL engine for Apache Kafka, simplifying the development of real-time stream processing applications. It enables users to express stream processing logic through SQL-like queries, making it accessible to data analysts and developers proficient in SQL. KsqlDB can process and analyze data in real-time from Kafka topics, handling tasks like data transformation, filtering, and aggregation.
  3. Materialize: Materialize is a streaming database that leverages SQL for processing. It can automatically refresh materialized views in a consistent manner, allowing concurrent querying of data in these views. Materialize builds upon Timely Dataflow, developed by Microsoft Research to support incremental and iterative processing. It employs a hot-standby model for fault tolerance.
  4. Timeplus: Timeplus is a data analytics platform designed with a focus on streaming-first analytics, enabling organizations to process both streaming and historical data quickly and intuitively. It empowers data and platform engineers to unlock the value of streaming data using SQL
  5. DeltaStream: DeltaStream is a stream processing platform designed for developing and deploying streaming applications. It is built on Apache Flink, offering a unified SQL interface for querying and processing streaming data using standard SQL syntax.

Conclusion

In summary, streaming databases represent a potent solution for organizations seeking to leverage real-time data. They stand apart from technologies like stream processing engines, OLAP databases, data warehouses, and messaging systems, offering distinct capabilities and advantages. Despite the challenges they pose, the benefits they provide, such as real-time insights and low-latency data access, make them an enticing choice for modern applications. As the data landscape continues to evolve, streaming databases will undoubtedly play an increasingly pivotal role in shaping the future of data-driven decision-making.

In the fast-evolving field of real-time analytics, many stream processing engines have emerged over the past decade. Notable examples include Apache Storm, Apache Flink, and Apache Samza. These engines have become widely accepted in various enterprises, providing substantial support for real-time processing and analytical applications.

In the last few years, a new and intriguing innovation has surfaced: streaming databases. Starting with early solutions like PipelineDB, conceived as a PostgreSQL plugin, followed by Confluent's KsqlDB designed for Kafka, and the more recent open-source RisingWave—a distributed SQL streaming database—these systems have steadily climbed the ladder of acceptance and popularity.

Close
Featured Stream Processing VS. Streaming Database

The question arises: Can stream processing engines and streaming databases be used interchangeably? This article sheds light on their respective design principles and explores these fascinating technologies' distinctions, similarities, use cases, and potential future trajectories.


Design Principles


Stream processing engines and databases serve critical functions in data stream processing. Yet, they differ notably in their design principles, particularly in user interaction interfaces and data storage options. Before diving into their unique characteristics, a clear understanding of the historical backdrop that gave rise to these technologies is needed.


The Evolution of Stream Processing Engines

Database systems have been under exploration for over six decades, while computing engines—encompassing both batch and stream processing—are relatively recent innovations. The journey into modern stream processing began in 2004 with the release of Google's MapReduce paper.

MapReduce aimed to optimize resource utilization across networks of commodity machines striving for peak performance. This lofty goal led to the introduction of an elegant low-level programming interface with two principal functions: Map and Reduce. Such a design granted seasoned programmers direct access to these core functions, allowing them to implement specific business logic, control program parallelism, and manage other intricate details on their own.

What set MapReduce apart was its exclusive focus on processing, relegating data storage to external systems like remote distributed file systems. This division resulted in a unique architecture that continues to influence today's stream processing engines and sets the stage for understanding the vital differences and synergies with streaming databases.

To successfully utilize MapReduce inside a company, three critical prerequisites were essential:

  1. The company must possess a substantial volume of data and relevant business scenarios internally;
  2. The company must have access to a sufficient number of commodity machines;
  3. The company must employ a cadre of skilled software engineers.


These three conditions were a high bar for most companies when MapReduce was introduced in 2004. It wasn't until after 2010, with the explosion of social networks and mobile internet, that the first condition began to be satisfied. This shift led major companies to turn their attention to stream processing technology, investing heavily to meet the second and third prerequisites. This investment marked the golden age of development for stream processing engines, which began around 2010. During this period, a slew of exceptional stream processing engines emerged, including Apache Storm, Apache Samza, Apache Flink, among others.

These emergent stream processing engines fully embraced the core principles of the MapReduce design pattern, specifically:

  1. exposing low-level programming interfaces to users;
  2. relinquishing control over data storage.

This design was aptly tailored to the big data era. Usually, technology companies requiring substantial data processing had their own data centers and specialized engineering teams. What these companies needed were improved performance and more adaptable programming paradigms. Coupled with their specialized engineering teams, they could typically deploy distributed file systems to handle vast data storage.

However, the landscape began to shift with the remarkable advancements in cloud computing technology post-2015. More companies, even those without extensive technical backgrounds, desired access to stream processing technology. In response to this demand, stream processing engines embraced SQL, a concise and universally recognized programming language. This adaptation opened the doors for a broader audience to reap the benefits of stream processing technology. As a result, today's leading stream processing engines offer a tiered approach to user interaction, providing low-level programming interfaces such as Java and Scala and more accessible high-level SQL programming interfaces.


User Interaction Interfaces


Modern stream processing engines offer both low-level programming interfaces, such as Java and Scala, and higher-level interfaces, like SQL and Python. These interfaces expose various system runtime details, such as parallelism, allowing users to control more nuanced aspects of their applications. On the other side, streaming databases primarily feature SQL interfaces, simplifying runtime complexities. Some even provide User-Defined Functions (UDFs) in languages like Python and Java to enhance expressive capabilities. This divergence in user interaction interfaces leads to two principal balancing acts.

1. Balance Between Flexibility and Ease of Use

Proficient programmers benefit from the considerable expressive capabilities offered by low-level interfaces such as Java and Scala. As Turing-complete languages, Java, Scala, and similar languages theoretically empower users to articulate any logic. For companies with pre-existing Java and Scala libraries, this is particularly advantageous.

However, due to its complexity, this approach may deter those unfamiliar with Java or Scala. Mastering a system's custom API can be time-consuming and demanding, even for technical users.

Streamlining ease of use through higher-level SQL interfaces in streaming databases addresses some usability challenges, but two main obstacles remain. First is the completeness of SQL support, which includes more than simple DML (Data Manipulation Language) statements. Users often require complex DDL (Data Definition Language) operations like creating or deleting roles, users, indexes, and more. Second is the support for the broader SQL ecosystem, requiring additional effort, such as compatibility with management tools like DBeaver or pgAdmin.

Low-level interfaces give technically-focused users more flexibility, while SQL interfaces of streaming databases are tailored for business-focused users, emphasizing ease of use.

2. Balance Between Flexibility and Performance

The availability of low-level programming interfaces allows users to optimize system performance using prior knowledge. Those with deep programming expertise and business logic comprehension often achieve superb performance through low-level interfaces. The flip side is that this may require companies to invest more in building their engineering teams.

Contrastingly, hierarchical stream processing engines might face performance drawbacks when using high-level programming interfaces. Encapsulation leads to performance overhead, and the more encapsulation layers there are, the worse the performance. Intermediate layers in stream processing engines can cause the underlying design to be oblivious to the higher-level logic, leading to performance loss. In comparison, streaming databases solely offering SQL interfaces and optimization capabilities can reach higher performance levels.

In conclusion, stream processing engines with low-level interfaces allow technically-focused users to exploit their expertise for performance benefits. Conversely, streaming databases that provide SQL interfaces are tailored to serve business-focused users, often achieving higher performance bounds without requiring intricate technical expertise.


Data Storage

Apart from user interaction interfaces, the most significant difference between stream processing engines and streaming databases is whether the system stores data. In stream processing, the presence or absence of data storage directly affects various aspects of the system, such as performance, fault recovery, scalability, and use cases.

In a stream processing engine, data input and output typically occur in external systems, such as remote distributed file systems like HDFS. In contrast, a streaming database not only possesses computational capabilities but also includes data storage capabilities. This means it can do at least two things: 1) store inputs, and 2) store outputs.

On the data input side, storing data yields significant performance benefits. Consider a scenario where a naive stream processing engine joins a data stream from a message queue (like Kafka) with a table stored in a remote database (like MySQL). Suppose a stream processing engine lacks data storage; a significant problem arises whenever a new data item enters the stream. In that case, the engine must fetch data from the remote database before performing calculations. This approach was adopted by the earliest distributed stream processing engines. Its advantage is that it simplifies the system architecture, but the drawback is a sharp decrease in performance. Accessing data across different systems clearly causes high latency, and introducing high-latency operations in latency-sensitive stream processing engines results in performance degradation.

Storing (or caching) data directly within the stream processing system helps to avoid cross-system data access, thereby improving performance. In a streaming database, which is a stream processing system with built-in storage capabilities, a user can directly replicate the table into the streaming database if they need to join a data stream from a message queue with a table from a remote database. This way, all access becomes internal operations, leading to efficient processing. Of course, this is a simplified example. In real-world scenarios, challenges such as large or dynamically changing tables in the remote database may arise, which we won't delve into here.

On the output side, storing output results can offer substantial benefits. To summarize, there are roughly four key advantages:

  1. Simplify the data stack architecture: Instead of using separate stream processing engines for computation and separate storage systems for data storage and query responses, a single system like a streaming database can handle computation, storage, and query responses. This approach significantly simplifies the data stack and often results in cost savings.
  2. Facilitate computation resource sharing: In stream processing engines, computation results are exported to external systems after consuming input data and performing calculations. This makes it challenging to reuse these results within the system directly. A streaming database with built-in storage can save computation results as materialized views internally, enabling other computations to access and reuse those resources directly.
  3. Ensure data consistency: While stream processing engines can ensure exactly-once semantics for processing, they cannot guarantee result access consistency. Since stream processing engines don't have storage, calculation results must be imported into downstream storage systems. The upstream engine must output version information of the results to the downstream system to achieve consistent results visible to users in downstream systems. This puts the burden of ensuring consistency on the user. In contrast, a streaming database with storage can manage computation progress and result version information within the system, assuring users always see consistent results through multi-version control.
  4. Enhance program interpretability: During program development, repeated modifications and verification of correctness are common. A common approach to verifying correctness is to obtain inputs and outputs and manually validate whether the calculation logic meets expectations. This practice is relatively straightforward in batch processing engines but more challenging in stream processing engines. In stream processing engines, where inputs and outputs are dynamically changing, and the engine lacks storage, verifying correctness requires traversing three systems: the upstream message source, the downstream result storage system, and the stream processing engine. Additionally, users must be aware of computation progress information, making it a complex task. Conversely, verifying correctness becomes relatively simple in a streaming database as the calculation results are also stored within the database. Users only need to obtain inputs from the upstream system (typically by directly fetching offsets from message queues like Kafka) and validate results within the single system. This significantly boosts development efficiency.

Of course, there's no such thing as a free lunch, and software development never offers a silver bullet. Compared to stream processing engines, streaming databases with storage brings many advantages in architecture, resource utilization, consistency, and user experience. However, what do we sacrifice in return?

One significant sacrifice is the complexity of software design. Recall the design of MapReduce. The concept of MapReduce stood out when giants like Oracle and IBM dominated the entire database market because they effectively used a simple model to enable large-scale parallel computing for technology companies with many ordinary computers. 

MapReduce directly subverted the storage and computation integration concept in databases, separating computation as an independent product. It enabled users to make computation scalable through programming. In other words, MapReduce achieved large-scale horizontal scalability through a simplified architecture, relying on the availability of highly skilled professional users.

However, when a system requires data storage, everything becomes more complex. To make such a system usable, developers must consider high availability, fault recovery, dynamic scaling, data consistency, and other challenges, each requiring intricate design and implementation. Fortunately, over the past decade, the theory and practice of large-scale databases and stream processing have significantly progressed. Hence, now is indeed a favorable time for the emergence of streaming databases.


Use Cases

While sharing similarities in their computational models, streaming processing engines and streaming databases still have nuances that lead to unique applications and functionalities. These technologies overlap significantly in their applications but diverge in their end-user focus.

Stream processing engines are generally more aligned with machine-centric operations, lacking data storage capabilities and often requiring integration with downstream systems for computation result consumption. In contrast, streaming databases cater more to human interaction, offering storage and random query support, thus enabling direct human engagement.

The dichotomy between machines and humans is evident in their preferences—machines demand high-performance processing of hardcoded programs, while humans favor a more interactive and user-friendly experience. This inherent divergence leads to a differentiation in functionality, user experience, and other aspects between stream processing engines and streaming databases.

In sum, the commonalities in application scenarios between these technologies are nuanced by the specific focus on end users and interaction modes, resulting in distinct features within each system.


A Step Back in History or the Current Trend of the Era?

Databases and computing engines often reflect two divergent philosophies in design, evidenced by their distinct scholarly contributions and industry evolution.

In academics, computing engine papers typically find their home in system conferences like OSDI, SOSP, and EuroSys, while database-centric works frequent SIGMOD and VLDB conferences. This division was famously highlighted by the 2008 critique "MapReduce: A major step backwards" by David DeWitt and Michael Stonebraker. He argued that MapReduce was historically regressive and needed more innovation relative to databases.

In the realm of stream processing, the question arises: Which philosophy represents a historical regression, and which embodies the current era's trend? I assess that both stream processing engines and streaming databases will coexist and continue evolving for at least the next 3-5 years.

Stream processing engines, emphasizing extreme performance and flexibility, cater to technically proficient users. On the other hand, streaming databases balance elegance in user experience with high performance and fine internal implementation. The mutual integration trend between these philosophies enhances both sides, with computing engines adopting SQL interfaces to boost user experience and databases leveraging UDF capabilities to increase programming adaptability.


Predicting the Future of Stream Processing


Looking back at history, we find that the concept of streaming databases was already proposed and implemented more than 20 years ago. For example, the aforementioned Aurora system is already a streaming database. However, streaming databases are not as popular as stream processing systems. History moves in a spiral pattern.

The big data era witnessed the trend of separating stream processing from databases and overthrowing the monopoly of the three giants: Oracle, IBM, and Microsoft. More recently, in the cloud era starting around 2012, in the field of batch processing systems, we are witnessing systems such as Redshift, Snowflake, and Clickhouse bring the "computing engine" back to the "database".

Now it is time to bring stream processing back into databases, which is the core idea behind modern streaming databases such as RisingWave and Materialize. But why now? We analyze it in detail in this section.

With over two decades of development, stream processing remains in its infancy concerning commercial adoption, particularly compared to its batch processing counterpart. However, the industry has reached a consensus on the direction of stream processing. Both stream processing engines and streaming databases concur on the necessity of supporting both stream and batch processing. The primary differentiation lies in harmoniously integrating these two distinct computing models. Essentially, the notion of "unified stream and batch processing" has become a shared understanding within the field.

There are generally three methodologies for accomplishing this unification:

  1. The Single Engine Approach

This strategy involves employing the same computing engine to manage both stream processing and batch processing functionalities. Its advantage is that it offers a relatively straightforward implementation plan and a more cohesive user experience. However, since stream processing and batch processing entail significant differences in implementation aspects like optimization and computation methods, distinct handling of each is often necessary to achieve peak performance.

2. The Dual Engine Approach

Under this approach, stream processing and batch processing are configured separately as two unique system engines. While it allows for specific optimization of each, it also necessitates considerable engineering resources, high levels of collaboration, and stringent priority management within the engineering team. The need to align and make both parts work seamlessly can be a considerable challenge.

3. The Sewn-Together System Approach

This third approach involves repurposing existing systems instead of constructing a new, ideal system to provide a unified user experience. Although it seems to be the least engineering-intensive method, delivering a seamless user experience can be highly challenging. Existing market systems vary in support interfaces, and even when they employ SQL, different dialects might be used. Shielding users from these inconsistencies becomes a core issue. Additionally, orchestrating multiple systems to function, interact, and remain cognizant of each other introduces complex engineering challenges.

Conclusion

Though stream processing engines and streaming databases may exhibit some overlap and differences in design and practical applications, both strive to align with corresponding batch processing systems. Selecting which approach to adopt ultimately depends on the user’s assessment, considering their particular scenarios and needs. This decision-making process emphasizes the importance of understanding each system’s unique characteristics and requirements and how they fit into the broader computing landscape.

I stumbled upon an intriguing big data project called Stratosphere a decade ago. What immediately captured my interest was a particular section in its introduction: the ability to initiate a cluster on a single machine and execute MapReduce-based WordCount computations with just three lines of code.

During a time dominated by Hadoop, installing and running a WordCount program would typically require several hours or even days of effort. Therefore, encountering a project that achieved the same functionality in merely three lines of code left an indelible impression on me. Motivated by this concise yet powerful approach, I delved extensively into the project and eventually became a contributor.

In the present day, the project previously known as Stratosphere has transformed and is now recognized as Apache Flink, reigning as the most popular stream processing engine in the realm of big data. However, in contrast to the early days of Stratosphere, Flink has grown into a colossal project with considerable intricacy.

Nevertheless, as someone who contributed to the initial design and development of Flink's stream processing engine, I still yearn for a user experience that embraces simplicity. My aspiration is for users to swiftly embark on their stream processing endeavors and encounter the streamlined efficiency it offers at remarkable speeds.

To uphold this belief, my team and I have crafted RisingWave, a cloud-based streaming database that furnishes users with a high-performance distributed stream processing with a PostgreSQL-like experience. In this article, I showcase how you can initiate your journey into stream processing using RisingWave with a mere four lines of code.


What is Stream Processing?


Stream processing and batch processing are the two fundamental modes of data processing. In the last two decades, stream processing systems and batch processing systems have experienced quick iterations, evolving from single-machine setups to distributed systems and adapting big data to cloud computing. Substantial architectural enhancements have been implemented in batch and stream processing systems.

Close
Featured The evolution of batch processing and stream processing systems over the last 20 years

The two primary distinctions between stream processing and batch processing are as follows:

  • Stream processing systems operate on event-driven computations, whereas batch processing systems rely on user-initiated computations.
  • Stream processing systems adopt an incremental computation model, while batch processing systems employ a full-computation model.


Regardless of whether it's stream processing or batch processing, both approaches are progressively shifting toward real-time capabilities. Batch systems are widely used in interactive analysis scenarios, while stream processing systems are extensively applied in monitoring, alerting, automation, and various other scenarios.

Close
Featured Comparison of real-time OLAP systems and real-time streaming stream processing systems


RisingWave: Stream Processing with PostgreSQL Experience


RisingWave is an open-source distributed SQL streaming database licensed under the Apache 2.0 license. It utilizes a PostgreSQL-compatible interface, allowing users to perform distributed stream processing like operating a PostgreSQL database.

RisingWave is primarily designed for two typical use cases: streaming ETL and streaming analytics.

Streaming ETL refers to the real-time ingestion of various data sources (such as OLTP databases, message queues, and file systems) into destination systems (such as OLAP databases, data warehouses, data lakes, or simply back to OLTP databases, message queues, file systems) after undergoing processing operations like joins, aggregations, groupings, windowing, and more. In this scenario, RisingWave can fully replace Apache Flink.

Close
Featured Streaming ETL use cases


Streaming analytics, on the other hand, refers to the capability of ingesting data from multiple data sources (such as OLTP databases, message queues, and file systems) and performing complex analytics (with operations like joins, aggregations, groupings, windowing, and more) before displaying results in the BI dashboards.

Users may also directly access data inside RisingWave using client libraries in different programming languages. In this scenario, RisingWave can replace a combination of Apache Flink and SQL/NoSQL databases (such as MySQL, PostgreSQL, and Cassandra, Redis).
Streaming analytical use cases.


Deploying RisingWave with 4 Lines of Code


To install and run RisingWave on a Mac, follow these three commands in the command line window (refer to our documentation if you are a Linux user.)

$ brew tap risingwavelabs/risingwave
$ brew install risingwave
$ risingwave playground


Next, open a new command line window and execute the following command to establish a connection with RisingWave:

$ psql -h localhost -p 4566 -d dev -U root


For ease of understanding, let's first try to create a table and use INSERT to add some test data. In real-world scenarios, we typically need to fetch data from the message queues or OLTP databases, which will be introduced later.

Let's create a table for web browsing records:

CREATE TABLE website_visits (
  timestamp TIMESTAMP,
  user_id VARCHAR,
  page_id VARCHAR,
  action VARCHAR
);


Next, we create a materialized view to count the number of visits, visitors, and last access time for each page. It is worth mentioning that materialized views based on streaming data are a core feature of RisingWave.

CREATE MATERIALIZED VIEW page_visits_mv AS
SELECT page_id,
       COUNT(*) AS total_visits,
       COUNT(DISTINCT user_id) AS unique_visitors,
       MAX(timestamp) AS last_visit_time
FROM website_visits
GROUP BY page_id;


We use INSERT to add some data:

INSERT INTO website_visits (timestamp, user_id, page_id, action) VALUES
  ('2023-06-13T10:00:00Z', 'user1', 'page1', 'view'),
  ('2023-06-13T10:01:00Z', 'user2', 'page2', 'view'),
  ('2023-06-13T10:02:00Z', 'user3', 'page3', 'view'),
  ('2023-06-13T10:03:00Z', 'user4', 'page1', 'view'),
  ('2023-06-13T10:04:00Z', 'user5', 'page2', 'view');


Take a look at the current results:

SELECT * from page_visits_mv;

-----Results
 page_id | total_visits | unique_visitors |   last_visit_time

---------+--------------+-----------------+---------------------

 page2   |            2 |               2 | 2023-06-13 10:04:00

 page3   |            1 |               1 | 2023-06-13 10:02:00

 page1   |            2 |               2 | 2023-06-13 10:03:00

(3 rows)


Let's insert five more rows of data:

INSERT INTO website_visits (timestamp, user_id, page_id, action) VALUES
  ('2023-06-13T10:05:00Z', 'user1', 'page1', 'click'),
  ('2023-06-13T10:06:00Z', 'user2', 'page2', 'scroll'),
  ('2023-06-13T10:07:00Z', 'user3', 'page1', 'view'),
  ('2023-06-13T10:08:00Z', 'user4', 'page2', 'view'),
  ('2023-06-13T10:09:00Z', 'user5', 'page3', 'view');


Inserting data twice is done to simulate the continuous influx of data. Let's take another look at the current results:

SELECT * FROM page_visits_mv;
-----Results
 page_id | total_visits | unique_visitors |   last_visit_time

---------+--------------+-----------------+---------------------

 page1   |            4 |               3 | 2023-06-13 10:07:00

 page2   |            4 |               3 | 2023-06-13 10:08:00

 page3   |            2 |               2 | 2023-06-13 10:09:00

(3 rows)


We can see that the results have been updated. This result automatically stays up-to-date when we are processing real-time streaming data.


Interaction with Kafka


Given that message queues are commonly used in stream data processing, let's look at how to retrieve and process data from Kafka in real-time.

If you haven't installed Kafka yet, first download the appropriate compressed package from the official website (using 3.4.0 as an example here), and then unzip it:

$ tar -xzf kafka_2.13-3.4.0.tgz
$ cd kafka_2.13-3.4.0


Now let's start Kafka.

  1. Generate a cluster UUID:
$ KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"


2. Format log directory:

$ bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties


3. Start Kafka server:

$ bin/kafka-server-start.sh config/kraft/server.properties


After starting the Kafka server, we can create a topic:

$ bin/kafka-topics.sh --create --topic test --bootstrap-server localhost:9092


Once Kafka is successfully launched, we can directly input messages from the command line.

First, run the following command to start the producer:

$ bin/kafka-console-producer.sh --topic test --bootstrap-server localhost:9092


Once the '>' symbol appears, we can enter the message. To facilitate data consumption in RisingWave, we input data in JSON format:

{"timestamp": "2023-06-13T10:05:00Z", "user_id": "user1", "page_id": "page1", "action": "click"}
{"timestamp": "2023-06-13T10:06:00Z", "user_id": "user2", "page_id": "page2", "action": "scroll"}
{"timestamp": "2023-06-13T10:07:00Z", "user_id": "user3", "page_id": "page1", "action": "view"}
{"timestamp": "2023-06-13T10:08:00Z", "user_id": "user4", "page_id": "page2", "action": "view"}
{"timestamp": "2023-06-13T10:09:00Z", "user_id": "user5", "page_id": "page3", "action": "view"}


We can start a consumer to view the messages we have inputted:

$ bin/kafka-console-consumer.sh --topic test --from-beginning --bootstrap-server localhost:9092


Now let's take a look at how RisingWave retrieves data from this message queue. In this scenario, RisingWave plays the role of a message consumer. Let's switch back to the psql window and create a data source to establish a connection with the previously created topic. It's important to note that we are only establishing the connection at this stage and have yet to start consuming data.

When creating a data source, we can directly define a schema to map relevant fields from the JSON data in the streaming data. To avoid conflicts with the tables mentioned earlier, we will name the data source website_visits_stream.

CREATE source IF NOT EXISTS website_visits_stream (
	timestamp TIMESTAMP,
	user_id VARCHAR,
	page_id VARCHAR,
	action VARCHAR
	)
WITH (
	connector='kafka',
	topic='test',
	properties.bootstrap.server='localhost:9092',
	scan.startup.mode='earliest'
	)
ROW FORMAT JSON;


We must create a materialized view for RisingWave to start ingesting data and performing computations. We have created a materialized view similar to the above example for easy understanding.

CREATE MATERIALIZED VIEW visits_stream_mv AS
	SELECT page_id,
	COUNT(*) AS total_visits,
	COUNT(DISTINCT user_id) AS unique_visitors,
	MAX(timestamp) AS last_visit_time
	FROM website_visits_stream
	GROUP BY page_id;


Now we can take a look at the results:

SELECT * FROM visits_stream_mv;
-----Results

 page_id | total_visits | unique_visitors |   last_visit_time

---------+--------------+-----------------+---------------------

 page1   |            1 |               1 | 2023-06-13 10:07:00

 page2   |            3 |               2 | 2023-06-13 10:08:00

 page3   |            1 |               1 | 2023-06-13 10:09:00

(3 rows)
At this point, we have successfully retrieved data from Kafka and performed processing operations on it.


Advanced: Build a Real-time Monitoring System with RisingWave


Real-time monitoring plays a crucial role in streaming applications. By processing data in real-time, you can visualize and monitor the results as they happen. RisingWave can act as a data source, seamlessly connecting to visualization tools like Superset, Grafana, and more, allowing you to display processed metric data in real-time.

We encourage you to take on the challenge of building your own stream processing and visualization system. For specific steps, you can refer to our use case document. In this document, we showcase how RisingWave can be used to monitor and process system performance metrics, subsequently presenting them in real time through Grafana.

Although our demonstration is relatively simple, we firmly believe that with real data, and in your familiar business scenarios, you can achieve much richer and more impactful display effects.

Close
Featured Use RisingWave for monitoring and displaying real-time results in Grafana

Conclusion

RisingWave distinguishes itself with its remarkable simplicity, enabling users to engage in distributed stream processing effortlessly using SQL. In terms of performance, RisingWave outperforms stream processing platforms of the big data era, such as Apache Flink.

The extent of this performance improvement is quite noteworthy. Brace yourself for the details: stateless computing demonstrates an enhancement of around 10%-30%, while stateful computing showcases an astonishing 10X+ boost!

Keep an eye out for the upcoming performance report to delve deeper into these findings. An efficient stream processing platform should always prioritize simplicity, and RisingWave delivers precisely that and more.


This article was originally published on Medium.

A streaming database is a type of database that is designed specifically to process large amounts of real-time streaming data. Unlike traditional databases, which store data in batches before processing, a streaming database processes data as soon as it is generated, allowing for real-time insights and analysis. Unlike traditional stream processing engines that do not persist data, a streaming database can store data and respond to user data access requests. Streaming databases are ideal for latency-critical applications such as real-time analytics, fraud detection, network monitoring, and the Internet of Things (IoT) and can simplify the technology stack.


Brief History


The concept of a streaming database was first introduced in academia in 2002. A group of researchers from Brown, Brandeis, and MIT pointed out the demand for managing data streams inside databases and built the first streaming database, Aurora. A few years later, the technology was adopted by large enterprises. The top three database vendors, Oracle, IBM, and Microsoft, consecutively launched their stream processing solutions known as Oracle CQL, IBM System S, and Microsoft SQLServer StreamInsight. Instead of developing a streaming database from scratch, these vendors have directly integrated stream processing functionality into their existing databases.

Since the late 2000s, developers inspired by MapReduce have separated stream processing functionality from database systems and developed large-scale stream processing engines, including Apache Storm, Apache Samza, Apache Flink, and Apache Spark Streaming. These systems were designed to continuously process ingested data streams and deliver results to downstream systems. However, compared to streaming databases, stream processing engines do not store data and, therefore, cannot serve user-initiated ad-hoc queries.

Streaming databases keep evolving in parallel with stream processing engines. Two streaming databases, PipelineDB and KsqlDB, were developed in the 2010s and were popular then. In the early 2020s, a few cloud-based streaming databases, like RisingWave, Materialize, and DeltaStream, emerged. These products aim to provide users with streaming database services in the cloud. To achieve that objective, the focus is on designing an architecture that fully utilizes resources on the cloud to achieve unlimited horizontal scalability and supreme cost efficiency.


Typical Use Cases

Close
Featured Real-time applications need streaming databases.

Streaming databases are well-suited for real-time applications that demand up-to-date results with a freshness requirement ranging from sub-seconds to minutes. Applications like IoT and network monitoring require sub-second latency, and latency requirements for applications like ad recommendations, stock dashboarding, and food delivery can range from hundreds of milliseconds to several minutes. Streaming databases continuously deliver results at low latency and can be a good fit for these applications.

Some applications are not freshness sensitive and can tolerate delays of tens of minutes, hours, or even days. Some representative applications include hotel reservations and inventory tracking. In these cases, users may consider using either streaming databases or traditional batch-based databases. They should decide based on other factors, such as cost efficiency, flexibility, and tech stack complexity.

Streaming databases are commonly used alongside other data systems in real-time applications to facilitate two classic types of use cases: streaming ingestion (ETL) and streaming analytics.

Close
Featured Streaming databases are commonly integrated with other modern data systems to facilitate two types of use cases: streaming ingestion (ETL) and streaming analytics.

Streaming Ingestion (ETL)


Streaming ingestion provides a continuous flow of data from one set of systems to another. Developers can use a streaming database to clean streaming data, join multiple streams, and move the joined results into downstream systems in real time. In real-world scenarios, data ingested into the streaming databases typically come from OLTP databases, messaging queues, or storage systems. After processing, the results are most likely to be dumped back into these systems or inserted into data warehouses or data lakes.


Streaming Analytics


Streaming analytics focuses on performing complex computations and delivering fresh results on-the-fly. Data typically comes from OLTP databases, message queues, and storage systems in the streaming analytics scenario. Results are usually ingested into a serving system to support user-triggered requests. A streaming database can also serve queries on its own. Users can connect a streaming database directly with a BI tool to visualize results.

With the growing demand for real-time machine learning, streaming databases have also become a crucial tool for enabling agile feature engineering. By utilizing streaming databases to store transformed data as features, developers can respond promptly to changing data patterns and new events. Streaming databases allow for real-time ingestion, processing, and transformation of data into meaningful features that can enhance the accuracy and efficiency of machine learning models while also reducing data duplication and improving data quality. This empowers organizations to make faster and more informed decisions, optimize their machine-learning workflows, and gain a competitive advantage.


Streaming Databases vs. Traditional Databases

Close
Featured Difference between a streaming database and a traditional database.

Traditional databases are designed to store large amounts of batch data and provide fast, consistent access to that data through transactions and queries. They are often optimized for processing complex operations, such as aggregations and joins, that manipulate the data in bulk. Traditional databases’ execution models are often referred to as Human-Active, DBMS-Passive (HADP) models. That is, a traditional database passively stores data, and queries actively initiated by humans trigger computations. Examples of traditional databases include OLTP databases like MySQL and PostgreSQL and OLAP databases like DuckDB and ClickHouse.

Streaming databases, on the other hand, are designed to incrementally process a large volume of continuously ingested data on-the-fly, and provide low-latency access to the data and results for further processing and analysis. They are optimized for processing data as soon as it arrives rather than bulk processing after data is persisted. Streaming databases’ execution models are often called DBMS-active, Human-Passive (DAHP) models. A streaming database actively triggers computation as data comes in, and humans passively receive results from the database. Examples of streaming databases include PipelineDB, KsqlDB, and RisingWave.


Streaming Databases vs. OLTP Databases


An OLTP database is ACID-compliant and can process concurrent transactions. In contrast, a streaming database does not guarantee ACID compliance and, therefore, cannot be used to support transactional workloads. In terms of data correctness, streaming databases enforce consistency and completeness. A well-designed streaming database should guarantee the following two properties:

  • Exactly-once semantics, meaning that every single data event will be processed once and only once, even if a system failure occurs.
  • Out-of-order processing means that users can enforce a streaming database to process data events in a predefined order, even if data events arrive out of order.


Streaming Databases vs. OLAP Databases


An OLAP database is optimized for efficiently answering user-initiated analytical queries. OLAP databases typically implement columnar stores and a vectorized execution engine to accelerate complex query processing over large amounts of data. OLAP databases are best suited for use cases where interactive queries are essential. Different from OLAP databases, streaming databases focus more on the resulting freshness, and they use an incremental computation model to optimize latency. Streaming databases typically do not adopt column stores but may implement vectorized execution for query processing.

Conclusion

In conclusion, a streaming database is an essential system for organizations that require real-time insights from large amounts of data. By providing real-time processing, scalability, and reliability, a streaming database can help organizations make better decisions, identify opportunities, and respond to threats in real time.

This article was originally published on Dzone.

RisingWave is a distributed SQL streaming database designed to reduce the complexity and cost of building real-time applications. The database is open-source and can be accessed publicly on GitHub under the Apache 2.0 License. RisingWave Cloud provides all the functionality of RisingWave in a managed cloud deployment. It delivers easy stream processing in the cloud while eliminating the challenges of deploying and maintaining your environment.

In the field of stream processing, Apache Flink is one of the most popular open-source frameworks. Flink provides a set of low-level APIs that allow users to write complex stream processing programs in high-level programming languages such as Java, Scala, and Python. Flink further introduced SQL interface to lower the bar of application development.

Recently, several companies, including AWS and Confluent, are investing in developing a managed Flink SQL service to eliminate the pains of deploying and maintaining open-source Flink. Given that RisingWave and Flink SQL provide SQL-based interfaces for supporting stream processing applications, such as fraud detection, real-time dashboarding, and alerting, it's natural to wonder about the distinctions between these technologies.

This article aims to provide a comprehensive explanation from the users' point of view, specifically focusing on three crucial factors: ease of use, cost-efficiency, and correctness.


Ease-of-use


The ease of use of RisingWave surpasses that of Flink SQL, thanks to its PostgreSQL compatibility and database-oriented architecture, as opposed to Flink SQL being a stream processing engine.

  • RisingWave is compatible with PostgreSQL, so users can use RisingWave the same way they would. This allegiance not only means that users are familiar with RisingWave's SQL syntax, but they can also use standard database DDL (such as create/drop roles/users/schemas/tables/views) and DML (such as update, insert, delete) to manage data.Flink uses big data style SQL, with many custom syntax rules, requiring users to refer to Flink documentation to learn it. Furthermore, Flink is not a SQL database and does not support database DDL and DML; hence, users cannot manipulate internal data flexibly.
  • Since RisingWave is wire-compatible with PostgreSQL, it can connect to popular data sources like Apache Kafka and Apache Pulsar and any systems or tools that can connect with PostgreSQL as GrafanaDBeaver, and others, allowing it to take full advantage of the convenience brought by the PostgreSQL ecosystem. Users can even use the PostgreSQL editor plugin in Visual Studio Code to compose SQL code, significantly improving their development efficiency.Although Flink integrates with various big data systems, it cannot integrate with database management tools, making it less convenient than systems like RisingWave, which are compatible with PostgreSQL.
  • RisingWave is a database where all input and output data are presented as tables or materialized views. Users can easily query both input and output data in RisingWave, enabling them to access, edit, verify, and debug their streaming queries anytime.In contrast, Flink does not offer a straightforward means of accessing internal data, making it challenging for users to access their queries and verify query correctness.
  • As a database, RisingWave can store results and support users to query them randomly without importing them into downstream systems.Flink is a stream processing engine that cannot store data. It can only export data to downstream systems, and users must query the computation results in the downstream system.


Cost-efficiency

  • The efficient architecture and implementation of RisingWave make it a cost-efficient solution for stream processing, being 2-15x lower in total cost compared to Flink SQL using industry-standard benchmark (performance report coming soon in Q2 2023). Additionally, its data storage capabilities offer users added flexibility and cost savings in building their streaming applications.


Architecture

  • RisingWave adopts a decoupled compute-storage architecture like many modern cloud databases. The computing and the cloud storage layers of RisingWave can be configured differently and scaled independently, leading to improved cost efficiency and performance. This architecture also allows each component to be optimized separately, reducing resource waste and avoiding task overload.Flink adopted the conventional coupled compute-storage architecture, which enables it to achieve high parallelism and scalability while resulting in high execution costs.


Implementation

  • RisingWave was architected from the ground up with a clear focus on database technologies. It actively harnesses modern database techniques to boost the efficiency of its optimizers, executors, and storage elements.Flink assembles its SQL layer on the foundation of its MapReduce-style computation framework, neglecting the opportunity for various optimizations specific to databases.
  • RisingWave is implemented in Rust, a programming language known for its safety and performance. Rust offers developers low-level control over system resources. Its ability to eliminate runtime overhead, optimize code at compile-time, and provide efficient memory management, thus giving a significant performance advantage.Flink's Java-based implementation makes it prone to runtime overhead and garbage collection pauses, which can hinder the performance of real-time stream processing.


Usage

  • The internal storage system of RisingWave provides the advantage of directly storing tables within the database, streamlining access and enhancing performance for real-time applications. This reduces the frequency of external data access, leading to faster query execution and retrieval.Flink SQL lacks an internal storage system, which may incur frequent accesses to external storage and introduce additional latency and cost for query processing.
  • RisingWave supports layered materialized views, which optimizes query performance by reusing computing resources and minimizing redundant computations. These precomputed query results are stored for quick retrieval without re-executing the query.Conversely, Flink SQL treats each query independently with no resource sharing between queries. While this allows for concurrent query execution, it may result in recomputing intermediate results for different queries, leading to potential inefficiencies.
  • Integrated storage in RisingWave enhances stream processing by allowing users to access and query data stored within its infrastructure directly, eliminating time-consuming data transfer steps. This capability streamlines data processing workflows and enables real-time analytics within RisingWave.In contrast, Flink's lack of internal storage means that users must deliver the results of Flink's processing into downstream systems and query the data within those external systems, introducing additional complexity and potential delays to the data processing workflow.


Correctness


RisingWave goes beyond Apache Flink in terms of ensuring the correctness of results delivered to downstream systems. While both platforms guarantee consistency and completeness in stream processing, RisingWave offers additional benefits, such as consistent snapshots for all data access, which ensures that users always see accurate and unambiguous results.

The definition of correctness in stream processing includes consistency and completeness, which refer to:

  • Consistency. Every single data event will be processed once and only once, even if any system failure occurs. This is also known as exactly-once semantic.
  • Completeness. Even if a data stream arrives out of order, the results will ultimately be in order.


Both Apache Flink and RisingWave achieve these goals by employing a consistent checkpoint algorithm and using watermarks to detect out-of-order data. However, RisingWave is a stream processing platform and a streaming database, which means it can guarantee consistent snapshots for all data access. This additional feature ensures that users always see accurate and unambiguous results without any confusion.

conclusion

In summary, RisingWave and Apache Flink SQL present notable solutions for SQL-based stream processing, with RisingWave standing out for its ability to simplify development and reduce costs compared to Apache Flink SQL. As users evaluate their options, the ease of use, cost-efficiency, and correctness of RisingWave make it a compelling choice for those seeking innovation and efficiency in their real-time data processing endeavors.

If you are looking for a stream processing system for your new streaming analytics or streaming ETL applications, consider RisingWave. As a distributed SQL streaming database, RisingWave is well-optimized for stream processing.


Is it meant for your stream processing application?


The underlying premise of this question is whether RisingWave is well-suited for your workload and business requirements. In this article, we will guide you through the different aspects of RisingWave. We cover user interface, performance, correctness, scalability and elasticity, fault tolerance and availability, durability, serving capability, integrations, and deployment.


RisingWave is a streaming database


RisingWave is an open-source distributed SQL database designed for stream processing. Also known as a streaming database, RisingWave processes streaming data in a database way. It uses SQL as the interface to ingest, manage, query, and store continuously generated data streams.

RisingWave is designed to reduce the complexity and cost of building real-time applications. It consumes streaming data, performs incremental computations when new data come in, and updates results dynamically.

As a streaming database system, RisingWave maintains results in its own storage so that users can access data efficiently. You can also sink data from RisingWave to an external stream for storage or additional processing.

The open-source version of RisingWave is released on GitHub under Apache License 2.0, a real open-source license. It is ready for production use and has been deployed in dozens of companies. The managed version, named RisingWave Cloud, is currently in beta release. Early access is available.


The decoupled compute-storage architecture provides better leverage


RisingWave is a distributed system. It leverages the decoupled compute-storage architecture to achieve infinite scalability at a low cost.

There are currently four types of nodes in the cluster:

  • Frontend: Frontend is a stateless proxy that accepts user queries through Postgres protocol. It is responsible for parsing, validating, optimizing, and providing each query's results.
  • ComputeNode: ComputeNode is responsible for executing the optimized query plan.
  • Compactor: Compactor is a stateless worker node responsible for executing the compaction tasks for the storage engine.
  • MetaServer: MetaServer is the central metadata management service. It also acts as a failure detector that periodically sends heartbeats to frontends and ComputeNodes in the cluster.
Close
Featured The overall architecture of RisingWave

The topmost component is the Postgres client. It issues queries through TCP-based Postgres wire protocol. The leftmost component is the streaming data source

Kafka is the most representative system for streaming sources. Alternatively, RedpandaApache PulsarAWS KinesisGoogle Pub/Sub are also widely used. Streams from Kafka will be consumed and processed through the pipeline in the database. The bottom-most component is AWS S3, or MinIO (an open-sourced s3-compatible system).


User interface


RisingWave is a SQL streaming database. Users can compose Postgres-style SQL to consume, process, and manage streaming data. Unlike big-data-style stream processing systems (such as Flink and Spark Streaming), RisingWave does not provide native Java, Scala, or Python API. However, since RisingWave is wire- compatible with PostgreSQL, users can directly use third-party PostgreSQL drivers to interact with RisingWave using other programming languages such as Java, Python, and Node.js.

To extend RisingWave's expressiveness in supporting complex queries, we are developing our UDF framework, and Python will be the first supported language. RisingWave will host the UDFs on a dedicated process and call them remotely. We are also exploring deploying UDF with serverless computing services such as AWS Lambda, which offers excellent scalability and low cost.


Performance


RisingWave is highly optimized for high-throughput stream processing. Our performance evaluation using the Nexmark benchmark shows RisingWave can achieve 10-50% higher throughput than Apache Flink for most queries. A detailed performance report will be released in Q2 2023.

RisingWave is not an in-memory system. It uses tiered storage architecture to manage its computation states and persistent data: the newly ingested and frequently accessed data are maintained in memory, and the infrequently accessed data are stored in S3 or S3-equivalent remote storage.

RisingWave is written in Rust, a low-level programming language designed for high- performance programming. Compared to JVM-based systems, RisingWave does not suffer any performance penalty caused by runtime management or garbage collection.


Correctness


RisingWave is a stream processing system. It guarantees “consistency and completeness.” Specifically:

  1. RisingWave ensures exactly-once semantics, meaning that every single data event will be processed once and only once, even if a system failure occurs.
  2. RisingWave supports out-of-order processing. Users can enforce RisingWave to process data events in a predefined order, even if data events arrive out of order.

RisingWave is also a database system. But it is not an OLTP database (≤ version 0.1.16) and cannot be used to process database transactions. RisingWave guarantees that all reads will see a consistent snapshot of the database.


Scalability and elasticity


RisingWave was born as a distributed system. All data is stored in an underlying object store, while the compute layer can be swiftly scaled in and out whenever workload fluctuation occurs. The scaling process can be completed almost instantaneously without interrupting data processing.

Internally, RisingWave dynamically partitions data into each compute node using a consistent hashing algorithm. These compute nodes collaborate by computing their unique portion of the data and then exchanging output with each other. Thanks to partitioning, RisingWave can achieve cache locality on each node and get an overall performance comparable to shared-nothing systems.


Fault tolerance and availability


Like many other stream processing systems, RisingWave implements the Chandy- Lamport algorithm to take a consistent snapshot. The consistent snapshot helps RisingWave efficiently recover from failure.

The recovery process is faster than similar systems due to its specialized shared- storage design. The system can continue from the last checkpoint almost immediately after recovery. The only thing affected is the cache, which is rebuilt gradually afterward.

Currently, the default checkpoint frequency is 10 seconds. Hence, the recovery process will bring the system back to where it was 10 seconds ago and continue from there, aka RPO (Recovery Point Objective). In addition, it may take about 10 seconds to detect the failure through heartbeat timeout. Together they make up the Recovery Time Objective or RTO.


Durability


RisingWave persists its stored data. Users can insert data into RisingWave by issuing DML statements like INSERT or directly using connectors.

We recommend using connectors from message brokers or direct CDC from upstream systems for production use. Incoming data is stored in RisingWave’s storage along with the checkpoint. To use a connector, users only need to add some extra properties when creating the table. Connector offsets are persisted in checkpoints to achieve data durability and exactly-once delivery.

In the case of a system failure, RisingWave will continue to consume data from the checkpointed offset, ensuring that no data is missing or lost. However, the users are responsible for ensuring that data stored in message brokers is persisted.

DML statements, including INSERT/ UPDATE/ DELETE statements, are beneficial for manually modifying data. However, they may suffer from data loss if the system inadvertently fails before they are persisted. The FLUSH command can explicitly tell them to persist data. In future releases, we plan to provide a write-ahead log (WAL) to ensure zero data loss — better data management.


Serving capability


RisingWave uses materialized views to store stream processing results. It can serve concurrent queries from users at ultra-low latency. Our experimental evaluation using a standard benchmark is a good example. It showed that RisingWave can answer concurrent queries with single-digit millisecond latency. We will release a public experiment report later this year.


Integrations


Check our comprehensive list of integrations. Here we highlight some of them.


Source

  • Message brokers: Apache Kafka, Redpanda, Apache Pulsar, AWS Kinesis, Google Pub/Sub
  • Database (direct CDC connection): MySQL, PostgreSQL
  • File: S3


Sink

  • Message brokers: Apache Kafka, Redpanda
  • Databases: MySQL, PostgreSQL, Redis, Cassandra, DynamoDB
  • Data lake: Apache Hudi, Apache Iceberg


3rd-Party Tools

  • Realtime dashboards: Grafana
  • BI tools: Superset, Metabase, DBeaver


Deployment


The easiest way to deploy RisingWave is through our official cloud hosting platform, RisingWave Cloud. We now support AWS in most regions, with support for GCP coming soon. Alternatively, we provide an open-source Kubernetes operator to run RisingWave on your local Kubernetes cluster. For storage, any S3-compatible storage service is accepted, including MinIO or Ceph.

We suggest not deploying RisingWave using only binaries, as it may incur additional efforts to achieve high availability.


Others: OLTP and OLAP


RisingWave does not support transaction semantics. So if you are looking for a system to support transactional workloads, there may be better choices than RisingWave. Consider cloud databases like AWS Aurora or Google Spanner. RisingWave does not implement a columnar store and is not optimized for interactive analytical workloads. But RisingWave can easily be integrated with popular OLTP and OLAP databases.

conclusion

RisingWave is a modern distributed SQL database for stream processing. It greatly saves the efforts of building streaming analytics or streaming ETL applications.

RisingWave can achieve higher cost efficiency than other stream processing systems due to its decoupled compute-storage architecture. RisingWave has already been deployed in many pioneering companies. Learn more about RisingWave.

The last decade witnessed the rapid growth of stream processing technology. My first exposure to stream processing was in 2012. At that time, I was fortunate to be an intern at Microsoft Research Asia, working on a distributed stream processing system that can distribute at a large scale. Then, I continued my research and development on stream processing and database systems at the National University of Singapore, Carnegie Mellon University, IBM Almaden Research Center, and AWS Redshift. Most recently, I founded a startup called RisingWave Labs, developing a distributed SQL streaming database.

2023 marks my 11th year in the stream processing area. I never imagined that I could have been working in this field for over a decade, jumping out of my comfort zone and gradually shifting from pure research and engineering to exploring the path of commercialization. In a startup, the most exciting and challenging thing is to predict the future: we need to constantly evaluate the technology and business trend and predict the development direction of the industry based on our understanding and intuition. As a startup founder building next-generation stream processing systems, I am always optimistic yet humble in the market. I keep asking myself over and over again: Why do we need stream processing? Why do we need a streaming database? Can stream processing really replace batch processing? In this blog, I will combine my software development and customer engagement experiences to discuss stream processing and streaming databases' past, present, and future from a practitioner's perspective.


When Do We Really Need a Stream Processing System?


When talking about stream processing systems, people naturally think of some low-latency use cases: stock trading, fraud detection, ad monetization, etc. However, how exactly are stream processing systems used in these use cases? How low does the latency need to be when using a stream processing system? Why not use batch processing systems to solve the problems? In this section, I will answer these questions based on my experience.


Classic Stream Processing Scenarios


Regardless of the specific use case, stream processing systems are typically used in the following two scenarios: streaming ingestion and streaming analytics.

Close
Featured Streaming ingestion: joining streams from OLTP databases and message queues and delivering results to data warehouses/data lakes.

Streaming ingestion (ETL). Streaming ingestion provides the capability of continuously sending data from one set of systems to another. In many cases, users need to perform some transformation on the data during ingestion; hence, people sometimes use the terms "streaming ingestion" and "streaming ETL" interchangeably. When do we need streaming ingestion? Here’s an example. Let’s assume we are running an e-commerce website that uses an OLTP database (e.g., AWS Aurora, CockroachDB, TiDB, etc.) to support online transactions. To better analyze user behaviors, our website needs to track user events (e.g., ad clicks) and record these events into message queues (e.g., Kafka, Pulsar, Redpanda, etc.). To improve the sales strategy based on the most recent data, we have to continuously combine and ingest the transaction data and user behavior logs into a single data warehouse, such as Snowflake or Redshift, for deep analytics. Stream processing systems play a critical role in this case: data engineers use stream processing systems to clean the data, join the two streams, and move the joined results into the data warehouse in real-time. In real-world scenarios, data ingested into the stream processing systems typically come from OLTP databases, messaging queues, or storage systems. After stream processing, the results are most likely to be dumped back into these systems or inserted into data warehouses or data lakes. One thing worth noticing is that data seldom comes out from data warehouses and data lakes, as these systems are typically considered the “terminal” of the data. As few data warehouses and data lakes provide data change logs, capturing data changes from these systems can be very challenging.

Streaming analytics. You may have already heard of the term real-time analytics. Streaming analytics is similar to real-time analytics but more focused on data freshness over query latency (we will discuss this topic later in the blog). When performing streaming analytics, users are interested in continuously receiving fresh results over recent data (e.g., data generated within the last 30 minutes, 1 hour, or seven days). Just think of the stock market data. Users may be more interested in getting insights from the most recent data, and processing historical data, in this case may not bring too much value. In the streaming analytics scenario, data typically comes from OLTP databases, message queues, and storage systems. Results are usually ingested into a key-value store (such as Cassandra, DynamoDB, etc.) or a caching system (such as Redis, etc.) to support user-triggered requests. Some users may directly send stream processing results to the application layer, which is a typical pattern in alerting applications.

Close
Featured

While batch processing systems can also support data ingestion and analysis, stream processing systems can reduce latency from days or hours to minutes or seconds. In many businesses, this can bring huge benefits. In the data ingestion scenario, reducing latency can allow users of downstream systems (such as data warehouses) to obtain the latest data promptly and process the latest data. In the data analytics scenario, the downstream system can see the latest data processing results in real time so that the results can be presented to users in a timely manner.

People may ask:

  • In the data ingestion (or ETL) scenario, wouldn’t it be enough for us to use an orchestration tool (such as Apache Airflow) to periodically trigger a batch processing system (such as Apache Spark) to do computations?
  • In the data analysis scenario, many real-time analysis systems (such as Clickhouse, Apache Pinot, Apache Doris, etc.) can analyze data online. Why do we need a stream processing system?


These two issues are worth exploring, and we discuss them in the last section.


User Expectations: How Low Does Latency Need to Be?


People are always talking about low-latency processing. But what “low latency” means? What are the users’ expectations of latencies? Seconds? Milliseconds Microseconds? The lower, the better, is it? We have interviewed hundreds of users who have adopted or are considering adopting processing technologies in their production. We find that most use cases require latencies ranging from hundreds of milliseconds to tens of minutes. Many data engineers told us, “After adapting stream processing systems, our computation latency has been reduced from days to minutes. We are very satisfied with this improvement and are not strongly motivated to reduce the latency further.” Ultra-low latency sounds appealing, but, unfortunately, there is not much use case to support it.

Close
Featured

In some ultra-low-latency scenarios, general-purpose stream processing systems cannot meet the latency requirements. A typical ultra-low-latency scenario is high-frequency quantitative trading. Quantitative trading firms expect their systems to be able to respond to market fluctuations in microseconds and to buy and sell stocks or futures timely. To achieve this level of latency, many companies move their data centers to buildings physically closer to the stock exchange and carefully choose network providers to reduce any possible network delay. In this scenario, almost all trading firms build their own systems instead of adopting general-purpose stream processing systems such as Flink or RisingWave. This is not only because general-purpose stream processing systems often fail to meet the latency requirements but also because quantitative trading usually requires some specialized expressions, which are not well supported by general-purpose stream processing systems.

Another set of low-latency scenarios includes IoT (Internet of Things) and network monitoring. The latency requirement in these scenarios is usually on the millisecond level and can vary from a few milliseconds or hundreds of milliseconds. In such scenarios, general-purpose stream processing systems might be a good fit. However, in some use cases, it is still necessary to use specialized systems. A good example is autonomous vehicles. Autonomous vehicles are typically equipped with radar and sensors to monitor roadside conditions, speed, coordinates, driver behaviors (e.g., pedaling and braking), and other information. Such information may not be sent back to the data center due to privacy issues and network bandwidth constraints. Therefore, a common solution is directly deploying embedded computing devices on the vehicle for data processing.

Stream processing systems are primarily adopted in scenarios where latency falls into the range of hundreds of milliseconds to several minutes. These scenarios can cover ad recommendations, fraud detection, stock market dashboard, food delivery, etc. In these scenarios, lower latency might bring more value but is not considered critical. Let’s take the stock market dashboard as an example. Individual investors usually make trading decisions by staring at websites or mobile phones to watch stock fluctuations. Do these individual investors really need to react to market fluctuation within ten milliseconds? Not at all. Human eyes can process no more than 60 frames per second, which means a human cannot distinguish refreshes below 16 milliseconds. At the same time, humans’ reaction speed is at the second level, and even well-trained sprint athletes can only react at about 100-200 milliseconds. Therefore, for the stock market dashboard, which is solely for humans to get insights and make decisions, achieving less than 100-millisecond latency is unnecessary. General-purpose stream processing systems (such as Flink, RisingWave, etc.) can be a great fit in this scenario.

Not all scenarios require low-latency processing. Considering examples like hotel reservations and inventory management, a 30-minute delay won’t be a big issue. In these cases, users may consider using either stream processing systems or batch processing systems, and they should make the decision based on other factors, such as cost efficiency, flexibility, tech stack complexity, etc.

Stream processing systems do not fit into scenarios with no latency requirements. These scenarios include machine learning model training, data science, and financial reporting. It is obvious that batch processing systems are more suitable for these scenarios. But it is worth noticing that people have started moving from batch to streaming in these cases. In fact, some companies have adopted stream processing systems to build their online machine-learning platforms.


Looking Back at (Forgotten) History


The previous section described the typical scenarios where stream processing systems fit. In this section, let's talk about the history of stream processing systems.


Apache Storm and Its Successors


For many experienced engineers, Apache Storm may be the first stream processing system they have ever used. Apache Storm is a distributed stream computing engine written in Clojure, a niche JVM-based language that many junior developers may not know. Storm was open-sourced by Twitter in 2011. In the big-data era dominated by Hadoop, the emergence of Storm blew many developers' minds in the data processing. Traditionally, the way users process data is to first import a large amount of data into HDFS, and then use batch computing engines such as Hadoop to analyze the data. With Storm, data could be processed on-the-fly, immediately after it flows into the system. With Storm, data processing latency was drastically reduced: Storm users were able to receive the latest results in just a few seconds, without waiting for hours or even days.

Close
Featured The initial version of Storm was not entirely user-friendly. Users need to learn many Storm-specific concepts before running a basic “word count” example. Image credit: Getting Started with Storm, by Jonathan Leibiusky, Gabriel Eisbruch, and Dario Simonassi (O'Reilly).

Apache Storm was groundbreaking at its time. However, the initial design of Storm was far from perfect. In fact, it lacked many basic functionalities that modern stream processing systems by default have to provide: state management, exactly-once semantics, dynamic scaling, SQL interface, etc. But it inspired many talented engineers to build next-generation stream processing systems. Just a few years after Storm emerged, many new stream processing systems were invented: Apache Spark Streaming, Apache Flink, Apache Samza, Apache Heron, Apache S4, and many more. Spark Streaming and Flink eventually stand out and become legends in the stream processing field.


History Before Apache Storm


If you think that the history of stream processing systems begins with the birth of Storm, you are probably wrong. In fact, stream processing systems have evolved for over two decades. Not surprisingly, the concept of stream processing originated from the database community. At the VLDB conference in 2002, a group of researchers from Brown and MIT published a paper titled "Monitoring Streams - A New Class of Data Management Applications", pointing out the demand for stream processing for the first time. In this paper, the authors proposed that to process streaming data, we should shift from the traditional "Human-Active, DBMS-Passive" processing mode and turn to the new "DBMS-Active, Human-Passive" mode. That is, instead of passively reacting to user-issued queries, the new class of database systems should trigger computation upon data arrival and actively push the results to the users. In academia, the earliest stream processing system was called Aurora. After that, Borealis, STREAM, and many other systems were born.

Close
Featured Aurora: The first stream processing system in academia.

Just a few years after being studied in academia, stream processing technology was adopted by large enterprises. The top three database vendors, Oracle, IBM, and Microsoft, consecutively launched their stream processing solutions, which were known as Oracle CQL, IBM System S, and Microsoft SQLServer StreamInsight. Interestingly, instead of developing a standalone stream processing system, all these vendors have chosen to integrate stream processing functionality into their existing systems.


2002 to present: what has changed?

Close
Featured The evolution of the stream processing systems: from subsystems of commercial database systems to standalone open-source big data systems.

As mentioned above, stream processing technology was first introduced in academia, and soon adopted in the industry world. In the early days, stream processing was implemented as an integrated module inside commercial database systems. But since 2010, dozens of standalone stream processing systems have been built and adopted in tech companies. What caused this change? Well, one of the major factors is the birth and growth of the Hadoop-centric big data ecosystems. In those conventional commercial database systems, computation and storage modules were well integrated and encapsulated into the same system. While the design greatly simplified the system architecture, it also restricted these database systems from scaling in the distributed environment. Observing the limitation, big data systems were built, to offer high scalability that can span across dozens or hundreds of machines. The computation and the storage layers were divided into independent systems, and users can use low-level interfaces to determine the best configurations and strategies to scale out on their own. This design perfectly met the needs of a new batch of Internet companies, such as Twitter, LinkedIn, and Uber. The stream processing systems built at the time (notably Storm and Flink) tightly followed the design principle of those batch processing systems (notably Hadoop and Spark), and their high scalability and flexibility perfectly fit the needs of the big data engineers.


The Renaissance of Streaming Databases


Looking back at history, we find that the concept of streaming databases was already proposed and implemented more than 20 years ago. For example, the aforementioned Aurora system is already a streaming database. However, streaming databases are not as popular as stream processing systems. History moves in a spiral pattern. The big data era witnessed the trend of separating stream processing from databases and overthrowing the monopoly of the three giants: Oracle, IBM, and Microsoft. More recently, in the cloud era starting around 2012, in the field of batch processing systems, we are witnessing systems such as Redshift, Snowflake, and Clickhouse bring the "computing engine" back to the "database". Now it is time to bring stream processing back into databases, which is the core idea behind modern streaming databases such as RisingWave and Materialize. But why now? We analyze it in detail in this section.


The Story of PipelineDB and ksqlDB


Since Apache Storm was open-sourced in 2011, stream processing systems have entered the fast lane of development. But streaming databases didn't just disappear. There are still two streaming databases born in the era of big data and were popular at the time: PipelineDB and ksqlDB.

Although both PipelineDB and ksqlDB are streaming databases, their technical routes are quite different.

  • PipelineDB is entirely based on PostgreSQL. It is a PostgreSQL extension to process continuous queries. Once the user has installed PostgreSQL, they can install PipelineDB as a plugin. This design is very user-friendly since users can directly use PipelineDB without migrating their own PostgreSQL databases. The core selling point of PipelineDB is the real-time updated materialized view. After the user inserts data into PipelineDB, the latest results are reflected on the materialized view created by the user in real-time.
  • ksqlDB has chosen a different path. It is strongly coupled with Kafka. In ksqlDB, the intermediate results of the computation are stored in Kafka, and the communication between nodes also relies on Kafka. The advantages and disadvantages of this technical route are clear. The advantage is that mature components can be highly reused, greatly reducing development costs. The disadvantage is that ksqlDB is strongly bound with Kafka, which seriously lacks flexibility, and dramatically reduces the performance and scalability of the system.


Aside from technological routes, the two systems have much to do with each other commercially. PipelineDB was founded in 2013. Initially, the PipelineDB team joined Y Combinator, a well-known incubator in Silicon Valley, in 2014. On the other hand, ksqlDB was built by Confluent in 2016 (in fact, it was Kafka Stream first). Confluent is a commercial company founded by the original Apache Kafka team. At the time, ksqlDB was not receiving much attention outside the Kafka ecosystem due to its strong binding to Kafka and the lack of technical maturity. Nevertheless, both PipelineDB and ksqlDB were still inferior to Spark Streaming and Flink in terms of user recognition. In 2019, Confluent acquired PipelineDB, and PipelineDB stopped updating since then. Sounds familiar? Yes, more recently in January 2023, Confluent acquired Immerok, a Flink commercialization company founded by Flink core members, and announced with a high profile that it plans to launch a computing platform based on Flink SQL. PipelineDB and Immerok are important enhancements to the Kafka-centric real-time data stack and direct replacements of ksqlDB. This makes the future status of ksqlDB within Confluent foggier.


The Rise of Cloud-Native Streaming Databases


In recent years, many cloud-native systems have gradually surpassed and subverted their big data system counterparts. One of the most well-known examples is Snowflake. Snowflake brings cloud warehouses to the next level with its innovative architecture: separating storage and computing. It rapidly dominated the market that belonged to the data analytic systems (such as Impala) in the big data era. In the world of streaming databases, similar things are happening again. Cloud-based streaming databases like RisingWave, Materialize, DeltaStream, and TimePlus have emerged in recent years. Although there are differences in their commercial and technical routes, their general objective is the same: to provide users with streaming database services on the cloud. To achieve that objective, the focus is on designing an architecture that fully utilizes resources on the cloud to achieve unlimited horizontal scalability and supreme cost efficiency. With a solid technical route and a reasonable product positioning, I believe streaming databases will gradually take over the leading position of stream processing systems of the previous generation.


Stream Processing Systems and Streaming Databases


The trend of shifting toward “cloud-native” is unstoppable. Now the biggest question is: why would you build a "cloud-native streaming database" instead of a "cloud-native stream processing system"?

Some people may argue that it is the impact of SQL. One of the major changes brought about by cloud services is the popularization of data processing systems. In the era of big data, users are typically engineers familiar with programming languages for general-purpose application development such as Java. Whereas in the cloud era, cloud services are adopted by a large wider group of users beyond well-trained programmers. This urgently requires the system to provide a simple and easy-to-understand interface to benefit developers without programming knowledge. SQL, the standard query language, obviously meets the requirement. SQL is a declarative language solely for querying and manipulating data. It has been the industry standard for decades. The widespread use of SQL has accelerated the popularization of the concept of "databases" in data engineering and hence stream processing.

However, SQL is only an indirect factor, not a direct factor. The logic is simple: many stream computing engines (such as Flink and Spark Streaming) already provide SQL interfaces, but they are not streaming databases. Although the SQL interface of these systems differs from the SQL experience of traditional databases such as MySQL and PostgreSQL, it already proves that having SQL does not necessarily mean having a database.

The key difference between a "streaming database" and a "stream processing system" is that a streaming database has its own storage, while a stream processing system does not. The deeper insight is that streaming databases treat storage as a first-class citizen. Based on this insight, streaming databases meet the two most fundamental needs of users: being simple and easy to use. How is this done? We look at the following aspects.

  • Unifying data operation objects. In stream processing systems, the stream is the basic object to do operations, whereas in databases, the table is the basic object to do operations. In streaming databases, the concepts of stream and table are perfectly unified. Users no longer need to consider the difference between a stream and a table. Instead, a stream is simply a table of unbounded size, hence a stream can be processed, queried, and manipulated in the same fashion as doing these corresponding operations on a table in traditional databases.
  • Simplifying the data stack. Differing from stream processing systems, streaming databases have ownership of data. Users can directly store data in a database. After the user finishes processing the data through the streaming database, the processed result can be directly stored in the same system and accessed concurrently by different users. In stream processing systems that cannot store data, the results of any computation must be exported to an external downstream system, e.g., a key-value store or a cache system, before users can access them. That is to say, users can use a single streaming database system to replace the previous service combinations such as Flink+Cassandra/DynamoDB. Simplifying the user's data stack and reducing operation and maintenance costs are obviously what many companies expect.
Close
Featured A streaming database can perfectly replace the combination of a streaming engine and a serving system.
  • Reducing external access overhead. The data sources of modern enterprises are diversified. When using a stream processing system, users often need to access data from external systems, like joining the data stream from Kafka with a table in MySQL. In a stream processing system, to access a MySQL table, a cross-system external call must be made. The performance cost of such calls is enormous regarding data-intensive computations. On the other hand, streaming databases have the ability to store data, therefore data in the external system can be stored (or cached) inside the streaming database. This greatly improves the overall performance by eliminating the costly external call.
Close
Featured In stream processing systems, cross-system data access introduces huge performance penalties.
  • Providing result interpretability and consistency guarantees. A major pain point of stream processing systems is the lack of interpretability and consistency guarantees for computational results. In Flink, different jobs run and output results to the downstream system continuously and independently. The different computational progress of different jobs in stream processing systems introduces the inconsistency between the results of different jobs received by the downstream system. Consider a simple example: the user submits two jobs to Flink, one is to find how many times the Tesla stock has been bought in the past ten minutes, and the other is to find how many times the stock has been sold in the past ten minutes. A downstream system receives continuous output from the two jobs. One may be the result at 8:10, and the other may be the result at 8:11. Furthermore, the inconsistency in the results cannot be traced. Seeing such a result, users are confused: they cannot judge whether the result is correct, and how it is generated and evolved. In streaming databases, with the ownership of the input and output data, the computing behavior of the system becomes observable, explainable, and strongly consistent from a theoretical point of view. For example, RisingWave is one of the systems that achieve these guarantees. All intermediate results of computations are stored in tables and stamped with timestamps, and all computational results can be stored in materialized views and traced through timestamps. In this way, the user can better understand the computational results by querying the historical intermediate results with a specific timestamp.
  • Deeply optimized execution plans. "Treating storage as a first-class citizen" means that the computing layer of streaming databases can perceive the storage layer. This perception opens new opportunities to optimize the computation and storage collectively. A great example is to share the intermediate state of stream processing better to save the resource overhead. We will not dive into the technical details of this topic, and we refer readers to our previous blog.


Design Principles for Cloud Native Streaming Databases


The biggest difference between cloud deployment and on-premise deployment is that the cloud can have unlimited and decoupled resources. In contrast, the on-premise deployment has only limited and coupled resources. What exactly does that mean?

  • First, users on the cloud no longer need to perceive the number of physical machines. While cluster users in the era of big data often only have limited physical machines.
  • Second, cloud vendors expose classified resources to users. They can directly purchase computing, storage, cache, or other resources separately according to their needs. In comparison, users for on-premise deployment can only request resources according to the number of machines.


The first difference causes a fundamental discrepancy in the design goals of big data and cloud-native systems. The goal of big data systems is to maximize system performance within limited resources, while the goal of cloud-native systems is to minimize cost overhead under unlimited resources. The second difference has caused a fundamental division in the implementation of the data system. The big data system achieves embarrassingly parallelism based on the shared-nothing architecture which couples storage with computing. In contrast, cloud-native systems realize on-demand scaling through the shared-storage architecture that separates storage and computing.

These design principles bring new challenges to cloud-native streaming databases. In streaming databases, the storage essentially stores the intermediate states in the computation. When intermediate states need to be separated from computing nodes and placed on persistent cloud storage services such as S3, the obvious challenge is that data access latency brought by S3 may greatly affect the latency of the stream processing system itself. Therefore, a cloud-native streaming database has to consider how to use the local storage of computing nodes and external storage (such as EBS, etc.) to cache data, to minimize the performance degradation caused by S3 access.

Close
Featured Big data systems and cloud systems have different optimization goals.

Stream Processing and Batch Processing: Replace or Coexist?


As stream processing technology can greatly reduce query latency, many people consider it a technology that can subvert batch processing. But an opposite view is that modern batch processing systems have been "upgraded" to real-time analytical systems that can generate results in milliseconds, the value of stream processing systems is limited. While I am extremely optimistic about the future of stream processing, I do not really think stream processing can fully replace batch processing. In my view, these two technologies will coexist and can complement each other in the long run.


Stream Processing Systems and Real-Time Analytical Systems


At present, most batch processing systems, including Snowflake, Redshift, Clickhouse, etc., claim to be capable of doing real-time analytics on large-scale data. It naturally comes to a question that we raised in the first chapter: why do we need a stream processing system when there are already many real-time analytical systems? Because there is a fundamental difference in the definition of "real-time". In stream processing systems, a query is defined by the user in advance, and the query processing is driven by the data. In contrast, in real-time analytical systems, a query could be defined by the user at any time, and it is actually the users who drive the query processing. The "real-time" mentioned by stream processing systems means that the query results reflect the latest received data. In contrast, the "real-time" mentioned by real-time analytical systems refers to the low latency of random issued queries. To put it more clearly, stream processing systems emphasize the real-time nature of computational results, while real-time analytical systems emphasize the real-time nature of user interaction. For those scenarios that require high freshness of data results, such as a stock trading dashboard, advertising recommendations, or network anomaly detection, stream processing systems are the better choice. For scenarios with high requirements for user interactive experience, such as interactive custom reports or analyzing user behavior logs, real-time analysis systems suit better.

Close
Featured The difference between a streaming database and a real-time analytical database. A streaming database processes data before storing them; data drive the computation; it focuses on delivering fresh results. A real-time analysis database stores data before processing them; the computation is driven by users; it focuses on enabling real-time user interaction.

Some people may say that since real-time analytical systems can give real-time results to the queries sent by users, as long as the user keeps sending the same query to the real-time analysis system periodically, wouldn't it be possible to ensure the freshness of the results at all times and achieve the same effect as stream processing systems? There are two problems. The first is implementation complexity. A user is not a machine, and cannot stay in front of the computer and send queries continuously. To issue a query periodically, you must either write a program or use an operation and maintenance orchestration tool, such as Airflow, to send queries in a loop. Either way, it means that users need to pay extra costs to operate and maintain external systems or programs. The second and more fundamental problem is that the cost is too high. Real-time analytic systems perform computations on the whole dataset, while stream processing systems perform incremental computations. When it comes to a large volume of data, the high number of redundant computations carried out by real-time analytical systems results in a huge waste of resources. This is why we should never use an orchestration tool to trigger batch processing in real-time analytical systems regularly.

Close
Featured Stream processing systems use incremental computation to avoid redundant processing costs.

Stream Processing and Real-Time Materialized Views


Nowadays, many real-time analytical systems have implemented real-time materialized views. Since streaming databases also use materialized views to define stream processing inside its kernel, there is an argument that given an analytical system with real-time materialized views, we no longer need a separate stream processing system. I don't agree with this point of view. Consider the following reasons.

  • Competition for resources. The core problem to be solved by analytical databases is to process complex queries on large-scale data sets efficiently. In such systems, the positioning of materialized views is essentially no different from that of database indexes. They are both caches of computations. Creating real-time materialized views has two benefits: on the one hand, redundant computations over frequently used queries can be reduced; on the other hand, pre-computing shared subqueries of different queries can speed up query execution. Such positioning makes it almost impossible to update the materialized view in a timely manner. Actively updating the materialized view will inevitably seize computing and memory resources, hampering the timely responses to user queries. To prevent "putting the cart before the horse", almost all analytical databases adopt either a passive updating strategy (that is, need to be actively driven by users) or a delayed updating strategy (wait until the system is idle to update) for materialized views.
  • Correctness guarantee. The problem of resource competition can be solved easily by adding more computing nodes, however, the problem of guaranteeing correctness is much more non-trivial for real-time analysis systems. Batch processing systems process finite and bounded data, while stream processing systems process infinite and unbounded data. When processing unbounded data, data disorder may occur due to various reasons such as unstable network communication or different network latencies from different sources. Stream processing systems introduce the concept of watermarks to solve this problem. The output result is considered to be semantically correct when the out-of-order data are processed in a certain order. Outdated data will be abandoned when they pass a threshold called a watermark. However, real-time analytical systems lack the basic design of watermarks, which makes it impossible for them to deliver correct results as stream processing systems do. This correctness guarantee is crucial in various application scenarios such as risk control and advertising recommendations. A system without correctness guarantees on real-time data sources cannot replace stream processing systems.


Considerations Before Adopting Stream Processing


Stream processing is not a panacea, and neither can stream processing completely replace batch processing in all scenarios. Before adopting stream processing in your tech stack, you may want to evaluate the following aspects.

  • Flexibility. Stream processing requires users to predefine queries, to achieve continuous computation over streaming data. This requirement makes stream processing less flexible than batch processing, which can actively respond to user-triggered one-shot queries. As a result, stream processing is more suitable for scenarios where queries can be created before execution, while batch processing is more suitable for scenarios where frequent user interactions are required.
  • Expressiveness. Stream processing uses incremental computation to reduce redundant overhead. However, incremental computation also brings a major problem, that is, the system expressiveness is limited. The main reason is that not all computations can be handled incrementally. A simple example is calculating the median from a data sequence: no algorithm can find the exact median number incrementally without scanning the whole dataset. Keeping this in mind, you should evaluate if stream processing is suitable when handling some highly complex scenarios.
  • Computing costs. Stream processing can greatly reduce the cost of real-time computing. But this does not mean that stream processing can be more cost-effective than batch processing in all scenarios. In fact, in scenarios where result freshness is not critical, batch processing can save more cost. Just think of generating the quarterly financial report inside a company. Stream processing would be an overkill for this case. This is because to perform incremental computation, stream processing systems have to continuously maintain the internal computation state, which keeps consuming resources. In contrast, batch processing only performs computations when user requests arrive, and no additional resources will be spent on maintaining computation states.


Unifying Stream Processing and Batch Processing


Stream processing and batch processing have their pros and cons, and it is difficult to completely replace each other in many real-world scenarios. Many believe that stream processing and batch processing will coexist in the long run. Observing this, it is natural to consider unifying stream processing and batch processing using a single system. Many existing systems have explored this direction, including those well-known systems such as Flink and Spark. While the unified approach to stream and batch processing can fit into some scenarios, the reality is that, at least for now, most users still choose to adopt separate solutions for stream processing and batch processing, respectively. The reason is straightforward: integrating two different systems collectively not only addresses performance and stability concerns but also offers users the opportunity to achieve the best of both worlds.

At the current stage, RisingWave is more focused on stream processing, and users can leverage RisingWave connectors to integrate with mainstream batch processing systems. In the current RisingWave version, users can easily export data to systems such as Snowflake, Redshift, Spark, and ClickHouse. I feel optimistic about the future of a unified approach to batch and stream processing, and in the long run, RisingWave will also explore this direction. In fact, streaming databases are already unifying stream processing and batch processing: when new data flows into the streaming database, the streaming database performs stream processing, but when a user initiates ad-hoc queries, the streaming database performs batch processing.


Postscript


While I am proud of having been working in the stream processing domain for over a decade, I have to be honest and admit that I once thought there was not much room for growth in this area. I can hardly imagine how cloud computing changed everything and brought the stream processing system back to the center of the stage after 2020. Nowadays, more and more people are researching, developing, and adopting stream processing technology. I shared my thoughts on stream processing in this blog, and hope it will help. Everyone is welcome to join our community to discuss the future of stream processing.


About RisingWave


RisingWave is a distributed SQL streaming database in the cloud. It is an open-source system released under Apache 2.0 license.

Official website: risingwave.com

Slack community: risingwave.com/slack

Documentation: risingwave.dev

Austin, the Live Music Capital of the World! This city in Texas, USA, is becoming an emerging tech city due to the influx of tech giants in recent years. Thanks to its convenient transportation and geographical location (located in the central United States), Austin is also attracting more and more technology conferences. And I have been fortunate to visit Austin for the second time this year to participate in the top technology conference in the field of data systems - Current 2022. When it comes to the name Current, it may be unfamiliar to many of us. But its alias should be familiar to all engineers: Kafka Summit. Due to various reasons such as branding, the event organizer, Confluent, a giant in the field of data systems, decided to rename the Kafka Summit in 2022 to Current 22.

Close
Featured Austin Airport
Close
Featured Texas State Capitol

The focus of the Current conference is the real-time data system. Different from widely adopted big data platforms (such as Apache Hadoop, Apache Spark, and Apache Hive) or data warehouses (such as Snowflake, Redshift, and BigQuery), real-time data systems emphasize real-time storage and computations of data generated in real-time. In recent years, the field of real-time data systems has gained gradual recognition in the market with the rise of applications such as real-time reporting, monitoring, and tracking. Confluent, an industrial leader in this field, also successfully IPO’d in mid-2021, driving a new wave of real-time data system development and application.

Close
Featured Current Conference Sponsor Booth

A great way to understand where a field is heading is to see what startups are doing. During my two days at Current, I chatted with all of Current's sponsors. Apart from certain industry giants, most conference sponsors are startups. In this article, based on my understanding, I will introduce you to these startups to help you understand the future of real-time data systems. I will begin by classifying these companies based on their core products. Then I will share my observations and raise the potential risks I can imagine. And I will end by introducing each startup company's core product and business model.

Disclaimer: This article only introduces the startups that sponsored the Current conference, and this article attempts to ensure that the comments are objective and fair; however, remember these are my personal observations. This article does not constitute any investment advice on any specific company, but I do recommend you invest in this field for the long term 🙂


Classification of Startups


Based on the analysis of the various startup companies sponsoring the conference, I have divided the entrepreneurial direction into the following nine categories:

1. Commercializing mature open-source projects on the cloud

This category includes Databricks, Aiven, Conduktor, Imply, StreamNative, StarTree, Immerok, Factorhouse, and more. Such companies can be further divided into three development models:

  • The core teams start a business: such companies include Databricks, Imply, StreamNative, StarTree, Immerok, etc. One of their core selling points is "orthodoxy". These companies guide the direction of open-source project development through strong control from the core contributors. Then they try to convert open-source users to paying users through the community.
  • Non-initiative teams start businesses: such companies include Conduktor, Factorhouse, etc. These companies are founded by some active members of certain open-source projects, but not always founding members. They usually emphasize the multi-directional development of cloud platforms, looking for differentiation from unique perspectives such as security, visualization, and system integration.
  • The companies directly commercialize multiple open-source projects. Aiven is one example. By providing hosting services for different open-source projects on the cloud, such companies can reduce the sense of separation between different projects and provide users with a complete set of solutions.

2. Real-time ETL/ELT/data integration

This category contains Decodable, Striim, Airbyte, etc. Airbyte focuses on ELT, while Decodable and Striim are more toward ETL. One core selling point of these companies is the ease of use. For example, Decodable emphasizes that users only need to click a few buttons on the platform and then write SQL to realize real-time data transformation from one database to another database. and import.

3. Message queues

This category contains StreamNative, Redpanda, etc. Message queues are fundamental building blocks of a modern data stack. I believe that everybody knows Kafka, which is backed by the company Confluent, the organizer of this Current conference. Message queues proved their value over the past decade. But this does not mean that the development of message queues has come to an end. In the new cloud era, new challengers and products have emerged based on new technologies, such as the separation of storage and computing to reduce costs and increase efficiency.

4. Real-time API

This category contains Aklivity, Macrometa, etc. Such companies fill the gap from data source to data access, allowing users to access data that is ingested in real time easily.

5. Edge computing

This category includes Ably et al. — companies focusing on real-time data processing on edge devices, which are very close to the data source. A critical source of real-time data is the data generated by edge devices, such as sensors, mobile phones, and so on. In order to reduce latency, real-time computing needs to be carried out on near-source devices rather than concentrated in the data center on the cloud brute-forcibly. This also brings new demands and challenges to the basic software.

6. Real-time vertical industry SaaS

This category contains Clear Street, Bicycle, and more. Such companies focus on the application of real-time data analysis in vertical markets. The development of such companies proves that real-time computing can be adapted on a large scale in some specific industries.

7. Real-time high-level language framework

This category contains Quix, Meroxa, etc. This type of company aims at developers who use a specific programming language (such as Python, etc.). People unfamiliar with low-level languages (e.g., Java, Scala) can easily program their stream processing applications in high-level languages.

8. Real-time analytical database

This category contains Imply, StarTree, InfluxData, Rockset, FeatureBase, and many more. This type of product mainly focuses on improving the ability of real-time data ingestion in traditional analytical databases. Data ingestion latency is reduced, and the newly ingested data is visible to analytics in real time. These databases provide SQL interfaces that allow users to perform analytical queries on data like traditional databases.

9. Streaming database

This category contains RisingWave, Materialize, DeltaStream, TimePlus, and more. Like real-time analytical databases, they all provide SQL interfaces and are optimized for real-time data ingestion. But on top of those, streaming databases take a step further to support the real-time computation of data. Streaming databases integrate stream processing technology into databases and support query results' real-time and continuous updates through incremental computing.


Founding Time


The following table summarizes all companies introduced in this blog according to the year of founding.

YearCompanyQuantity
2012InfluxData, Striim2
2013Databricks1
2014Slower1
2015Imply, Oxylabs, Swim3
2016Aiven, Ably, Rockset, Memgraph, Nussknacker5
2017Macrometa, FeatureBase2
2018Acceldata, Clear Street, StarTree3
2019Conduktor, Materialize, StreamNative, Cube, Redpanda, Tinybird, Meroxa7
2020Bicycle, Airbyte, CelerData, DeltaStream, Factorhouse, LakeFS6
2021RisingWave, Decodable, Aklivity, Timeplus4
2022Immerok1

Without a doubt, 2019, 2020, and 2021 were the golden years for launching real-time data systems. Before 2019, the field of real-time computing was far less hot than it is today. But in recent years, as traditional batch data processing has entered a bottleneck period, real-time data processing has begun to attract people's attention. The landmark event of Confluent's IPO in 2021 has brought new confidence into the real-time data processing industry. It is no exaggeration — real-time computing is one of the hottest markets today.


The field of real-time data systems has gradually transformed from being used by technology giants on a small scale to being used by the general public. If Confluent's IPO proves that enterprises need to store real-time data, we can say that many real-time data systems startups are now trying to prove that users also need to do computations on real-time data. These startups in this wave have different entry points, from real-time APIs to real-time analytical databases to streaming databases. Although they seem similar from a macro perspective, everyone is actually looking for segments to differentiate. For example, the real-time API field is more for application developers, while the real-time high-level language frameworks are more for data analysts and scientists. Regardless of the segment, we can see that real-time computing has ushered in the next wave.


Entrepreneurial Challenges


Frankly speaking, I don't think the real-time data system field, or the data system field in general, is technically insurmountable. After all, everyone is playing a game of balancing performance and resources. I am very bullish on the real-time data system field. Of course, if I am not optimistic about this field, I will not start a business in this field. So, are there actually any challenges in launching a startup? Yes, there still are.

The biggest challenge comes from the immature market. Whether technology can usher in explosive development depends not on how advanced the technology is but on whether the technology matches the market demand. The concept of stream processing was first proposed in academia 20 years ago and landed in the industry almost a decade ago, but it has been in a tepid state for the past two decades. Although some technology giants have adopted such technology, it does not mean that various companies can widely use this technology. This is the typical technology-market mismatch. Today, in the field of real-time data systems, we clearly feel the market is heating up, but it will take time to be widely recognized and accepted, like Oracle in the enterprise market or Snowflake in the cloud computing field. I predict this time to be about 2-5 years. During this time, startups in this industry must spend a lot of time and energy educating the market. This investment is massive and full of unknowns. Of course, as we all know, challenges and opportunities coexist, and whoever can grasp the opportunity in the challenge will eventually become the market leader.


Company Details


Next, let's list the startups that are sponsors of the Current conference. Here, I only included private companies established within the last ten years (i.e., companies established after 2012). As for those giants (such as AWS, Google, Microsoft, etc.), acquired companies or companies have lasted for more than 10 years; I will not introduce them one by one here.


RisingWave

Founded year: 2021

Keywords: stream processing, database

Funding round: Series A

Last funding round year: 2022

Let me first introduce our company RisingWave Labs. We launched it in early 2021. Over the past two years, we have grown into a fully distributed team spanning seven time zones. RisingWave has been focusing on developing cloud-native streaming databases from day one. The core idea is to democratize stream processing, making it simple, affordable, and accessible. We hope users can develop their stream processing applications simply by operating a regular database. To do stream processing, the only thing a user needs to do is to create a materialized view. Similar to some popular systems today, RisingWave does not depend on the JVM ecosystem, making the deployment, operation, and maintenance very simple. The entire system is written in Rust from scratch, mainly because of the efficient and safe features of the Rust language. As an open-source project using the Apache license, RisingWave's commercial model is still providing cloud services. The private preview version has been released already, and the GA version is expected to be released next year. As a streaming database, users can process streaming data while also storing data. This also means that users can query directly on the database. Many people will think of the popular concept of unifying batches and streaming. RisingWave currently focuses more on stream processing, and the batch processing capabilities still depend on the project implementation priorities.


Databricks


Official websitehttps://www.databricks.com/

Founded year: 2013

Keywords: data lake, data platform

Funding round: Pre-IPO

Last funding round year: 2021

Databricks needs no introduction. With a valuation of up to 38 billion US dollars, starting with Apache Spark, this company is moving towards a unified lakehouse direction (integrating data lakes and warehouses). This year, Databricks also surpassed $1 billion in revenue. If there are no surprises, presumably, the company will have a successful IPO in the next two years. Databricks is steadily moving towards real-time computing. Spark Streaming is also one of their core development directions. This year Databricks has also announced the upcoming launch of the next-generation Spark Streaming engine Lightspeed. Since Lightspeed is not open-sourced yet, I won't elaborate on it.


Slower


Official websitehttp://slower.ai/

Founded year: 2014

Funding round: unknown

Last funding round year: unknown

Slower is a mysterious company. As a company founded in 2014, I could not find any concrete information on the internet, and its official website is just a simple company logo. After chatting with their employees, I learned they mainly provide various cloud and on-premise deployment solutions for multiple enterprises. The business covers databases, data platforms, data management tools, machine learning platforms, security, etc. In general, it is a well-rounded solution team. Because Slower is too mysterious, I will stop here.


Aiven


Official websitehttps://aiven.io/

Founded year: 2016

Keywords: open source, cloud services

Funding round: Series D

Last funding round year: 2022

Aiven is a Helsinki-based cloud service provider. Unlike many other startups, the business story is not about a core member of an open-source project, starting a company, or commercializing an open-source project. It is about providing cloud hosting services for various existing open-source projects! From Kafka to Flink, from ClickHouse to InfluxDB, from MySQL to PostgreSQL, they can provide hosting services as long as it is a mainstream open-source project. At first glance, it does not seem particularly competitive, but in fact, Aiven is coping with a big pain point for users, that is, the problem of software selection. Today there are too much data software, and they are too complex. To build complex applications, companies often have to compose a variety of software. Providing a complete set of cloud services will largely solve the problem of users choosing software. In addition, the unified management interface will make the connection between the software smoother, without a strong sense of separation; I believe this is also a good advantage.


Conduktor


Official websitehttps://www.conduktor.io/

Founded year: 2019

Keywords: Apache Kafka, cloud service

Funding round: Series A

Last funding round year: 2021

Apache Kafka is a distributed streaming message storage system. As long as there is streaming data, Apache Kafka can be used for storage. However, it is not enough to have this storage system. We also need to deploy, maintain, and operate the system, monitor the system, and analyze and manage the data on the system. Conduktor is a company that does this series of things. Their tagline is "Streamline Apache Kafka". I thought it was highly similar to what Confluent did. After chatting with their CTO, I realized that they not only do Kafka hosting services but also analyze and manage data. For example, you can use their platform to know whether the quality of your data stored in Kafka is reliable, or you may want to query or monitor the data. Conduktor essentially takes Kafka as a data platform. Users no longer need to import data into downstream data warehouses or data lakes, but can process data within Kafka. I believe this is a good direction.


Decodable


Official websitehttps://www.decodable.co/

Founded year: 2021

Keywords: data pipelines, data engineering, cloud services

Funding round: Series A

Last funding round year: 2022

Decodable was born in 2021 — hence is a rookie in the field of data engineering. The focus of Decodable is a very familiar problem: ETL. Countless platforms provide ETL capabilities, so what is the entry point for Decodable? The answer is straightforward. Decodable provides engineers with a simple and easy-to-use platform: through simple clicks and writing SQL code — you can import data from one platform (such as Apache Kafka and Apache Pulsar) to another (such as Snowflake and Redshift). And they provide a cloud service that allows users to connect to databases on the cloud without installing any software locally. I believe this is also an excellent product.


Imply


Official websitehttps://imply.io/

Founded year: 2015

Keywords: Apache Druid, real-time analytics, database

Funding round: Series D

Last funding round year: 2022

Engineers in the database field should not be unfamiliar with Imply. Imply commercializes Apache Druid, a well-known real-time analysis engine in the industry. The core objective of Apache Druid is to respond to random complex queries on large-scale data with low latency. Although many new startups have emerged in the field of real-time analysis in recent years, Imply still maintains a relatively leading position in terms of customer volume by virtue of its stable performance.


Materialize


Official websitehttps://materialize.com/

Founded year: 2019

Keywords: stream processing, database

Funding Round: Series C

Last funding round year: 2021

Materialize is the most direct friend of our company. Like what we do, the core product of Materialize is also a streaming database. At Current, they finally released their long-awaited product: a cloud-native streaming database. Although Materialize has been building a streaming database based on the Timely Dataflow open-source project since 2019, for a long time, Materialize has always been a single-machine database running on pure memory. Hence its availability may encounter considerable challenges in real production environments. However, this new version is worth looking forward to and hopefully will make everyone's eyes sparkle.


StreamNative


Official websitehttps://streamnative.io/

Founded year: 2019

Keywords: Apache Pulsar, message queue, pub/sub

Funding round: Series A

Last funding round year: 2021

StreamNative was founded in 2019. Although it has not been long, StreamNative has an outstanding reputation in the open-source and infrastructure community. Their core product is a commercial version of Apache Pulsar. Since open sourced in 2016, many companies worldwide have adopted Apache Pulsar. Apache Pulsar and Apache Kafka have two significant differences as message queuing systems. Compared to Kafka, which focuses solely on storing event data, Pulsar also pays attention to the message data generated within the application. Pulsar is also more cloud-native, and its separate storage and computing architecture can make the entire system more scalable. As for commercialization, StreamNative is also currently focusing on providing services to users on the cloud.


Ably


Official websitehttps://ably.com/

Founded year: 2016

Keywords: message queue, pub/sub, edge computing

Funding round: Series B

Last funding round year: 2021

Ably is a London-based company that provides message queuing services on the cloud. Speaking of the message queue service, you may think that Ably’s product is similar to Kafka or Pulsar. Yes, Ably's products are identical to Kafka in terms of category, but the most significant difference is edge computing. Kafka is usually deployed in a company's data center. Through Kafka, we can obtain message data in a centralized way. Ably focuses on the edge. To achieve millisecond-level latency, you can deploy their product on the edge cloud and process data directly on the device side, such as mobile phones, sensors, and tablet computers. Ably has been established for six years, and the cumulative financing has exceeded 80 million US dollars.


Acceldata


Official websitehttps://www.acceldata.io/

Founded year: 2018

Keywords: observability, cloud services

Funding round: Series B

Last funding round year: 2021

Acceldata is a comprehensive data observability platform in the cloud. The field of the observability platform has been really hot in recent years. Aside from market leaders — Splunk and Datadog —there are also various other startups working in this field. Based on conversations with Acceldata folks, I would think Acceldata is an observability platform that does everything. But I needed a better understanding of how they differ from Datadog etc. After I dug deep, I found that they are more concerned about the observability of various systems on the so-called "modern data stack" rather than the observability of the machine itself or traditional applications such as CI/CD. What they observe is slightly different from the focus of other companies.


Aklivity


Official websitehttps://www.aklivity.io/

Founded year: 2021

Keywords: real-time API

Funding round: Seed round

Last funding round year: 2022

Aklivity is a startup with only three employees (including the founder himself)! I chatted with all three of their employees at a cocktail party. They told me they are working on an open-source API tool called Zilla. It has raised $4 million in the seed round. More specifically, what they are doing is a real-time API gateway. Simply put, when users use Kafka, depending on the device or application, they may choose different interfaces to connect to Kafka, which is relatively troublesome. Zilla, developed by Aklivity, is essentially a unified encapsulation on top of Kafka, so that different applications can access Kafka in the same way.


Bicycle


Official websitehttps://bicycle.io/

Founded year: 2020

Keywords: revenue operation, SaaS

Funding round: Unknown

Last funding round year: Unknown

At Current 2022, we were pleasantly surprised to discover some SaaS products that provide real-time analytics capabilities. Bicycle is one of them. Bicycle is not selling a real-time analysis engine or storage engine; it provides real-time data monitoring, alarming, and analysis functionalities for customers. For example, for an e-commerce platform company, Bicycle can analyze and predict possible future sales through past sales data and use such data to manage its revenue. Employees from Bicycle revealed that they developed their core engine and used machine learning methods to analyze sales data. As the underlying systems for real-time analytics are gradually improved, I feel there will be more and more SaaS startups like this.


Clear Street


Official websitehttps://clearstreet.io/

Founded year: 2018

Keywords: FinTech, SaaS, securities trading

Funding round: Series B

Last funding round year: 2022

Clear Street is a New York-based fintech platform service provider that provides a cloud-based securities trading service. Traditional securities brokers, such as banks, have out-of-date infrastructures, opaque information, and low efficiency. Clear Street saw this opportunity and wanted to revolutionize the field with cloud computing. If Robinhood is a trading platform for retail investors, then Clear Street is Robinhood for professional institutions, in my opinion. Through Clear Street, users can not only trade securities but also perform real-time data analysis through a simple interface. In 2021, the average transaction volume processed in a single trading day on its cloud platform had reached 3 billion US dollars.


Cube


Official websitehttps://cube.dev/

Founded year: 2019

Keywords: Headless BI, cloud platform

Funding round: Series B

Last funding round year: 2022

When I first saw Cube's booth, I thought they were a BI visualization company, but it turned out that they were not. What Cube does is a layer between data storage systems (such as databases, data warehouses, etc.) and visual BI tools. It solves several problems:

  1. Unified caliber: When users query data from different data sources, the type, unit, and representation of data may not be uniform, and Cube can provide a data model to solve this problem;
  2. Access permission: Administrators can set different permissions for different users through Cube to display different reports for different users;
  3. Cache: Every time you fetch data from a BI tool to the underlying data storage system, there is always a lot of access overhead. Cube provides a layer of caching to solve this problem;
  4. Different APIs: You don’t have to worry about the different APIs of underlying data storage systems.


Overall, I think the Cube is a very thin and easy-to-use tool, and many companies should like it.


Immerok


Official websitehttps://www.immerok.io/

Founded year: 2022

Keywords: Apache Flink, cloud service

Funding round: Seed round

Last funding round year: 2022

Immerok is probably the youngest company to sponsor the Current conference. However, what they do may be familiar to people who do real-time analysis: commercializing the Flink system in the cloud. When it comes to Flink-related companies, you may think of Ververica, which Alibaba acquired in 2019. Immerok's relationship with Ververica is extraordinary: Almost the entire founding team of Immerok is from Ververica. Since its establishment in the first half of this year, Immerok has completed a seed round of 17 million euros. Unlike Ververica, which offers on-premise deployments, Immerok has put its entire focus on cloud services. It appears this is the future trend.


InfluxData


Official websitehttps://www.influxdata.com/

Founded year: 2012

Keywords: time series database

Funding round: Series D

Last funding round year: 2019

InfluxData is already very famous — it does not need too much introduction. The main product, InfluxDB, is a mainstream time series database in the industry. As a commercial open-source software company, InfluxData uses the most relaxed MIT license to open source its core code. But what's interesting is that only the stand-alone version is open sourced, and the distributed version is completely closed-source with applied charges. InfluxData is also recently rewriting its system kernel in Rust. It seems that rewriting a system with Rust is quite common in the industry.


Macrometa


Official websitehttps://www.macrometa.com/

Founded year: 2017

Keywords: Real-time API

Funding Round: Series A

Last funding round year: 2021

Macrometa provides real-time data API services. What exactly is a real-time data API service? Essentially Macrometa can be regarded as a global real-time multimodal database. Users write to the database through API or connect to real-time event sources. The underlying system achieves global real-time synchronization through CRDT. Users can implement data read, write and cache services just like using a distributed database across multiple regions and can directly query the data written in real time. Compared to other real-time data services, Macrometa has two specialties. First, Macrometa has encapsulated the underlying technology into services, and the upper layer provides a very rich data model and API, such as key-value store, document database, graph database, and pub/sub; second, Macrometa is very focused on edge computing. Users who have a large demand for edge computing can simply access real-time data by accessing their API.


Oxylabs


Official websitehttps://oxylabs.io/

Founded year: 2015

Keywords: Gateway, SERP scraper, SEO

Funding round: Unknown

Last funding round year: Unknown

If you look at the official website of Oxylabs, you will find that their core business is network proxy, such as IP address proxy and data center gateway. This seems to have nothing to do with real-time systems. But if you take a closer look, you will find that another pillar of Oxylabs' business is real-time SERP scraping. I was relatively unfamiliar with this term before, so I did my research, and let me briefly introduce it. SERP refers to the Search Engine Result Page, and SERP scraper refers to the automatic tracking of the query results of some keywords in the search engine. Such results include advertisements, related queries, web page rankings, etc. And those results are returned to users in a structured format. SERP scraping is mainly used in the market department to analyze the SEO of its own products and competitors. The core market value here is to provide SaaS services that lower the technical bar. Oxylab takes a step further by providing automated real-time SERP scraping. This requires a comprehensive real-time data stack, from data crawling to real-time analysis to sending to users. Oxylabs has been established for seven years, yet it has no public funding records. Nevertheless, the company has expanded from Lithuania to the globe. This also shows that the application of real-time data is really of high value.


Quix


Official websitehttps://quix.io/

Founded year: 2020

Keywords: Python, data science, data engineering

Funding round: Seed round

Last funding round year: 2021

The previous stream processing platform was designed for programmers familiar with low-level APIs. They only provided language interfaces such as Java. However, for data scientists, the dominant programming language is actually Python. Here comes an opportunity: how to simplify stream processing for users who are more familiar with Python, such as data scientists and engineers. Quix is a stream processing platform mainly for other high-level languages such as Python. Quix was founded by several data scientists from McLaren (you read that right, the company that sells luxury sports cars) in the UK. It has been a stream processing platform for data scientists and engineers since its inception. It relies on message queues such as Kafka for data input and output but provides streaming data processing services hosted on the cloud. In addition to Python, I also noticed from their official website that they have added support for C#.


Redpanda


Official websitehttps://redpanda.com/

Founded year: 2019

Keywords: message queue, Apache Kafka compatible

Funding round: Series B

Last funding round year: 2022

Redpanda competes directly with Apache Kafka. If Red Hat is a commercial distribution for Linux, then Redpanda is a commercial distribution for Kafka. The interface of Redpanda is fully compatible with Kafka. Compared to Kafka, Redpanda is mainly cost-effective: it claims to be ten times more performant than Kafka, while the hardware efficiency is more than six times higher. As a C++ project, Redpanda also has a big selling point: to completely abandon the JVM dependency. When installing Redpanda, users no longer need to install JVM ecological components such as Zookeeper. I highly agree with this idea. In the era of big data, the complexity of installation and maintenance of the Hadoop ecosystem is too high. Today, it is clear that a minimalist deployment and operation and maintenance environment will be a solid competitive advantage compared to existing technologies in the era of big data.


Rockset


Official websitehttps://rockset.com/

Founded year: 2016

Keywords: real-time analytics, database

Funding round: Series B

Last funding round year: 2020

Rockset is a real-time analytics database. There are already many open-source products In this field, yet Rockset is one of the few closed-source products. In the early days, Rockset was not actually for real-time analytics (OLAP). The Rockset founding team is the same bunch of people who built RocksDB and HDFS on Facebook. And indeed, their products are developed based on RocksDB. I have been following their products since they were only 10 people or so. I still remember that what they did in the earliest days was actually SQL on raw data, which is to query raw data (such as semi-structured data and JSON). After that, it gradually became the so-called indexing database. The product was fully positioned as a real-time analysis database in the past two years. The most attractive point for me is that they can predict from that time in 2016 that the future data will be in the cloud, and more and more raw data will be saved. Looking back at the products they made from this point in time, there is no doubt that they are ahead of their times.


StarTree


Official websitehttps://www.startree.ai/

Founded year: 2018

Keywords: Apache Pinot, real-time analytics, database

Funding round: Series B

Last funding round year: 2022

StarTree is a rookie in the real-time analytics database field. Although it was founded recently, it has already gained good attention in Silicon Valley. In addition to products, their VP of DevRel — Tim Berglund — has also attracted much attention. StarTree's core business is commercializing Apache Pinot, an open-source real-time analytics database. Pinot focuses more on high-concurrency queries than other real-time analytics databases, which is also required in many user interaction scenarios (such as the "Who's viewed your profile" application on LinkedIn). In addition to selling real-time analytics databases in the cloud, StarTree has a SaaS service called ThirdEye that uses Pinot for data anomaly detection. This reflects a trend — infra companies are developing their SaaS layer.


Striim


Official websitehttps://www.striim.com/

Founded year: 2012

Keywords: data integration, stream processing

Funding round: Series C

Last funding round year: 2021

Striim is a company specializing in data integration. It was founded by the original Oracle GoldenGate team. GoldenGate was acquired by Oracle in 2009, and their team focused on data import and export business for the Oracle database. Striim's current core business is the same as what GoldenGate did: database-to-database data integration solutions. Due to its relatively early establishment, Striim was still mainly privatized deployments in the early days. But in recent years, with the rise of the cloud, they have also expanded their business to cloud services. What's impressive about the company is how friendly its products are to the data ecosystem. Striim Cloud supports all mainstream databases, data services, and cloud platforms. The company has even developed dozens of connectors and released them in the application market of cloud manufacturers, which greatly reduces the complexity of user access.


Swim


Official websitehttps://www.swim.inc/

Founded year: 2015

Keywords: real-time data analysis, real-time application

Funding round: Series B

Last funding round year: 2019

Swim's product is mainly to help developers build and manage real-time applications based on streaming data. Their open-source product Swim OS provides a framework for building real-time applications. In contrast the commercial product Swim Continuum enables stream data source management, real-time analysis and display based on stream data, and monitoring of application running status. It combines application monitoring and business analytics into one platform. This company provides not a single service (middleware), but a set of real-time data processing and analysis solutions for business users.


Tinybird


Official websitehttps://www.tinybird.co/

Founded year: 2019

Keywords: real-time data analysis, real-time API

Funding round: Series A

Last funding round year: 2022

Tinybird is a real-time data analytics company. There are various data sources, but in order to build an application, the data used by the application side still needs to be accessed through API. What Tinybird builds is the bridge from the data source to the API. Their products support multiple types of data sources, developers can use SQL to transform and process these data, and then expose the used queries through API interfaces. Downstream applications only need to call these API interfaces to instantly access the latest data, and there is no need to build complex data pipelines. From a technical point of view, Tinybird uses the currently popular Clickhouse for data processing.


Airbyte


Official websitehttps://airbyte.com/

Founded year: 2020

Keywords: data integration, ELT/ETL

Funding round: Series B

Last funding round year: 2021

Airbyte is a Silicon Valley-based, fast-growing data integration company. Their product can be seen as an open-source alternative to FiveTran. Specifically, Airbyte is a data connector that supports the connection from multiple data sources (applications, APIs, message streams, databases, etc.) to target data systems (databases, data warehouses, data lakes, etc.). Unlike many stream computing systems, which require cleaned and structured data, or complex data cleaning logic written by engineers, Airbyte directly supports end-to-end data integration. Through simple SaaS-style configuration, data exchange between more than 100 different systems can be implemented easily. Thanks to the active contribution of the open-source community, the number of Airbyte data connectors has exceeded 150. The official development kit is also provided, and developers can complete the development of custom connectors without spending too much time (the official claim is within 30 minutes). Airbyte is very popular due to its ease of use as well as rich documentation and support resources.


CelerData


Official websitehttps://celerdata.com/index

Founded year: 2020

Keywords: real-time data analysis, database

Funding round: Unknown

Last funding round year: Unknown

CelerData is a new company founded in the United States by StarRocks, a real-time analytical database startup. StarRocks is a commercial product stemming from the open-source project Apache Doris. Similar to Rockset, StarTree, Imply, etc., StarRocks can efficiently handle complex analytical requests. Its interface is compatible with MySQL, and in terms of performance, it claims to be able to significantly outperform similar products.


DeltaStream


Official websitehttps://www.deltastream.io/

Founded year: 2020

Keywords: stream processing, database

Funding round: Seed round

Last funding round year: 2022

DeltaStream is a stream database company founded at the end of 2020. Its founder Hojjat Jafarpour is also the founder of Confluent's KSQL project. DeltaStream provides a serverless streaming database to manage and process data streams in real time. DeltaStream itself does not contain a storage module, but considers streaming storage platforms such as Kafka and AWS Kinesis or static data sources such as AWS S3 as a storage layer. It allows users to read data from one or more data sources, perform computations and simultaneously write the result across different storage components. DeltaStream internally uses Apache Flink SQL as the engine.


Factorhouse


Official websitehttps://factorhouse.io

Founded year: 2020

Keywords: Apache Kafka, cloud service

Funding round: none

Last funding round year: none

Factor House (also known as http://Operatr.IO before the rebranding in September 2022) is a three-member (3!) team based in Australia. Its CEO and COO are Derek Troy-West and Kylie Troy-West respectively. The main product Kpow is a web visualization tool specially designed for Apache Kafka, which can help enterprise users better manage and monitor Kafka resources. Kpow enables users to visualize, retrieve, and export real-time data, greatly improving Kafka's observability and ease of maintenance. Kpow can also easily manage all Kafka clusters and topics without using complex command lines. After chatting with its team, I learned that they started developing the platform at home during the epidemic. So far, they have not received any funding, but they already have customers.


LakeFS


Official websitehttps://lakefs.io

Founded year: 2020

Keywords: git-like, multi-version, data lake

Funding round: Series A

Last funding round year: 2021

LakeFS is developed by Treeverse. It allows users to manage data in the data lake like code: branch, commit, merge, and revert are all a small piece of cake. Data lakes, especially super-large data lakes, are very difficult to manage. The object storage system it relies on lacks critical features such as atomicity, rollback, and recurrence, resulting in reduced data quality and recovery. In the past, when we used data lake, we often created a copy of the production environment to test data changes in the copy first and then applied changes to the production environment when it was ready. But the problem is this method is very time-consuming and expensive, and it is difficult for many people to work together. LakeFS transforms object storage into a Git-like repo, without duplicating any data, supporting multi-person collaboration, injecting only secure data, and reducing the occurrence of errors. Even if errors occur, the corrupted data can be directly rollbacked atomically in the production environment.


Memgraph


Official websitehttps://memgraph.com

Founded year: 2016

Keywords: graph database, real-time analysis

Funding round: Seed round

Last funding round year: 2021

Memgraph is a low-latency, high-performance in-memory graph database that handles transactional and analytical graph tasks well. Memgraph can analyze data from multiple data sources and discover potential connections between them, allowing users to apply graph algorithms to analyze and then build their own real-time applications. CEO Dominik Tomicevic mentioned that the most typical users of Memgraph are from the chemical industry, manufacturing, and financial industries. They all have one thing in common: they need to obtain real-time analysis from scattered data.


Meroxa


Official websitehttps://meroxa.com

Founded year: 2019

Keywords: code-first, real-time analysis

Funding round: Series A

Last funding round year: 2021

With Meroxa, users can build, test, and deploy real-time data applications in days. Meroxa is developer-centric, code-first tooling that lets software engineers maximize their time spent building data products as opposed to maintaining fragile data systems that weren’t designed for developers. Meroxa's goal is to help developers focus on building applications with real-time data~~,~~ rather than automating repetitive operational functions. Their vision is to make Meroxa the industry-leading Data Application Platform as a Service (DAPaaS).


FeatureBase


Official websitehttps://www.featurebase.com/

Founded year: 2017

Keywords: real-time analysis, bitmap

Funding round: unknown

Last funding round year: unknown

FeatureBase is based in Austin, Texas, and was renamed from the recent merger of Molecula Corporation and the Pilosa Project. Its product is an OLAP database that uses bitmaps for data indexing. Specifically, FeatureBase converts the data stored in traditional OLAP columns into features based on bitmaps, thereby achieving better read and write performances and resource efficiency. At the same time, FeatureBase regards streaming data as an important focus, emphasizing the data freshness of streaming updates brought by bitmaps. FeatureBase can index structured data well, but it can't do anything for unstructured data. Its products have two service modes — open source and cloud services — and support two data access interfaces, SQL and its custom PQL.


Nussknacker


Official websitehttps://nussknacker.io/

Founded year: 2016

Keywords: real-time analytics, visualization

Funding round: unknown

Last funding round year: unknown

Nussknacker is a visualized real-time analysis tool. Its target users are managers, analysts, and others accustomed to using interactive tools such as Excel. Users can build analysis and processing logic on the data streams through the visual operation in a web page without writing code. For simple analytical queries, Nussknacker uses Kafka as the main input stream and output stream interface and develops its lightweight engine to perform simple stream processing operations. In contrast, advanced complex aggregation operations will be processed on Flink. Nussknacker lowers the threshold for building real-time data processing analysis. Business teams can deploy and test business processing logic without the need for code writing or the help of professional developers.


Timeplus


Official websitehttps://www.timeplus.com/

Founded year: 2021

Keywords: stream processing, database

Funding round: Seed round

Year of last funding round: 2022

Timeplus is a company founded by a group of senior experts from the Splunk engineering team. Their core product is a streaming database on the cloud. Users can register and apply for beta version access on the official website. At first glance, the interactive interface is a good experience. Their product so far is still wholly close-sourced. This business model allows the team to focus more on commercialization.

Summarize

Thanks for being here. This blog is a comprehensive overview of the entrepreneurial development direction in the field of real-time data systems. I believe this is also the most exciting direction in data infrastructure in recent years. If you are interested in this direction or the RisingWave open-sourced product and cloud product, do not hesitate to contact us. I believe that real-time data systems will usher in leaps and bounds in the near future.


Plug

Close
Featured My talk at Current conference

In fact, when I went to the Current conference this time, in addition to chatting with colleagues and promoting RisingWave, I also gave a technical talk. The title of the talk is "Rethinking State Management in Cloud-Native Streaming Systems". This talk is all about technicalities. It introduces some internal implementations of the RisingWave system. If you are interested, you can check out my presentation slides here. Meanwhile, the full video is also available on the Current conference website. Also, the RisingWave source code can be accessed here, and the cloud private preview is here. Please check it out and leave your comments!

sign up successfully_

Welcome to RisingWave community

Get ready to embark on an exciting journey of growth and inspiration. Stay tuned for updates, exclusive content, and opportunities to connect with like-minded individuals.

message sent successfully_

Thank you for reaching out to us

We appreciate your interest in RisingWave and will respond to your inquiry as soon as possible. In the meantime, feel free to explore our website for more information about our services and offerings.

subscribe successfully_

Welcome to RisingWave community

Get ready to embark on an exciting journey of growth and inspiration. Stay tuned for updates, exclusive content, and opportunities to connect with like-minded individuals.