TL;DR: It’s a challenging market, yet it holds promising prospects.
My friend Michael Drogalis recently shared a post on LinkedIn that went viral. In it, he voiced his concerns about the current cohort of stream processing startups. He's particularly nervous about the uncertainty surrounding the stream processing market, primarily because its main benefits have not yet achieved widespread adoption.
I know Michael quite well and believe in his genuine passion for stream processing; he truly hopes that every startup in this market succeeds. As the founder of RisingWave, a stream processing startup, I can vouch for the accuracy of his observations—there are numerous challenges in this market that indeed make stakeholders anxious. Despite this, I continue to invest heavily in stream processing, confident in its potential to flourish.
Now, let’s explore why I think stream processing is a challenging market, yet why I remain committed to investing in it.
From Kafka and beyond
The primary function of a stream processing system is to continuously process streaming data. Apache Kafka, notably, is one of the main sources of such streaming data. Stream processing systems like Apache Flink and RisingWave treat Kafka as one of the most important upstream systems, consuming and processing data from it extensively. Within venture capital circles, there's a prevalent assumption that the stream processing market should be at least as large as the Kafka market. The expectation is that if a dollar can be earned from Kafka, an equivalent dollar should be attainable from stream processing systems like Flink and RisingWave. However, the validity of this assumption is questionable. While stream processing systems often depend on Kafka as a primary upstream system, Kafka does not necessarily require a stream processing system downstream. This begs the question: Why do companies specifically choose Kafka?
Kafka’s Use Cases
From my experience, the two primary use cases for Kafka are:
- Decoupling Systems: If a company needs to send data from one system to multiple others, Kafka is often essential. It serves as a buffer between data producers and consumers, allowing them to operate at different speeds or scales, ensuring smooth data flow without system overload.
- Buffering Data: For handling large volumes of data continuously emitted from various sources, such as IoT devices or websites, Kafka is invaluable. Traditional databases like PostgreSQL aren't built to manage such high-throughput, real-time data streams.
For the first use case, it's clear that a stream processing system isn't always necessary. But what about the second use case? Unfortunately, as Michael noted in his LinkedIn post, the necessity of stream processing systems is also debatable here. Many organizations choose to bypass these systems entirely, opting instead to directly route data into data warehouses or lakes. The reason is straightforward: not all organizations require the ability to extract real-time insights from fresh data; many are content with performing periodic computations on batch data.
Challenge: Not all Kafka users need a stream processing system.
ETL vs. ELT
May people use Kafka to send data from different sources to data warehouses or data lakes. A common debate arises around whether to use stream processing systems for data pre-processing (ETL) before sending it to the final destinations. Thanks to companies like Snowflake and other data warehouse vendors, a shift from ETL to ELT is occurring. Snowflake, although not an ELT vendor per se, heavily promotes ELT. This approach encourages users to store raw data and perform repeated computations within Snowflake, thereby increasing their spending—a strategic move in their marketing playbook.
Despite this shift, examining the current ETL market reveals that stream processing systems are not indispensable. Here’s why:
- ETL Doesn’t Necessarily require Real-Time: Most companies do not need to see data generated within the last five minutes appear in their data warehouses; delays of ten minutes to an hour are generally acceptable. Consider Databricks: before their substantial investment in data lakehouses, their core business revolved around providing Spark for batch-based ETL purposes.
- Availability of Specialized ETL Tools: The market is replete with advanced ETL tools that simplify and enhance data processing efficiency. Competing in this space is challenging, and success requires a deep understanding of user experiences with these tools. Note that most ETL users do not need a stream processing system; rather, they need a tool that can manage their data movements effectively and efficiently.
Challenge: The transition from ETL to ELT creates a competitive environment for stream processing vendors, who now need to contend with well-established ETL providers.
Data Processing for Kafka
Stream processing systems can undoubtedly handle much more than just ETL when paired with Kafka. Many companies require capabilities such as real-time monitoring, alerting, and other specific applications, making systems like Flink and RisingWave potentially suitable choices. However, it's important to note that Kafka itself includes Kafka Streams, a built-in module capable of addressing a wide range of stream processing use cases. Organizations that opt for standalone systems like Flink and RisingWave to process their streaming data often have more complex or specialized needs. This scenario significantly raises the competitive bar for stream processing vendors looking to attract and retain customers.
Challenge: Standalone systems such as Flink and RisingWave must compete with Kafka Streams, the integrated stream processing solution within Kafka.
Company size and sales cycle
While small companies can generate significant amounts of data, it’s undeniable that larger enterprises more frequently require the capacity to manage substantial streaming data volumes. Consequently, Kafka is more likely to be utilized by these sizeable companies. The adoption of stream processing systems in those companies often follows the integration of Kafka. Moreover, the sales cycle for enterprises can be lengthy—typically 6 to 12 months to close a deal with a large enterprise. Therefore, it should not be surprising to hear that some stream processing companies (including us!) are still finalizing deals that began in the second half of 2023.
Challenge: Selling stream processing technology to enterprises is consistently challenging.
OLTP and OLAP, then stream processing?
While our discussion has heavily focused on Kafka, it's important to note that stream processing is not inherently dependent on Kafka. Understanding the evolution of a company's data stack reveals that companies typically start with an OLTP database. As their data needs grow and an OLTP database no longer suffices, they move towards adopting an OLAP database. But where does stream processing fit into this sequence? Typically, stream processing vendors focus on two primary use cases.
Data movement
The first major use case is data movement, including ETL/ELT processes. When there is a need to transfer data from operational databases to analytical platforms, a stream processing system can be an efficient bridge. Modern systems like Flink and RisingWave can handle database CDC (Change Data Capture) directly, without involving Kafka. However, it's worth noting that some OLAP databases or data warehouse vendors, such as Snowflake with its Snowpipe feature, offer built-in data ingestion tools. This necessitates stream processing vendors to clearly articulate their value proposition in terms of cost and efficiency compared to these integrated solutions.
Challenge: Beyond the challenges described in the ETL vs. ELT subsection, stream processing vendors also face competition from built-in data ingestion tools.
Real-time materialized views
The second use case revolves around real-time materialized views. Although one might assume that an OLAP database is the go-to for handling analytical workloads, many initially attempt to manage these queries using their OLTP databases. Why not stick with one system if it meets the needs? This is where many try leveraging materialized views, which can serve as a bridge to stream processing, demonstrating its capabilities within the familiar context of database systems. Systems like RisingWave and Materialize are specifically optimized for this purpose.
As the creator of RisingWave, I see materialized views as a prime method for introducing stream processing to those familiar with databases. However, several challenges remain. Firstly, materialized views are often viewed by users as an advanced feature, requiring some educational efforts. Secondly, as RisingWave operates as a separate system from common databases like PostgreSQL, potential users must decide whether they are willing to maintain an additional system.
Effectively integrating stream processing as materialized views remains a challenge.
Don’t Start a Stream Processing Business in 2024
If you're considering launching a new stream processing business in 2024, I would strongly advise against it. My reasons are not driven by fear of competition—I am confident in the robust system we've built over the past three years that can compete with any new entry. Here are my main concerns:
- Technology Constraints: There is little room left for substantial improvements in existing stream processing systems. Database construction is no longer a realm of hidden secrets; their performance is primarily driven by two factors: increasing data volumes and advancements in hardware. Over the last few years, data volumes have not seen significant spikes, and existing systems are well-equipped to handle current workloads. Even with the advent of new technologies like S3 Express that enhance decoupled compute and storage architectures, I see no compelling reason to develop another system based on this. RisingWave, for instance, already implements this architecture and supports elastic workloads across numerous enterprises.
- Capital Efficiency: Building a data system from scratch today does not represent a wise allocation of capital. Unlike three years ago, when many stream processing startups began, today's venture capitalists are more inclined to invest in less capital-intensive projects, especially in the AI sector. There, they can fund many companies with lower initial costs, rather than sinking funds into hard-core database technologies.
Even if you plan to build your business on existing platforms like Flink or RisingWave, consider several factors: Are you the original creator of the system? What unique value can you offer? How will you differentiate from other vendors with similar offerings?
Where Opportunities Exist in Stream Processing
While I have listed many reasons explaining why stream processing is a challenging market, it doesn’t mean that everyone should give up. Indeed, I am still investing in stream processing and remain confident about its future.
There are several aspects that bolster my confidence and belief that stream processing is still a promising field to explore.
First, enhancing the incremental computation model: Whenever people discuss stream processing, they often highlight “low latency” as a key benefit. This feature makes stream processing particularly appealing for various applications such as stock dashboarding, fraud detection, and IoT monitoring. However, as Michael pointed out in his post, not all applications require such low latency, and many are quite satisfied with data freshness of five or ten minutes. This leads me to believe that low latency should not be the only selling point of stream processing. There's another significant advantage—incremental computation. Unlike traditional batch processing that recomputes everything from scratch, stream processing efficiently computes only the incremental changes, potentially offering a cost-effective solution that's far cheaper than conventional technologies!
Second, the database field needs more stream processing solutions: Kafka is widely recognized as a source of streaming data, but as database CDC technology matures, users of operational databases could directly benefit from stream processing systems. Conventional big data systems like Flink and Spark Streaming might be too heavy for these users, indicating substantial room for improvement. Shifting the focus from catering to Kafka users to addressing the needs of database-focused markets could open up new avenues for growth.
Third, building developer-friendly tooling: Consider the capabilities of stream processing—real-time ETL, monitoring, alerting, among others. While the technology for stream processing has matured, there is still a lack of products that enable users to easily develop specialized applications. Although SQL integration has made stream processing more accessible and cloud services have relieved users from managing complex distributed systems, the end goal should always be to solve practical problems. Vendors need to focus on helping users create impactful applications rather than continuously educating them about the technology itself. Neon and Supabase are two very popular developer-friendly database products that stream processing vendors could learn from.
The stream processing market is challenging, and starting a stream processing startup in 2024 is not recommended. However, despite these hurdles, the sector remains promising, with substantial opportunities for startups to not only survive but also thrive. Winter has arrived, but spring is not too far away. No great business venture is easy to succeed in. If you aim to become a leader in stream processing or the broader data field, maintaining an optimistic outlook is essential. This is why I continue to invest tirelessly in the development of RisingWave, as I believe stream processing is the most worthwhile investment direction in the data domain.