Stream Processing: Is SQL Good Enough?

Two weeks ago, Current 2023, the biggest data streaming event in the world, was held in San Jose. It stands out as one of my favorite events in 2023. Not only was the conference venue conveniently located just a 10-minute drive from my home, but it was also the unique gathering where data streaming experts from around the world openly engage in discussions about technology.

Current 2023, organized by Confluent, is one of the biggest events in real-time data streaming space.

If you missed the event, fret not. My friend Yaroslav Tkachenko from Goldsky has penned a comprehensive blog detailing the key insights from the event. Among the many insights he shared, one that particularly piqued my interest was his comments on streaming databases:

“They [Streaming databases] can cover 50% to 80% of use cases, but they’ll never arrive close to 100%. But they can be surprisingly sticky!”
“I approached some of the vendors and asked if they planned to introduce programmatic APIs down the road. Everyone said no - SQL should give them enough market share.”

As the founder of RisingWave, a leading streaming database (of which I am shamelessly proud), Yaroslav's observations prompted a whirlwind of reflections on my end. My reflections are not rooted in disagreement; on the contrary, I wholeheartedly concur with his viewpoints. His blog post has spurred me to contemplate SQL stream processing and streaming databases from various angles. I'm eager to share these musings with everyone in the data streaming community and the wider realm of data engineering.

SQL’s Expressiveness

Streaming databases enable users to process streaming data in the same way as they use databases, with SQL naturally being the primary language. Like most modern databases, RisingWave and several other streaming databases prioritize SQL, and they also offer User Defined Functions (UDFs) in languages like Python and Java. However, these databases do not really provide lower-level programmatic APIs.

So, the question we grapple with is: is the expressiveness of SQL (even with UDF support) sufficient?

In my converstation with hundreds of data streaming practitioners, I've found many who argue that SQL alone doesn't suffice for stream processing. The top three use cases that immediately come to my mind are: (1) (rule-based) fraud detection; (2) financial trading; (3) machine learning.

For fraud detection, many applications continue to rely on a rule-based approach. For these, Java often proves more straightforward for expressing rules and directly integrating with applications. Why? Primarily because numerous system backends are developed in Java. If the streaming data doesn't require persistence in a database, it becomes much easier to articulate the logic using a consistent programming language like Java..

In financial trading, there are concerns about the limitations of SQL, especially when it comes to handling specialized expressions that go beyond standard SQL. While it is possible to embed this logic in User-Defined Functions (UDFs), there are concerns about the increased latency that UDFs can introduce. Traditional UDF implementations, which often involve hosting a UDF server, are known for causing significant delays.

When it comes to machine learning use cases, practitioners have a strong preference for Python. They predominantly develop their applications using Python and rely heavily on popular libraries such as Pandas and Numpy. Using SQL to express their logic is not an instinctive choice for them.

While there are numerous real-world scenarios where SQL, with or without UDF support, adequately meets the requirements (you can refer to some of RisingWave's use cases in fintech, and machine learning), it's important to acknowledge that SQL may not match the expressiveness of Java or Python. However, it's worth discussing whether, if SQL can fulfill their stream processing needs, people would choose SQL-centric interfaces or continue relying on Java-centric frameworks. In my opinion, most individuals would opt for SQL. The key point of my argument lies in the widespread adoption and fundamental nature of SQL. Every data engineer, analyst, and scientist is familiar with it. If basic tools can fulfill their needs, why complicate matters with a more complex solution?

SQL’s Audience

Most data systems, like Hadoop, Spark, Hive, Flink, that emerged during the big data era were Java-centric. However, newer systems, like ClickHouse, RisingWave, and DuckDB, are fundamentally database systems which prioritize SQL. Who exactly uses these Java-centric systems, and who exactly uses SQL-centric systems?

I often find it challenging to convince established companies founded before 2015, such as LinkedIn, Uber, and Pinterest, to adopt SQL for stream processing. For sure, many of them didn’t like my pitch solely because they have already had well established data infrastructures and prefer focusing on developing more application-level projects (well, I know these days many companies are looking into LLMs!). But a closer examination of their data infrastructure reveals some patterns:

These companies began with their own data centers;
They initiated with a Java-centric big data ecosystem, including technologies like Hadoop, Spark, and Hive;
They maintain extensive engineering teams dedicated to data infrastructure;
Even when transitioning to the cloud, they lean towards building custom data infrastructure on platforms like EC2, S3, or EKS, rather than opting for hosted data services.

There are several challenging factors that hinder the adoption of stream processing by these enterprises:

The enterprises have a codebase that is entirely in Java. Although SQL databases do provide Java UDFs, it is not always feasible to frequently invoke Java code within these UDFs.
Some of the big data systems they utilize are customized to their specific requirements, making migration exceptionally difficult.
Integrating with extensive big data ecosystems often requires significant engineering efforts.
There is also a human element to consider. Teams that have been maintaining Java-centric systems for a long time may perceive a shift to hosted services as a threat to their job security.

While selling SQL-centric systems to these corporations can be daunting, don’t be disheartened. We do have success stories. There are some promising indicators to watch for:

Companies transitioning to data warehouses like Snowflake, Redshift, or Big Query signal a positive shift. The rising endorsement of the "modern data stack" prompts these companies to recenter their data infrastructure around these SQL-centric warehouses. As a result, they are less inclined to manage their infrastructure or stick with Java-centric systems;
Another interesting signal to consider is the occurrence of business shifts and leadership changes. Whether driven by changes in business priorities or new leadership taking charge, these transitions often trigger a reevaluation of existing systems. New leaders may not be inclined to simply maintain the status quo (as there may be limited ROI in doing so) and may be more receptive to exploring alternatives, such as upgrading their data infrastructure.

Interestingly, the growing popularity of SQL stream processing owes much to the support and promotion from Java-centric big data technologies like Apache Spark Streaming and Apache Flink. While these platforms initially started with a Java interface, they have increasingly recognized the importance of SQL. The prevailing trend suggests that most newcomers to these platforms begin their journey with SQL, which predisposes them towards the SQL ecosystem rather than the Java ecosystem. Moreover, even if they initially adopted Java-centric systems, transitioning to SQL streaming databases in the future may be a more seamless pivot than anticipated.

Market Size of SQL Stream Processing

Before delving into the market size of SQL stream processing, it's essential to first consider the broader data streaming market. We must recognize that the data streaming market, as it stands, is somewhat niche compared to the batch data processing market. Debating this would be fruitless. A simple examination of today's market value for Confluent (the leading streaming company) compared to Snowflake (the dominant batch company) illustrates this point. Regardless of its current stature, the streaming market is undoubtedly booming. An increasing amount of venture capital is being invested, and major data infrastructure players, including Snowflake, Databricks, and Mongo, are beginning to develop their own modern streaming systems.

Today’s market cap of Confluent, the leading streaming company, is $9.08B. Today’s market cap of Snowflake, the dominant batch company, is $53.33B.

It's plausible to suggest that the stream processing market will eventually mirror the batch processing market in its patterns and trends. So, within the batch processing market, what's the size of the SQL segment? The revenue figures for SQL-centric products, such as Snowflake, Redshift, and Big Query, speak volumes. Then what about the market for products that primarily offer Java interface? Well at least in the data infrastructure space, I didn’t see any strong cash cow at the moment. Someone may mention Databricks, the rapidly growing pre-IPO company commercializing Spark. While no one can deny the fact that Spark is the most widely big data system in the world, a closer look at Databricks’ offerings and marketing strategies would soon lead to the conclusion that the SQL-centric data lakehouse is the thing they bet in.

The streaming world vs the batching world.

This observation raises a paradox: SQL's expressiveness might be limited compared to Java, yet SQL-centric data systems manages to generate more revenue. Why is that?

Firstly, as highlighted in Yaroslav’s blog, SQL caters to approximately 50-80% of use cases. While the exact figure remains elusive, it's evident that SQL suffices for a significant proportion of organizational needs. Hypothetically, if a company determines that SQL stream processing aligns with its use cases, which system would they likely opt for? A SQL-centric one or a Java-centric one? If you're unsure, consider this analogy: if you aim to cut a beef rib, would you opt for a specialized beef rib knife or a Swiss Army knife? The preference is clear.

Secondly, consider the audience for Java. Individuals with a computer science background might proficiently navigate Java and grasp system-specific Java APIs. However, expecting those without such a background to master Java is unrealistic. Even if they did, wouldn't they prefer managing the service independently? While it's not absolute, companies boasting a robust engineering team seem less inclined to outsource.

More Than Just Stream Processing

While we've extensively discussed SQL stream processing, it's time to pivot our attention to streaming databases. Classic stream processing engines like Spark Streaming and Flink have incorporated SQL. As previously mentioned, these engines have begun to use SQL as an entry-level language. Vendors are primarily building on SQL, with Confluent’s Flink offering standing as a notable example. Given that both Spark Streaming and Flink provide SQL interfaces, why the push for streaming databases?

The distinctions are significant. Big data SQL fundamentally diverges from database SQL. For instance, big data SQL often lacks standard SQL statements common in database systems, such as create, drop, alter, insert, update, and delete, among others. Digging deeper, one discerns that a pivotal difference lies in storage capabilities: streaming databases possess them, while stream processing engines typically do not. This discrepancy influences design, implementation, system efficiency, cost, performance, and various other dimensions. For those intrigued by these distinctions, I'd recommend my QCon San Francisco 2023 talk (slide deck here).

Furthermore, the fundamental idea behind a database is distinct from that of a computation engine. To illustrate, database users often employ BI tools or client libraries for result visualization, while those using computation engines typically depend on an external system for storage and querying.

Some perceive a streaming database as a fusion of a stream processing engine and a traditional database. Technically, you could construct a streaming database by merging, say, Flink (a stream processing engine) and Postgres (a database). However, such an endeavor would present myriad challenges, including maintenance, consistency, failure recovery, and other intricate technical issues. I'll delve into these in an upcoming blog.

Engaging in debates about SQL's expressiveness becomes somewhat irrelevant. While SQL possesses enough expressiveness for numerous scenarios, languages like Java, Python, and others may outperform it in specific use cases. However, the decision to embrace SQL stream processing is not solely driven by its expressiveness. It is often influenced by a company's current position and its journey with data infrastructure. Similar to trends observed in the batch domain, one can speculate about the future market size of SQL stream processing. Nonetheless, streaming databases provide capabilities that go beyond mere stream processing. It will be intriguing to witness how the landscape evolves in the years to come. > >