Future of Real-time Data Systems: A Recap of Current ’22

Austin, the Live Music Capital of the World! This city in Texas, USA, is becoming an emerging tech city due to the influx of tech giants in recent years. Thanks to its convenient transportation and geographical location (located in the central United States), Austin is also attracting more and more technology conferences. And I have been fortunate to visit Austin for the second time this year to participate in the top technology conference in the field of data systems - Current 2022. When it comes to the name Current, it may be unfamiliar to many of us. But its alias should be familiar to all engineers: Kafka Summit. Due to various reasons such as branding, the event organizer, Confluent, a giant in the field of data systems, decided to rename the Kafka Summit in 2022 to Current 22.

Austin Airport Texas State Capitol

The focus of the Current conference is the real-time data system. Different from widely adopted big data platforms (such as Apache Hadoop, Apache Spark, and Apache Hive) or data warehouses (such as Snowflake, Redshift, and BigQuery), real-time data systems emphasize real-time storage and computations of data generated in real-time. In recent years, the field of real-time data systems has gained gradual recognition in the market with the rise of applications such as real-time reporting, monitoring, and tracking. Confluent, an industrial leader in this field, also successfully IPO’d in mid-2021, driving a new wave of real-time data system development and application.

Current Conference Sponsor Booth

A great way to understand where a field is heading is to see what startups are doing. During my two days at Current, I chatted with all of Current's sponsors. Apart from certain industry giants, most conference sponsors are startups. In this article, based on my understanding, I will introduce you to these startups to help you understand the future of real-time data systems. I will begin by classifying these companies based on their core products. Then I will share my observations and raise the potential risks I can imagine. And I will end by introducing each startup company's core product and business model.

Disclaimer: This article only introduces the startups that sponsored the Current conference, and this article attempts to ensure that the comments are objective and fair; however, remember these are my personal observations. This article does not constitute any investment advice on any specific company, but I do recommend you invest in this field for the long term :-)

Classification of Startups

Based on the analysis of the various startup companies sponsoring the conference, I have divided the entrepreneurial direction into the following nine categories:

1. Commercializing mature open-source projects on the cloud

This category includes Databricks, Aiven, Conduktor, Imply, StreamNative, StarTree, Immerok, Factorhouse, and more. Such companies can be further divided into three development models:

The core teams start a business: such companies include Databricks, Imply, StreamNative, StarTree, Immerok, etc. One of their core selling points is "orthodoxy". These companies guide the direction of open-source project development through strong control from the core contributors. Then they try to convert open-source users to paying users through the community.
Non-initiative teams start businesses: such companies include Conduktor, Factorhouse, etc. These companies are founded by some active members of certain open-source projects, but not always founding members. They usually emphasize the multi-directional development of cloud platforms, looking for differentiation from unique perspectives such as security, visualization, and system integration.
The companies directly commercialize multiple open-source projects. Aiven is one example. By providing hosting services for different open-source projects on the cloud, such companies can reduce the sense of separation between different projects and provide users with a complete set of solutions.

2. Real-time ETL/ELT/data integration

This category contains Decodable, Striim, Airbyte, etc. Airbyte focuses on ELT, while Decodable and Striim are more toward ETL. One core selling point of these companies is the ease of use. For example, Decodable emphasizes that users only need to click a few buttons on the platform and then write SQL to realize real-time data transformation from one database to another database. and import.

3. Message queues

This category contains StreamNative, Redpanda, etc. Message queues are fundamental building blocks of a modern data stack. I believe that everybody knows Kafka, which is backed by the company Confluent, the organizer of this Current conference. Message queues proved their value over the past decade. But this does not mean that the development of message queues has come to an end. In the new cloud era, new challengers and products have emerged based on new technologies, such as the separation of storage and computing to reduce costs and increase efficiency.

4. Real-time API

This category contains Aklivity, Macrometa, etc. Such companies fill the gap from data source to data access, allowing users to access data that is ingested in real time easily.

5. Edge computing

This category includes Ably et al. — companies focusing on real-time data processing on edge devices, which are very close to the data source. A critical source of real-time data is the data generated by edge devices, such as sensors, mobile phones, and so on. In order to reduce latency, real-time computing needs to be carried out on near-source devices rather than concentrated in the data center on the cloud brute-forcibly. This also brings new demands and challenges to the basic software.

6. Real-time vertical industry SaaS

This category contains Clear Street, Bicycle, and more. Such companies focus on the application of real-time data analysis in vertical markets. The development of such companies proves that real-time computing can be adapted on a large scale in some specific industries.

7. Real-time high-level language framework

This category contains Quix, Meroxa, etc. This type of company aims at developers who use a specific programming language (such as Python, etc.). People unfamiliar with low-level languages (e.g., Java, Scala) can easily program their stream processing applications in high-level languages.

8. Real-time analytical database

This category contains Imply, StarTree, InfluxData, Rockset, FeatureBase, and many more. This type of product mainly focuses on improving the ability of real-time data ingestion in traditional analytical databases. Data ingestion latency is reduced, and the newly ingested data is visible to analytics in real time. These databases provide SQL interfaces that allow users to perform analytical queries on data like traditional databases.

9. Streaming database

This category contains RisingWave, Materialize, DeltaStream, TimePlus, and more. Like real-time analytical databases, they all provide SQL interfaces and are optimized for real-time data ingestion. But on top of those, streaming databases take a step further to support the real-time computation of data. Streaming databases integrate stream processing technology into databases and support query results' real-time and continuous updates through incremental computing.

Founding Time

The following table summarizes all companies introduced in this blog according to the year of founding.

Year	Company	Quantity
2012	InfluxData, Striim	2
2013	Databricks	1
2014	Slower	1
2015	Imply, Oxylabs, Swim	3
2016	Aiven, Ably, Rockset, Memgraph, Nussknacker	5
2017	Macrometa, FeatureBase	2
2018	Acceldata, Clear Street, StarTree	3
2019	Conduktor, Materialize, StreamNative, Cube, Redpanda, Tinybird, Meroxa	7
2020	Bicycle, Airbyte, CelerData, DeltaStream, Factorhouse, LakeFS	6
2021	RisingWave, Decodable, Aklivity, Timeplus	4
2022	Immerok	1

Without a doubt, 2019, 2020, and 2021 were the golden years for launching real-time data systems. Before 2019, the field of real-time computing was far less hot than it is today. But in recent years, as traditional batch data processing has entered a bottleneck period, real-time data processing has begun to attract people's attention. The landmark event of Confluent's IPO in 2021 has brought new confidence into the real-time data processing industry. It is no exaggeration — real-time computing is one of the hottest markets today.

Business Trends

The field of real-time data systems has gradually transformed from being used by technology giants on a small scale to being used by the general public. If Confluent's IPO proves that enterprises need to store real-time data, we can say that many real-time data systems startups are now trying to prove that users also need to do computations on real-time data. These startups in this wave have different entry points, from real-time APIs to real-time analytical databases to streaming databases. Although they seem similar from a macro perspective, everyone is actually looking for segments to differentiate. For example, the real-time API field is more for application developers, while the real-time high-level language frameworks are more for data analysts and scientists. Regardless of the segment, we can see that real-time computing has ushered in the next wave.

Entrepreneurial Challenges

Frankly speaking, I don't think the real-time data system field, or the data system field in general, is technically insurmountable. After all, everyone is playing a game of balancing performance and resources. I am very bullish on the real-time data system field. Of course, if I am not optimistic about this field, I will not start a business in this field. So, are there actually any challenges in launching a startup? Yes, there still are.

The biggest challenge comes from the immature market. Whether technology can usher in explosive development depends not on how advanced the technology is but on whether the technology matches the market demand. The concept of stream processing was first proposed in academia 20 years ago and landed in the industry almost a decade ago, but it has been in a tepid state for the past two decades. Although some technology giants have adopted such technology, it does not mean that various companies can widely use this technology. This is the typical technology-market mismatch. Today, in the field of real-time data systems, we clearly feel the market is heating up, but it will take time to be widely recognized and accepted, like Oracle in the enterprise market or Snowflake in the cloud computing field. I predict this time to be about 2-5 years. During this time, startups in this industry must spend a lot of time and energy educating the market. This investment is massive and full of unknowns. Of course, as we all know, challenges and opportunities coexist, and whoever can grasp the opportunity in the challenge will eventually become the market leader.

Company Details

Next, let's list the startups that are sponsors of the Current conference. Here, I only included private companies established within the last ten years (i.e., companies established after 2012). As for those giants (such as AWS, Google, Microsoft, etc.), acquired companies or companies have lasted for more than 10 years; I will not introduce them one by one here.

RisingWave

Official website: https://www.risingwave-labs.com/

Founded year: 2021

Keywords: stream processing, database

Funding round: Series A

Last funding round year: 2022

Let me first introduce our company RisingWave Labs. We launched it in early 2021. Over the past two years, we have grown into a fully distributed team spanning seven time zones. RisingWave has been focusing on developing cloud-native streaming databases from day one. The core idea is to democratize stream processing, making it simple, affordable, and accessible. We hope users can develop their stream processing applications simply by operating a regular database. To do stream processing, the only thing a user needs to do is to create a materialized view. Similar to some popular systems today, RisingWave does not depend on the JVM ecosystem, making the deployment, operation, and maintenance very simple. The entire system is written in Rust from scratch, mainly because of the efficient and safe features of the Rust language. As an open-source project using the Apache license, RisingWave's commercial model is still providing cloud services. The private preview version has been released already, and the GA version is expected to be released next year. As a streaming database, users can process streaming data while also storing data. This also means that users can query directly on the database. Many people will think of the popular concept of unifying batches and streaming. RisingWave currently focuses more on stream processing, and the batch processing capabilities still depend on the project implementation priorities.

Databricks

Official website: https://www.databricks.com/

Founded year: 2013

Keywords: data lake, data platform

Funding round: Pre-IPO

Last funding round year: 2021

Databricks needs no introduction. With a valuation of up to 38 billion US dollars, starting with Apache Spark, this company is moving towards a unified lakehouse direction (integrating data lakes and warehouses). This year, Databricks also surpassed \$1 billion in revenue. If there are no surprises, presumably, the company will have a successful IPO in the next two years. Databricks is steadily moving towards real-time computing. Spark Streaming is also one of their core development directions. This year Databricks has also announced the upcoming launch of the next-generation Spark Streaming engine Lightspeed. Since Lightspeed is not open-sourced yet, I won't elaborate on it.

Slower

Official website: http://slower.ai/

Founded year: 2014

Funding round: unknown

Last funding round year: unknown

Slower is a mysterious company. As a company founded in 2014, I could not find any concrete information on the internet, and its official website is just a simple company logo. After chatting with their employees, I learned they mainly provide various cloud and on-premise deployment solutions for multiple enterprises. The business covers databases, data platforms, data management tools, machine learning platforms, security, etc. In general, it is a well-rounded solution team. Because Slower is too mysterious, I will stop here.

Aiven

Official website: https://aiven.io/

Founded year: 2016

Keywords: open source, cloud services

Funding round: Series D

Last funding round year: 2022

Aiven is a Helsinki-based cloud service provider. Unlike many other startups, the business story is not about a core member of an open-source project, starting a company, or commercializing an open-source project. It is about providing cloud hosting services for various existing open-source projects! From Kafka to Flink, from ClickHouse to InfluxDB, from MySQL to PostgreSQL, they can provide hosting services as long as it is a mainstream open-source project. At first glance, it does not seem particularly competitive, but in fact, Aiven is coping with a big pain point for users, that is, the problem of software selection. Today there are too much data software, and they are too complex. To build complex applications, companies often have to compose a variety of software. Providing a complete set of cloud services will largely solve the problem of users choosing software. In addition, the unified management interface will make the connection between the software smoother, without a strong sense of separation; I believe this is also a good advantage.

Conduktor

Official website: https://www.conduktor.io/

Founded year: 2019

Keywords: Apache Kafka, cloud service

Funding round: Series A

Last funding round year: 2021

Apache Kafka is a distributed streaming message storage system. As long as there is streaming data, Apache Kafka can be used for storage. However, it is not enough to have this storage system. We also need to deploy, maintain, and operate the system, monitor the system, and analyze and manage the data on the system. Conduktor is a company that does this series of things. Their tagline is "Streamline Apache Kafka". I thought it was highly similar to what Confluent did. After chatting with their CTO, I realized that they not only do Kafka hosting services but also analyze and manage data. For example, you can use their platform to know whether the quality of your data stored in Kafka is reliable, or you may want to query or monitor the data. Conduktor essentially takes Kafka as a data platform. Users no longer need to import data into downstream data warehouses or data lakes, but can process data within Kafka. I believe this is a good direction.

Decodable

Official website: https://www.decodable.co/

Founded year: 2021

Keywords: data pipelines, data engineering, cloud services

Funding round: Series A

Last funding round year: 2022

Decodable was born in 2021 — hence is a rookie in the field of data engineering. The focus of Decodable is a very familiar problem: ETL. Countless platforms provide ETL capabilities, so what is the entry point for Decodable? The answer is straightforward. Decodable provides engineers with a simple and easy-to-use platform: through simple clicks and writing SQL code — you can import data from one platform (such as Apache Kafka and Apache Pulsar) to another (such as Snowflake and Redshift). And they provide a cloud service that allows users to connect to databases on the cloud without installing any software locally. I believe this is also an excellent product.

Imply

Official website: https://imply.io/

Founded year: 2015

Keywords: Apache Druid, real-time analytics, database

Funding round: Series D

Last funding round year: 2022

Engineers in the database field should not be unfamiliar with Imply. Imply commercializes Apache Druid, a well-known real-time analysis engine in the industry. The core objective of Apache Druid is to respond to random complex queries on large-scale data with low latency. Although many new startups have emerged in the field of real-time analysis in recent years, Imply still maintains a relatively leading position in terms of customer volume by virtue of its stable performance.

Materialize

Official website: https://materialize.com/

Founded year: 2019

Keywords: stream processing, database

Funding Round: Series C

Last funding round year: 2021

Materialize is the most direct friend of our company. Like what we do, the core product of Materialize is also a streaming database. At Current, they finally released their long-awaited product: a cloud-native streaming database. Although Materialize has been building a streaming database based on the Timely Dataflow open-source project since 2019, for a long time, Materialize has always been a single-machine database running on pure memory. Hence its availability may encounter considerable challenges in real production environments. However, this new version is worth looking forward to and hopefully will make everyone's eyes sparkle.

StreamNative

Official website: https://streamnative.io/

Founded year: 2019

Keywords: Apache Pulsar, message queue, pub/sub

Funding round: Series A

Last funding round year: 2021

StreamNative was founded in 2019. Although it has not been long, StreamNative has an outstanding reputation in the open-source and infrastructure community. Their core product is a commercial version of Apache Pulsar. Since open sourced in 2016, many companies worldwide have adopted Apache Pulsar. Apache Pulsar and Apache Kafka have two significant differences as message queuing systems. Compared to Kafka, which focuses solely on storing event data, Pulsar also pays attention to the message data generated within the application. Pulsar is also more cloud-native, and its separate storage and computing architecture can make the entire system more scalable. As for commercialization, StreamNative is also currently focusing on providing services to users on the cloud.

Ably

Official website: https://ably.com/

Founded year: 2016

Keywords: message queue, pub/sub, edge computing

Funding round: Series B

Last funding round year: 2021

Ably is a London-based company that provides message queuing services on the cloud. Speaking of the message queue service, you may think that Ably’s product is similar to Kafka or Pulsar. Yes, Ably's products are identical to Kafka in terms of category, but the most significant difference is edge computing. Kafka is usually deployed in a company's data center. Through Kafka, we can obtain message data in a centralized way. Ably focuses on the edge. To achieve millisecond-level latency, you can deploy their product on the edge cloud and process data directly on the device side, such as mobile phones, sensors, and tablet computers. Ably has been established for six years, and the cumulative financing has exceeded 80 million US dollars.

Acceldata

Official website: https://www.acceldata.io/

Founded year: 2018

Keywords: observability, cloud services

Funding round: Series B

Last funding round year: 2021

Acceldata is a comprehensive data observability platform in the cloud. The field of the observability platform has been really hot in recent years. Aside from market leaders — Splunk and Datadog —there are also various other startups working in this field. Based on conversations with Acceldata folks, I would think Acceldata is an observability platform that does everything. But I needed a better understanding of how they differ from Datadog etc. After I dug deep, I found that they are more concerned about the observability of various systems on the so-called "modern data stack" rather than the observability of the machine itself or traditional applications such as CI/CD. What they observe is slightly different from the focus of other companies.

Aklivity

Official website: https://www.aklivity.io/

Founded year: 2021

Keywords: real-time API

Funding round: Seed round

Last funding round year: 2022

Aklivity is a startup with only three employees (including the founder himself)! I chatted with all three of their employees at a cocktail party. They told me they are working on an open-source API tool called Zilla. It has raised \$4 million in the seed round. More specifically, what they are doing is a real-time API gateway. Simply put, when users use Kafka, depending on the device or application, they may choose different interfaces to connect to Kafka, which is relatively troublesome. Zilla, developed by Aklivity, is essentially a unified encapsulation on top of Kafka, so that different applications can access Kafka in the same way.

Bicycle

Official website: https://bicycle.io/

Founded year: 2020

Keywords: revenue operation, SaaS

Funding round: Unknown

Last funding round year: Unknown

At Current 2022, we were pleasantly surprised to discover some SaaS products that provide real-time analytics capabilities. Bicycle is one of them. Bicycle is not selling a real-time analysis engine or storage engine; it provides real-time data monitoring, alarming, and analysis functionalities for customers. For example, for an e-commerce platform company, Bicycle can analyze and predict possible future sales through past sales data and use such data to manage its revenue. Employees from Bicycle revealed that they developed their core engine and used machine learning methods to analyze sales data. As the underlying systems for real-time analytics are gradually improved, I feel there will be more and more SaaS startups like this.

Clear Street

Official website: https://clearstreet.io/

Founded year: 2018

Keywords: FinTech, SaaS, securities trading

Funding round: Series B

Last funding round year: 2022

Clear Street is a New York-based fintech platform service provider that provides a cloud-based securities trading service. Traditional securities brokers, such as banks, have out-of-date infrastructures, opaque information, and low efficiency. Clear Street saw this opportunity and wanted to revolutionize the field with cloud computing. If Robinhood is a trading platform for retail investors, then Clear Street is Robinhood for professional institutions, in my opinion. Through Clear Street, users can not only trade securities but also perform real-time data analysis through a simple interface. In 2021, the average transaction volume processed in a single trading day on its cloud platform had reached 3 billion US dollars.

Cube

Official website: https://cube.dev/

Founded year: 2019

Keywords: Headless BI, cloud platform

Funding round: Series B

Last funding round year: 2022

When I first saw Cube's booth, I thought they were a BI visualization company, but it turned out that they were not. What Cube does is a layer between data storage systems (such as databases, data warehouses, etc.) and visual BI tools. It solves several problems:

Unified caliber: When users query data from different data sources, the type, unit, and representation of data may not be uniform, and Cube can provide a data model to solve this problem;
Access permission: Administrators can set different permissions for different users through Cube to display different reports for different users;
Cache: Every time you fetch data from a BI tool to the underlying data storage system, there is always a lot of access overhead. Cube provides a layer of caching to solve this problem;
Different APIs: You don’t have to worry about the different APIs of underlying data storage systems.

Overall, I think the Cube is a very thin and easy-to-use tool, and many companies should like it.

Immerok

Official website: https://www.immerok.io/

Founded year: 2022

Keywords: Apache Flink, cloud service

Funding round: Seed round

Last funding round year: 2022

Immerok is probably the youngest company to sponsor the Current conference. However, what they do may be familiar to people who do real-time analysis: commercializing the Flink system in the cloud. When it comes to Flink-related companies, you may think of Ververica, which Alibaba acquired in 2019. Immerok's relationship with Ververica is extraordinary: Almost the entire founding team of Immerok is from Ververica. Since its establishment in the first half of this year, Immerok has completed a seed round of 17 million euros. Unlike Ververica, which offers on-premise deployments, Immerok has put its entire focus on cloud services. It appears this is the future trend.

InfluxData

Official website: https://www.influxdata.com/

Founded year: 2012

Keywords: time series database

Funding round: Series D

Last funding round year: 2019

InfluxData is already very famous — it does not need too much introduction. The main product, InfluxDB, is a mainstream time series database in the industry. As a commercial open-source software company, InfluxData uses the most relaxed MIT license to open source its core code. But what's interesting is that only the stand-alone version is open sourced, and the distributed version is completely closed-source with applied charges. InfluxData is also recently rewriting its system kernel in Rust. It seems that rewriting a system with Rust is quite common in the industry.

Macrometa

Official website: https://www.macrometa.com/

Founded year: 2017

Keywords: Real-time API

Funding Round: Series A

Last funding round year: 2021

Macrometa provides real-time data API services. What exactly is a real-time data API service? Essentially Macrometa can be regarded as a global real-time multimodal database. Users write to the database through API or connect to real-time event sources. The underlying system achieves global real-time synchronization through CRDT. Users can implement data read, write and cache services just like using a distributed database across multiple regions and can directly query the data written in real time. Compared to other real-time data services, Macrometa has two specialties. First, Macrometa has encapsulated the underlying technology into services, and the upper layer provides a very rich data model and API, such as key-value store, document database, graph database, and pub/sub; second, Macrometa is very focused on edge computing. Users who have a large demand for edge computing can simply access real-time data by accessing their API.

Oxylabs

Official website: https://oxylabs.io/

Founded year: 2015

Keywords: Gateway, SERP scraper, SEO

Funding round: Unknown

Last funding round year: Unknown

If you look at the official website of Oxylabs, you will find that their core business is network proxy, such as IP address proxy and data center gateway. This seems to have nothing to do with real-time systems. But if you take a closer look, you will find that another pillar of Oxylabs' business is real-time SERP scraping. I was relatively unfamiliar with this term before, so I did my research, and let me briefly introduce it. SERP refers to the Search Engine Result Page, and SERP scraper refers to the automatic tracking of the query results of some keywords in the search engine. Such results include advertisements, related queries, web page rankings, etc. And those results are returned to users in a structured format. SERP scraping is mainly used in the market department to analyze the SEO of its own products and competitors. The core market value here is to provide SaaS services that lower the technical bar. Oxylab takes a step further by providing automated real-time SERP scraping. This requires a comprehensive real-time data stack, from data crawling to real-time analysis to sending to users. Oxylabs has been established for seven years, yet it has no public funding records. Nevertheless, the company has expanded from Lithuania to the globe. This also shows that the application of real-time data is really of high value.

Quix

Official website: https://quix.io/

Founded year: 2020

Keywords: Python, data science, data engineering

Funding round: Seed round

Last funding round year: 2021

The previous stream processing platform was designed for programmers familiar with low-level APIs. They only provided language interfaces such as Java. However, for data scientists, the dominant programming language is actually Python. Here comes an opportunity: how to simplify stream processing for users who are more familiar with Python, such as data scientists and engineers. Quix is a stream processing platform mainly for other high-level languages such as Python. Quix was founded by several data scientists from McLaren (you read that right, the company that sells luxury sports cars) in the UK. It has been a stream processing platform for data scientists and engineers since its inception. It relies on message queues such as Kafka for data input and output but provides streaming data processing services hosted on the cloud. In addition to Python, I also noticed from their official website that they have added support for C#.

Redpanda

Official website: https://redpanda.com/

Founded year: 2019

Keywords: message queue, Apache Kafka compatible

Funding round: Series B

Last funding round year: 2022

Redpanda competes directly with Apache Kafka. If Red Hat is a commercial distribution for Linux, then Redpanda is a commercial distribution for Kafka. The interface of Redpanda is fully compatible with Kafka. Compared to Kafka, Redpanda is mainly cost-effective: it claims to be ten times more performant than Kafka, while the hardware efficiency is more than six times higher. As a C++ project, Redpanda also has a big selling point: to completely abandon the JVM dependency. When installing Redpanda, users no longer need to install JVM ecological components such as Zookeeper. I highly agree with this idea. In the era of big data, the complexity of installation and maintenance of the Hadoop ecosystem is too high. Today, it is clear that a minimalist deployment and operation and maintenance environment will be a solid competitive advantage compared to existing technologies in the era of big data.

Rockset

Official website: https://rockset.com/

Founded year: 2016

Keywords: real-time analytics, database

Funding round: Series B

Last funding round year: 2020

Rockset is a real-time analytics database. There are already many open-source products In this field, yet Rockset is one of the few closed-source products. In the early days, Rockset was not actually for real-time analytics (OLAP). The Rockset founding team is the same bunch of people who built RocksDB and HDFS on Facebook. And indeed, their products are developed based on RocksDB. I have been following their products since they were only 10 people or so. I still remember that what they did in the earliest days was actually SQL on raw data, which is to query raw data (such as semi-structured data and JSON). After that, it gradually became the so-called indexing database. The product was fully positioned as a real-time analysis database in the past two years. The most attractive point for me is that they can predict from that time in 2016 that the future data will be in the cloud, and more and more raw data will be saved. Looking back at the products they made from this point in time, there is no doubt that they are ahead of their times.

StarTree

Official website: https://www.startree.ai/

Founded year: 2018

Keywords: Apache Pinot, real-time analytics, database

Funding round: Series B

Last funding round year: 2022

StarTree is a rookie in the real-time analytics database field. Although it was founded recently, it has already gained good attention in Silicon Valley. In addition to products, their VP of DevRel — Tim Berglund — has also attracted much attention. StarTree's core business is commercializing Apache Pinot, an open-source real-time analytics database. Pinot focuses more on high-concurrency queries than other real-time analytics databases, which is also required in many user interaction scenarios (such as the "Who's viewed your profile" application on LinkedIn). In addition to selling real-time analytics databases in the cloud, StarTree has a SaaS service called ThirdEye that uses Pinot for data anomaly detection. This reflects a trend — infra companies are developing their SaaS layer.

Striim

Official website: https://www.striim.com/

Founded year: 2012

Keywords: data integration, stream processing

Funding round: Series C

Last funding round year: 2021

Striim is a company specializing in data integration. It was founded by the original Oracle GoldenGate team. GoldenGate was acquired by Oracle in 2009, and their team focused on data import and export business for the Oracle database. Striim's current core business is the same as what GoldenGate did: database-to-database data integration solutions. Due to its relatively early establishment, Striim was still mainly privatized deployments in the early days. But in recent years, with the rise of the cloud, they have also expanded their business to cloud services. What's impressive about the company is how friendly its products are to the data ecosystem. Striim Cloud supports all mainstream databases, data services, and cloud platforms. The company has even developed dozens of connectors and released them in the application market of cloud manufacturers, which greatly reduces the complexity of user access.

Swim

Official website: https://www.swim.inc/

Founded year: 2015

Keywords: real-time data analysis, real-time application

Funding round: Series B

Last funding round year: 2019

Swim's product is mainly to help developers build and manage real-time applications based on streaming data. Their open-source product Swim OS provides a framework for building real-time applications. In contrast the commercial product Swim Continuum enables stream data source management, real-time analysis and display based on stream data, and monitoring of application running status. It combines application monitoring and business analytics into one platform. This company provides not a single service (middleware), but a set of real-time data processing and analysis solutions for business users.

Tinybird

Official website: https://www.tinybird.co/

Founded year: 2019

Keywords: real-time data analysis, real-time API

Funding round: Series A

Last funding round year: 2022

Tinybird is a real-time data analytics company. There are various data sources, but in order to build an application, the data used by the application side still needs to be accessed through API. What Tinybird builds is the bridge from the data source to the API. Their products support multiple types of data sources, developers can use SQL to transform and process these data, and then expose the used queries through API interfaces. Downstream applications only need to call these API interfaces to instantly access the latest data, and there is no need to build complex data pipelines. From a technical point of view, Tinybird uses the currently popular Clickhouse for data processing.

Airbyte

Official website: https://airbyte.com/

Founded year: 2020

Keywords: data integration, ELT/ETL

Funding round: Series B

Last funding round year: 2021

Airbyte is a Silicon Valley-based, fast-growing data integration company. Their product can be seen as an open-source alternative to FiveTran. Specifically, Airbyte is a data connector that supports the connection from multiple data sources (applications, APIs, message streams, databases, etc.) to target data systems (databases, data warehouses, data lakes, etc.). Unlike many stream computing systems, which require cleaned and structured data, or complex data cleaning logic written by engineers, Airbyte directly supports end-to-end data integration. Through simple SaaS-style configuration, data exchange between more than 100 different systems can be implemented easily. Thanks to the active contribution of the open-source community, the number of Airbyte data connectors has exceeded 150. The official development kit is also provided, and developers can complete the development of custom connectors without spending too much time (the official claim is within 30 minutes). Airbyte is very popular due to its ease of use as well as rich documentation and support resources.

CelerData

Official website: https://celerdata.com/index

Founded year: 2020

Keywords: real-time data analysis, database

Funding round: Unknown

Last funding round year: Unknown

CelerData is a new company founded in the United States by StarRocks, a real-time analytical database startup. StarRocks is a commercial product stemming from the open-source project Apache Doris. Similar to Rockset, StarTree, Imply, etc., StarRocks can efficiently handle complex analytical requests. Its interface is compatible with MySQL, and in terms of performance, it claims to be able to significantly outperform similar products.

DeltaStream

Official website: https://www.deltastream.io/

Founded year: 2020

Keywords: stream processing, database

Funding round: Seed round

Last funding round year: 2022

DeltaStream is a stream database company founded at the end of 2020. Its founder Hojjat Jafarpour is also the founder of Confluent's KSQL project. DeltaStream provides a serverless streaming database to manage and process data streams in real time. DeltaStream itself does not contain a storage module, but considers streaming storage platforms such as Kafka and AWS Kinesis or static data sources such as AWS S3 as a storage layer. It allows users to read data from one or more data sources, perform computations and simultaneously write the result across different storage components. DeltaStream internally uses Apache Flink SQL as the engine.

Factorhouse

Official website: https://factorhouse.io

Founded year: 2020

Keywords: Apache Kafka, cloud service

Funding round: none

Last funding round year: none

Factor House (also known as http://Operatr.IO before the rebranding in September 2022) is a three-member (3!) team based in Australia. Its CEO and COO are Derek Troy-West and Kylie Troy-West respectively. The main product Kpow is a web visualization tool specially designed for Apache Kafka, which can help enterprise users better manage and monitor Kafka resources. Kpow enables users to visualize, retrieve, and export real-time data, greatly improving Kafka's observability and ease of maintenance. Kpow can also easily manage all Kafka clusters and topics without using complex command lines. After chatting with its team, I learned that they started developing the platform at home during the epidemic. So far, they have not received any funding, but they already have customers.

LakeFS

Official website: https://lakefs.io

Founded year: 2020

Keywords: git-like, multi-version, data lake

Funding round: Series A

Last funding round year: 2021

LakeFS is developed by Treeverse. It allows users to manage data in the data lake like code: branch, commit, merge, and revert are all a small piece of cake. Data lakes, especially super-large data lakes, are very difficult to manage. The object storage system it relies on lacks critical features such as atomicity, rollback, and recurrence, resulting in reduced data quality and recovery. In the past, when we used data lake, we often created a copy of the production environment to test data changes in the copy first and then applied changes to the production environment when it was ready. But the problem is this method is very time-consuming and expensive, and it is difficult for many people to work together. LakeFS transforms object storage into a Git-like repo, without duplicating any data, supporting multi-person collaboration, injecting only secure data, and reducing the occurrence of errors. Even if errors occur, the corrupted data can be directly rollbacked atomically in the production environment.

Memgraph

Official website: https://memgraph.com

Founded year: 2016

Keywords: graph database, real-time analysis

Funding round: Seed round

Last funding round year: 2021

Memgraph is a low-latency, high-performance in-memory graph database that handles transactional and analytical graph tasks well. Memgraph can analyze data from multiple data sources and discover potential connections between them, allowing users to apply graph algorithms to analyze and then build their own real-time applications. CEO Dominik Tomicevic mentioned that the most typical users of Memgraph are from the chemical industry, manufacturing, and financial industries. They all have one thing in common: they need to obtain real-time analysis from scattered data.

Meroxa

Official website: https://meroxa.com

Founded year: 2019

Keywords: code-first, real-time analysis

Funding round: Series A

Last funding round year: 2021

With Meroxa, users can build, test, and deploy real-time data applications in days. Meroxa is developer-centric, code-first tooling that lets software engineers maximize their time spent building data products as opposed to maintaining fragile data systems that weren’t designed for developers. Meroxa's goal is to help developers focus on building applications with real-time data, rather than automating repetitive operational functions. Their vision is to make Meroxa the industry-leading Data Application Platform as a Service (DAPaaS).

FeatureBase

Official website: https://www.featurebase.com/

Founded year: 2017

Keywords: real-time analysis, bitmap

Funding round: unknown

Last funding round year: unknown

FeatureBase is based in Austin, Texas, and was renamed from the recent merger of Molecula Corporation and the Pilosa Project. Its product is an OLAP database that uses bitmaps for data indexing. Specifically, FeatureBase converts the data stored in traditional OLAP columns into features based on bitmaps, thereby achieving better read and write performances and resource efficiency. At the same time, FeatureBase regards streaming data as an important focus, emphasizing the data freshness of streaming updates brought by bitmaps. FeatureBase can index structured data well, but it can't do anything for unstructured data. Its products have two service modes — open source and cloud services — and support two data access interfaces, SQL and its custom PQL.

Nussknacker

Official website: https://nussknacker.io/

Founded year: 2016

Keywords: real-time analytics, visualization

Funding round: unknown

Last funding round year: unknown

Nussknacker is a visualized real-time analysis tool. Its target users are managers, analysts, and others accustomed to using interactive tools such as Excel. Users can build analysis and processing logic on the data streams through the visual operation in a web page without writing code. For simple analytical queries, Nussknacker uses Kafka as the main input stream and output stream interface and develops its lightweight engine to perform simple stream processing operations. In contrast, advanced complex aggregation operations will be processed on Flink. Nussknacker lowers the threshold for building real-time data processing analysis. Business teams can deploy and test business processing logic without the need for code writing or the help of professional developers.

Timeplus

Official website: https://www.timeplus.com/

Founded year: 2021

Keywords: stream processing, database

Funding round: Seed round

Year of last funding round: 2022

Timeplus is a company founded by a group of senior experts from the Splunk engineering team. Their core product is a streaming database on the cloud. Users can register and apply for beta version access on the official website. At first glance, the interactive interface is a good experience. Their product so far is still wholly close-sourced. This business model allows the team to focus more on commercialization.

Thanks for being here. This blog is a comprehensive overview of the entrepreneurial development direction in the field of real-time data systems. I believe this is also the most exciting direction in data infrastructure in recent years. If you are interested in this direction or the RisingWave open-sourced product and cloud product, do not hesitate to contact us. I believe that real-time data systems will usher in leaps and bounds in the near future.

Plug

My talk at Current conference

In fact, when I went to the Current conference this time, in addition to chatting with colleagues and promoting RisingWave, I also gave a technical talk. The title of the talk is "Rethinking State Management in Cloud-Native Streaming Systems". This talk is all about technicalities. It introduces some internal implementations of the RisingWave system. If you are interested, you can check out my presentation slides here. Meanwhile, the full video is also available on the Current conference website. Also, the RisingWave source code can be accessed here, and the cloud private preview is here. Please check it out and leave your comments!