1. Spark Structured Streaming
Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on top of the Apache Spark framework. It allows you to process and analyze real-time data streams using the same APIs and programming constructs as batch processing in Spark. Structured Streaming provides a high-level, declarative API for processing data streams, making it easier for developers to work with streaming data.
With the vast growth of Spark Structured Streaming, Databricks, the tech unicorn behind Apache Spark and Spark Structured Streaming, announced Project LightSpeed in 2022. Project Lightspeed is an umbrella project aimed at improving a couple of key aspects of Spark Streaming:
- Performance improvements. Including offset management, log purging, microbatch pipelining, state rebalancing, adaptive query execution, and many more.
- Enhanced functionalities. Some new functionalities include: multiple stateful operators, stateful processing in Python, dropping duplicates within watermark, and native support over Protobuf serialization.
- Improved observability. It is important to have metrics and tools for monitoring, debugging and alerting over streaming jobs. Project Lightspeed introduces Python query listener in PySpark to send streaming metrics to external systems.
- Expanding ecosystem. Project Lightspeed adds new connectors such as Amazon Kenesis and Google Pub/Sub to expand the ecosystem of Spark structured streaming.
Project Lightspeed is a significant undertaking, but it has the potential to make Spark Structured Streaming a more powerful and versatile stream processing engine. I am excited to see how it develops in the future.
2. KsqlDB
KsqlDB is a stream processing engine built on top of Apache Kafka and Kafka Streams. It combines powerful stream processing with a relational database model using SQL syntax. This makes it a powerful tool for building real-time applications that need to process and analyze streaming data. Some of the key features of ksqlDB include:
- SQL interface: ksqlDB uses a SQL interface, which makes it familiar to most developers.
- Stream processing: ksqlDB can be used to process streaming data in real time.
- Relational database model: ksqlDB uses a relational database model, which makes it easy to store and query data.
- Scalability: ksqlDB is scalable and can be deployed on a variety of platforms.
- Reliability: ksqlDB is reliable and can handle high volumes of data.
KsqlDB can be deployed on a variety of platforms, including Confluent Cloud. When ksqlDB is deployed on Confluent Cloud, it is managed by Confluent and is automatically provisioned, scaled, and updated. This makes it easy to get started with ksqlDB and to focus on building applications rather than managing infrastructure.
3. RisingWave
RisingWave is an open-source distributed SQL streaming database designed for the cloud. It is designed to reduce the complexity and cost of building real-time applications. RisingWave consumes streaming data, performs incremental computations when new data comes in, and updates results dynamically. As a database system, RisingWave maintains results in its own storage so that users can access data efficiently.
Some of the key features of RisingWave:
- Distributed architecture: RisingWave is a distributed database that can be scaled horizontally to handle large amounts of data.
- SQL interface: RisingWave provides a SQL interface that allows users to query streaming data in a familiar way.
- Incremental computations: RisingWave performs incremental computations when new data comes in, which reduces the processing time and allows for low latency queries.
- Materialized views: RisingWave supports materialized views, which allow users to define the data they need and have it pre-computed for efficient querying.
- Cloud-native architecture: RisingWave is designed to be deployed and managed in the cloud, which makes it easy to scale and manage.
RisingWave is fully open-sourced so that you can easily deploy.
4. Arroyo
Arroyo is an open source distributed stream processing engine written in Rust. It is designed to efficiently perform stateful computations on streams of data. Arroyo lets you ask complex questions of high-volume real-time data with sub-second results.
The Arroyo project was started by a team of engineers from YC W23. They are passionate about making real-time data processing more accessible and affordable. They believe that Arroyo can help organizations of all sizes to take advantage of the power of real-time data.
The Arroyo project is still under development, but it has already been used by a number of organizations, including Plaid, Affirm, and Stitch Fix. The project is open source, so anyone can contribute to its development.
Here are some of the features of the Arroyo project:
- SQL and Rust pipelines
- Scales up to millions of events per second
- Stateful operations like windows and joins
- State checkpointing for fault-tolerance and recovery of pipelines
- Timely stream processing via the Dataflow model
Arroyo can be self-hosted, or used via the Arroyo Cloud service managed by Arroyo Systems. If you are looking for a powerful and efficient stream processing engine, then Arroyo is a good option to consider. It is still under development, but it has a lot of potential.
5. Materialize
Materialize is a streaming database that allows you to process data at speeds and scales not possible in traditional databases, but without the cost, complexity, or development time of most streaming engines. It is a good fit for applications that need to process data in real time, such as fraud detection, anomaly detection, and real-time analytics.
Materialize combines the accessibility of SQL databases with a streaming engine that is horizontally scalable, highly available, and strongly consistent. In particular, it is strong in the following aspects:
- Incremental updates. Materialize supports incrementally updated materialized views that are always fresh, even when using complex SQL statements, like multi-way joins with aggregations. Its engine is built on Timely and Differential Dataflow — data processing frameworks backed by many years of research and optimized for this exact purpose.
- Standard SQL support. Materialize follows the SQL standard (SQL-92) implementation, so you interact with it like any relational database: using SQL. You can build complex analytical workloads using any type of join (including non-windowed joins and joins on arbitrary conditions), but you can also leverage exciting new SQL patterns enabled by streaming like Change Data Capture (CDC), temporal filters, and subscriptions.
- PostgreSQL wire-compatibility. Materialize uses the PostgreSQL wire protocol, which allows it to integrate out-of-the-box with many SQL clients and other tools in the data ecosystem that support PostgreSQL — like dbt.
- Strong consistency guarantee. Materialize provides the highest level of transaction isolation: strict serializability. This means that it presents as if it were a single process, despite spanning a large number of threads, processes, and machines. Strict serializability avoids common pitfalls like eventual consistency and dual writes, which affect the correctness of your results.
Materialize is a new kind of data warehouse built for operational workloads: the instant your data changes, Materialize reacts.
6. Quix
Quix Platform is a complete system that enables you to develop, debug, and deploy real-time streaming data applications. Quix provides an online IDE and an open-source stream processing library called Quix Streams. Quix Streams is a client library that can be used in Python or C# code to develop custom elements of a processing pipeline.
Quix Platform was built on top of a message broker, specifically Kafka, rather than on top of a database, as databases introduce latency that can result in problems in real-time applications, and can also present scaling issues. Quix Platform helps abstract these issues, providing you with a scaleable and cost-effective solution.
From the top-down, the Quix stack provides the following:
- Quix Portal, the web-based Integrated Development Environment (IDE). Sign up for free.
- REST and websocket APIs
- Quix Streams
7. Bytewax
Bytewax is an open source Python framework for building highly scalable dataflows in a streaming or batch context. It is based on the Timely Dataflow library, which is a dataflow processing library written in Rust. Bytewax provides a number of features that make it a powerful tool for building stream processing applications, including:
- Dataflow programming: Bytewax uses a dataflow programming model, which means that program execution is conceptualized as data flowing through a series of operations or transformations. This makes it easy to build complex applications that process data in real time.
- Stateful processing: Bytewax supports stateful processing, which means that some operations can remember information across multiple events. This is useful for applications that need to track the state of the world, such as fraud detection or anomaly detection.
- Windowing: Bytewax supports windowing, which allows you to aggregate data over a period of time. This is useful for applications that need to track trends or patterns in data.
- Connectors: Bytewax provides connectors to a variety of data sources, such as Kafka, Spark, and Redis. This makes it easy to connect your applications to the data that you need to process.
Bytewax is a relatively new framework, but it has a lot of potential. It is a good choice for organizations that are looking for a powerful and flexible stream processing framework that is written in Python.
Comparison
There are many factors to consider when comparing stream processing frameworks, especially making decisions for adapting to the production. Here are some of the most important ones:
Latest version: This is to understand if the software is under regular maintenance. An active project is usually backed by a thriving user community.
Open source license: Whether the product is open-sourced. If so, what License is it using? Different open-source licenses post different restrictions on usage.
- Distributed system: Whether the system is a distributed system, and how it scales with the workload?
Ease of use: How easy is it to develop applications on the framework? This is important for organizations that don't have a lot of in-house expertise in stream processing. More precisely, ease of use is closely related to the user interface.
Stream processing capability. What streaming queries can the system support? Is the supported interface expressive enough to support various applications? In particular, we focus on three aspects:
- Supported join types. Joining a stream with either another stream or a static data table could be sophisticated. It involves stateful stream processing which leads to complex state management mechanisms.
- Supported time windows. How the stream can be transformed in a timely manner.
- Watermark support. Out-of-order data may occur due to various reasons. Does the framework support watermarks to handle the out-of-order data and guarantee the result's correctness?
Ecosystem: How versatile is it to be integrated into existing software ecosystems? This is especially crucial for stream processing frameworks, as they must connect to various data sources and sinks.
Deployment model. What deployment models does the framework natively support? What about deployment models from cloud providers? Is Bring Your Own Cloud (BYOC) supported?
!(https://risingwave9.wpcomstaging.com/wp-content/uploads/2023/09/截屏2023-09-26-10.51.28.png)> Apache Flink is undoubtedly a strong and powerful stream processing framework, but it's essential to explore alternatives to determine the best fit for your specific use case. The seven alternatives discussed in this article offer a range of features and integrations, making them valuable contenders for various real-time data processing needs. When choosing an alternative to Apache Flink, consider factors such as your existing technology stack, scalability requirements, and the complexity of your stream processing tasks. Ultimately, the right choice will depend on your organization's unique circumstances and goals.