What Is the Best Way to Write a Stream Processing Application?
Discover the game-changing art of stream processing as we unveil the different ways of coding streaming applications and much more.
Discover the game-changing art of stream processing as we unveil the different ways of coding streaming applications and much more.
Stream processing processes data as it enters the system (as it's in motion). Unlike traditional offline processing, where a complete data set is available for the processor to act on, stream processing continuously works on partial or limited data. So, stream processing algorithms are incremental and stateful.
This type of processing is typically used in applications that require very low latency—for example, in a remote patient monitoring (RPM) solution where a patient monitor device is monitoring a patient at a remote location and a caregiver needs to be notified immediately when certain parameters go below a certain level.
The most simplistic implementation of stream processing uses messaging queues. Data from sensors is pumped into messaging queues to which listeners are attached. The listeners implement the real-time algorithm to compute the necessary values from the data in the stream. The following diagram shows the difference between offline processing and stream processing for an IoT system that monitors the input voltage of an HVAC system to detect a "continuous low input voltage":
Here, the Kafka consumer implementation increases in complexity as more use cases are added, such as the correlation between voltage and current values or exception handling, such as voltage fluctuations. These complexities can be abstracted into a stream processing framework to provide a robust, highly available, resilient environment to implement stream processing applications. A few examples of stream processing frameworks include Apache Flink, Azure Stream Analytics, and RisingWave.
To create a stream processing application, these frameworks use different types of interfaces, which typically fall into the following categories:
This article compares these different interfacing methods, their pros and cons, and the factors you should consider before choosing one.
The following are some of the most important factors to consider when choosing a stream processing framework.
The ease of coding is affected by the steepness of the learning curve and whether developers can quickly adopt the framework. For example, a framework may be considered challenging if developers have to learn a new language's syntax, understand custom APIs, or follow the nonstandard syntax.
There are two ways of coding: declarative (emphasis on achieving the function) and imperative (emphasis on how to achieve the function). Coding with native APIs is imperative programming, and coding SQL is declarative programming. It's easier to ramp up with declarative programming than imperative programming.
The flexibility of the framework is defined by how fast you can go from rapid prototyping to complex algorithms. While visual-based or no-code modeling will get you up and running quickly, it has limited flexibility to extend functions as use cases become more complex.
When a framework fails to extend with the requirements, you have to port over to a new framework. Hence, you need a framework that allows you to write loosely coupled code. Frameworks that provide native APIs tightly couple your code to the framework, while frameworks that provide SQL APIs are loosely coupled.
Data is transient, making debugging more complex. You need to trace the steps in a workflow and narrow it down to the problem step in real time. Moreover, these run on standalone clusters with multiple levels of parallelism, so you need tools that provide workflow visualization and error indicators while the data is flowing.
Typical applications have multiple types of source streams. IoT devices can connect through an MQTT broker or directly to a Kafka queue. This means you need a framework that supports various source streams such as Kafka queues, socket data, and Amazon Kinesis.
Data formats within an application vary. Frameworks need to work with common data formats, such as JSON and serialized objects, and provide ways to define custom deserialization of data in streams.
As mentioned, stream processing frameworks provide different methods for writing applications. The following sections describe some of these methods and their pros and cons.
In this method, the framework provides a visual modeling tool, which you can use to create analytical workflows from source streams. The following screenshot is taken from Microsoft Azure Stream Analytics:
Pros
Cons
In this method, the framework exposes native APIs, such as Java APIs and Scala APIs, that are called directly from the application. Tools like Apache Spark and Apache Flink utilize this method.
Below is some sample Java API code:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("192.168.0.103", 9999)
.flatMap(new Splitter())
.keyBy(value -> value.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.sum(1);
Pros
Cons
In this method, the framework provides scripting APIs that can be called in scripting languages such as Python. For example, Apache Flink provides a Table API to call in Python code.
Below is some sample Python code:
from pyflink.table import EnvironmentSettings, TableEnvironment
env_settings = EnvironmentSettings.new_instance().use_blink_planner().build()
table_env = TableEnvironment.create(env_settings)
proj_table = table_env.from_path("X").select(...)
table_env.register_table("projectedTable", proj_table)
Pros
Cons
In this method, the framework allows you to create various SQL artifacts and use standard SQL statements against these artifacts to retrieve and process data. For example, RisingWave uses sources, sinks, and materialized views to allow easy access to streaming data, which can then be accessed via any SQL-based tool or code.
The following is some sample SQL code:
CREATE MATERIALIZED VIEW ad_ctr AS
SELECT
ad_clicks.ad_id AS ad_id,
ad_clicks.clicks_count :: NUMERIC / ad_impressions.impressions_count AS ctr
FROM
(
SELECT
ad_impression.ad_id AS ad_id,
COUNT(*) AS impressions_count
FROM
ad_impression
GROUP BY
ad_id
) AS ad_impressions
JOIN (
SELECT
ai.ad_id,
COUNT(*) AS clicks_count
FROM
ad_click AS ac
LEFT JOIN ad_impression AS ai ON ac.bid_id = ai.bid_id
GROUP BY
ai.ad_id
) AS ad_clicks ON ad_impressions.ad_id = ad_clicks.ad_id;
Pros
Cons
Looking at the different ways to write a streaming processing application, SQL strikes a balance between flexibility, ease of coding, and ease of maintenance for developing streaming applications. SQL is the de facto most popular language for big data analytics and standardized streaming SQL definitions are beginning to make it the future standard of stream processing.
Conclusion
This article looked at the different ways of coding streaming applications with frameworks, their pros and cons, and the various factors that play a role in selecting the right stream processing framework. RisingWave is an open-source distributed SQL database for stream processing. It is designed to reduce the complexity and cost of building real-time applications. Access our GitHub repository to learn more.
Raji Sankar
Community Contributor