What Is the Best Way to Write a Stream Processing Application?

Stream processing processes data as it enters the system (as it's in motion). Unlike traditional offline processing, where a complete data set is available for the processor to act on, stream processing continuously works on partial or limited data. So, stream processing algorithms are incremental and stateful. This type of processing is typically used in applications that require very low latency—for example, in a remote patient monitoring (RPM) solution where a patient monitor device is monitoring a patient at a remote location and a caregiver needs to be notified immediately when certain parameters go below a certain level.The most simplistic implementation of stream processing uses messaging queues. Data from sensors is pumped into messaging queues to which listeners are attached. The listeners implement the real-time algorithm to compute the necessary values from the data in the stream. The following diagram shows the difference between offline processing and stream processing for an IoT system that monitors the input voltage of an HVAC system to detect a "continuous low input voltage":

Kafka Implementation

Here, the Kafka consumer implementation increases in complexity as more use cases are added, such as the correlation between voltage and current values or exception handling, such as voltage fluctuations. These complexities can be abstracted into a stream processing framework to provide a robust, highly available, resilient environment to implement stream processing applications. A few examples of stream processing frameworks include Apache Flink, Azure Stream Analytics, and RisingWave.

To create a stream processing application, these frameworks use different types of interfaces, which typically fall into the following categories:

Visual modeling or no code
Raw API interface
Scripting API
SQL code

This article compares these different interfacing methods, their pros and cons, and the factors you should consider before choosing one.

Factors to Consider When Choosing a Stream Processing Framework

The following are some of the most important factors to consider when choosing a stream processing framework.

Ease of Coding

The ease of coding is affected by the steepness of the learning curve and whether developers can quickly adopt the framework. For example, a framework may be considered challenging if developers have to learn a new language's syntax, understand custom APIs, or follow the nonstandard syntax.

Declarative vs. Imperative Programming

There are two ways of coding: declarative (emphasis on achieving the function) and imperative (emphasis on how to achieve the function). Coding with native APIs is imperative programming, and coding SQL is declarative programming. It's easier to ramp up with declarative programming than imperative programming.

Flexibility

The flexibility of the framework is defined by how fast you can go from rapid prototyping to complex algorithms. While visual-based or no-code modeling will get you up and running quickly, it has limited flexibility to extend functions as use cases become more complex.

Porting

When a framework fails to extend with the requirements, you have to port over to a new framework. Hence, you need a framework that allows you to write loosely coupled code. Frameworks that provide native APIs tightly couple your code to the framework, while frameworks that provide SQL APIs are loosely coupled.

Debugging and Visualization

Data is transient, making debugging more complex. You need to trace the steps in a workflow and narrow it down to the problem step in real time. Moreover, these run on standalone clusters with multiple levels of parallelism, so you need tools that provide workflow visualization and error indicators while the data is flowing.

Multiple Source Streams

Typical applications have multiple types of source streams. IoT devices can connect through an MQTT broker or directly to a Kafka queue. This means you need a framework that supports various source streams such as Kafka queues, socket data, and Amazon Kinesis.

Multiple Message Formats

Data formats within an application vary. Frameworks need to work with common data formats, such as JSON and serialized objects, and provide ways to define custom deserialization of data in streams.

Comparing Different Ways to Write a Stream Processing Application

As mentioned, stream processing frameworks provide different methods for writing applications. The following sections describe some of these methods and their pros and cons.

Visual Modeling or No Code

In this method, the framework provides a visual modeling tool, which you can use to create analytical workflows from source streams. The following screenshot is taken from Microsoft Azure Stream Analytics:

Azure Stream Analytics job

Pros

Visualizes data flow.
Has rapid prototyping and deployment, improving time to market.

Cons

Not everything can be achieved using the standard model-provided blocks.
Not entirely no-code, as some amount of coding is still required.
Porting is impossible.

Raw API Interface

In this method, the framework exposes native APIs, such as Java APIs and Scala APIs, that are called directly from the application. Tools like Apache Spark and Apache Flink utilize this method.

Below is some sample Java API code:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();


DataStream<Tuple2<String, Integer>> dataStream = env
   .socketTextStream("192.168.0.103", 9999)
   .flatMap(new Splitter())
   .keyBy(value -> value.f0)
   .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
   .sum(1);

Pros

Can write complex algorithms since it's highly flexible.
Provides fine-grained control over the various parameters of operation, which supports more complex use cases. For example, previous data can drive follow-up operations. In the patient monitor example, the raw API could be configured to check for rising blood pressure if the blood sugar remains high for more than five minutes.
Easy to convert from streaming to batch mode processing.
Combining multiple data streams is possible based on the support provided by the framework.
Operations such as UNION and JOIN are possible on multiple data streams depending on the support provided by the framework.

Cons

Tightly coupled to the framework used. Porting is very difficult.
Steeper learning curve because APIs need to be understood.
Code is hard to understand, and debugging needs very specific tools.
Tightly coupled to stream data type because it's coded specifically for the data type. So, any change in the data type changes the code.
Predefined named source streams are mostly absent; hence, logic is tightly coupled to the defined source stream and cannot be changed without changing code.

Python API (Scripting)

In this method, the framework provides scripting APIs that can be called in scripting languages such as Python. For example, Apache Flink provides a Table API to call in Python code.Below is some sample Python code:

from pyflink.table import EnvironmentSettings, TableEnvironment


env_settings = EnvironmentSettings.new_instance().use_blink_planner().build()
table_env = TableEnvironment.create(env_settings)
proj_table = table_env.from_path("X").select(...)
table_env.register_table("projectedTable", proj_table)

Pros

Reduced complexity compared to using a raw API.
Easy to code because it uses scripting.
Scripting allows dynamic data types, unlike strongly coupled data type languages such as Java, and hence decouples the stream data type from the code.
SQL artifacts can be created from multiple connectors, such as Kafka connectors, depending on what's compatible with the framework.
Decouples the source from the logic. If the data type remains the same, the same SQL artifact can be recreated with a different connector.

Cons

Tightly coupled to the framework used. Porting is difficult.
Requires an understanding of the specific API provided by the framework, so there's a steeper learning curve.
Requires framework-specific tools for debugging.

SQL Code

In this method, the framework allows you to create various SQL artifacts and use standard SQL statements against these artifacts to retrieve and process data. For example, RisingWave uses sources, sinks, and materialized views to allow easy access to streaming data, which can then be accessed via any SQL-based tool or code. The following is some sample SQL code:

CREATE MATERIALIZED VIEW ad_ctr AS
SELECT
   ad_clicks.ad_id AS ad_id,
   ad_clicks.clicks_count :: NUMERIC / ad_impressions.impressions_count AS ctr
FROM
   (
       SELECT
           ad_impression.ad_id AS ad_id,
           COUNT(*) AS impressions_count
       FROM
           ad_impression
       GROUP BY
           ad_id
   ) AS ad_impressions
   JOIN (
       SELECT
           ai.ad_id,
           COUNT(*) AS clicks_count
       FROM
           ad_click AS ac
           LEFT JOIN ad_impression AS ai ON ac.bid_id = ai.bid_id
       GROUP BY
           ai.ad_id
   ) AS ad_clicks ON ad_impressions.ad_id = ad_clicks.ad_id;

Pros

Declarative programming allows you to focus on what needs to be achieved rather than how.
Standard SQL language, so the learning curve is manageable.
Materialized views can be created over other materialized views, decoupling the various methods of using the stream data from the actual computations.
Multiple sources can be added and removed without affecting the materialized views.
Multiple sources can be combined easily using SQL JOINs.

Cons

Reduced control over algorithm complexity, which may require workarounds with various SQL functions.
Porting can only be done to other frameworks that support SQL-based access.

Looking at the different ways to write a streaming processing application, SQL strikes a balance between flexibility, ease of coding, and ease of maintenance for developing streaming applications. SQL is the de facto most popular language for big data analytics and standardized streaming SQL definitions are beginning to make it the future standard of stream processing.

This article looked at the different ways of coding streaming applications with frameworks, their pros and cons, and the various factors that play a role in selecting the right stream processing framework. RisingWave is an open-source distributed SQL database for stream processing. It is designed to reduce the complexity and cost of building real-time applications. Access our GitHub repository to learn more.