Edge computing is a modern architecture that allows companies to quickly analyze large amounts of data by processing the data not far from where it was generated. Edge computing is widely used in medicine, the transportation industry, manufacturing, and other business areas. However, it has some challenges around management, maintenance, and ensuring data security and availability.

Stream processing is a technology that makes it possible to continuously receive and process a large stream of data in real time. It works well with edge computing and can help solve challenges that arise in the operation of edge computing.

This article explores the use cases and challenges of edge computing and how stream processing can help solve these challenges.


What Is Edge Computing?


An edge computing architecture involves processing, analyzing, and storing data at or as close to the data source as possible, rather than sending it to the central server. It enables rapid analysis and processing of data on peripheral devices.

In today's world, there are many devices and sensors that constantly generate huge amounts of data. Transferring this data and processing it in centralized cloud storage is a very long and expensive process. So, companies use edge computing technology to reduce the overall volume of data that needs to be transferred and stored on the central server, reducing data latency and network costs and enabling real-time decision-making.


Edge Computing Use Cases


Edge computing is well-suited for situations where large amounts of data are constantly generated, for scenarios with limited bandwidth, and for processes that require real-time decision-making or low latency.


Autonomous Vehicles


Autonomous vehicles process a large amount of data that comes from various on-board sensors. They receive information about the speed and direction of movement, location, presence of other vehicles nearby, etc. For their smooth operation, it's critical to process data in real time. This is where edge computing plays a crucial role in allowing large amounts of data to be processed on peripheral computers in milliseconds.


Patient Health Monitoring


Portable devices and medical sensors are becoming more and more popular, which makes it easy to constantly collect important information about patients' health. For example, smartwatches make it possible to measure ECG, blood pressure, and blood oxygen levels.

The analysis of this information in real time makes it possible to detect diseases in their early stages, identify situations when urgent medical assistance is needed, and notify a doctor about them.


Predictive Maintenance


Edge computing technology makes it possible to monitor the technical equipment of enterprises in real time. This enables continuous awareness of equipment conditions, early detection and resolution of issues before they cause breakdowns, and accurate prediction of maintenance needs.

For example, in oil and gas production, sensors are installed on drilling equipment, pumps, and pipelines. Peripheral devices process the data from these sensors and monitor the condition of the equipment, detect early signs of wear and leaks, and predict when maintenance is due.


Increasing Agricultural Yield


Peripheral devices allow farmers to collect data about the temperature and humidity of the air, as well as the condition of the soil, in real time. Based on this data, farmers can make decisions about the need for irrigation, fertilization, etc.

Peripheral devices that are equipped with image recognition capabilities can scan crops for signs of disease or pest infestation. This information makes it possible to identify problems in time and take measures to prevent the spread of diseases and control pests.


Smart Cities


In smart cities, edge computing can provide automatic monitoring and management of traffic flow, detection and notification of the presence of harmful substances in the air, automatic street lighting, and so on. This allows for the effective development of smart cities to improve public safety and citizen satisfaction.


Personalized Advertising


In e-commerce, personalized customer experience data is used to analyze customer interests and provide special offers. Edge computing allows you to analyze data on customer devices and send only the most important information to the cloud. In addition, it ensures users' privacy as unprotected information is not sent to the cloud.


Challenges of Edge Computing


Edge computing offers many advantages. However, certain challenges need to be addressed when implementing edge computing in order for it to work as efficiently as possible and bring more business benefits.


Data Analysis


Peripheral devices generate large amounts of data. However, they often have limited capacity and computing power. Therefore, there are difficulties in data storage, processing, and analysis. Another problem is the possibility of losing valuable information since only partial volumes of data are analyzed at the periphery.


Performance and Latency


If the devices have an unstable network connection, high latency, or bandwidth limitations, that slows down the data transfer to the central server. In addition, it's important to correctly determine what data must be transferred to the server. The limited computing resources of peripherals also cause performance issues.


Security and Privacy


Ensuring data security and privacy is one of the biggest challenges in implementing edge computing. Since many devices are involved in data processing and analysis, the risk of unauthorized access significantly increases. In addition, peripheral devices are usually less physically secure and more vulnerable than cloud servers.


Resilience


Peripheral devices use a network connection to transfer data to the central cloud. Network failures, increased latency, and packet loss can disrupt the process of data transmission and synchronization. During the operation of peripheral devices, hardware and power failures can occur, which leads to data loss and data unavailability. Therefore, ensuring data availability and accessibility is an important task when implementing edge computing.


Management and Maintenance


Since the number of peripheral devices is very large, the process of their management and maintenance is quite complex. Remote management and monitoring, software updates, and security measures must be implemented.


Using Stream Processing to Resolve Edge Computing Challenges


Stream processing is a technology that allows data to be processed immediately after it is received, which is very important for applications that require real-time data analysis. In addition, data streaming systems provide tools for real-time data aggregation, filtering, enrichment, and analysis. The integration of stream processing and edge computing provides companies with a number of advantages.


Improved Data Analysis


Stream processing provides the ability to aggregate data by combining multiple data points over time, based on specific events, and more. This allows companies to reduce the amount of data and get more meaningful information. In addition, stream processing can filter out irrelevant or noisy data at the boundary, allowing only important information to be transmitted or stored.

The ability to aggregate and filter data reduces the amount of data that needs to be stored, analyzed, and transferred to a centralized server and makes it possible to process and analyze data as it is created.


Reduced Latency and Faster Decision-Making


Streaming systems can run on peripheral devices and enable large volumes of streaming data to be processed. They provide tools to aggregate and filter data based on certain criteria. This reduces the amount of data sent to the cloud, saving bandwidth and cloud processing costs, reducing latency, and increasing performance.

Real-time analytics allows organizations to detect malfunctions, trigger alerts, and obtain important statistics that are essential for quick decision-making. RisingWave Database is a streaming database that allows you to receive streaming data, process it incrementally, and dynamically update the results.

Streaming systems also provide data enrichment possibilities that allow you to add context or additional information to streaming data. This may involve merging data streams with reference data or performing search operations.


Simplified Management and Maintenance of Peripheral Devices


Using stream processing, central control systems can communicate instructions to peripheral devices, manage the distribution of firmware updates, and monitor the update process. It's also possible to receive and analyze data on the serviceability and operating status of peripheral devices and perform real-time monitoring. This allows companies to simplify management and maintenance processes, monitor the performance of devices, and identify potential problems.

RisingWave Cloud is a fully managed streaming platform that makes streaming easy and provides convenient tools for monitoring and managing clusters and users.


Increased Security


Stream processing in edge computing helps detect security threats, such as intrusion attempts or suspicious user behavior, in real time. It also makes it possible to send an automatic response when a potential threat is detected (for example, by blocking malicious network traffic).


Ability to Create a Resilient System


Stream processing allows peripheral devices to analyze and process data as it is generated without having to transmit it to a centralized server. This allows peripherals to make decisions autonomously and reduce dependency on the network.

In case a peripheral device fails, such a system allows other devices to work and make decisions independently.

Streaming systems contribute to a resilient system by providing fault tolerance features such as data replication, load balancing, and automatic failover.


Examples of Integration of Edge Computing and Stream Processing


Edge computing and stream processing are used in a wide range of industries to enable real-time data processing, reduce latency, and facilitate rapid decision-making. Let's look at some examples.


Financial Industry


In addition to ATMs, online and mobile banking applications installed on edge devices can collect data about users' behavior and transactions. This data is analyzed using stream processing, which allows financial companies to detect suspicious or fraudulent activities and send instant notifications about them to the bank's security service.

Customer behavior data is used to generate information about customer preferences and needs, allowing financial companies to provide personalized services and recommendations to customers.

Companies such as Nationwide Building Society, Capital One, and PayPal successfully use edge computing and stream processing to improve customer interaction, monitor transactions in real time, and detect fraud.


Retail


Peripheral devices equipped with GPS trackers are used to monitor the movement of goods in the supply chain. Using stream processing, retailers can track the location and condition of goods. Retailers can streamline logistics and provide customers with real-time shipment tracking information.

The use of edge computing and stream processing also allows for real-time customer feedback analysis, which can be left by the client through mobile applications. This allows for quick identification and resolution of issues and improves customer satisfaction.

The integration of edge computing and stream processing is used by such well-known companies as Amazon, Walmart, and Albertsons.


Transportation


Real-time streaming data analysis helps optimize traffic flow and reduce congestion in smart cities. The ability to obtain data about traffic conditions, the location of pedestrians, and other objects makes it possible for autonomous vehicles to make decisions in a fraction of a second.

Examples of using a combination of edge computing and stream processing include autopilot systems by Tesla and Waymo.


Medicine


The use of peripheral devices makes it easy to remotely monitor the health of patients. The integration of streaming data allows you to quickly obtain important health indicators and respond to emergency situations, such as a sudden increase in blood pressure or heart rate. In addition, the analysis of medical data allows the identification of potential health problems and opportunities for early intervention.

The integration of edge computing with stream processing is used by Philips Healthcare and Siemens Healthineers.

Conlusion

The number of devices continues to grow. Edge computing technology has emerged due to the need to efficiently analyze an ever-increasing amount of data generated by these devices. Edge computing is used in various industries because it has many advantages, including increasing productivity, reducing latency and decision-making time, and reducing costs.

However, a number of difficulties arise in the process of implementing edge computing technology and managing peripheral devices. One of the ways to solve these difficulties is to use stream processing. The integration of edge computing and stream processing increases the efficiency of data processing, simplifies the process of management and control of peripheral devices, and improves the security of devices and data.

Want to learn more about stream processing? Read about how the RisingWave data streaming system works, subscribe to the RisingWave LinkedIn and Twitter pages, or join RisingWave’s Slack workspace.

Using stream processing to power real-time machine learning (ML) can create dynamic and responsive systems capable of making intelligent decisions in real time. Stream processing deals with the continuous and rapid analysis of data as it flows in, enabling organizations to harness data as it's generated. On the other hand, machine learning involves training models on historical data to make predictions or classifications on new data. Traditionally, ML has been used in batch processing scenarios, but its application in real-time scenarios is becoming increasingly important for applications like fraud detection, recommendation systems, and internet of things (IoT) analytics.

This article explores the intersection of stream processing and real-time machine learning. It begins by introducing the concept and highlighting the importance of stream processing in the realm of real-time ML. By the end of the article, you'll have a solid understanding of the synergy between stream processing and real-time ML.


The Need for Stream Processing


At its core, stream processing involves the real-time analysis and manipulation of data as it flows in, allowing organizations to keep pace with the rapid influx of high-velocity data streams. This capability is vital in an era where data is generated at an unprecedented rate from sources like sensors, social media, IoT devices, and so on.

Real-time data ingestion is one key advantage of stream processing. Unlike traditional batch processing, which works with data in chunks, stream processing ingests and analyzes data point by point as it's produced. This real-time ingestion allows businesses to respond instantly to changing conditions. For example, an e-commerce platform can process user interactions as they happen, immediately tailoring product recommendations or promotional offers based on the user's current behavior. This level of responsiveness can significantly enhance the user experience and drive engagement.

Scalability is another compelling reason to embrace stream processing. As data volumes grow, stream processing systems can effortlessly scale to handle the increased load. Whether you have a sudden surge in data or a consistently high data volume, stream processing frameworks like Apache Kafka, Apache Flink, or RisingWave can distribute the workload across multiple nodes, ensuring efficient processing without compromising performance. This scalability is crucial for businesses that need to adapt to changing demands and handle large-scale data processing tasks without incurring exorbitant infrastructure costs.

Moreover, stream processing offers cost efficiency by optimizing resource utilization and allowing you to pay only for the resources you need, precisely when you need them. Additionally, real-time applications rely on the high availability of services. Stream processing supports better fault tolerance in these applications by allowing systems to continue processing even when there are hardware failures on certain nodes or other issues. This ensures uninterrupted data analysis and minimizes downtime.


Why Is Stream Processing Crucial in the Machine Learning Lifecycle?


The significance of stream processing extends far beyond traditional applications. This section explores why stream processing is crucial in the ML lifecycle, shedding light on its role in enhancing the performance and usability of various ML applications. For illustrative purposes, the article focuses on the dynamic and data-intensive world of over-the-top (OTT) platforms, such as Netflix, Amazon Prime Video, and Disney+.


Traditional ML pipeline


Before getting into the details of stream processing in the ML lifecycle, it helps to familiarize yourself with the typical workflow for training and using ML models in the industry. The following diagram illustrates a traditional machine learning pipeline:

Close
Featured A traditional machine learning pipeline.

The process begins with the initial source data, which is meticulously transformed to make it suitable for model training. The model training engine leverages algorithms and mathematical techniques to learn patterns and relationships within the data. Through this process, it generates "model artifacts," which essentially hold the distilled knowledge that the model has acquired from the training data. They are then securely stored for future use to make predictions. The ML models need to be retrained to ensure they remain up-to-date with new data. This step is crucial, as models can become outdated as new information becomes available.

This seamless flow, from data preparation and model training to inference and then retraining of models, is repeated as a continuous cycle. It's important to note that in this conventional approach, the model is retrained by combining the new and old data each time the cycle repeats, rather than incrementally.


Importance of Stream Processing in ML Applications


Now that you have an understanding of a traditional ML pipeline, let's use the following example to understand why stream processing is crucial in the ML lifecycle.

Recent years have seen the rise of OTT applications, with vast numbers of viewers embracing streaming content alongside or even in place of regular satellite providers. The limitations of batch ML quickly become evident in such user-facing apps when stream processing is not integrated into the ML application lifecycle.

Imagine you've been watching several movies in the crime genre lately. So, when you log in to your OTT app, you should see a list of recommended crime movies in your profile. However, today you decide to watch some family dramas instead and search for the "drama" category. Ideally, the OTT app should recognize this change and update your recommendations with more drama movies. But, without real-time streaming data feeding into the ML model, the app cannot refresh the recommendation list with the latest drama genre selections. If the app can only use batch ML operations, you'll continue to get recommendations based on the crime genre until the next batch operation cycle completes.


Enabling Continuous Learning Based on Streaming Data


The example in the previous section demonstrated how important stream processing is for ML operations. This section focuses on how to enable the trained model to continually learn from streaming data. The following diagram illustrates how this process works:

Close
Featured Real-time ML pipeline.

Real-time machine learning pipelines require a dynamic and responsive approach to model training and inference. The process starts with an initial model training phase with an optional batch mode or streaming mode, and this is based on the volume of data available for the use case. If the machine learning use case has a significant number of historical datasets, then a traditional bulk data processing and model building pipeline is suitable. Otherwise, the stream-based processing pipeline can be chosen to process the records one by one or in smaller chunks as they arrive in streams.

The data in general has to undergo meticulous preparation to ensure its suitability for model training. The model training engine leverages advanced algorithms to learn from the data, culminating in the generation of valuable model artifacts. These artifacts, representing the distilled knowledge of the model, are securely stored for future use. When real-time inference is required, the inference engine retrieves the latest published model and serves inference results to designated targets such as users, systems, or devices. This initial model setup lays the groundwork for a flexible and adaptive real-time ML pipeline.

However, the true power of this pipeline lies in its ability to continuously evolve and adapt. Model retraining takes on a real-time stream processing mode where new data streams into the system. In this phase, stream processors and transformers come into play, performing tasks like drift identification and data preparation for model training. The model training engine here takes a more granular approach, allowing training based on each incoming record or incremental records, ensuring that the model remains up-to-date and responsive to changing data patterns. Like in the initial training phase, this process results in the creation of updated model artifacts, which are stored for future use.

The inference engine plays a pivotal role throughout this dynamic cycle, retrieving the latest published model to deliver real-time inferences to the designated targets. This continuous evolution, from initial batch training to real-time stream processing, ensures that the machine learning models stay relevant and accurate. Real-time ML pipelines like these empower organizations to harness the full potential of their data and make timely, informed decisions.


Additional Stream Processing and Real-Time Machine Learning Use Cases


Integrating stream processing with machine learning applications brings forth a range of significant benefits that can transform how organizations leverage data. Here are some key advantages, along with examples to illustrate the impact:

  • Fraud detection and prevention: ML models that utilize stream processing can continuously analyze transaction data from various sources in real time. ML models can identify patterns of fraudulent activity. For instance, if a transaction occurs in an unusual pattern, the system can trigger an alert to help you immediately take preventive action. This can prevent financial loss and safeguard users in real time.
  • Anomaly detection: ML models that utilize stream processing can continuously analyze network traffic data. ML models can learn what constitutes "normal" behavior and raise alerts when anomalies or security threats are detected. This proactive approach helps mitigate cybersecurity risks.
  • Dynamic pricing: E-commerce companies can adjust prices in real time based on factors like demand, competition, and inventory levels. Stream processing can gather user actions and process market data while ML models predict optimal pricing strategies based on these influencing features. This dynamic pricing approach maximizes revenue and ensures competitiveness.

CONCLUSION

This article explored the potential of integrating stream processing and ML for more dynamic, intelligent, and responsive applications. You learned about the advantages of stream processing, including real-time data ingestion, scalability, cost efficiency, and fault tolerance. Moreover, you explored the pivotal role of stream processing in the ML lifecycle, illustrated through use cases such as fraud detection, recommendation systems, dynamic pricing, and anomaly detection.

If you want to create real-time ML pipelines, you need a robust, flexible, and reliable stream processing platform. RisingWave is a distributed SQL database specifically designed to streamline the process of building real-time applications. RisingWave simplifies development, reduces costs, and seamlessly processes streaming data, enabling continuous real-time updates—a critical feature in ML workflows.

1. What is RisingWave


RisingWave is a streaming database. It can connect to event streaming platforms like Apache Kafka and can be queried with SQL. Unlike traditional relational databases, the primary use case for a streaming database like RisingWave is for analytical purposes. In that sense, RisingWave behaves much like other OLAP (online analytical processing) products like Amazon Redshift or Snowflake.

In modern applications, however, OLTP (online transaction processing), i.e., relational databases might not be the single source of truth. Think of event-based systems based on Apache Kafka or others. Here, RisingWave can process such streams in real-time and offer a Postgres-compatible interface to this data. Luckily, RisingWave can also persist data on its own, which is what we will use in this article.

Django’s ORM is designed for traditional relational databases. Data is expected to be normalized, constraints are enforced, and relations have an integrity guarantee.

So, why on earth would anyone use an Analytical database, that in its nature cannot enforce constraints, such as NOT NULL, or UNIQUE in a Django application?

The answer to this is simple: Dashboarding. The concept of RisingWave lies in providing (materialized) views on the data for analytical purposes. Given the power and flexibility of Django’s ORM, we could leverage the performance of analytical databases. In this blog post we are not going to use RisingWave as a primary datastore. Rather, we will use a RisingWave database to build a dashboard for the analytical part of our application.

Other products like Amazon Redshift or Snowflake already have connectors for Django. Since RisingWave doesn’t have one yet, we will try to implement our own and learn about the internals of Django ORM by the way. Luckily, RisingWave is mostly Postgres compatible, so we will start from the original Django Postgres driver. However, PostgreSQL compatibility refers to the DQL (Data Query Language) of SQL – or in simple terms: SELECT statements. In this example we will only read from RisingWave, not write to it. We will also avoid migrations. This is because RisingWave’s concept of materialized views with different connectors (Apache Kafka being just one of them) cannot be created with the semantics of Django’s Model classes. Don’t worry though: we have some sample data to play with.

By the end of this blog post, you will have

  • set up a simple docker-compose setup with a Python webserver, a PostgreSQL instance and a RisingWave instance
  • built a Django ORM connector for RisingWave
  • seen a sample dashboard inside of a Django application built on RisingWave


2. The Setup


docker-compose Setup


For our setup, we need three services: One for the Django application, one for Postgres (to store the transactional data like user accounts in Django), and, of course, one RisingWave service. Here is a simple docker-compose setup that does the trick for us.

version: '3.8'

services:
  app:
    image: python

  postgres:
    image: postgres:latest
    restart: unless-stopped
    volumes:
      - postgres-data:/var/lib/postgresql/data
    environment:
      POSTGRES_USER: postgres
      POSTGRES_DB: postgres
      POSTGRES_PASSWORD: postgres

  risingwave:
    image: risingwavelabs/risingwave:latest
    restart: unless-stopped

volumes:
  postgres-data:


Devcontainer


To make things easy, we can leverage VSCode’s charming devcontainer feature that happens to support docker-compose setups, too. Within the comfort of our editor, we can access all these services effortlessly. If you check out the GitHub repository and open it in VSCode, it will offer you to re-open the project in a container. If you do so, the whole setup will spin up in almost no time. Note that you need Docker and the VSCode Devcontainer extension installed.

Close
Featured

In the setup above, we only persist the Postgres database, not RisingWave. You could add a volume mount for the risingwave service as well, but I prefer to have a clean database on every start to explore its features.


3. The Sample Data


Once you have started the three containers from the docker-compose.yaml file, you can access the RisingWave database with the psql command line tool. As for fetching data from the database, RisingWave is protocol-compatible with Postgres.

We use this connection to ingest the test data into RisingWave.

Here is a simple Python script that generates some random order data. We will use this just like in any other relational database to insert data in RisingWave. Note, however, that this is not the primary use case for RisingWave. It really shines when you connect external data sources like Kafka or S3. For the purposes of our blog post, this data is sufficient, however. If you want to learn more about the connectivity features of RisingWave, have a look at their excellent documentation.


Installing The Sample Data


Within our Devcontainer we can use the psql command to connect to the RisingWave service:

psql -h risingwave -p 4566 -d dev -U root

Create table

CREATE TABLE orders (orderid int, itemid int, qty int, price float, dt date, state varchar);

Now, we run gen_data.py and insert the generated fake data into RisingWave:

INSERT INTO orders (orderid, itemid, qty, price, dt, state) VALUES (1000001, 100001, 3, 993.99, '2020-06-25', 'FM');
INSERT INTO orders (orderid, itemid, qty, price, dt, state) VALUES (1000001, 100002, 3, 751.99, '2020-06-25', 'FM');
INSERT INTO orders (orderid, itemid, qty, price, dt, state) VALUES (1000001, 100003, 4, 433.99, '2020-06-25', 'FM');
…


4. The Django Connector


Now, that we have the data ingested in our RisingWave instance, we should take care of retrieving this data from within a Django application. Django comes with built-in connectors to different relational databases, including PostgreSQL. The ORM (read: the Model classes in your Django app) communicate with the connector which in turn communicates with a database drive (e.g., psycopg2 for PostgreSQL). The connector will generate vendor-specific SQL and provides special functionality not provided by other database products.

In the case of an analytical database, such like RisingWave, we would rather need to disable certain functionalities, such like constraint checking or even primary keys. RisingWave purposefully doesn’t provide such features (and neither do other streaming databases). The reason for this is that such analytical databases provide a read (for faster analytics) and write (for faster ingestion of data) optimized storage. The result usually is a denormalized schema in which constraints doesn’t need to be checked as the source of the data (read: the transactional databases or event streaming sources in your application) are considered satisfy integrity needs of the business logic.

We can just start with copying Django’s PostgreSQL connector to a new package called django-risingwave as a start. Further down the line, we are going to use the new connector just for read operations (i.e., SELECTs). However, at least in theory we want to implement at least part of some functioning management (for creating models) and write operations code in our module. Due to the very nature of the difference of scope between transactional and analytical DB engines, this might not work as the primary datastore for Django, but we will learn some of the internals of RisingWave while doing so.

As an ad-hoc test, we want at least the initial migrations to run through with the django-risingwave connector. This is, however, already more than we will need – since our intention is to use the connector for unmanaged, and read-only models.


base.py


In base.py we find a function that gets the column datatype for varchar columns. RisingWave doesn’t support a constraint on the length, so we just get rid of the parameter:

def getvarchar_column(data): return “varchar”

Also, we need to get rid of the data_types_suffix attribute.


features.py


features.py is one of the most important files for Django ORM connectors. Basically, it holds a configuration of the database capabilities. For any generated code that is not vendor specific, Django ORM will consult this file to turn on or off specific features. We have to disable quite a lot of them to make RisingWave work with Django. Here are some highlights, you’ll find the whole file in the GitHub Repo.

First, we need to set the minimum required database version down to 9.5 – that’s RisingWave’s version, not Postgres`.

minimum_database_version = (9,5)

Next, we disable some features that are mostly needed for constraint checking which RisingWave does not support:

enforces_foreign_key_constraints = False
enforces_unique_constraints = False
allows_multiple_constraints_on_same_fields = False
indexes_foreign_keys = False
supports_column_check_constraints = False
supports_expression_indexes = False
supports_ignore_conflicts = False
supports_indexes = False
supports_index_column_ordering = False
supports_partial_indexes = False
supports_tz_offsets = False
uses_savepoints = False


schema.py


In schema.py we need to override the _iter_column_sql method which is not found in the Postgres backend but inherited from BaseDatabaseSchemaEditor. In particular, we get rid of all the part that is put in place to check NOT NULL constraints.


introspection.py


In introspection.py we need to change the SQL generated by the get_table_list method. We treat all views and tables as just tables for our demo purposes.


operations.py


In operations.py, we get rid of the DEFERRABLE SQL part of our queries.

def deferrable_sql(self): return ""


5. A Simple Django Dashboard


Configure Django to use a second database


Django supports the use of multiple database connections in one project out of the box. This way, we can have the transactional part of our database needs in Postgres, and the analytical part in RisingWave. Exactly what we want here!

DATABASES = {
    "default": {
        "NAME": "postgres",
        "ENGINE": "django.db.backends.postgresql",
        "USER": "postgres",
        "PASSWORD": "postgres",
        "HOST": "postgres"
    },
    "risingwave": {
        "ENGINE": "django_risingwave",
        "NAME": "dev",
        "USER": "root",
        "HOST": "risingwave",
        "PORT": "4566",
    },
}


Creating An Unmanaged Model


Now, let’s create a model to represent the data in our RisingWave service. We want this model to be unmanaged, so that it doesn’t get picked up by Django’s migration framework. Also, we will use it for analytical purposes exclusively, so that we diable its ability to save data.

class Order(models.Model):
    class Meta:
        managed = False

    orderid = models.IntegerField()
    itemid = models.IntegerField()
    qty = models.IntegerField()
    price = models.FloatField()
    date = models.DateField()
    state = models.CharField(max_length=2)

    def save(self, *args, **kwargs):
        raise NotImplementedError


Making “Analytical” Queries with Django ORM


As an example for our dashboard, we create four Top-5-rankings:

  1. The Top5 states by total turnover
  2. The Top5 products with the highest average quantity per order
  3. The Top5 overall best-selling items by quantity
  4. The top5 overall best-selling items by turnover

Let’s take a moment to think about how the corresponding SQL queries would look like. These queries will be very simple as they contain a simple aggregation, such like SUM or AVG, a GROUP BY clause and an ORDER BY clause.

Here are the queries we come up with:

  1. select sum(qty*price) as turnover, state from orders group by state order by turnover desc;
  2. select avg(qty), itemid from orders group by itemid order by avg(itemid) desc;
  3. select sum(qty), itemid from orders group by itemid order by sum(itemid) desc;
  4. select sum(qty*price) as turnover, itemid from orders group by itemid order by turnover desc;

How would these queries translate to Django’s ORM?

Django does not have a group_by function on its model but it will automatically add a GROUP BY clause for the values in the values() function. So, the above queries can be written with Django ORM as follows:

  1. Order.objects.values('state').annotate(turnover=Sum(F('price')*F('qty'))).order_by('-turnover')[:5]
  2. Order.objects.values('itemid').annotate(avg_qty=Avg('qty')).order_by('-avg_qty')[:5]
  3. Order.objects.values('itemid').annotate(total_qty=Sum('qty')).order_by('-total_qty')[:5]
  4. Order.objects.values('itemid').annotate(turnover=Sum(F('price')*F('qty'))).order_by('-turnover')[:5]

On top of that, we need to instruct Django to read these queries not from the default connection but from the risingwave service. We can do so by adding a call to the using method like so:

Order.objects.using('risingwave').values('state').annotate(turnover=Sum(F('price')*F('qty'))).order_by('-turnover')[:5]


Using A Database Router


The data in our Order model isn’t present in the Postgres database at all, so it would be nice if any query to this model would be routed through the RisingWave backend.


The Router Class


Django offers a solution to that problem. We can create a database routing class that needs to implement db_for_readdb_for_writeallow_relation and allow_migrate. We only need the db_for_read method here. By not implementing, i.e., passing, the other methods, we can do so. Since we could add multiple routers to the settings.py file, using the database connection in the return statement isn’t enforced but treated as a hint by Django’s ORM.

from .models import Order

class OlapOltpRouter():
    def db_for_read(self, model, **hints):
        if model == Order:
            return 'risingwave'
    def db_for_write(self, model, **hints):
        pass
    def allow_relation(self, obj1, obj2, **hints):
        pass
    def allow_migrate(self, db, app_label, model_name=None, **hints):
        pass


settings.py


To enable this router (which lives in dashboard/dbrouter.py), we need to add it to the DATABASE_ROUTERS setting in our settings.py:

DATABASE_ROUTERS = ['dashboard.dbrouter.OlapOltpRouter']

Conclusion

  • RisingWave is a powerful analytical database engine that can ingest streaming data sources and offers a performant way to query data
  • Since RisingWave offers PostgreSQL compatibility over the wire, we can quickly hack a connector for Django’s ORM based upon the original Postgres connector
  • Django’s ORM allows building analytical queries and can be used to create dashboards integrated in a Django application easily
  • The connector is released in public and still needs a lot of polishing. Check out the GitHub repo.


Originally published at
Building A RisingWave Connector for Django ORM

Stream processing engines are tools that consume a series of data and take action on it in real time or pseudo-real time. Possible actions include aggregations (sum), analytics (predicting future actions), transformations (converting to a date format), or enrichment of data (combining data to enhance context). Stream processing has become more popular with increases in data collection and the need to decrease application response time, which is achieved with synchronous stream processing. Technologies for implementing stream processing can simplify data architectures, react appropriately to time-sensitive data, and provide real-time insights and analytics.

Many tools are available for implementing stream processing, each providing stream processing with a unique feature set. These include open source tools such as Apache Spark and Kafka as well as proprietary public cloud services like AWS Kinesis and Google Dataflow.


Types of Stream Processing Engines


Stream processing engines are divided into three major types. When choosing a stream processing engine for your software architecture, consider which type will work best for your offering to filter down your options. The difference between these technologies mainly relates to the directed acyclic graph (DAG) control, which defines the processing functions applied to data as it flows through the stream.


Compositional Engines


Software developers must define the DAG and how processing will proceed in advance with compositional engines. Planning must be done in advance to ensure efficient and effective processing. Processing is controlled entirely by the developer.


Declarative Engines


In declarative engines, the developers implement functions in the stream engine, but the engine itself determines the correct DAG and will direct streamed data accordingly. There may be options for developers to define the DAG in some tools, but the stream engine will also optimize the DAG as needed.


Fully Managed Self-Service Engines


These engines are based on declarative engines, but they further reduce the work required of developers. The DAG is determined by the stream engine, as in declarative engines, but a complete end-to-end solution is provided, including data storage and streaming analytics.


Streaming Tools


There are many current and retired stream engines available for use. The tools chosen for this list are all being actively enhanced and monitored for bugs. Each is popular for its own feature set, making each useful for specific use cases. A variety of self-managed and managed options are also provided.


Open Source


Apache is an open source software foundation producing various projects for community use. All Apache projects listed are available for use without paying licensing fees when running as self-managed software. Some tools are also available as managed software through different cloud providers. While all Apache projects are open source, other open source streaming tools are available.

1. Apache Spark


Apache Spark is an open source distributed streaming engine designed for large-scale data processing. Spark can distribute data processing tasks across multiple nodes, making it a good choice for big data and machine learning. Spark is also flexible in handling both batch and streaming data using a variety of language options such as Java, Python, R, and more. Spark supports SQL querying, machine learning, and graph processing.

Spark can run on a standalone cluster or a cluster using container orchestration tools like Kubernetes or Docker Swarm. The latter option offers automated deployments, resilience, and security. Managed versions are also available through Amazon, Google, and Microsoft.

Spark is a popular engine for scalable computing and is utilized by thousands of companies in various fields. It has a highly active community contributing to the codebase, with contributions from thousands of individuals.

2. Apache Storm


Apache Storm is an open source compositional streaming engine designed for large-scale data processing. Storm uses parallel task computation to operate on one record at a time, making it a real-time streaming technology. Storm can be used for online machine learning, continuous computation, and more. However, it doesn't store data locally, so it's not used for analyzing historical data.

Developers can create Apache Storm clusters natively, on virtual machines, or using container technologies like Kubernetes and Docker. There are no options for managed versions of Storm.

Storm is a popular engine for real-time data integrations since it can handle high throughputs of data that require complex processing. When less complex processing is required, other stream processing engines can provide simpler integrations.

3. Apache Flink


Apache Flink is an open source, distributed streaming engine that supports both batch and stream processing. It excels at processing both unbounded streams (which have a start but no end) and bounded streams (with a defined start and end). It's designed to run stateful streaming applications with low latency and at large scales.

Flink supports running as a standalone cluster or can run on all common cluster environments. It's a distributed system, so it does require compute resources to execute applications.

Flink, with its fault-tolerant design, is used for critical event-driven applications like fraud detection, rule-based alerting, and anomaly detection. It's also used for data analytics applications like quality monitoring and data pipeline applications that feed real-time search indices.

4. Apache Flume


Apache Flume is an open source distributed streaming engine designed to handle log data. Flume can accept data from a variety of external sources, including web servers, Twitter, and Facebook. It's mainly used to aggregate and load data into the Apache Hadoop Distributed File System (HDFS).

Flume is installed and configured as part of the Hadoop repository. It's generally run on the Hadoop cluster using a cluster management tool such as Hadoop YARN.

Since Apache Flume is designed for log aggregations, it's commonly used to collect and move large data volumes within Hadoop. Flume is also used by e-commerce companies to analyze buying behavior and for sentiment analysis.

5. Apache Kafka Streams


Apache Kafka Streams is an open source distributed streaming engine and storage system. It allows high throughput while delivering especially low latencies. It's able to scale up and down both storage and processing as needed, making it a suitable option for large and small requirements.

Kafka Streams can be deployed to containers, run on virtual machines, and also run on bare metal. Cloud providers like AWS also provide managed Kafka services, including support for Kafka Streams.

Kafka Streams is popular across many different industries. It can be used to process payments and financial transactions, track shipments and other geospatial data, capture and analyze sensor data, or drive event-driven software architectures.

6. RisingWave


RisingWave is an open-source cloud-native streaming database used to support real-time applications. It's a stateful system that can perform complex incremental computations such as aggregation, joins, and windows. It uses the PostgreSQL syntax to store the results at each step to decrease the latency of future calculations. It accepts data from various sources such as Kafka and Kinesis and supports outputting data to Kafka streams. 

RisingWave was built to provide a real-time streaming database that's easy to use. It can support real-time applications without significant operational overhead for software engineering and maintenance.


Managed Stream Processing


Public cloud providers allow users to run stream processing tools on their servers. They include different guarantees on the availability of the service, freeing up developers to work on other tasks besides maintaining the stability of the stream. Some open source tools are available as managed services in the public cloud.

7. Google Cloud Dataflow


Google Cloud Dataflow is a fully managed, cloud-based data processing and streaming service. It's capable of handling stream processing as well as batch processing, and it allows for scheduling batch processing executions. Dataflow also provides auto-scaling to handle spikes in streaming data sizes.

Dataflow comes with extra processing features built in, such as real-time AI patterns that allow the system to react to events within the data flow, including anomaly detection.

8. Amazon Kinesis Data Streams


Amazon Kinesis Data Streams is a fully managed streaming service provided by Amazon's cloud platform. Data is captured from one or more consumers, and the data can then trigger processing (or consumer) functions with batched stream data.

Kinesis Data Streams is designed to handle high data throughput, handling gigabytes of data per second. The cost of the stream is based on what is used and can be either provisioned for use cases with predictable throughput or set to on-demand when the throughput should scale with the data.

This is popular for big data processing in use cases like event data logging, running real-time analytics (using consumer functions), and powering event-driven applications.

9. Amazon Kinesis Data Firehose


Amazon Kinesis Data Firehose is a fully managed streaming service on AWS. This service is designed to collect, transform, and load data streams to permanent storage and analytics services.

Like Kinesis Data Streams, Firehose can handle high data throughput. When data storage is required, Firehose is an effective tool since it formats and stores directly rather than requiring a consumer to provide that functionality.

Firehose is popular for storing stream data in data lakes or data warehouses. It can also provide a large data set for machine learning applications and security analytics.

Conclusion

Stream processing tools continuously obtain real-time data from external sources and perform operations. This processing supports event-driven actions, analysis of behavior, and machine learning. Available engines may be self-managed or managed by a cloud provider, which eases the load on in-house development teams. When choosing a stream processing tool, determine which is best for your use case by considering design aspects like throughput, latency, deployment options, and the complexity of building it.

Sign up for our monthly newsletter if you’d like to keep up to date on all the happenings with RisingWave. Follow us on Twitter and Linkedin, and join our Slack community to talk to our engineers and hundreds of streaming enthusiasts worldwide.

Stream processing processes data as it enters the system (as it's in motion). Unlike traditional offline processing, where a complete data set is available for the processor to act on, stream processing continuously works on partial or limited data. So, stream processing algorithms are incremental and stateful.

This type of processing is typically used in applications that require very low latency—for example, in a remote patient monitoring (RPM) solution where a patient monitor device is monitoring a patient at a remote location and a caregiver needs to be notified immediately when certain parameters go below a certain level.

The most simplistic implementation of stream processing uses messaging queues. Data from sensors is pumped into messaging queues to which listeners are attached. The listeners implement the real-time algorithm to compute the necessary values from the data in the stream. The following diagram shows the difference between offline processing and stream processing for an IoT system that monitors the input voltage of an HVAC system to detect a "continuous low input voltage":

Close
Featured Kafka Implementation

Here, the Kafka consumer implementation increases in complexity as more use cases are added, such as the correlation between voltage and current values or exception handling, such as voltage fluctuations. These complexities can be abstracted into a stream processing framework to provide a robust, highly available, resilient environment to implement stream processing applications. A few examples of stream processing frameworks include Apache Flink, Azure Stream Analytics, and RisingWave.

To create a stream processing application, these frameworks use different types of interfaces, which typically fall into the following categories:

  • Visual modeling or no code
  • Raw API interface
  • Scripting API
  • SQL code


This article compares these different interfacing methods, their pros and cons, and the factors you should consider before choosing one.


Factors to Consider When Choosing a Stream Processing Framework


The following are some of the most important factors to consider when choosing a stream processing framework.


Ease of Coding


The ease of coding is affected by the steepness of the learning curve and whether developers can quickly adopt the framework. For example, a framework may be considered challenging if developers have to learn a new language's syntax, understand custom APIs, or follow the nonstandard syntax.


Declarative vs. Imperative Programming


There are two ways of coding: declarative (emphasis on achieving the function) and imperative (emphasis on how to achieve the function). Coding with native APIs is imperative programming, and coding SQL is declarative programming. It's easier to ramp up with declarative programming than imperative programming.


Flexibility


The flexibility of the framework is defined by how fast you can go from rapid prototyping to complex algorithms. While visual-based or no-code modeling will get you up and running quickly, it has limited flexibility to extend functions as use cases become more complex.


Porting


When a framework fails to extend with the requirements, you have to port over to a new framework. Hence, you need a framework that allows you to write loosely coupled code. Frameworks that provide native APIs tightly couple your code to the framework, while frameworks that provide SQL APIs are loosely coupled.


Debugging and Visualization


Data is transient, making debugging more complex. You need to trace the steps in a workflow and narrow it down to the problem step in real time. Moreover, these run on standalone clusters with multiple levels of parallelism, so you need tools that provide workflow visualization and error indicators while the data is flowing.


Multiple Source Streams


Typical applications have multiple types of source streams. IoT devices can connect through an MQTT broker or directly to a Kafka queue. This means you need a framework that supports various source streams such as Kafka queues, socket data, and Amazon Kinesis.


Multiple Message Formats


Data formats within an application vary. Frameworks need to work with common data formats, such as JSON and serialized objects, and provide ways to define custom deserialization of data in streams.


Comparing Different Ways to Write a Stream Processing Application


As mentioned, stream processing frameworks provide different methods for writing applications. The following sections describe some of these methods and their pros and cons.


Visual Modeling or No Code


In this method, the framework provides a visual modeling tool, which you can use to create analytical workflows from source streams. The following screenshot is taken from Microsoft Azure Stream Analytics:

Close
Featured Azure Stream Analytics job

Pros

  • Visualizes data flow.
  • Has rapid prototyping and deployment, improving time to market.

Cons

  • Not everything can be achieved using the standard model-provided blocks.
  • Not entirely no-code, as some amount of coding is still required.
  • Porting is impossible.


Raw API Interface


In this method, the framework exposes native APIs, such as Java APIs and Scala APIs, that are called directly from the application. Tools like Apache Spark and Apache Flink utilize this method.

Below is some sample Java API code:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); 
 
 
DataStream<Tuple2<String, Integer>> dataStream = env 
   .socketTextStream("192.168.0.103", 9999) 
   .flatMap(new Splitter()) 
   .keyBy(value -> value.f0) 
   .window(TumblingProcessingTimeWindows.of(Time.seconds(10))) 
   .sum(1); 


Pros

  • Can write complex algorithms since it's highly flexible.
  • Provides fine-grained control over the various parameters of operation, which supports more complex use cases. For example, previous data can drive follow-up operations. In the patient monitor example, the raw API could be configured to check for rising blood pressure if the blood sugar remains high for more than five minutes.
  • Easy to convert from streaming to batch mode processing.
  • Combining multiple data streams is possible based on the support provided by the framework.
  • Operations such as UNION and JOIN are possible on multiple data streams depending on the support provided by the framework.


Cons

  • Tightly coupled to the framework used. Porting is very difficult.
  • Steeper learning curve because APIs need to be understood.
  • Code is hard to understand, and debugging needs very specific tools.
  • Tightly coupled to stream data type because it's coded specifically for the data type. So, any change in the data type changes the code.
  • Predefined named source streams are mostly absent; hence, logic is tightly coupled to the defined source stream and cannot be changed without changing code.


Python API (Scripting)


In this method, the framework provides scripting APIs that can be called in scripting languages such as Python. For example, Apache Flink provides a Table API to call in Python code.

Below is some sample Python code:

from pyflink.table import EnvironmentSettings, TableEnvironment 
 
 
env_settings = EnvironmentSettings.new_instance().use_blink_planner().build() 
table_env = TableEnvironment.create(env_settings) 
proj_table = table_env.from_path("X").select(...) 
table_env.register_table("projectedTable", proj_table) 


Pros

  • Reduced complexity compared to using a raw API.
  • Easy to code because it uses scripting.
  • Scripting allows dynamic data types, unlike strongly coupled data type languages such as Java, and hence decouples the stream data type from the code.
  • SQL artifacts can be created from multiple connectors, such as Kafka connectors, depending on what's compatible with the framework.
  • Decouples the source from the logic. If the data type remains the same, the same SQL artifact can be recreated with a different connector.

Cons

  • Tightly coupled to the framework used. Porting is difficult.
  • Requires an understanding of the specific API provided by the framework, so there's a steeper learning curve.
  • Requires framework-specific tools for debugging.


SQL Code


In this method, the framework allows you to create various SQL artifacts and use standard SQL statements against these artifacts to retrieve and process data. For example, RisingWave uses sources, sinks, and materialized views to allow easy access to streaming data, which can then be accessed via any SQL-based tool or code.

The following is some sample SQL code:

CREATE MATERIALIZED VIEW ad_ctr AS 
SELECT 
   ad_clicks.ad_id AS ad_id, 
   ad_clicks.clicks_count :: NUMERIC / ad_impressions.impressions_count AS ctr 
FROM 
   ( 
       SELECT 
           ad_impression.ad_id AS ad_id, 
           COUNT(*) AS impressions_count 
       FROM 
           ad_impression 
       GROUP BY 
           ad_id 
   ) AS ad_impressions 
   JOIN ( 
       SELECT 
           ai.ad_id, 
           COUNT(*) AS clicks_count 
       FROM 
           ad_click AS ac 
           LEFT JOIN ad_impression AS ai ON ac.bid_id = ai.bid_id 
       GROUP BY 
           ai.ad_id 
   ) AS ad_clicks ON ad_impressions.ad_id = ad_clicks.ad_id; 


Pros

  • Declarative programming allows you to focus on what needs to be achieved rather than how.
  • Standard SQL language, so the learning curve is manageable.
  • Materialized views can be created over other materialized views, decoupling the various methods of using the stream data from the actual computations.
  • Multiple sources can be added and removed without affecting the materialized views.
  • Multiple sources can be combined easily using SQL JOINs.


Cons

  • Reduced control over algorithm complexity, which may require workarounds with various SQL functions.
  • Porting can only be done to other frameworks that support SQL-based access.


Looking at the different ways to write a streaming processing application, SQL strikes a balance between flexibility, ease of coding, and ease of maintenance for developing streaming applications. SQL is the de facto most popular language for big data analytics and standardized streaming SQL definitions are beginning to make it the future standard of stream processing.

Conclusion

This article looked at the different ways of coding streaming applications with frameworks, their pros and cons, and the various factors that play a role in selecting the right stream processing framework. RisingWave is an open-source distributed SQL database for stream processing. It is designed to reduce the complexity and cost of building real-time applications. Access our GitHub repository to learn more.

Event-Driven Computation


Everything that happens in the world is an event that reflects the current state of a system. When you recollect memories, look through images in your gallery, or read history books in libraries, you are going through records of past events. In these cases, the records of the events are stored in your brain, galleries, or libraries.

Similarly, every interaction you have on the internet can be regarded as an event. As a developer, besides initiating these events, you are also responsible for designing systems to respond to them in the applications you build. Event-driven computation helps you do this effectively.

Event-driven computation is a programming paradigm that focuses on processing events as they happen by delegating control to a system that watches for events to occur. Many applications rely on it, including graphical user interfaces, fraud detection systems, real-time analytics dashboards, and so on. An event can be defined as a significant change in the state of an application. Examples include a user clicking a button on a website, a sensor detecting a temperature change in the environment, and a fraud detection system flagging a transaction.

Batch processing is an alternative to processing events as they occur. It involves waiting for the data to accumulate before processing it. While this is useful in some situations, many applications and their users require near-real-time responses to events. Event-driven computation can help you achieve this by reducing latency significantly. It also helps applications better use resources by breaking down operations into smaller bits that can be completed more efficiently than batch operations. This also reduces the overall system complexity, which is essential when dealing with systems that involve intensive computations.

This article will discuss event-driven computation and its advantages. You will learn how event-driven computation works and take a look at the various tools that can be used to implement different stages of the event-driven computation process.


What Are the Benefits of Event-Driven Computation?


Event-driven computation is the key component that enables real-time systems to function, and many of the online applications relied on today are built on this model. The following are just a few examples of event-driven computation’s benefits.


Responsiveness


The responsiveness of a system can be described as its ability to complete a specific task in the shortest amount of time. For instance, when you click a button in a mobile application, you want it to perform its intended function quickly and correctly. With event-driven computation, the latency of your system will be significantly reduced since each designated event is processed as it occurs and not in batches.


Reduced Complexity


Complexity arises in data applications when different services must directly communicate with one another to process ingested data. With event-driven computation, you can decouple your system into individual units that don’t need to communicate directly with one another. As a result, the data flow is separated from the application’s core logic. This makes your system easy to design, maintain, and update.

For instance, consider a music processing pipeline that includes services for ingesting tracks and extracting both low-level and high-level information from them. In a traditional system, the ingestion service would need to coordinate directly with the other services to ensure the track is processed. In an event-driven system, however, the ingestion service simply publishes an event when music is ingested. The other services listen to their respective events and perform their tasks. This makes your system easier to manage and less complex.


Resource Efficiency


Cutting your operating costs is a direct way to increase your company’s profitability. In a batch processing system, a reasonable amount of memory and computational power have to be set aside for data storage and processing. However, in an event-driven computation system, your application processes events as they occur in smaller volumes. Much less memory needs to be allocated, and this can be done based on demand. This helps you and your organization save resources.


How Does Event-Driven Computation Work?


To successfully implement event-driven computation, you must first understand the key steps involved: generating events, streaming data, creating a materialized view, querying the view, and persisting data. The flow of data from an IoT temperature sensor installed in a supermarket’s cold storage room will be used as an example to help explain these steps. This storage room is assumed to hold perishable food items such as fruits and vegetables that must be kept within 1 to 2 degrees Fahrenheit of their optimal temperature to extend their shelf life.

Close
Featured

Generating an Event


The first stage in event-driven computation is the generation of an event in an application. This indicates that there has been a state change that requires attention. Users, other applications, or the system itself can trigger an event.

In the temperature sensor example, temperature values are recorded at distinct timestamps, which means each temperature reading is regarded as an event to be streamed. Without the critical step of event generation, the entire system would be dormant because there would be no initial trigger. The temperature sensor would function as a thermometer.

Events can be generated using programming languages like Java and Python, as well as frameworks like Apache Kafka and Amazon Kinesis. A sample temperature reading event in JSON format would look something like this:

{
 "temperature": 1,
 "timestamp": "2023-03-23 14:11:04"
}


Streaming Data


After an event is generated, it is streamed to the appropriate application for processing. Streaming is the process of continuously transmitting data in real time, rather than in batch mode. It is vital for event-driven computation because it enables the application to respond to the action initiated by the event generation step in real time.

After recording a new temperature value, the temperature sensor must send data to an application or control system to monitor and make decisions as needed.

Streaming is carried out using messaging systems or platforms like Apache KafkaAmazon KinesisGoogle Cloud Pub/Sub, and so on. These systems are designed to provide fault tolerance, reliable delivery, scalability, and security for stream data.


Creating a Materialized View


A materialized view is a data snapshot that represents the system’s current state based on all events processed at any given point. This is especially useful in streaming scenarios because it allows performing incremental event computations as they are streamed, reducing the computational burden when the processed data is required later on.

Materialized views can be stored in memory or a database for later access by other services requiring current information. You can use tools like RisingWaveApache Spark, or Amazon Redshift to create a materialized view.

Let’s consider how you’d use RisingWave to create a materialized view of temperature readings. RisingWave requires that you first create a source stream that feeds data to the materialized view. Setting up a temperature data stream in Kafka allows you to connect to it in RisingWave with the following SQL statement:

CREATE SOURCE temperature_reading (
    timestamp TIMESTAMP,
    temperature FLOAT
) WITH (
    connector = 'kafka',
    topic = 'temperature_reading',
    properties.bootstrap.server = 'message_queue:29092',
    scan.startup.mode = 'earliest'
) ROW FORMAT JSON;

This code block defines a source called temperature_reading that consumes JSON messages from a Kafka data stream using the CREATE SOURCE command. Each reading is expected to include a temperature value and a timestamp.

Next, you’d create the materialized view using the command below:

CREATE MATERIALIZED VIEW avg_temp_1min AS
SELECT
    AVG(temperature) AS avg_temp,
    window_end AS reported_timestamp
FROM
    TUMBLE(
        temperature_reading,
        timestamp,
        INTERVAL '1' MINUTE
    )
GROUP BY
    window_end;

The code above uses the CREATE MATERIALIZED VIEW command to create a materialized view named avg_temp_1min to calculate the average temperature reading from the sensor every minute. The temperature events are divided into time windows, and the average temperature per window is aggregated in this view. The TUMBLE function is used to map each event into a one-minute window, after which the average temperature is calculated within each time window. The end of the time window is used as the reported timestamp.


Querying the View


The next step is to query the view. Queries, such as SQL queries and GraphQL queries, are used to retrieve relevant data from the materialized view for use in other applications.

Depending on the application’s requirements, these queries could be simple or complex. The query results could be used to gain insights into the system’s state, perform actions based on business logic, or trigger additional events. Materialized views can be queried using RisingWaveKafka StreamsSpark SQL, and other tools.

The avg_temp_1min materialized view you learned how to create earlier can be queried as follows:

SELECT * FROM avg_temp_1min 
ORDER BY reported_timestamp;

The command in this code block selects all time-windowed average temperature readings and sorts them according to the reported timestamp. The average temperature can also be reported as a metric to a tool like Grafana, which can visualize it and send alerts when temperature readings exceed a certain threshold.


Persisting the Data


After the data has been queried, the final step is to save it, which entails permanently storing it for later retrieval. Persistence can be achieved by writing data to a file, storing it in a database, or storing it in a data warehouse. Persisting data is crucial because it lets you store historical data, which allows you to perform long-term analysis later.

Some tools for persisting data include databases like MySQLPostgreSQL, and Apache Cassandra, as well as data warehouses like Amazon Redshift and Google BigQuery.

conclusion

This article explained event-driven computation and its benefits, such as responsiveness, reduced complexity, and resource efficiency. Additionally, you learned about the process of event-driven computation and some of the typical tools used to complete each step, such as RisingWaveAmazon Kinesis, and so on.

Using event-driven computation, you can create applications that process data more effectively and quickly respond to events. This is important for applications such as financial trading systems, sensor networks, and social media platforms.

Keep up with the latest blog alerts via LinkedInTwitter, and Slack.

What Are Materialized Views in Databases?


Materialized views are database objects that contain the results of your operations, calculations, or queries. They exist as a locally saved representation of your data that you can use and reuse without recalculating or computing data values. Using materialized views saves processing time and cost and allows you to front-load your frequently used or most time-consuming queries.

Materialized views are similar to views in your SQL databases. However, while views are also virtual copies of your data tables, their query results are not stored or cached locally, and only the query expression is retained. Therefore, a major difference between views and materialized views is that you need to recompute results stored within views every time you execute or query a view.

Views are typically utilized when your data is constantly changing and updating but is only viewed sporadically. On the other hand, materialized views are utilized when data has to be viewed often, but the saved tables are not updated frequently.

This article explores materialized views in modern-day databases, their use cases, why they are important, how they are connected with data analytics, and their limitations.


Why Do You Need Materialized Views?


The process of creating a materialized view, often called materialization, involves caching the outcome of your query operations. It can be a basic SELECT statement, a JOIN operation, an aggregate summary of your data, or any other SQL query. Your materialization creates a concrete table that is stored locally and takes up space just like any other data table.

You can then query this stored table directly instead of executing from your original data tables. With materialized views, your frequently used variables can be precomputed with the needed joins, aggregates, and filters, so they're ready for all subsequent queries. This helps optimize your queries and alleviates performance issues with common and repeated queries.

Materialized views can be read-only, updateable, or writeable. DML statements (such as INSERT and MERGE statements) cannot be executed on read-only materialized views; however, they may be executed on updatable and writeable materialized views. Your materialized view can also become outdated if not updated as values change in the original data tables. Using triggers, you can refresh your materialized views to recompute the values saved, either manually or on a schedule. There are a few databases, such as RisingWave, that automatically update your materialized views with incremental changes as new values appear in the base table.


Use Cases for Materialized Views


As a rule, queries utilizing materialized views are significantly faster and require fewer resources for their operations compared to similar queries that access data directly from your basic tables. This is especially true for modern databases, where every query execution has an associated cost, and the use of your resources and compute power needs to be optimized.

Materialized views can codify the common logic that's important to your business or data processes and then precompute this logic, making it available and more efficient for your downstream processes such as data reporting or analytics.

Materialized views improve and optimize the execution time of your most important or resource-intensive queries, enabling better and more efficient access to immediately relevant data at the cost of some extra storage. While materialized views need to be refreshed to combat outdated data, and you can utilize triggered updates or databases with auto-refresh features combined with stream processing to get real-time data precomputed and available for your queries and operations.


Why Are Materialized Views Useful in Modern Data Warehouses?


As your organization continues to generate data, the amount of data you need to process for operations such as analytics, monthly data reporting, or dashboarding increases. Certain operations may require data contained across several tables saved on distributed servers. Executing queries on these data tables, in turn, requires an increasing level of compute resources as the data grows larger and more spread out.

Materialized views offer significant benefits to organizations with OLAP operations, which require substantial computation with expected and recurring queries, such as those in extract, transform, load (ETL) or business intelligence (BI) pipelines. A data warehouse with enhanced materialized view capability allows you to improve your query performance, especially when your queries are frequently used and require aggregated, filtered, clustered, or joined data across several data tables.

Modern data warehouses, especially those utilizing cloud computing, offer per-query pricing or cost over time for the use of available compute power. Materialized views help you reduce the execution time and compute power spent on running the same queries.


Cost-Efficient Queries and Reduced Execution Time


Precomputing results on relevant variables for your business logic and processes is a cost-efficient way to streamline the amount of work done when querying, especially considering how much of this work is repeated as queries are used frequently.

Materialized views are low-maintenance and cost-efficient compared to the other options. Though there is a storage cost to cache the data and compute power will be used on the creation. During updates, significantly fewer resources are used to rerun queries or utilize regular views. You can reduce maintenance costs by ensuring your materialized view is filtered and aggregated to provide only the necessary data. Additionally, compute costs can be reduced if you maintain best practices and use materialized views for data that doesn't require regular recomputation.


Streamlined Query Data across Departments


For example, an e-commerce organization will find it useful to know their top-performing customers and where they come from, which is valuable information across the marketing, operations, and merchandising departments. The information could be collated from different tables (such as KYC tables and orders) and aggregated in different buckets and categories (such as monthly, by region, and all-time best). The query for this information could be used regularly across departments.

Your operations team could use it for reporting purposes, and your marketing team could use the information to reward customers with discounts and market products differently. Lastly, merchandising can optimize products using your customer data. Rather than querying your database every time this information is needed, you can create a materialized view that precomputes this query and makes it available to the various departments faster with fewer resources used.


Secure Data with Segmented Data Privileges


With materialized views, you can limit data privileges within your organization, giving departments or individuals access to only the data subsets they need without full access to data tables that might contain sensitive data and customer information. This creates a more secure environment for your database within your organization.


Limitations in Modern Capabilities


As a concept, materialized views are still being improved and optimized across modern databases. Some databases only offer limited operations in materialized views, and the refresh and update capabilities of materialized views also vary.

Only a few systems support automatically maintained materialized views, which is an important feature for maintaining data freshness and avoiding outdated data within your materialized views. Additionally, incremental updates in materialized views are not available in all systems. Therefore, refreshing or updating your materialized view requires recomputation of your query result and rebuilding the view from scratch every time you require an update.

This can be a critical limitation for you and your organization, as the ability to update your materialized views with incremental changes automatically is integral to building real-time materialized views. Real-time materialized views are immediately updated with only the new data from their base tables, avoiding outdated data and extra compute costs.

You can overcome these limitations with a solution that offers real-time materialized views that can be used for cost-efficient, real-time data analytics, ensuring your most relevant data values are available, precomputed, and up-to-date.

RisingWave's mission is to empower real-time data analytics, enabling stream processing for everyone by making it simple, affordable, and accessible. RisingWave uses stream processing inside databases to refresh materialized views in real time, allowing organizations to benefit from cost-efficient and performant queries on real-time data. This reduces the complexity and cost of running data operations and downstream processes in real time.

conclusion

In this article, you learned about materialized views, how they differ from regular views in SQL databases, their use cases, and how they are useful to your modern data architecture. You also learned about the limitations of materialized views and how real-time materialized views are a more efficient option.

RisingWave Labs is an early-stage startup developing stream processing applications that offer real-time materialized views. Its flagship product, RisingWave, is a cloud-native streaming database that employs SQL as its interface to decrease the difficulty and expense of developing real-time data solutions. RisingWave consumes streaming data from various sources, such as Kafka and Pulsar, and utilizes real-time materialized views to enable continuous queries and dynamic updates.

Check out this latest tutorial to learn how to how to use RisingWave as a metrics store to monitor its runtime metrics and visualize them using Grafana.

sign up successfully_

Welcome to RisingWave community

Get ready to embark on an exciting journey of growth and inspiration. Stay tuned for updates, exclusive content, and opportunities to connect with like-minded individuals.

message sent successfully_

Thank you for reaching out to us

We appreciate your interest in RisingWave and will respond to your inquiry as soon as possible. In the meantime, feel free to explore our website for more information about our services and offerings.

subscribe successfully_

Welcome to RisingWave community

Get ready to embark on an exciting journey of growth and inspiration. Stay tuned for updates, exclusive content, and opportunities to connect with like-minded individuals.