Real-Time Data Enrichment and Analytics with RisingWave and StarRocks

The term “data enrichment” refers to the process of enhancing a central fact table by joining it with multiple dimension tables, potentially from different sources. The resulting enriched table facilitates subsequent processing and analysis.

Real-time data enrichment and analysis remain common scenarios in the stream processing domain. However, many people still find it difficult to deal with this seemingly straightforward scenario due to various complexities.

Before we delve into these issues, let's review what the industry has already addressed:

The current real-time data enrichment with Starrocks

CDC ingestion from operational databases: Modern open-source CDC tools, such as RisingWave and FlinkCDC, seamlessly support mainstream databases like MySQL, Postgres, MongoDB, etc.
Real-time writes to data warehouses: Popular data warehouse systems like StarRocks, efficiently support real-time data loading. Specifically, StarRocks offers features such as:
1. Primary key tables: This table type internally maintains a primary key index in StarRocks, efficiently handling updates and deletions. RisingWave writes to StarRocks by default require this table type.
2. Partial updates: Ideal for scenarios where only a small number of columns (usually less than 30%) are modified each time in a enriched table.
3. StreamLoad: StarRocks supports HTTP-based data loading. Throughput tests have shown that the write throughput can meet the requirements of most scenarios.
Cleansing and enrichment for small state: Many stream processing solutions, both open-source and commercial, offer robust SQL support, eliminating concerns about limited expressiveness during data cleaning.

Challenges with large states

While these solutions offer significant advantages, managing large states remains a challenge for systems like FlinkSQL and Materialize. Users often face a trade-off between complex memory-disk tuning and limited state size. FlinkSQL, for instance, provides techniques like time windows with watermarks, lookup joins, and state cleanup (Time-to-live, TTL) to mitigate this issue. However, these techniques require users to learn advanced checkpoint tuning for large states.

Alternatively, for people cost is not an issue, there's another option: investing in high-spec servers, each with a hefty price tag in the hundreds of thousands.

Money would solve literally all of my problems right now.

Back to our initial discussion. Even though we have solved basic ETL (Extract, Transform, Load) processes (CDC → data cleansing → real-time data warehouse writes), there are still many unresolved pain points in the industry:

Maintenance of large states for complex stream processing logic
How to keep intermediate states for a long time (e.g., several months)
Real-time materialized views based on enriched tables

The final mile of real-time data enrichment

Those who have delved into these issues may realize that what's missing is precisely a large-state storage tailored for streaming:

Cost-effective storage: Infrequently accessed data resides on affordable options like HDFS or S3, minimizing storage costs.
Efficient handling of frequent updates in stream-joins.
Queryable to power real-time metrics for monitoring scenarios.
Swift recovery from failures.

RisingWave aims to provide precisely such capabilities.

In the context of real-time data enrichment, RisingWave can, at a lower machine cost, leverage its storage-computation separation without the need for tuning tricks to support a previously challenging join pipeline. RisingWave has the capability to handle complex joins that involve multiple tables. At CVTE, for example, RisingWave successfully performs joins involving up to 13 tables effortlessly. It is even capable of handling joins with a higher number of tables without any difficulty.

Enhanced real-time analysis stack

As RisingWave’s data enrichment capabilities continue to be validated in various user scenarios, we focus more on refining this aspect.

In the recently released 1.7 version, we have optimized writes to StarRocks, making the entire process even more seamless.

Beyond data enrichment, users can also view RisingWave as a real-time data warehouse cache. In scenarios with low real-time requirements, users can leverage StarRocks for historical analysis. However, when real-time demands become critical, RisingWave's enriched tables can be used to develop materialized views. Users can choose the right technology within the stack based on their specific requirements.

Apart from writing through StarRocks sink, RisingWave can also serve as an external table for StarRocks, unifying the entry point for batch and streaming query.

CREATE EXTERNAL RESOURCE rw_jdbc
PROPERTIES (
    "type"="jdbc",
    "user"="postgres",
    "password"="",
    "jdbc_uri"="jdbc:postgresql://risingwave-standalone:4566/dev",
    "driver_url"="<https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.3/postgresql-42.3.3.jar>",
    "driver_class"="org.postgresql.Driver"
);

CREATE EXTERNAL TABLE users (
    id BIGINT,
    description VARCHAR(255),
    name TEXT,
    created_at DATETIME
) ENGINE=jdbc
properties (
    "resource"="rw_jdbc",
    "table"="users"
);

While the challenge of managing large states has been addressed, schema changes remain an obstacle. Specifically, if columns are added to the upstream MySQL table, can StarRocks and RisingWave automatically adapt by synchronizing and adding the corresponding columns? > >

>

Some industry practices demonstrate the ability to capture upstream DDL changes and update StarRocks accordingly. We are actively exploring the possibility of offering similar capabilities in future versions of RisingWave. Currently, RisingWave supports the manual addition and drop of columns. We are continuously enhancing this capability. > >