Streaming dbt: The Right Way to Unlock Stream Processing with RisingWave

Streaming dbt: The Right Way to Unlock Stream Processing with RisingWave

Have you ever stumbled upon the classic Reddit thread about why dbt is so popular:

The most upvoted reply captured the essence.

The most upvoted reply of this thread.

Why dbt?

In short, dbt (Data Build Tool) is a wildly popular tool in the data warehouse field.

Image

Imagine your team is juggling 200+ SQL views, meticulously transforming and cleaning raw data to build a complex ETL pipeline, with thousands of lines of code crammed into a single file, riddled with redundancy and chaos. That's where dbt shines. It offers a dedicated framework for data transformation (the T in ETL) that is built around the powerful Jinja templating language. It enables SQL statements to behave like functions in programming languages, accepting parameters and being reused throughout your project.

{% set payment_methods = ["bank_transfer", "credit_card", "gift_card"] %}

select
    order_id,
    {% for payment_method in payment_methods %}
    sum(case when payment_method = '{{payment_method}}' then amount end) as {{payment_method}}_amount,
    {% endfor %}
    sum(amount) as total_amount
from app_data.payments
group by 1

Unlike many data tools, dbt doesn't magically boost performance. But what it brings to the table is a revolution in SQL development experience:

  • Modularity: With dbt, you can break down your SQL code into modular components, reducing repetition and ensuring a well-organized structure.
  • Jinja Macros: Think of Jinja macros as standardized and flexible SQL user-defined functions (UDFs) that can be utilized across different database systems.
  • Automated Documentation: dbt simplifies documentation generation by automatically rendering your data models as web pages. This feature promotes transparency and understanding across teams, eliminating the need to search for context or data lineage.
  • Built-in Testing: Say goodbye to data issues. With dbt's built-in testing functionality, you can thoroughly test your data models for duplicate entries, unexpected values, and other anomalies.
  • Database Compatibility: dbt seamlessly integrates with popular database systems like Snowflake, BigQuery, and Redshift. It even offers custom macros tailored for each database, bridging any expression gaps.
  • Thriving Community: The dbt ecosystem benefits from a vibrant and active community. It offers a wide range of community-built packages, including libraries packed with functions for manipulating dates and other data operations.

Streaming dbt

While dbt is great for handling data in batches, RisingWave specializes in handling real-time data. The dbt-risingwave adapter (available via GitHub repository or PyPl project) allows you to combine the best of both worlds, enabling you to benefit from the unique features and capabilities of both dbt and RisingWave.

Here's a glimpse of a batch processing job:

{{config(materialized='incremental') }}

select *, my_slow_function(my_column)
from raw_app_data.events

{% if is_incremental() %}
  -- this filter will only be applied on an incremental run
  where event_time > (select max(event_time) from {{ this }})
{% endif %}

And here's the same concept, but in the context of RisingWave's real-time materialized views, which offer effortless maintainability.

{{config(materialized='materialized_view') }}

select *, my_slow_function(my_column)
from raw_app_data.events;

Unlike other solutions, RisingWave eliminates the need for manual query triggers (via dbt run) or result updates. You no longer have to deal with complex incremental models and time-filtering code. RisingWave's auto-refreshing materialized views ensure that your data is always up-to-date, effortlessly handling new arrivals and keeping your data fresh.

For many dbt users, working with tables is more familiar than materialized views, specifically through the use of CREATE TABLE AS SELECT (”CTAS”). In most data warehouse systems, CTAS generates a static snapshot of query results at a specific moment in time. However, in RisingWave, CTAS corresponds to a materialized view, ensuring that all table results are continuously updated in real-time with the incoming data stream. This eliminates the necessity for a cron job to trigger periodic updates.

Real-time Dashboard

A data lineage graph for the u0022jaffle_shopu0022 project, generated by DBT.

To create a real-time dashboard, we can leverage the combined power of dbt, Metabase, and RisingWave. For detailed instructions on setting up the integration between RisingWave and Metabase, please refer to this integration guide.

exposures:
  - name: jaffle_shop
    type: dashboard
    maturity: high
    url: <http://127.0.0.1:3000/dashboard/1-jaffle-shop#refresh=5>

    depends_on:
      - ref('customers')
      - ref('orders')

In line with our models, we can configure a "dbt exposure," which is a feature tailored for data engineering teams to present their final products. It serves as a presentation layer for your data models.

While RisingWave can provide significantly fresher data with its 1-second refresh rate, there may be a cause for concern because Metabase's minimum auto-refresh interval is limited to 1 minute. This means that, by default, it cannot take full advantage of the real-time data freshness offered by RisingWave. But no need to worry, there is a simple solution. Just add #refresh=5 to the URL, and your dashboard will be refreshed every 5 seconds, bringing it to life with live updates.

The Big Picture: dbt + RisingWave – A Match Made in Data Heaven

dbt stands out as the Git of the modern data stack. It is easy to learn yet surprisingly powerful, covering all aspects of constructing a data warehouse. In the era of real-time data warehouses, the combination of dbt and RisingWave is becoming an irresistible force. RisingWave simplifies streaming data processing, while dbt enables data teams to collaborate and organize more effectively.

If you're seeking a dynamic duo to revolutionize your data practices, give dbt and RisingWave a try. The dbt-risingwave adapter is available at GitHub and PyPl. For details about how to install and use the adpter, see our documentation.

If you have any questions or suggestions about the dbt-risingwave adapter, feel free to join the vibrant community on Slack or raise an issue on Github!

This article discusses the advantages of using DBT (Data Build Tool) for data transformation in data warehouses, highlighting its features like modularity, Jinja templating, and built-in testing that streamline the SQL development process. It also introduces RisingWave as a complement to DBT for real-time data processing, offering a new dbt-risingwave adapter that enables auto-refreshing materialized views, simplifying the transition from batch to real-time processing. Lastly, the article suggests how DBT, when combined with RisingWave, can create powerful real-time dashboards, illustrating the broader impact of this integration on modern data stack practices.

The Modern Backbone for Your
Event-Driven Infrastructure
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.