Monitoring RisingWave Pipelines with Grafana and Prometheus

Monitoring RisingWave Pipelines with Grafana and Prometheus

·

9 min read

TL;DR

RisingWave exposes Prometheus metrics on each cluster component (meta, compute, frontend, compactor). Pair those metrics with Grafana using the official dashboard definition from the RisingWave repository, and you have full visibility into ingestion lag, processing throughput, memory pressure, and compaction health. For quick health checks without Prometheus, RisingWave also ships a rich set of internal system catalog views under rw_catalog that you can query directly with SQL.


What Metrics Does RisingWave Expose?

RisingWave follows the standard Prometheus scrape model. Each component in a RisingWave cluster exposes a /metrics HTTP endpoint:

ComponentDefault metrics port
Meta node1250
Compute node1222
Frontend node2222
Compactor1260

These endpoints return Prometheus text format output, scraped by a standard prometheus.yml job configuration. RisingWave also maintains an official Grafana dashboard definition at grafana/risingwave-user-dashboard.dashboard.py in the main repository.

Key metric categories

Ingestion metrics track how fast data flows from sources (Kafka, CDC, etc.) into RisingWave:

  • stream_source_rows_received_total -- cumulative rows ingested per source
  • source_kafka_consumer_lag -- consumer group lag per topic partition
  • stream_source_split_change_count -- number of partition reassignments

Processing metrics cover the streaming execution engine:

  • stream_actor_processing_time_ns -- CPU time per streaming actor
  • stream_barrier_send_latency_ms -- end-to-end barrier propagation latency (a proxy for processing lag)
  • stream_backpressure_count -- how often downstream actors block upstream ones

Storage metrics reflect Hummock, RisingWave's LSM-tree storage layer:

  • storage_write_bytes -- bytes written to storage
  • storage_read_bytes -- bytes read during queries
  • compaction_success_count -- completed compaction tasks
  • storage_level_sst_num -- SST file count per LSM level

Resource metrics show cluster health:

  • process_resident_memory_bytes -- RSS per process
  • process_cpu_seconds_total -- CPU usage per process

Setting Up Prometheus Scraping

Minimal prometheus.yml

Add a scrape job for each RisingWave component. Adjust targets to match your deployment:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: risingwave-meta
    static_configs:
      - targets: ['localhost:1250']
        labels:
          component: meta

  - job_name: risingwave-compute
    static_configs:
      - targets: ['localhost:1222']
        labels:
          component: compute

  - job_name: risingwave-frontend
    static_configs:
      - targets: ['localhost:2222']
        labels:
          component: frontend

  - job_name: risingwave-compactor
    static_configs:
      - targets: ['localhost:1260']
        labels:
          component: compactor

For Kubernetes deployments, use the Prometheus Operator's ServiceMonitor CRD. The RisingWave Operator creates a ServiceMonitor automatically when the Prometheus Operator is detected in the cluster.

Quick validation

After Prometheus starts, verify targets are healthy:

# Check Prometheus targets endpoint
curl http://localhost:9090/api/v1/targets | python3 -m json.tool | grep '"health"'
# Should return "health": "up" for each RisingWave component

Key Dashboard Panels in Grafana

Importing the Official Dashboard

The RisingWave repository ships a Grafana dashboard generator. The easiest path is to use the pre-built JSON file. In Grafana:

  1. Go to Dashboards -> Import.
  2. Upload the dashboard JSON from the RisingWave repository, or paste the dashboard ID.
  3. Select your Prometheus data source.
  4. Click Import.

The official dashboard includes panels for barrier latency, actor CPU, memory usage, compaction status, and source lag -- all pre-wired to the correct PromQL queries.

Essential PromQL Queries

If you build a custom dashboard, these queries cover the most important pipeline health signals:

Barrier latency (pipeline heartbeat):

histogram_quantile(0.99,
  rate(stream_barrier_send_latency_ms_bucket[1m])
)

RisingWave uses a barrier-based consistency model. Each barrier is a checkpoint signal that propagates through the entire pipeline. Barrier p99 latency is the single best proxy for end-to-end pipeline health. Values under 100ms indicate a healthy pipeline; values climbing toward seconds indicate backpressure or resource exhaustion.

Source ingestion rate:

rate(stream_source_rows_received_total[1m])

Panels this per source to see per-topic throughput.

Actor CPU time (hotspot detection):

topk(10, rate(stream_actor_processing_time_ns[1m])) / 1e9

Shows the 10 most CPU-intensive actors. Use this to identify which materialized views are driving compute load.

Memory pressure:

process_resident_memory_bytes{component="compute"} /
  (node_memory_MemTotal_bytes{instance=~".*"})

Compaction backlog:

storage_level_sst_num{level="0"}

Level 0 SST count is the most sensitive early indicator of compaction falling behind. Values above 50 warrant investigation; values above 128 trigger write stalls.


Alerting on Pipeline Lag

Prometheus alerting rules let you page on-call when pipelines degrade. Add these to your rules.yml:

groups:
  - name: risingwave-pipeline
    rules:
      # Alert when barrier latency exceeds 5 seconds (pipeline is slow)
      - alert: RisingWaveHighBarrierLatency
        expr: |
          histogram_quantile(0.99,
            rate(stream_barrier_send_latency_ms_bucket[2m])
          ) > 5000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "RisingWave barrier p99 latency is {{ $value }}ms"
          description: >
            Pipeline processing is slowing down. Check for backpressure,
            memory pressure, or slow downstream sinks.

      # Alert when Kafka consumer lag spikes
      - alert: RisingWaveKafkaLagHigh
        expr: source_kafka_consumer_lag > 100000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Kafka consumer lag {{ \(value }} on {{ \)labels.topic }}"
          description: >
            RisingWave is falling behind ingesting from Kafka.
            Check source throughput and compute node resources.

      # Alert when Level 0 compaction is backing up
      - alert: RisingWaveCompactionBacklog
        expr: storage_level_sst_num{level="0"} > 64
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Hummock L0 SST count {{ $value }} -- compaction is behind"

Configure Alertmanager to route these to PagerDuty, Slack, or your on-call platform.


SQL Monitoring Queries

RisingWave exposes a rich set of internal system views under the rw_catalog schema. These are queryable with any PostgreSQL client and provide a fast way to check pipeline health without opening Grafana.

Cluster node health

SELECT
    id,
    host,
    port,
    type,
    state,
    parallelism,
    rw_version,
    started_at
FROM rw_catalog.rw_worker_nodes
ORDER BY type;

On a running local cluster, this returns one row per component. The actual output from RisingWave 2.8.0:

 id |   host    | port |           type           |  state  | parallelism | rw_version |        started_at
----+-----------+------+--------------------------+---------+-------------+------------+---------------------------
  0 | 127.0.0.1 | 5690 | WORKER_TYPE_META         | RUNNING |             | 2.8.0      | 2026-04-15 00:03:35+00:00
  1 | 127.0.0.1 | 5688 | WORKER_TYPE_COMPUTE_NODE | RUNNING |          10 | 2.8.0      | 2026-04-15 00:03:35+00:00
  2 | 127.0.0.1 | 6660 | WORKER_TYPE_COMPACTOR    | RUNNING |          30 | 2.8.0      | 2026-04-15 00:03:35+00:00
  3 | 0.0.0.0   | 4566 | WORKER_TYPE_FRONTEND     | RUNNING |             | 2.8.0      | 2026-04-15 00:03:35+00:00

Active streaming jobs and their status

SELECT
    id,
    name,
    status,
    parallelism,
    max_parallelism
FROM rw_catalog.rw_streaming_jobs
ORDER BY id;

Each materialized view, source, and sink appears as a streaming job. The status column shows CREATED (running), CREATING (backfilling), or other states. The parallelism field shows whether the job is running at ADAPTIVE (auto-scaled) or a fixed degree of parallelism.

Pipeline backfill progress

When you create a new materialized view over historical data, RisingWave runs a backfill job. Track its progress:

SELECT
    ddl_id,
    ddl_statement,
    create_type,
    progress,
    initialized_at
FROM rw_catalog.rw_ddl_progress;

This view is empty when no backfill is running and populates during active CREATE MATERIALIZED VIEW operations with large source tables.

CDC ingestion progress

SELECT
    job_id,
    split_total_count,
    split_backfilled_count,
    split_completed_count
FROM rw_catalog.rw_cdc_progress;

For CDC sources (MySQL, PostgreSQL, MongoDB), this view shows how many source splits have been captured. Use it to verify that all CDC splits are fully captured before directing traffic to new materialized views.

Table and view storage size

SELECT
    j.name,
    s.total_key_count,
    pg_size_pretty(s.total_key_size + s.total_value_size) AS approx_size
FROM rw_catalog.rw_table_stats s
JOIN rw_catalog.rw_streaming_jobs j ON s.id = j.id
ORDER BY s.total_key_size + s.total_value_size DESC
LIMIT 10;

This query joins rw_table_stats with rw_streaming_jobs to show which materialized views consume the most storage.

System event log

SELECT
    unique_id,
    timestamp,
    event_type
FROM rw_catalog.rw_event_logs
ORDER BY timestamp DESC
LIMIT 20;

rw_event_logs records cluster lifecycle events: node starts, recovery cycles, barrier completions, and more. It is the first place to look when diagnosing unexpected restarts or recovery events. Events like GLOBAL_RECOVERY_START and GLOBAL_RECOVERY_SUCCESS indicate the cluster underwent a recovery pass, which is normal on startup or after a node failure.


Putting It Together: A Monitoring Runbook

For day-to-day pipeline operations, this three-step check covers most issues:

Step 1: Is the cluster healthy?

SELECT type, state, rw_version FROM rw_catalog.rw_worker_nodes;
-- All rows should show state = 'RUNNING'

Step 2: Are pipelines processing?

Open Grafana, look at the barrier latency panel. A stable p99 under 200ms means the pipeline is healthy. A rising trend or spikes above 1 second indicate resource pressure.

Step 3: Is ingestion keeping up?

rate(stream_source_rows_received_total[5m])

Compare to your expected ingest rate. A sudden drop to zero means the source connection is broken. A steady decline while Kafka lag rises means compute is the bottleneck.


Key Takeaways

  • RisingWave exposes Prometheus metrics on each component at well-known ports. Standard Prometheus + Grafana tooling works without plugins.
  • Barrier latency is the single most important metric for pipeline health. Alert on it first.
  • The rw_catalog schema provides SQL-native monitoring that works without a metrics stack -- useful for quick operational checks and for embedding health checks in application code.
  • For Kubernetes deployments, the RisingWave Operator integrates with the Prometheus Operator to automate scrape configuration.

FAQ

Which Grafana dashboard should I use?

Start with the official risingwave-user-dashboard from the RisingWave GitHub repository. It is maintained by the RisingWave engineering team and covers all key subsystems. A second dashboard (risingwave_dev_dashboard) is more granular and oriented toward internal debugging.

What is a reasonable barrier latency SLO?

For most production workloads, a p99 barrier latency under 500ms is healthy. Alert at 2 seconds and page at 5 seconds. Latency consistently above 10 seconds indicates the pipeline is effectively stalled.

Can I use Prometheus Operator on Kubernetes?

Yes. The RisingWave Operator creates a ServiceMonitor resource automatically when Prometheus Operator is present in the cluster. No manual scrape configuration is needed.

Are there OpenMetrics or OTLP export options?

RisingWave natively exposes Prometheus text format. You can use the OpenTelemetry Collector with a Prometheus receiver to convert this to OTLP and forward it to any OTLP-compatible backend (Grafana Cloud, Datadog, etc.).

How do I monitor a RisingWave Cloud deployment?

RisingWave Cloud provides a built-in metrics dashboard in the console. For custom alerting, you can use the Prometheus-compatible metrics endpoint exposed by the managed service, or integrate with RisingWave Cloud's native alert configuration.

What is rw_catalog.rw_event_logs useful for?

It is the cluster audit log. Check it when diagnosing unexpected recovery events, unusual startup sequences, or to understand the history of DDL operations that affected pipeline state.


What to Read Next


Ready to instrument your streaming pipelines? Start with RisingWave Cloud for a managed cluster with built-in monitoring, or follow the self-hosted deployment guide to run your own cluster with full Prometheus integration.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.