TL;DR
RisingWave exposes Prometheus metrics on each cluster component (meta, compute, frontend, compactor). Pair those metrics with Grafana using the official dashboard definition from the RisingWave repository, and you have full visibility into ingestion lag, processing throughput, memory pressure, and compaction health. For quick health checks without Prometheus, RisingWave also ships a rich set of internal system catalog views under rw_catalog that you can query directly with SQL.
What Metrics Does RisingWave Expose?
RisingWave follows the standard Prometheus scrape model. Each component in a RisingWave cluster exposes a /metrics HTTP endpoint:
| Component | Default metrics port |
|---|---|
| Meta node | 1250 |
| Compute node | 1222 |
| Frontend node | 2222 |
| Compactor | 1260 |
These endpoints return Prometheus text format output, scraped by a standard prometheus.yml job configuration. RisingWave also maintains an official Grafana dashboard definition at grafana/risingwave-user-dashboard.dashboard.py in the main repository.
Key metric categories
Ingestion metrics track how fast data flows from sources (Kafka, CDC, etc.) into RisingWave:
stream_source_rows_received_total-- cumulative rows ingested per sourcesource_kafka_consumer_lag-- consumer group lag per topic partitionstream_source_split_change_count-- number of partition reassignments
Processing metrics cover the streaming execution engine:
stream_actor_processing_time_ns-- CPU time per streaming actorstream_barrier_send_latency_ms-- end-to-end barrier propagation latency (a proxy for processing lag)stream_backpressure_count-- how often downstream actors block upstream ones
Storage metrics reflect Hummock, RisingWave's LSM-tree storage layer:
storage_write_bytes-- bytes written to storagestorage_read_bytes-- bytes read during queriescompaction_success_count-- completed compaction tasksstorage_level_sst_num-- SST file count per LSM level
Resource metrics show cluster health:
process_resident_memory_bytes-- RSS per processprocess_cpu_seconds_total-- CPU usage per process
Setting Up Prometheus Scraping
Minimal prometheus.yml
Add a scrape job for each RisingWave component. Adjust targets to match your deployment:
global:
scrape_interval: 15s
scrape_configs:
- job_name: risingwave-meta
static_configs:
- targets: ['localhost:1250']
labels:
component: meta
- job_name: risingwave-compute
static_configs:
- targets: ['localhost:1222']
labels:
component: compute
- job_name: risingwave-frontend
static_configs:
- targets: ['localhost:2222']
labels:
component: frontend
- job_name: risingwave-compactor
static_configs:
- targets: ['localhost:1260']
labels:
component: compactor
For Kubernetes deployments, use the Prometheus Operator's ServiceMonitor CRD. The RisingWave Operator creates a ServiceMonitor automatically when the Prometheus Operator is detected in the cluster.
Quick validation
After Prometheus starts, verify targets are healthy:
# Check Prometheus targets endpoint
curl http://localhost:9090/api/v1/targets | python3 -m json.tool | grep '"health"'
# Should return "health": "up" for each RisingWave component
Key Dashboard Panels in Grafana
Importing the Official Dashboard
The RisingWave repository ships a Grafana dashboard generator. The easiest path is to use the pre-built JSON file. In Grafana:
- Go to Dashboards -> Import.
- Upload the dashboard JSON from the RisingWave repository, or paste the dashboard ID.
- Select your Prometheus data source.
- Click Import.
The official dashboard includes panels for barrier latency, actor CPU, memory usage, compaction status, and source lag -- all pre-wired to the correct PromQL queries.
Essential PromQL Queries
If you build a custom dashboard, these queries cover the most important pipeline health signals:
Barrier latency (pipeline heartbeat):
histogram_quantile(0.99,
rate(stream_barrier_send_latency_ms_bucket[1m])
)
RisingWave uses a barrier-based consistency model. Each barrier is a checkpoint signal that propagates through the entire pipeline. Barrier p99 latency is the single best proxy for end-to-end pipeline health. Values under 100ms indicate a healthy pipeline; values climbing toward seconds indicate backpressure or resource exhaustion.
Source ingestion rate:
rate(stream_source_rows_received_total[1m])
Panels this per source to see per-topic throughput.
Actor CPU time (hotspot detection):
topk(10, rate(stream_actor_processing_time_ns[1m])) / 1e9
Shows the 10 most CPU-intensive actors. Use this to identify which materialized views are driving compute load.
Memory pressure:
process_resident_memory_bytes{component="compute"} /
(node_memory_MemTotal_bytes{instance=~".*"})
Compaction backlog:
storage_level_sst_num{level="0"}
Level 0 SST count is the most sensitive early indicator of compaction falling behind. Values above 50 warrant investigation; values above 128 trigger write stalls.
Alerting on Pipeline Lag
Prometheus alerting rules let you page on-call when pipelines degrade. Add these to your rules.yml:
groups:
- name: risingwave-pipeline
rules:
# Alert when barrier latency exceeds 5 seconds (pipeline is slow)
- alert: RisingWaveHighBarrierLatency
expr: |
histogram_quantile(0.99,
rate(stream_barrier_send_latency_ms_bucket[2m])
) > 5000
for: 2m
labels:
severity: warning
annotations:
summary: "RisingWave barrier p99 latency is {{ $value }}ms"
description: >
Pipeline processing is slowing down. Check for backpressure,
memory pressure, or slow downstream sinks.
# Alert when Kafka consumer lag spikes
- alert: RisingWaveKafkaLagHigh
expr: source_kafka_consumer_lag > 100000
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka consumer lag {{ \(value }} on {{ \)labels.topic }}"
description: >
RisingWave is falling behind ingesting from Kafka.
Check source throughput and compute node resources.
# Alert when Level 0 compaction is backing up
- alert: RisingWaveCompactionBacklog
expr: storage_level_sst_num{level="0"} > 64
for: 5m
labels:
severity: critical
annotations:
summary: "Hummock L0 SST count {{ $value }} -- compaction is behind"
Configure Alertmanager to route these to PagerDuty, Slack, or your on-call platform.
SQL Monitoring Queries
RisingWave exposes a rich set of internal system views under the rw_catalog schema. These are queryable with any PostgreSQL client and provide a fast way to check pipeline health without opening Grafana.
Cluster node health
SELECT
id,
host,
port,
type,
state,
parallelism,
rw_version,
started_at
FROM rw_catalog.rw_worker_nodes
ORDER BY type;
On a running local cluster, this returns one row per component. The actual output from RisingWave 2.8.0:
id | host | port | type | state | parallelism | rw_version | started_at
----+-----------+------+--------------------------+---------+-------------+------------+---------------------------
0 | 127.0.0.1 | 5690 | WORKER_TYPE_META | RUNNING | | 2.8.0 | 2026-04-15 00:03:35+00:00
1 | 127.0.0.1 | 5688 | WORKER_TYPE_COMPUTE_NODE | RUNNING | 10 | 2.8.0 | 2026-04-15 00:03:35+00:00
2 | 127.0.0.1 | 6660 | WORKER_TYPE_COMPACTOR | RUNNING | 30 | 2.8.0 | 2026-04-15 00:03:35+00:00
3 | 0.0.0.0 | 4566 | WORKER_TYPE_FRONTEND | RUNNING | | 2.8.0 | 2026-04-15 00:03:35+00:00
Active streaming jobs and their status
SELECT
id,
name,
status,
parallelism,
max_parallelism
FROM rw_catalog.rw_streaming_jobs
ORDER BY id;
Each materialized view, source, and sink appears as a streaming job. The status column shows CREATED (running), CREATING (backfilling), or other states. The parallelism field shows whether the job is running at ADAPTIVE (auto-scaled) or a fixed degree of parallelism.
Pipeline backfill progress
When you create a new materialized view over historical data, RisingWave runs a backfill job. Track its progress:
SELECT
ddl_id,
ddl_statement,
create_type,
progress,
initialized_at
FROM rw_catalog.rw_ddl_progress;
This view is empty when no backfill is running and populates during active CREATE MATERIALIZED VIEW operations with large source tables.
CDC ingestion progress
SELECT
job_id,
split_total_count,
split_backfilled_count,
split_completed_count
FROM rw_catalog.rw_cdc_progress;
For CDC sources (MySQL, PostgreSQL, MongoDB), this view shows how many source splits have been captured. Use it to verify that all CDC splits are fully captured before directing traffic to new materialized views.
Table and view storage size
SELECT
j.name,
s.total_key_count,
pg_size_pretty(s.total_key_size + s.total_value_size) AS approx_size
FROM rw_catalog.rw_table_stats s
JOIN rw_catalog.rw_streaming_jobs j ON s.id = j.id
ORDER BY s.total_key_size + s.total_value_size DESC
LIMIT 10;
This query joins rw_table_stats with rw_streaming_jobs to show which materialized views consume the most storage.
System event log
SELECT
unique_id,
timestamp,
event_type
FROM rw_catalog.rw_event_logs
ORDER BY timestamp DESC
LIMIT 20;
rw_event_logs records cluster lifecycle events: node starts, recovery cycles, barrier completions, and more. It is the first place to look when diagnosing unexpected restarts or recovery events. Events like GLOBAL_RECOVERY_START and GLOBAL_RECOVERY_SUCCESS indicate the cluster underwent a recovery pass, which is normal on startup or after a node failure.
Putting It Together: A Monitoring Runbook
For day-to-day pipeline operations, this three-step check covers most issues:
Step 1: Is the cluster healthy?
SELECT type, state, rw_version FROM rw_catalog.rw_worker_nodes;
-- All rows should show state = 'RUNNING'
Step 2: Are pipelines processing?
Open Grafana, look at the barrier latency panel. A stable p99 under 200ms means the pipeline is healthy. A rising trend or spikes above 1 second indicate resource pressure.
Step 3: Is ingestion keeping up?
rate(stream_source_rows_received_total[5m])
Compare to your expected ingest rate. A sudden drop to zero means the source connection is broken. A steady decline while Kafka lag rises means compute is the bottleneck.
Key Takeaways
- RisingWave exposes Prometheus metrics on each component at well-known ports. Standard Prometheus + Grafana tooling works without plugins.
- Barrier latency is the single most important metric for pipeline health. Alert on it first.
- The
rw_catalogschema provides SQL-native monitoring that works without a metrics stack -- useful for quick operational checks and for embedding health checks in application code. - For Kubernetes deployments, the RisingWave Operator integrates with the Prometheus Operator to automate scrape configuration.
FAQ
Which Grafana dashboard should I use?
Start with the official risingwave-user-dashboard from the RisingWave GitHub repository. It is maintained by the RisingWave engineering team and covers all key subsystems. A second dashboard (risingwave_dev_dashboard) is more granular and oriented toward internal debugging.
What is a reasonable barrier latency SLO?
For most production workloads, a p99 barrier latency under 500ms is healthy. Alert at 2 seconds and page at 5 seconds. Latency consistently above 10 seconds indicates the pipeline is effectively stalled.
Can I use Prometheus Operator on Kubernetes?
Yes. The RisingWave Operator creates a ServiceMonitor resource automatically when Prometheus Operator is present in the cluster. No manual scrape configuration is needed.
Are there OpenMetrics or OTLP export options?
RisingWave natively exposes Prometheus text format. You can use the OpenTelemetry Collector with a Prometheus receiver to convert this to OTLP and forward it to any OTLP-compatible backend (Grafana Cloud, Datadog, etc.).
How do I monitor a RisingWave Cloud deployment?
RisingWave Cloud provides a built-in metrics dashboard in the console. For custom alerting, you can use the Prometheus-compatible metrics endpoint exposed by the managed service, or integrate with RisingWave Cloud's native alert configuration.
What is rw_catalog.rw_event_logs useful for?
It is the cluster audit log. Check it when diagnosing unexpected recovery events, unusual startup sequences, or to understand the history of DDL operations that affected pipeline state.
What to Read Next
Ready to instrument your streaming pipelines? Start with RisingWave Cloud for a managed cluster with built-in monitoring, or follow the self-hosted deployment guide to run your own cluster with full Prometheus integration.

