Running a Streaming Database on Kubernetes: A Complete Guide

Running a Streaming Database on Kubernetes: A Complete Guide

Stream processing systems are notoriously difficult to operate. Between managing stateful workloads, coordinating distributed components, and handling storage backends, the operational overhead can overshadow the actual data engineering work. Kubernetes offers a path out of this complexity, but only if you understand how streaming databases map onto its primitives.

RisingWave is a distributed SQL streaming database built for cloud-native environments. Its architecture separates compute, compaction, and frontend query serving into independent components, each of which maps naturally to Kubernetes workload types. This separation means you can scale each tier independently, apply different resource profiles, and roll upgrades without downtime.

In this guide, you will deploy RisingWave on Kubernetes using the official Helm chart, configure S3-compatible storage, set up Prometheus and Grafana monitoring, and apply production hardening. By the end, you will have a production-grade streaming database running on Kubernetes that can handle real-time workloads at scale.

RisingWave Architecture on Kubernetes

Before diving into the deployment, it helps to understand how RisingWave's components map to Kubernetes resources. RisingWave follows a disaggregated architecture with four core components:

  • Frontend nodes handle SQL parsing, query planning, and client connections. They are stateless and horizontally scalable. In Kubernetes, these run as a Deployment.
  • Compute nodes execute streaming operators (joins, aggregations, materialized views) and hold in-memory state. These are stateful and run as a StatefulSet.
  • Compactor nodes perform background compaction of the LSM-tree storage layer. They are CPU-intensive but stateless, running as a Deployment.
  • Meta node manages cluster metadata, scheduling, and DDL operations. It runs as a StatefulSet with a single replica (or in HA mode with etcd).
graph TB
    Client[SQL Clients] --> FE[Frontend Nodes<br/>Deployment]
    FE --> CN[Compute Nodes<br/>StatefulSet]
    FE --> Meta[Meta Node<br/>StatefulSet]
    CN --> S3[Object Storage<br/>S3 / MinIO]
    Compactor[Compactor Nodes<br/>Deployment] --> S3
    Meta --> etcd[etcd<br/>StatefulSet]

This separation of concerns is what makes RisingWave a good fit for Kubernetes. Each component has different scaling characteristics, resource requirements, and failure modes. Kubernetes lets you manage each one independently.

Installing RisingWave with Helm

The fastest way to deploy RisingWave on Kubernetes is the official Helm chart. It bundles all components with sensible defaults and exposes configuration through a single values.yaml file.

Prerequisites

You need a running Kubernetes cluster (v1.24+), kubectl configured, and Helm 3.x installed. For local testing, minikube or kind works fine. For production, use a managed Kubernetes service like EKS, GKE, or AKS.

# Add the RisingWave Helm repository
helm repo add risingwave https://risingwavelabs.github.io/helm-charts/
helm repo update

# Install RisingWave with default settings (suitable for development)
helm install my-risingwave risingwave/risingwave \
  --namespace risingwave \
  --create-namespace

This installs RisingWave with an embedded meta store and MinIO for object storage. It is good enough for development and testing, but you will want to customize the configuration for production.

Customizing the Installation

Create a values.yaml file to override defaults:

# values.yaml - Production configuration
image:
  tag: "v2.3.0"

metaNode:
  replicaCount: 1
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "8Gi"

frontendNode:
  replicaCount: 3
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "8Gi"

computeNode:
  replicaCount: 3
  resources:
    requests:
      cpu: "4"
      memory: "16Gi"
    limits:
      cpu: "8"
      memory: "32Gi"

compactorNode:
  replicaCount: 2
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "8Gi"

Apply the custom configuration:

helm install my-risingwave risingwave/risingwave \
  --namespace risingwave \
  --create-namespace \
  -f values.yaml

Verifying the Deployment

After installation, verify all pods are running:

kubectl get pods -n risingwave

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE
my-risingwave-compute-0                 1/1     Running   0          2m
my-risingwave-compute-1                 1/1     Running   0          2m
my-risingwave-compute-2                 1/1     Running   0          2m
my-risingwave-compactor-6d8f9b7-x4kl2   1/1     Running   0          2m
my-risingwave-compactor-6d8f9b7-z9mn3   1/1     Running   0          2m
my-risingwave-frontend-5c7d8f-abc12     1/1     Running   0          2m
my-risingwave-frontend-5c7d8f-def34     1/1     Running   0          2m
my-risingwave-frontend-5c7d8f-ghi56     1/1     Running   0          2m
my-risingwave-meta-0                    1/1     Running   0          2m

Connect to the frontend and run a quick test:

kubectl port-forward svc/my-risingwave-frontend -n risingwave 4566:4566 &
psql -h localhost -p 4566 -U root -d dev -c "SELECT version();"

StatefulSet Architecture and Storage Configuration

Why StatefulSets for Compute Nodes

Compute nodes in RisingWave maintain in-memory streaming state (partial aggregations, join buffers, materialized view data). When a compute node restarts, it recovers state from object storage. StatefulSets provide stable network identities and ordered startup/shutdown, which RisingWave's recovery protocol relies on.

Each compute node pod gets a stable hostname like my-risingwave-compute-0, my-risingwave-compute-1, and so on. The meta node uses these identities to track which compute node owns which portion of the streaming graph. If a pod restarts, it comes back with the same identity and can recover its state checkpoint from object storage.

Configuring Object Storage

RisingWave persists all state to object storage. In production, this is typically Amazon S3, Google Cloud Storage, or Azure Blob Storage. For on-premises deployments, MinIO provides an S3-compatible alternative.

Amazon S3 Configuration

# values.yaml - S3 backend
stateStore:
  s3:
    enabled: true
    bucket: "my-risingwave-state"
    region: "us-east-1"
    # Use IAM roles for EKS (recommended)
    # Or provide credentials directly
    credentials:
      useServiceAccount: true
      serviceAccountAnnotations:
        eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/risingwave-s3-role"

Using IAM Roles for Service Accounts (IRSA) on EKS is the recommended approach. It avoids embedding AWS credentials in your cluster and follows the principle of least privilege. Create an IAM policy that grants s3:GetObject, s3:PutObject, s3:DeleteObject, and s3:ListBucket on your state bucket.

MinIO for On-Premises or Development

The Helm chart can deploy MinIO as a sub-chart:

# values.yaml - MinIO backend
minio:
  enabled: true
  persistence:
    size: 100Gi
    storageClass: "fast-ssd"
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"

stateStore:
  minio:
    enabled: true
    endpoint: "http://my-risingwave-minio:9000"
    bucket: "risingwave-state"

For production MinIO deployments, use the MinIO Operator instead. It provides erasure coding, multi-tenant support, and automated certificate management.

Persistent Volume Claims for Meta Node

The meta node stores cluster metadata in an embedded SQLite or etcd backend. In production, use etcd with persistent storage:

metaStore:
  etcd:
    enabled: true
    endpoints:
      - "my-risingwave-etcd-0.my-risingwave-etcd-headless:2379"

etcd:
  enabled: true
  replicaCount: 3
  persistence:
    enabled: true
    size: 10Gi
    storageClass: "fast-ssd"

Scaling Compute, Compactor, and Frontend Independently

One of the key advantages of RisingWave's disaggregated architecture is independent scaling. Each component has different bottleneck characteristics:

ComponentBottleneckScale When
FrontendConnection count, query parsingMany concurrent clients
ComputeMemory, CPUMore materialized views or higher throughput
CompactorCPU, I/O bandwidthWrite amplification increases, compaction lag grows
MetaMemory (metadata size)Rarely needs scaling beyond 1 replica

Scaling Compute Nodes

Compute nodes are the most common scaling target. When you add materialized views or increase ingestion throughput, compute nodes need more memory and CPU.

# Scale compute nodes from 3 to 5
helm upgrade my-risingwave risingwave/risingwave \
  --namespace risingwave \
  -f values.yaml \
  --set computeNode.replicaCount=5

RisingWave automatically rebalances streaming operators across compute nodes after scaling. Existing materialized views continue processing during the rebalancing period, though you may see a brief increase in latency.

Scaling Frontend Nodes

Frontend nodes are stateless, so scaling them is straightforward. Use a Horizontal Pod Autoscaler (HPA) to scale based on CPU utilization:

# frontend-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: risingwave-frontend-hpa
  namespace: risingwave
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-risingwave-frontend
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
kubectl apply -f frontend-hpa.yaml

Scaling Compactor Nodes

Compactors handle background maintenance of the storage layer. If you notice compaction lag increasing (visible in RisingWave's metrics), add more compactor replicas:

helm upgrade my-risingwave risingwave/risingwave \
  --namespace risingwave \
  -f values.yaml \
  --set compactorNode.replicaCount=4

A good rule of thumb: start with 1 compactor per 2 compute nodes and adjust based on observed compaction lag.

Monitoring with Prometheus and Grafana

RisingWave exposes a rich set of metrics via Prometheus endpoints. The Helm chart can configure ServiceMonitors for automatic scraping if you have the Prometheus Operator installed.

Enabling Prometheus Metrics

# values.yaml - Monitoring configuration
monitoring:
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
      namespace: monitoring
      labels:
        release: kube-prometheus-stack

Each RisingWave component exposes metrics on port 1250 (configurable). Key metrics to monitor include:

Essential Dashboards

RisingWave provides pre-built Grafana dashboards that you can import directly. The key dashboards cover:

Streaming Performance Dashboard

  • stream_actor_output_buffer_blocking_duration - Time actors spend blocked on output. High values indicate backpressure.
  • stream_barrier_latency - Time for a barrier to propagate through the streaming graph. This directly correlates with data freshness.
  • stream_source_rows_per_second - Ingestion throughput from each source.

Storage Dashboard

  • state_store_write_batch_size - Size of write batches to object storage.
  • compactor_compaction_task_duration - How long compaction tasks take. Growing durations suggest you need more compactor resources.
  • state_store_sst_count - Number of SST files. A growing count without corresponding compaction indicates compactor is falling behind.

Resource Utilization Dashboard

  • Per-node CPU and memory usage
  • JemAlloc memory allocation tracking
  • Network I/O between components

Setting Up Alerts

Create PrometheusRule resources for critical conditions:

# risingwave-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: risingwave-alerts
  namespace: monitoring
spec:
  groups:
    - name: risingwave
      rules:
        - alert: RisingWaveBarrierLatencyHigh
          expr: |
            histogram_quantile(0.99,
              rate(meta_barrier_duration_seconds_bucket[5m])
            ) > 60
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "RisingWave barrier latency p99 exceeds 60s"
            description: "Barrier latency is high, indicating potential backpressure or resource contention."

        - alert: RisingWaveCompactionLag
          expr: |
            compactor_pending_task_count > 100
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "RisingWave compaction tasks backing up"
            description: "More than 100 pending compaction tasks. Consider scaling compactor nodes."

        - alert: RisingWaveNodeDown
          expr: |
            up{job=~"risingwave.*"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "RisingWave component is down"
kubectl apply -f risingwave-alerts.yaml

Production Best Practices

Resource Isolation with Node Pools

In production, dedicate separate node pools for RisingWave components. Compute nodes are memory-intensive and should not compete with compactors for CPU:

# values.yaml - Node affinity
computeNode:
  nodeSelector:
    workload-type: risingwave-compute
  tolerations:
    - key: "risingwave-compute"
      operator: "Exists"
      effect: "NoSchedule"

compactorNode:
  nodeSelector:
    workload-type: risingwave-compactor

Pod Disruption Budgets

Prevent Kubernetes from evicting too many RisingWave pods at once during node upgrades or cluster autoscaling:

# compute-pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: risingwave-compute-pdb
  namespace: risingwave
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: compute
      app.kubernetes.io/instance: my-risingwave

Network Policies

Restrict traffic between RisingWave components and external services:

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: risingwave-compute-policy
  namespace: risingwave
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: compute
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/component: frontend
        - podSelector:
            matchLabels:
              app.kubernetes.io/component: meta
  egress:
    - to:
        - podSelector: {}  # Allow cluster-internal traffic
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0  # S3 access
      ports:
        - port: 443
          protocol: TCP

Rolling Updates and Version Upgrades

Upgrade RisingWave by updating the image tag in your values file. The Helm chart handles the rollout strategy:

# Update the image tag
helm upgrade my-risingwave risingwave/risingwave \
  --namespace risingwave \
  -f values.yaml \
  --set image.tag="v2.4.0"

The recommended upgrade order is: meta node first, then frontend, then compactors, and finally compute nodes. The Helm chart respects this ordering through its StatefulSet update strategy. Always review the RisingWave release notes before upgrading, as some versions require manual migration steps.

Backup and Recovery

RisingWave's state lives in object storage, which simplifies backup. Enable versioning on your S3 bucket and set up cross-region replication for disaster recovery:

# Enable S3 versioning
aws s3api put-bucket-versioning \
  --bucket my-risingwave-state \
  --versioning-configuration Status=Enabled

# Set up lifecycle rules to clean old versions
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-risingwave-state \
  --lifecycle-configuration '{
    "Rules": [{
      "ID": "cleanup-old-versions",
      "Status": "Enabled",
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 7
      }
    }]
  }'

For the meta store (etcd), schedule regular snapshots:

# CronJob for etcd snapshots
kubectl create cronjob etcd-backup \
  --namespace risingwave \
  --image=bitnami/etcd:latest \
  --schedule="0 */6 * * *" \
  -- /bin/sh -c "etcdctl snapshot save /backup/snapshot-\$(date +%Y%m%d%H%M).db"

How Do You Choose Between RisingWave Cloud and Self-Hosted Kubernetes?

RisingWave Cloud is a fully managed service that handles all the operational work described in this guide: provisioning, scaling, monitoring, upgrades, and backups. It is the right choice when you want to focus entirely on building streaming pipelines without managing infrastructure.

Self-hosted Kubernetes deployment makes sense when you need full control over the infrastructure, have strict data residency requirements, or already run a mature Kubernetes platform with dedicated platform engineering support. The operational overhead is significant but manageable if you have the team and tooling.

How Do You Monitor RisingWave Performance on Kubernetes?

RisingWave exposes Prometheus-compatible metrics on port 1250 of every component. The most critical metric is barrier latency (meta_barrier_duration_seconds), which measures how quickly changes propagate through the streaming graph. In a healthy cluster, p99 barrier latency stays under 10 seconds. If it consistently exceeds 30 seconds, check for resource contention on compute nodes or growing compaction lag. The pre-built Grafana dashboards in the RisingWave repository provide ready-to-use visualization for all key metrics.

What Storage Backend Should You Use for RisingWave on Kubernetes?

Amazon S3 is the recommended storage backend for production deployments on AWS due to its durability, availability, and cost profile. For GCP, use Google Cloud Storage. For on-premises deployments, MinIO with erasure coding provides S3-compatible storage. The choice of storage backend does not affect RisingWave's functionality, but it impacts performance characteristics. S3 has higher per-request latency compared to local SSDs, which RisingWave compensates for with aggressive caching and batch I/O. Ensure your object storage bucket is in the same region as your Kubernetes cluster to minimize latency and data transfer costs.

How Do You Scale RisingWave on Kubernetes?

Scale each RisingWave component independently based on its bottleneck. Frontend nodes scale based on connection count and query complexity. Compute nodes scale based on the number of materialized views and streaming throughput. Compactor nodes scale based on compaction lag metrics. Use helm upgrade --set <component>.replicaCount=N for manual scaling, or configure Horizontal Pod Autoscalers for frontend nodes. Compute node scaling triggers automatic rebalancing of the streaming graph, which RisingWave handles transparently.

Conclusion

Running a streaming database on Kubernetes is not trivial, but RisingWave's disaggregated architecture maps cleanly onto Kubernetes primitives. Here are the key takeaways:

  • Use the official Helm chart for deployment. It encodes best practices for component configuration, startup ordering, and health checks.
  • Scale components independently based on their specific bottleneck: memory and CPU for compute, connection count for frontend, CPU and I/O for compactors.
  • S3-compatible object storage decouples state from compute, making recovery and scaling straightforward. Use IRSA on EKS for secure credential management.
  • Monitor barrier latency and compaction lag as your two primary health indicators. Set up alerts for both.
  • Apply Pod Disruption Budgets and node affinity to prevent cascading failures during cluster maintenance.

For a deeper understanding of RisingWave's architecture, check out the architecture documentation. If you are evaluating streaming databases for your Kubernetes platform, the RisingWave vs Flink comparison provides a detailed feature-by-feature analysis.


Ready to deploy RisingWave on your Kubernetes cluster? Get started with RisingWave in 5 minutes. Quickstart →

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.