Observability in Streaming Systems

Observability in Streaming Systems refers to the ability to gain deep insights into the internal state and behavior of a stream processing system by examining the data it generates. Given the dynamic, continuous, and often complex nature of streaming applications, robust observability is crucial for development, debugging, performance monitoring, and maintaining operational health. It typically revolves around three main pillars: Metrics, Logging, and Tracing.

Core Pillars of Observability

Metrics (Quantitative Data):
- Numerical measurements of system performance and behavior collected over time.
- Key Streaming Metrics:
  - Throughput: Events/bytes processed per second (per operator, per source/sink, overall pipeline).
  - Latency: Time taken for events to pass through the system or parts of it (e.g., end-to-end latency, operator processing time, sink commit latency).
  - Error Rates: Number or rate of processing errors, failed deserializations, connection issues.
  - Backpressure: Indicators of downstream slowness affecting upstream processing (e.g., buffer utilization, time spent blocked).
  - Watermark Status: Current watermark values, lag behind real-time (for event-time processing).
  - Checkpointing Stats: Duration, size, success/failure rate of checkpoints.
  - State Size: Amount of state being managed by stateful operators.
  - Resource Utilization: CPU, memory, network, disk I/O of system components (e.g., compute nodes, brokers).
- Tools: Prometheus, Grafana, InfluxDB, Datadog Metrics. RisingWave exposes metrics in Prometheus format.
Logging (Qualitative Data):
- Timestamped records of discrete events or errors occurring within the system.
- Purpose in Streaming:
  - Diagnosing errors and unexpected behavior.
  - Tracking the lifecycle of individual events or tasks.
  - Auditing critical operations.
  - Understanding system startup, shutdown, and recovery processes.
- Best Practices: Structured logging (e.g., JSON format) for easier parsing and analysis, appropriate log levels (DEBUG, INFO, WARN, ERROR), contextual information (e.g., operator ID, event ID).
- Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Loki, Fluentd.
Tracing (Request-Scoped Data):
- Provides a view of the entire lifecycle of a request or data flow as it moves through various components of a distributed system.
- Application in Streaming:
  - Understanding end-to-end event flow across multiple operators, sources, and sinks.
  - Identifying bottlenecks by seeing time spent in each processing stage.
  - Debugging issues in complex pipelines involving multiple microservices or external calls.
- Each trace consists of one or more "spans," where each span represents a unit of work and includes metadata like start/end times, operation name, and tags.
- Tools: Jaeger, Zipkin, OpenTelemetry, Datadog APM. While direct event tracing through every low-level operator in a stream processor like RisingWave might be overly verbose, tracing is highly valuable for interactions between RisingWave and external systems (sources, sinks, UDFs) or across microservices that use RisingWave.

Why Observability is Critical for Streaming

Debugging Complexity: Streaming pipelines can be intricate. Observability helps pinpoint where failures or performance degradation occurs.
Performance Optimization: Identifying bottlenecks, understanding resource consumption, and tuning for lower latency and higher throughput.
Operational Stability: Proactive monitoring for issues, understanding system health, and ensuring data integrity.
Capacity Planning: Making informed decisions about scaling resources based on observed load and performance.
Cost Management: Understanding resource usage to optimize cloud costs.
Ensuring Data Correctness: Monitoring for data loss, duplication, or processing errors, especially important for systems aiming for exactly-once semantics.

Observability in RisingWave

RisingWave is designed with observability in mind:

Metrics: Exposes a wide range of internal metrics via a Prometheus endpoint, covering system health, performance, state store operations (Hummock), checkpointing, and more. These can be scraped and visualized using tools like Grafana.
Logging: Provides detailed logs from its various components, which can be collected and analyzed using standard logging solutions.
System Tables & Views: Offers SQL-accessible system tables and views (e.g., rw_catalog, rw_actors, rw_fragments) that provide insights into the system's internal state, metadata, and execution plans.
Tracing Integration (Conceptual): While not providing out-of-the-box distributed tracing for every internal data tuple, its interactions with external systems (connectors, UDFs) can be part of broader application traces.

Effective observability empowers developers and operators to build, manage, and scale reliable and high-performance real-time streaming applications.

Observability in Streaming Systems

Core Pillars of Observability

Why Observability is Critical for Streaming

Observability in RisingWave

Related Blog Posts

Frequently Asked Questions

Related Glossary Terms