How Streaming Databases Handle Backpressure

How Streaming Databases Handle Backpressure

Every streaming system eventually faces the same problem: a producer generates data faster than a consumer can process it. In a batch pipeline, this mismatch simply means a longer job runtime. In a streaming pipeline, it means growing memory usage, increasing latency, and eventually, system failure.

Backpressure is the mechanism that prevents this failure. It is the feedback signal that flows upstream from a slow consumer to a fast producer, telling it to slow down. Without backpressure, a streaming system has no way to gracefully handle load imbalances, and operators either drop data or run out of memory. Streaming database backpressure handling is one of the most critical design decisions in any real-time data architecture.

In this article, you will learn what backpressure is, how three major streaming systems handle it (Apache Flink, Apache Kafka, and RisingWave), how to monitor backpressure in production, and what tuning strategies actually work.

What Is Backpressure in Streaming Systems?

Backpressure is a flow control mechanism used in distributed streaming systems to handle situations where a data-producing component generates data faster than a downstream component can consume it. Think of it as a pressure signal that travels backward through a pipeline: when the downstream can't keep up, the upstream smartly hits the brakes.

In a streaming dataflow graph, data flows from source operators through transformation operators (filters, joins, aggregations) to sink operators. Each operator processes records and passes them to the next. When any operator in this chain slows down, records accumulate in the buffers between operators. Without flow control, those buffers grow without bound until the system runs out of memory.

graph LR
    S[Source] -->|fast| A[Aggregation]
    A -->|fast| J[Join]
    J -->|slow| SK[Sink]
    SK -.->|backpressure signal| J
    J -.->|backpressure signal| A
    A -.->|backpressure signal| S
    style SK fill:#ff6b6b,color:#fff
    style J fill:#ffa94d,color:#fff
    style A fill:#69db7c,color:#fff
    style S fill:#69db7c,color:#fff

The diagram above shows a simple streaming pipeline where the sink operator is slow. Backpressure signals propagate backward from the sink through the join and aggregation operators, eventually slowing down the source.

Backpressure handling strategies fall into three categories:

  • Blocking: The upstream operator pauses until the downstream operator catches up. This is the simplest approach but can cause cascading stalls across the entire pipeline.
  • Buffering: An intermediate buffer absorbs temporary bursts, decoupling the speed of adjacent operators. This works well for short spikes but can exhaust memory during sustained overload.
  • Dropping: The system intentionally discards records when buffers are full. This preserves system stability at the cost of data completeness, and is rarely acceptable for database workloads.

Most production streaming systems use a combination of blocking and buffering, with the specific implementation varying significantly between systems.

Apache Flink implements a sophisticated credit-based flow control mechanism within its network stack. This approach was introduced in Flink 1.5 (via FLINK-7282) to replace the earlier TCP-based backpressure mechanism, and it remains the foundation of Flink's flow control today.

How Credit-Based Flow Control Works

In Flink's model, each downstream task grants "credits" to its upstream task. One credit corresponds to one network buffer (typically 32 KB). The upstream task can only send data when it holds credits from the downstream task.

The process works as follows:

  1. The upstream ResultSubPartition sends a backlog size to the downstream InputChannel, announcing how many buffers of data are ready to send.
  2. The downstream task calculates how many buffers it can accept based on available memory.
  3. The downstream task returns credits to the upstream task, authorizing it to send that many buffers.
  4. Every time the sender transmits a buffer, one credit is subtracted from the total.
  5. When the downstream task's input buffers fill up (because it is processing slowly), it stops granting credits.
  6. This causes the upstream task's output buffers to fill, which prevents it from processing more records.
  7. The backpressure propagates all the way to the source operator.
sequenceDiagram
    participant U as Upstream Task
    participant D as Downstream Task
    U->>D: Backlog size = 5 buffers
    D->>U: Grant 3 credits
    U->>D: Send buffer 1 (credits: 2)
    U->>D: Send buffer 2 (credits: 1)
    U->>D: Send buffer 3 (credits: 0)
    Note over U: Blocked - no credits
    D->>U: Grant 2 more credits
    U->>D: Send buffer 4 (credits: 1)
    U->>D: Send buffer 5 (credits: 0)

Why Credit-Based Over TCP-Based?

The earlier TCP-based approach had a critical problem: Flink multiplexes multiple logical channels over a single TCP connection. When one logical channel experienced backpressure, the TCP window would shrink, blocking all logical channels on that connection. A single slow task could block unrelated tasks sharing the same physical connection.

Credit-based flow control operates at the logical channel level, allowing fine-grained per-channel backpressure without cross-contamination.

Flink 1.13 and later provide enhanced backpressure visualization in the web UI. Each operator shows a color-coded backpressure status:

MetricMeaning
backPressuredTimeMsPerSecondTime the task spends blocked waiting for output buffers
idleTimeMsPerSecondTime the task spends waiting for input data
busyTimeMsPerSecondTime the task spends actively processing

A task with high backPressuredTimeMsPerSecond is being slowed by its downstream operator. A task with high busyTimeMsPerSecond and whose downstream task shows high backPressuredTimeMsPerSecond is the actual bottleneck.

How Apache Kafka Handles Backpressure: Consumer Lag and Pull-Based Design

Apache Kafka takes a fundamentally different approach to backpressure because of its architecture. Kafka is a distributed log, not a processing engine. It decouples producers from consumers through persistent, partitioned commit logs. There is no direct backpressure signal from consumer to producer. Instead, backpressure manifests as consumer lag: the growing offset difference between the latest produced message and the latest consumed message.

The Pull-Based Model

Kafka consumers pull data at their own pace rather than having data pushed to them. If a consumer falls behind, it simply fetches less frequently. The data remains safely stored in Kafka's log (subject to retention policies) until the consumer catches up. This pull-based design provides natural backpressure without any explicit signaling mechanism.

graph LR
    P[Producer] -->|write| K[Kafka Broker<br/>Offset: 1000]
    K -->|read at offset 950| C[Consumer<br/>Lag: 50]
    style K fill:#4dabf7,color:#fff
    style C fill:#ffa94d,color:#fff

Managing Consumer Lag

While Kafka's architecture provides inherent resilience to backpressure, consumer lag still needs active management. Key configuration parameters include:

  • max.poll.records: Controls how many records a single poll() call returns. Reducing this value gives the consumer smaller batches to process, preventing timeouts.
  • max.poll.interval.ms: The maximum time between poll() calls before the consumer is considered dead. If processing takes longer than this interval, the consumer is removed from the group and partitions are rebalanced.
  • fetch.min.bytes and fetch.max.wait.ms: Control how much data the broker accumulates before responding to a fetch request, allowing you to trade latency for throughput.

Partition-Level Flow Control

For more granular control, Kafka consumers can pause and resume individual partitions:

// Pause partition when downstream is overloaded
consumer.pause(Collections.singleton(new TopicPartition("events", 0)));

// Resume when downstream recovers
consumer.resume(Collections.singleton(new TopicPartition("events", 0)));

This allows a consumer to temporarily stop fetching from specific partitions while continuing to process others, and it prevents rebalancing because the consumer continues to send heartbeats and call poll().

The Limitation

Kafka's approach works well when consumer lag is temporary and the consumer can eventually catch up. But if a consumer is permanently slower than the producer, lag grows indefinitely. Kafka does not slow down producers, and once data exceeds the retention period, it is deleted regardless of whether the consumer has processed it.

How RisingWave Handles Backpressure: Barrier-Based Flow Control

RisingWave is a streaming database that uses a barrier-based flow control mechanism tightly integrated with its checkpoint system. Unlike Flink's credit-based per-buffer approach, RisingWave's backpressure operates at the actor level using bounded channels and checkpoint barriers.

Bounded Channels Between Actors

In RisingWave's streaming engine, the dataflow graph is divided into fragments, and each fragment runs as one or more actors on compute nodes. Upstream and downstream actors connect via bounded channels. When a downstream actor cannot consume data quickly enough, its input channel fills up, causing the upstream actor to block on sends. This blocking propagates upstream toward the source actors.

graph TD
    subgraph Compute Node 1
        S1[Source Actor] -->|bounded channel| A1[Agg Actor]
    end
    subgraph Compute Node 2
        A1 -->|bounded channel| J1[Join Actor]
        J1 -->|bounded channel| SK1[Sink Actor]
    end
    subgraph Meta Node
        M[Barrier Generator] -.->|barrier every 1s| S1
    end
    M -.->|collect barriers| SK1
    style J1 fill:#ff6b6b,color:#fff
    style SK1 fill:#ffa94d,color:#fff

Barriers and Checkpoints

RisingWave injects barrier messages into the data stream at a configurable interval (default: every 1 second via barrier_interval_ms). These barriers flow along with regular data through the entire dataflow graph. When a barrier reaches all terminal actors and is reported back to the Meta Node, it marks a consistent checkpoint.

The key interaction with backpressure is this: source actors prioritize barrier consumption. They only ingest external data (from Kafka, for example) when the barrier channel is empty. This means that when backpressure causes barriers to queue up, source actors automatically throttle their ingestion rate. The system self-regulates without any explicit credit or token mechanism.

Barrier latency, the time it takes for a barrier to travel from the Meta Node through all compute nodes and back, serves as RisingWave's primary indicator of system health. High barrier latency means the pipeline is under backpressure.

Unaligned Joins: Solving the Amplification Problem

One particularly challenging backpressure scenario occurs with join operators. A single input record might match thousands of records on the other side of a join, producing a burst of output that overwhelms downstream operators. This "join amplification" causes barrier latency to spike because checkpoint barriers get stuck behind the massive output queue.

RisingWave addresses this with unaligned joins, which you can enable with:

SET streaming_enable_unaligned_join = true;

When enabled, RisingWave inserts an intermediate buffer after the join operator. The join writes its amplified output to this buffer. Checkpoint barriers bypass the buffered data and pass through to the next operator immediately, rather than waiting for the slow downstream to consume the join output. This keeps barrier latency low even when the join produces massive data volumes.

The trade-off is a slight increase in end-to-end latency for data flowing through the buffered path. But this localized latency increase is far preferable to the entire pipeline stalling due to backpressure.

Comparing Backpressure Mechanisms Across Systems

Each system's backpressure mechanism reflects its core architecture:

AspectApache FlinkApache KafkaRisingWave
MechanismCredit-based flow controlPull-based consumer modelBounded channels + barriers
GranularityPer-network-bufferPer-partitionPer-actor-channel
Signal directionDownstream to upstream (credits)No explicit signal (lag-based)Downstream to upstream (channel blocking)
BufferingFixed-size network buffersBroker-side commit logBounded in-memory channels
Checkpoint interactionBarrier alignment in buffersN/A (no checkpoints)Barrier priority at sources
Data loss riskNone (blocks upstream)Possible (retention expiry)None (blocks upstream)
Latency impactUpstream stalls propagateConsumer lag growsBarrier latency increases

Flink's credit-based system provides the most granular control, operating at the network buffer level. Kafka offloads the problem to its durable log, making backpressure a capacity planning concern rather than a flow control one. RisingWave's approach ties backpressure directly to its consistency model through barriers, ensuring that flow control and fault tolerance work in concert.

Monitoring Backpressure in Production

Detecting backpressure early is critical. By the time users notice increased query latency, the system may already be severely constrained. Here are the key metrics to monitor for each system.

  • backPressuredTimeMsPerSecond: The percentage of time an operator spends waiting on output. Values above 500ms/s indicate significant backpressure.
  • numRecordsOutPerSecond: A sudden drop in throughput at a specific operator often indicates it is being backpressured.
  • Checkpoint duration: Increasing checkpoint times can indicate that barriers are stuck behind backpressured operators.

Access these through Flink's web UI or export them to Prometheus and Grafana for alerting.

Kafka Metrics

  • records-lag-max: The maximum lag across all partitions for a consumer group. Set alerts when this exceeds your acceptable latency budget.
  • records-consumed-rate: If consumption rate drops while production rate stays constant, backpressure is building.
  • poll-rate: A declining poll rate suggests the consumer is spending more time processing and less time fetching.

Tools like Conduktor or Confluent Control Center provide visual consumer lag dashboards.

RisingWave Metrics

  • Barrier latency: The primary health indicator. Access the "Streaming - Backpressure" panel in the Grafana dashboard bundled with RisingWave.
  • Actor Output Blocking Time Ratio: Shows which specific actors are experiencing backpressure. A high ratio means the actor is frequently blocked waiting for its downstream to accept data.
  • Source throughput: A drop in source ingestion rate while upstream systems (Kafka, etc.) have data available indicates backpressure throttling at the source.

To diagnose the root cause, find the frontmost bottleneck (the actor closest to the source with high blocking time) and correlate it with the Fragment panel in the RisingWave Dashboard and EXPLAIN CREATE MATERIALIZED VIEW output.

Tuning Strategies for Backpressure

Monitoring tells you backpressure exists. Tuning resolves it. Here are practical strategies organized by system.

General Strategies (All Systems)

  1. Scale out the bottleneck operator: Add parallelism to the slow operator. In Flink, increase the operator's parallelism. In RisingWave, the system can scale fragments across compute nodes.
  2. Optimize the slow operator: Examine the operator's logic. Is it doing expensive I/O? Can you batch writes to external systems? Can you reduce the state size for joins?
  3. Use bounded buffers deliberately: Avoid infinite queues. Always set upper limits on how much data can accumulate between operators. This forces the system to apply backpressure rather than silently consuming all available memory.
  4. Separate fast and slow paths: If some data paths are inherently slower (e.g., writing to a database vs. writing to a log), consider splitting the pipeline so the slow path does not backpressure the fast path.
  • Increase taskmanager.network.memory.buffers-per-channel: More buffers per channel allow larger bursts before backpressure triggers. Default is 2; increasing to 4-8 can help with bursty workloads.
  • Tune taskmanager.network.memory.floating-buffers-per-gate: Floating buffers are shared across channels. Increasing this pool helps when only some channels are bursty.
  • Enable unaligned checkpoints: When backpressure causes checkpoint timeouts, unaligned checkpoints allow barriers to overtake buffered data, similar in concept to RisingWave's unaligned joins.

Kafka-Specific Tuning

  • Increase consumer max.poll.records: Larger batches reduce per-record overhead, but increase per-poll processing time. Find the sweet spot.
  • Add consumer instances: Scale the consumer group horizontally. Each new instance takes over some partitions.
  • Batch downstream writes: Instead of inserting one record at a time into your database, batch 100-1000 records per write. This dramatically reduces per-record I/O overhead.
  • Increase partitions: More partitions allow more consumer instances. But note that partition count cannot be decreased, so plan carefully.

RisingWave-Specific Tuning

  • Enable unaligned joins: For pipelines with high join amplification, set streaming_enable_unaligned_join = true to prevent join output from blocking checkpoint barriers.
  • Adjust barrier_interval_ms: The default is 1000ms. Increasing this reduces barrier overhead but increases checkpoint granularity. Decreasing it provides more responsive backpressure detection at the cost of higher overhead.
  • Scale compute nodes: RisingWave's cloud-native architecture allows you to add compute nodes independently. The system redistributes actors across available nodes.
  • Optimize materialized views: Use EXPLAIN CREATE MATERIALIZED VIEW to examine the query plan. Look for expensive operators (large state joins, complex aggregations) and consider restructuring the query or breaking it into multiple materialized views.

What Is the Difference Between Backpressure and Data Loss?

Backpressure is a protective mechanism that prevents data loss. When a downstream operator is slow, backpressure signals the upstream to slow down, keeping all data intact at the cost of increased latency. Data loss occurs when a system has no backpressure mechanism (or when buffers overflow before backpressure can take effect) and records are dropped.

In Flink and RisingWave, backpressure blocks the upstream, guaranteeing zero data loss. In Kafka, data is not lost during consumer lag because it is persisted in the broker's commit log, but if the consumer never catches up and data exceeds the retention period, those records are permanently deleted.

How Do You Detect Backpressure Before It Becomes a Problem?

The best approach is to monitor leading indicators rather than waiting for symptoms. Track buffer utilization and channel blocking ratios, not just end-to-end latency. In Flink, watch backPressuredTimeMsPerSecond at every operator. In Kafka, alert on records-lag-max trends rather than absolute values, because a steadily increasing lag is more dangerous than a temporarily high but stable lag. In RisingWave, monitor barrier latency and the Actor Output Blocking Time Ratio in the Grafana dashboard. Set alerts at 50-70% of your capacity limits so you have time to react before the system saturates.

When Should You Scale Out vs. Optimize the Bottleneck?

Scale out when the bottleneck operator is CPU-bound and its logic is already efficient. Adding parallelism distributes the work across more cores or nodes. Optimize first when the bottleneck is I/O-bound (slow database writes, network calls) or when the operator's logic is inefficient (unnecessary state lookups, unoptimized joins). In practice, start with optimization because it is cheaper and often yields larger improvements. If profiling shows the operator is already efficient and simply needs more compute, then scale out.

Conclusion

Backpressure is not a bug or a failure mode. It is a necessary safety mechanism that keeps streaming systems stable under load. The key takeaways from this deep dive:

  • Flink uses credit-based flow control that operates at the network buffer level, providing fine-grained per-channel backpressure without cross-contamination between logical connections.
  • Kafka relies on its pull-based architecture and durable commit log to handle backpressure implicitly through consumer lag, but requires active monitoring and consumer-side tuning.
  • RisingWave uses bounded channels and barrier-based flow control that ties backpressure directly to its checkpoint mechanism, with source actors automatically throttling ingestion when barriers queue up.
  • Monitoring is essential: Track buffer utilization, barrier latency, and consumer lag before end-to-end latency degrades.
  • Tuning follows a pattern: Identify the bottleneck operator, optimize its logic, batch I/O operations, and scale out only when the operator is already efficient.

Understanding how your streaming system handles backpressure is the difference between a pipeline that gracefully degrades under load and one that falls over at 2 AM.


Ready to build streaming pipelines with built-in backpressure handling? Try RisingWave Cloud free, no credit card required. Sign up here.

Join our Slack community to ask questions and connect with other stream processing developers.

Best-in-Class Event Streaming
for Agents, Apps, and Analytics
GitHubXLinkedInSlackYouTube
Sign up for our to stay updated.