Recovery Point Objective

Recovery Point Objective (RPO) is a business continuity metric that quantifies the maximum acceptable amount of data loss an organization can tolerate following a disruptive event or system failure. It is measured in time, representing the duration of data that might be lost between the last successful data backup/recovery point and the moment of failure.

Essentially, RPO answers the question: "Up to what point in time must data be recoverable for the business to resume operations without unacceptable consequences?"

Key Characteristics

Time-Based Metric: RPO is expressed in units of time (e.g., seconds, minutes, hours, days).
Maximum Tolerance: It defines the upper limit of acceptable data loss, not the actual data loss that will occur.
Business Driven: The RPO for a system or application is determined by business requirements, impact analysis, and the criticality of the data. Different systems will have different RPOs.
Inverse Relationship with Cost: A lower RPO (meaning less data loss is acceptable) typically requires more frequent backups or more sophisticated replication mechanisms, which often translates to higher costs.

How RPO Relates to Backups and Checkpoints

Backup Frequency: For traditional systems, RPO directly influences backup frequency. If an RPO is 1 hour, backups must occur at least every hour.
Checkpointing in Streaming: In stateful stream processing systems like RisingWave, RPO is closely related to the frequency and mechanism of checkpointing. A checkpoint captures a consistent snapshot of the system's state. If a failure occurs, the system can recover to the last successful checkpoint.
- A shorter RPO (e.g., a few seconds) means RisingWave needs to perform checkpoints more frequently to minimize potential state loss upon recovery.

RPO vs. RTO (Recovery Time Objective)

It's important to distinguish RPO from RTO:

RPO (Recovery Point Objective): Focuses on data loss. How much data can we afford to lose?
RTO (Recovery Time Objective): Focuses on downtime. How quickly must the system be back online and operational after a failure?

While related, they are distinct. A system might have a very low RPO (minimal data loss) but a higher RTO (takes longer to recover), or vice-versa.

Factors Influencing RPO

Data Criticality: Mission-critical data (e.g., financial transactions, healthcare records) typically requires a very low RPO.
Rate of Data Change: Systems with high data update rates may need lower RPOs to avoid losing significant amounts of new data.
Cost of Data Loss: The financial, operational, legal, and reputational impact of losing data.
Industry Regulations: Compliance requirements might mandate specific RPOs.
Cost of Implementation: The expense of achieving a particular RPO (e.g., more frequent checkpoints, robust replication).

RPO in RisingWave

For RisingWave, RPO is primarily tied to its checkpointing mechanism within the Hummock state store:

Checkpoint Frequency: RisingWave periodically takes checkpoints of its internal state (e.g., the state of materialized views, aggregations, joins) and persists them to durable cloud storage.
Minimizing Data Loss: If a failure occurs, RisingWave can restore its state from the latest successful checkpoint. The data processed and state changes made since the last checkpoint might be lost if not also replayed from an upstream durable source.
Configurability: The interval for checkpointing in RisingWave is configurable, allowing users to balance the RPO (and thus potential data loss) against the overhead of frequent checkpointing. A shorter interval leads to a lower RPO but potentially more system overhead.
Upstream Source Replay: To achieve near-zero RPO, in addition to RisingWave's checkpointing, upstream data sources (like Kafka) should also be configured for durability and replayability, allowing RisingWave to re-ingest data from the point of the last successful sink commit after recovery.

Achieving a desired RPO is a critical aspect of designing fault-tolerant and resilient streaming data pipelines.