What Consistency Really Means in Data Systems?

“Can your system achieve strong consistency?” As a practitioner who has been developing stream processing systems over the last few years, I encounter this question frequently. I'm often tempted to confidently promote our product, but realistically, answering this query is complex. The challenge lies not in the question itself but in the fact that "consistency" means different things to different people depending on their technical background.

In fact, people coming from different backgrounds such as:

Databases
Distributed systems
Streaming systems

each have their own interpretation of consistency. Answering without contextual understanding can lead to misunderstandings. In this article, I will clarify what consistency actually means across these different data systems.

Consistency in Databases

In traditional databases, consistency is a cornerstone among the ACID properties (Atomicity, Consistency, Isolation, Durability). This principle guarantees that each transaction transitions the database from one valid state to another. For instance, in a banking transaction where one account is debited and another credited, consistency ensures the aggregate balance remains unchanged. It's crucial to distinguish this from atomicity, which refers to the all-or-nothing nature of transactions. For example, atomicity ensures that all operations within a transaction are completed successfully, or none at all, ensuring no partial transactions are ever left in the database.

In summary, database consistency focuses on enforcing rules that preserve data correctness and validity throughout transactions.

Consistency in database’s ACID model refers to the database transitioning from one valid state to another.

Consistency in Distributed Systems

One of the fundamental concepts frequently addressed in discussions of distributed systems is the consistency component of the well-known CAP theorem, first introduced by researchers at the University of California, Berkeley. This theorem has become a staple topic in both academic courses and professional dialogues on distributed systems.

In the CAP theorem, "consistency" specifically refers to the uniformity of data across various replicas distributed across different nodes. Ensuring this type of consistency in distributed systems can be particularly challenging. The theorem underscores a crucial trade-off among three key attributes: consistency, availability, and partition tolerance. According to the CAP theorem, a distributed system is limited to achieving only two out of these three attributes simultaneously.

Consistency in this context ensures that all replicas of data across different nodes display the same information at any given time. This is crucial for maintaining data integrity, especially in scenarios involving network failures or delays. The theorem illuminates the inherent difficulties in keeping data synchronized across multiple nodes, highlighting the ongoing struggle to balance these three competing demands in the design and maintenance of robust distributed systems.

Consistency in distributed systems ensures that all replicas of data across different nodes display the same information at any given time.

Consistency in Streaming Systems

When discussing streaming systems, the conversation around consistency often diverges from that of databases or distributed systems, reflecting unique requirements and challenges. In the article “Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka” , the authors mention the concept of “exactly-once” semantics as a key feature of consistency in streaming environments.

In streaming systems, maintaining consistency isn't about replicating data; rather, it focuses on ensuring that every data event is processed exactly once, even in cases of system failure. For instance, consider a financial transaction being processed in real-time. Should the system crash during this process, it is crucial that upon recovery, the system ensures the transaction is not duplicated. This requirement for exactly-once processing is vital for preserving the integrity and consistency of streaming data, ensuring that each transaction is processed accurately and precisely once, regardless of system interruptions.

Consistency in streaming systems typically refers to “exactly-once” semantics.

What Do People Really Want?

To highlight the variations in consistency requirements across different systems, consider this table:

System Type	Consistency Needs	Example
Databases	Transaction integrity	Bank transactions should always maintain balanced accounts
Distributed Systems	Data uniformity across nodes	Profile updates on social media must be consistent everywhere
Streaming Systems	Order and processing guarantees	Each financial transaction is processed only once, in real time

In practice, what users and businesses truly desire is reliability and data integrity, which extend beyond the standard textbook definitions of consistency. They need systems that are not only robust and reliable but also capable of effectively managing real-world complexities. The ultimate measure of success for users is a system's ability to consistently deliver correct outcomes without errors. It’s essential to recognize that achieving this level of consistency is not merely about the system “feeling” correct, but being theoretically sound and functionally reliable.

Consistency in Distributed Streaming Databases

Alright, let’s delve into something more intriguing: consistency in distributed streaming databases. If you are not familiar with what a streaming database is, think of traditional systems like KsqlDB and PipelineDB. Essentially, a streaming database is tailored for stream processing. Interested readers may refer to my previous articles for more details. At RisingWave, we develop a distributed streaming database aimed at reducing the cost and complexity of performing stream processing.

A distributed streaming database is a database, a distributed system, and a streaming system combined. So, what does its consistency look like? Ideally, a streaming database can achieve consistency in all contexts, but this certainly also depends on the specific system implementations. It’s not feasible for me to outline the consistency model of all distributed streaming databases here, but let’s focus on RisingWave: What does consistency look like in RisingWave? Let’s explore this across three dimensions.

“Consistency” in Database Context

The answer is yes. RisingWave effectively ensures that the internal state transitions from one valid state to another seamlessly. It is important to note, however, that while RisingWave supports read-only transactions, it does not facilitate read-write transactions across different tables. Therefore, if someone requires an OLTP database to manage complex transactional workloads, they would be better served by solutions such as MySQL, Postgres, CockroachDB, or TiDB. This design decision is influenced by two primary factors:

Specialization in Streaming Data: RisingWave is tailored to optimize the handling of streaming data. Incorporating full transactional capabilities could introduce significant complexity to the system.
Integration with Traditional OLTP Databases: Typically, traditional OLTP databases handle transaction serialization upstream. RisingWave, acting as a downstream system, focuses on real-time analytics. Integrating heavy transaction processing could lead to substantial performance degradation, particularly under the demanding conditions of real-world operations.

RisingWave preserves transaction semantics from upstream OLTP databases.

Moreover, RisingWave is designed to comprehend and process transaction semantics from upstream OLTP databases, which is a pivotal feature for clients in sectors like finance.

“Consistency” in Distributed Systems Context

The answer is yes. RisingWave can facilitate high availability across various regions. While RisingWave does not implement complex consensus protocols such as Paxos or Raft to ensure consistency across replicas, it leverages S3 for data storage. S3 not only stores database tables but also internal states of stream processing, effectively replicating data across multiple replicas to maintain consistency.

“Consistency” in Streaming Systems Context

The answer is yes. RisingWave excels by ensuring exactly-once semantics and adeptly managing out-of-order data processing. This capability guarantees that each data event is processed precisely once, irrespective of interruptions or disruptions in the data stream, thus maintaining high levels of consistency in streaming data.

Summary

Consistency in data systems varies significantly across databases, distributed systems, and streaming systems:

Databases focus on transaction integrity.
Distributed systems emphasize how data is accessed across replicated nodes.
Streaming systems prioritize processing guarantees like exactly-once semantics.

RisingWave effectively meets these diverse needs with a robust, adaptable distributed streaming database. Our approach not only adheres to theoretical consistency standards but also excels in real-world applications, making RisingWave a reliable solution in the evolving landscape of data consistency.