How does a distributed system both tolerate network partition and achieve consistency?

Question

In the CAP theorem, there are only three possible cases:

C and A without P
C and P without A
A and P without C

How can we have both P and C in the second case?

Doesn't propagation of update from one replica to the other require a communication path between the two replicas?

If network partition happens between two replicas, isn't it impossible to achieve consistency?
If the replicas can achieve consistency, how can network partition between the replicas happen?

The current answer says that in presence of network partition between the replica nodes, the system can stay in consistency, by not responding.

The data stored in the replicas are still not in consistency. Does Consistency mean consistency between the data stored in the replicas, or between the responses from the replica nodes?
How does a replica node know whether network partition happens or the other replica nodes are crashed? In the latter case, it shouldn't stop responding.

Thanks.

score 4 · Answer 1 · edited Dec 29 '19 at 12:28

4

The CAP Theorem says that you can only achieve at maximum two out of the three properties of Consistency (every read receives the most recent write or an error), Availability (every read receives a write, but not necessarily the most recent one), and Partition Tolerance (the system keeps responding when an arbitrary amount of messages between the nodes are dropped or delayed). So, according to the CAP Theorem, if you want to achieve Consistency and Partition Tolerance, you must give up Availability.

Or, to put it simply: if you detect a partition, you simply stop responding to requests! (More precisely, you reply with an error.) If you never respond with an answer, you can never respond with an outdated answer.

if network partition happens between two replica machines, isn't it impossible to achieve consistency?

Yes, it is. But, remember, the CAP Theorem says that in this case you lose Availability. So, you just respond with an error, and by not responding with an answer, you also don't respond with an outdated answer.

Remember, the definition of Consistency is (bold emphasis mine): every read receives the most recent write or an error.

So, the request erroring out satisfies the definition of Consistency.

How can we have both P and C in the second case?

By giving up A.

That is what the CAP Theorem is all about! You have to make a trade-off. If you want your system to be Available and Consistent, then you cannot tolerate network errors. However, since network errors are a fact of life, you have to tolerate them. But then it is impossible to always reply with the most recent write, because your nodes will get out of sync when the network fails. So, you can either always respond but risk responding with outdated data, or you can try to always respond with the latest data, but then you have to accept that you sometimes cannot respond at all.

It is trivially possible to achieve two out of three properties:

CA: Since we assume no partitions here, we can use a centralized datastore.
CP: Always respond with an error.
AP: Always respond with the initial value (ignoring all writes).

What the CAP Theorem proves is that it is impossible to achieve all three.

edited Dec 29 '19 at 12:28

Bart van Ingen Schenau

71,712
20
110
179

answered Dec 28 '19 at 20:07

Jörg W Mittag

101,921
24
218
318

How does a replica node know whether network partition happens or the other replica nodes are crashed? In the latter case, it shouldn't stop responding. – Tim Dec 28 '19 at 21:41
The CAP Theorem doesn't say anything about crashing, so crashing is irrelevant to the CAP Theorem. – Jörg W Mittag Dec 28 '19 at 22:15
That is not true. When a system has availability as "A" in CAP, it tolerates crashing of individual server processes. That belongs to the cases of either "A and P without C" or A and C without P" – Tim Dec 28 '19 at 22:16
Can you point to the specific paragraph or lemma that says that? If the CAP Theorem says that it tolerates crashing of individual server processes, it will also contain a precise definition of what "crashing" means. – Jörg W Mittag Dec 28 '19 at 22:18
a replica node crashing means it stops working, no message or response sent out. – Tim Dec 28 '19 at 22:20
Which lemma says that? – Jörg W Mittag Dec 28 '19 at 22:20
Definition of availability refererenced in the CAP theorem – Tim Dec 28 '19 at 22:20
Could you cite the precise paragraph, please? It would make it much easier to respond if I knew which *precise* paragraph, section, subsection, or lemma you were talking about. In particular, to make sure that we are talking about the same definition of "crashing". – Jörg W Mittag Dec 28 '19 at 22:22
a node crashing means it is down – Tim Dec 28 '19 at 22:23
Are you talking about Brewer's Conjecture or about Gilbert's and Lynch's paper? – Jörg W Mittag Dec 28 '19 at 22:24
Yes. If a node is down i.e. crashes, it can't respond. So that affects availability unless the system has other replicated nodes to respond to any request – Tim Dec 28 '19 at 22:26
It would be helpful if you could *at least* say which paper you are talking about, so I know where to look. – Jörg W Mittag Dec 28 '19 at 22:31
2

Brewer and Gilbert / Lynch use slightly different definitions of "Availability", so if you don't say which of the two definitions you are talking about, this entire discussion is meaningless. For *my* answer, the difference doesn't matter, but I have the feeling that you are making an important distinction between the two, so it would really, really, really, really, really, really, really, really, really help if you wouldn't keep me guessing, but simply state which of the two you are talking about. – Jörg W Mittag Dec 28 '19 at 22:39
I am using the definitions in the paper – Tim Dec 28 '19 at 22:42
1

Which of the two papers? Gilbert / Lynch or Brewer? The definitions are subtly different. – Jörg W Mittag Dec 28 '19 at 22:43
I am not aware there is difference. I use the definitions you gave in the first paragraph of your answer – Tim Dec 28 '19 at 22:45
The definition by Gilbert / Lynch says (**bold** emphasis mine) "For a distributed system to be continuously available, every request received **by a non-failing node** must result in a response." This explicitly says that a failing node *does not* make the system un-available. The definition by Brewer says (**bold** emphasis mine) "For a distributed system to be continuously available, almost all requests received **by a non-failing node** must result in a response." This explicitly says that a failing node *does not* make the system un-available. – Jörg W Mittag Dec 28 '19 at 22:45
However, *you* say: "If a node is down i.e. crashes, it can't respond. So that affects availability [...]" So, it seems you use a different definition of Availability than either Brewer or Gilbert / Lynch, which is why I repeat my question: *please* cite the *precise* paragraph, section, subsection, Lemma, or whatever you are referring to, so that we can have a meaningful discussion. – Jörg W Mittag Dec 28 '19 at 22:47
So both say that a failing node does not make the system un-available. What is the difference? I am not referring to other papers. If you need me to choose, I will stick to Gilber/Lynch's paper, since it gives a proof and thus requires accurate definition. My question still remains, as in the first comment. – Tim Dec 28 '19 at 22:47
I don't understand the question. I already said that *for my argument*, there is no difference between the two. However, you seem to be referring to a *third* definition which *does* include failing nodes, and without knowing where I can look up the definition that you are using, it is impossible to have a meaningful discussion. – Jörg W Mittag Dec 28 '19 at 22:49
"Which of the two papers? Gilbert / Lynch or Brewer? The definitions are subtly different. " You mentioned they are different, and asked me which of the two papers I refered to. Now you said there is no difference. ...? None of these is leading to addressing my question which still stands in my first comment. – Tim Dec 28 '19 at 22:50
"Now you said there is no difference." - I never said that. Not once. Please, don't put words in my mouth. Three times I said that there *is* a difference (but this difference is inconsequential to my argument). Never did I say there is no difference. – Jörg W Mittag Dec 28 '19 at 23:00
"None of these is leading to addressing my question which still stands in my first comment." - The CAP Theorem doesn't say anything about crashing, so crashing is irrelevant to the CAP Theorem. – Jörg W Mittag Dec 28 '19 at 23:00
Nothing is put into your mouth. "Which of the two papers? Gilbert / Lynch or Brewer? The definitions are subtly different. " You mentioned they are different, and asked me which of the two papers I refered to. Then you wrote "I already said that for my argument, there is no difference between the two. " – Tim Dec 28 '19 at 23:04
Let me reiterate my question stated in my first comment. In case of network partition, you reply says the nodes stop responding. My question is how a node detects network partition happens, because it could be other nodes are down. When the other nodes are down, the node should respond not stop responding. – Tim Dec 28 '19 at 23:06
Yes, exactly. I said that they are subtly different and that the difference is irrelevant to *my argument*. However, I don't fully understand your argument yet, so I can't tell whether the difference is relevant to *your* argument, which is why knowing that you are referring to the stronger definition by Gilbert / Lynch and not the weaker definition by Brewer is important. – Jörg W Mittag Dec 28 '19 at 23:06
I am still having trouble understanding your argument, since you said "When a system has availability as "A" in CAP, it tolerates crashing of individual server processes." and that this is stated in "Definition of availability refererenced in the CAP theorem". However, you also say that "So both say that a failing node does not make the system un-available." I don't quite understand how those two statements do not contradict each other, unless by "Definition of availability refererenced in the CAP theorem" you mean a third definition. – Jörg W Mittag Dec 28 '19 at 23:09
Could you come back to my question in my first comment and reiterated in my last comment? – Tim Dec 28 '19 at 23:10
The CAP Theorem doesn't say anything about nodes being down, so nodes being down is irrelevant to the CAP Theorem. – Jörg W Mittag Dec 28 '19 at 23:15
Nodes can fail i.e. crash in CAP. You wrote: The definition by Gilbert / Lynch says (bold emphasis mine) "For a distributed system to be continuously available, every request received by a non-failing node must result in a response." This explicitly says that a failing node does not make the system un-available. The definition by Brewer says (bold emphasis mine) "For a distributed system to be continuously available, almost all requests received by a non-failing node must result in a response." This explicitly says that a **failing** node does not make the system un-available. – Tim Dec 29 '19 at 00:15

score 2 · Answer 2 · answered Jan 03 '20 at 00:28

How can we have both P and C in the second case?

I will answer with a common real-world example.

Common CP system in AWS cloud

Consider a distributed system made up of parts deployed to 3 datacenters (e.g. 3 AWS Availability Zones within a Region).

Now introduce a network partition (P) - e.g. the nodes in one datacenter is unreachable.

Use a consensus algorithm e.g. Raft to achieve consistency in the system (C). With Raft >50% of the nodes most be reachable, so with 2 of 3 datacenters reachable we are fine.

Voila - we have a CP system.

Critisism of CAP

While we are discussing CAP, it worth to mention recent critisism. CAP theorem is meant to be about trade-offs when designing a distributed system. But in a distributed system network partitions is never a trade-off - there is always chance for network partitions in any distributed system.

Source: A Critique of the CAP Theorem, Martin Kleppmann

Martin Kleppmann has written a highly recommended book about distributed systems: Designing Data-Intensive Applications.

As he suggests in his book, CAP Theorem is not something important anymore, there are other more well-definied properties about distributed systems that we should focus on instead.

score 0 · Answer 3 · answered Mar 03 '22 at 21:46

I am going to add another perspective.

CAP: if partitioning is happening, then the system may be either available or consistent. The million dollar question is what is partitioning?

Let's say I have a raft/paxos based system with three nodes. For a new message to be accepted, the message has to be accepted by at least two nodes. In fact, the system tell the customer that the operation succeeded after two nodes are confirmed. The third node will be updated as well, but you need just two to confirm the operation.

Let's say your network became partitioned - two nodes are in partition A and the third node is in partition B.

From CAP point of view, your system as a whole is not partitioned! Since your system as a whole needs two nodes to be available and partition A has them - your system as a whole still available and consistent.

Bottom line - for whatever system you analyze, you have to define what partitioning means.

How does a distributed system both tolerate network partition and achieve consistency?

3 Answers3

Common CP system in AWS cloud

Critisism of CAP