Why is it considered hard to maintain strong consistency in a distributed system?

Question

Why can't you just use strongly consistent reads for all your DB reads, with retries on 500 responses? According to CAP theorem increasing consistency should probably lower availability, but can't the decreased availability (increased 500 responses) be handled fairly easily using retries? (assuming you are fine with a small percentage of queries taking a bit longer due to retries)

Using DynamoDB as an example, but this can be generalized to any noSQL cloud offering - It also seems like DDB with on demand scaling will simply increase your read capacity units (RCU) used if you turn on strong consistency, incurring a higher cost ($) but keeping the same latency on db queries, so it seems like the negative is only higher cost. It seems like you can just keep vertical scaling the DB's processing power to meet your needs. Is it actually plausible that with a noSQL cloud database with a high traffic level, you cannot just throw enough money at it, and it could hit some scaling limit to make strongly consistent reads slower?

And then with regard to the entire question generalized to distributed systems, what does it actually mean for a distributed system to be 'strongly consistent.' I've heard this used to describe systems before but I don't actually know it means beyond 'all DB interactions being strongly consistent.'

My second question might be more basic level but necessary to understand the cost of providing consistency, but why do consistent read issues actually occur, ie. why does stale data occur (in single queries like read/write, not transactions with multiple reads/writes per transaction)?

From what I understand, with decreasing probability any time after a write occurs, it's possible for one reader to read correct data, then a second reader to read stale data AFTER the first reader reads correct data (correct me if this isn't actually true, but my understanding was it is). Why does this happen? Doesn't a read just involve a read from some location on the disk?

score 4 · Answer 1 · answered Jan 18 '21 at 05:23

Retries aren’t guaranteed to succeed. And while your system is retrying, it isn’t available. For just reads, that is only a small problem. What happens when writes fail? “Just retry” doesn’t work because you need to keep the ordering proper for consistency. The entire system is unavailable until the write succeeds (so later reads can get the correct results).

As for your second question: no, reads do not involve reading from the same location on disk. In order to provide availability, distributed systems will replicate data across multiple disks (usually different machines, usually in entirely different data centers/regions). A stale read will happen when the write was successful at one node, but not yet replicated to the node the read is happening at. You could wait until all of the writes are successful (improve consistency), but that comes with a CAP theorem tradeoff.

user3153970 · Answer 2 · 2021-01-18T23:06:18.363

Why can't you just use strongly consistent reads for all your DB reads, with retries on 500 responses?

As you mentioned later, if you have stale data, you don't get 500 responses.

so it seems like the negative is only higher cost

It's higher cost but also higher complexity because a read/write lock needs to be implemented.

To further enhance the locking mechanism, the distribution of keys is maintained as such so the reads are always re-directed towards the consistent server(s). And if replication fails, there is the added complex layer of rollovers. This ensures that data is consistent too.

why does stale data occur

Without strong consistency, if the read request is directed to a replica server that the master write has not populated to, the out-of-date data is returned.

Why is it considered hard to maintain strong consistency in a distributed system?

2 Answers2