What is the difference between masking and tolerating failures?

Question

Distributed Systems 5ed by Coulouris says on p21-22

1.5.5 Failure handling

Detecting failures: Some failures can be detected. For example, checksums can be used to detect corrupted data in a message or a file. Chapter 2 explains that it is difficult or even impossible to detect some other failures, such as a remote crashed server in the Internet. The challenge is to manage in the presence of failures that cannot be detected but may be suspected.

Masking failures: Some failures that have been detected can be hidden or made less severe. Two examples of hiding failures:

Messages can be retransmitted when they fail to arrive.

File data can be written to a pair of disks so that if one is corrupted, the other may still be correct.

Just dropping a message that is corrupted is an example of making a fault less severe – it could be retransmitted. The reader will probably realize that the techniques described for hiding failures are not guaranteed to work in the worst cases; for example, the data on the second disk may be corrupted too, or the message may not get through in a reasonable time however often it is retransmitted.

Tolerating failures: Most of the services in the Internet do exhibit failures – it would not be practical for them to attempt to detect and hide all of the failures that might occur in such a large network with so many components. Their clients can be designed to tolerate failures, which generally involves the users tolerating them as well. For example, when a web browser cannot contact a web server, it does not make the user wait for ever while it keeps on trying – it informs the user about the problem, leaving them free to try again later. Services that tolerate failures are discussed in the paragraph on redundancy below.

Recovery from failures: Recovery involves the design of software so that the state of permanent data can be recovered or ‘rolled back’ after a server has crashed. In general, the computations performed by some programs will be incomplete when a fault occurs, and the permanent data that they update (files and other material stored in permanent storage) may not be in a consistent state. Recovery is described in Chapter 17.

Redundancy: Services can be made to tolerate failures by the use of redundant components. Consider the following examples:

There should always be at least two different routes between any two routers in the Internet.

In the Domain Name System, every name table is replicated in at least two different servers.

A database may be replicated in several servers to ensure that the data remains accessible after the failure of any single server; the servers can be designed to detect faults in their peers; when a fault is detected in one server, clients are redirected to the remaining servers.

Can they both be done by redundancy? (The quote seems to say so. Then what differences are between them?)

Do they both need to perform recovery from failures?

I am also wondering what difference and relation are between fault tolerance and (high) availability?

I wouldn't put too much stock in the exact definitions of these words. By the author's own definitions, the numbered items in the Redundancy section could be characterized as "masking," not tolerating. — Robert Harvey, Dec 24 '19 at 20:35

score 1 · Answer 1 · answered Dec 24 '19 at 20:37

Maybe a comparison could help you understanding the difference. Imagine you're going to an e-commerce website. You found a product you want to buy and you click on the “Add to cart” button.

Under the hood, the browser sends an HTTP POST request which is processed by the reverse proxy and sent to an application server which may want to call other services and make some calls to a database.

Imagine that something during this task failed. Either the distant service haven't replied to the application server, or the application server was disconnected while processing your request, or the database change was not committed successfully.

Masking failures, here, means that the workflow will, by design, try to solve the issue. For instance, the application server could query the faulty service again or make another request to the database, or the application running in your browser can do another HTTP POST if the first one failed.

Tolerating failures would mean that you, as a user, will simply see a error message telling that, oh, sorry, we couldn't do what you asked, and could you please do it again? No worries, you can surely click the second time on the button.

Both have nothing to do with redundancy. Redundancy is a very different subject: you can experience failures with or without redundancy, and you can mask or tolerate failures with or without redundancy.

Recovery from failures apply to both situations. In the example above, you don't want to add the product twice to the cart when the user requested to add it once. Doing otherwise would be a bad user experience.

Are you going by this textbook's definitions? Because they already seem to be ambiguous. — Robert Harvey, Dec 24 '19 at 20:38
@RobertHarvey: I do. It's strange, the definitions quoted in the question don't seem that ambiguous; actually, I found them rather clear, and I supposed that an extra example would be enough for the OP to understand them. — Arseni Mourzenko, Dec 24 '19 at 20:41
Case in point: *"File data can be written to a pair of disks"* -- **Masking.** *"There should always be at least two different routes between any two routers in the Internet"* -- **Tolerance.** — Robert Harvey, Dec 24 '19 at 20:43
"you, as a user, will simply see a error message". From my understanding this might be the case anyways. Depending of course what "user" means. I as a user do not want to get an inicident report for every failure of google's drives. But in a broader sense, the reporting is orthogonal to both concepts: Someone has to be informed - but not everybody. — Thomas Junk, Dec 25 '19 at 08:16

Thomas Junk · Answer 2 · 2019-12-25T08:17:48.430

From what I understand both are different in respect to the level of abtractions involved:

"Masked" means here: Lower levels "mask" failure transparently for higher levels of the system. Failure on a lower level should be dealt on that level. A simple example for that are HDDs. There may be sectors failing without having you to worry immediately.
"Tolerance" means here: Dealing with failure on the same level. Which in case of the HDD would be the embedded system of the HDD taking measures in case an incident happens.

Both concepts add up to the resilience of a system: the ablity to recover from failure.

The fact that your hard drive crashed shouldn't stop your system from "working". It is an "event" like any other and should be known to the right people.

Another example for tolerance is:

Say the job is to deliver a message. There is the naive solution to deliver it right away. In case it doesn't work, you stop trying.

The more elaborate version would be to retry later. Which is a kind of fault tolerance in that regard that you assume things might go wrong on the other side.

Then there is the possibility, that not only the recipient's system fails but that of the sender too. In order to deal with that you could leverage having redundancy on the sender's side to minimize failure.

This makes clear that being fault tolerant might not necessarily include redundancy but including redundancy making systems more fault tolerant to a wider range of faults.

Reporting of errors is orthogonal to both concepts.

What is the difference between masking and tolerating failures?

2 Answers2

Linked