Distributed Systems 5ed by Coulouris says on p21-22
1.5.5 Failure handling
Detecting failures: Some failures can be detected. For example, checksums can be used to detect corrupted data in a message or a file. Chapter 2 explains that it is difficult or even impossible to detect some other failures, such as a remote crashed server in the Internet. The challenge is to manage in the presence of failures that cannot be detected but may be suspected.
Masking failures: Some failures that have been detected can be hidden or made less severe. Two examples of hiding failures:
Messages can be retransmitted when they fail to arrive.
File data can be written to a pair of disks so that if one is corrupted, the other may still be correct.
Just dropping a message that is corrupted is an example of making a fault less severe – it could be retransmitted. The reader will probably realize that the techniques described for hiding failures are not guaranteed to work in the worst cases; for example, the data on the second disk may be corrupted too, or the message may not get through in a reasonable time however often it is retransmitted.
Tolerating failures: Most of the services in the Internet do exhibit failures – it would not be practical for them to attempt to detect and hide all of the failures that might occur in such a large network with so many components. Their clients can be designed to tolerate failures, which generally involves the users tolerating them as well. For example, when a web browser cannot contact a web server, it does not make the user wait for ever while it keeps on trying – it informs the user about the problem, leaving them free to try again later. Services that tolerate failures are discussed in the paragraph on redundancy below.
Recovery from failures: Recovery involves the design of software so that the state of permanent data can be recovered or ‘rolled back’ after a server has crashed. In general, the computations performed by some programs will be incomplete when a fault occurs, and the permanent data that they update (files and other material stored in permanent storage) may not be in a consistent state. Recovery is described in Chapter 17.
Redundancy: Services can be made to tolerate failures by the use of redundant components. Consider the following examples:
There should always be at least two different routes between any two routers in the Internet.
In the Domain Name System, every name table is replicated in at least two different servers.
A database may be replicated in several servers to ensure that the data remains accessible after the failure of any single server; the servers can be designed to detect faults in their peers; when a fault is detected in one server, clients are redirected to the remaining servers.
What is the difference between masking and tolerating failures?
Can they both be done by redundancy? (The quote seems to say so. Then what differences are between them?)
Do they both need to perform recovery from failures?
I am also wondering what difference and relation are between fault tolerance and (high) availability?