From Coulouris' Distributed Systems 5ed
Chapter 18 Replication
18.1 Introduction
Increased availability: Users require services to be highly available. That is, the proportion of time for which a service is accessible with reasonable response times should be close to 100%. Apart from delays due to pessimistic concurrency control conflicts (due to data locking), the factors that are relevant to high availability are:
- server failures;
- network partitions and disconnected operation (communication disconnections that are often unplanned and are a side effect of user mobility).
To take the first of these, replication is a technique for automatically maintaining the availability of data despite server failures. If data are replicated at two or more failure-independent servers, then client software may be able to access data at an alternative server should the default server fail or become unreachable.
Network partitions (see Section 15.1) and disconnected operation are the second factor that militate against high availability.
Fault tolerance: Highly available data is not necessarily strictly correct data. It may be out of date, for example; or two users on opposite sides of a network partition may make updates that conflict and need to be resolved. A fault-tolerant service, by contrast, always guarantees strictly correct behaviour despite a certain number and type of faults. The correctness concerns the freshness of data supplied to the client and the effects of the client’s operations upon the data. Correctness sometimes also concerns the timeliness of the service’s responses – such as, for example, in the case of a system for air traffic control, where correct data are needed on short timescales.
The same basic technique used for high availability – that of replicating data and functionality between computers – is also applicable for achieving fault tolerance. If up to f of f + 1 servers crash, then in principle at least one remains to supply the service. And if up to f servers can exhibit Byzantine failures, then in principle a group of 2f + 1 servers can provide a correct service, by having the correct servers outvote the failed servers (who may supply spurious values). But fault tolerance is subtler than this simple description makes it seem. The system must manage the coordination of its components precisely to maintain the correctness guarantees in the face of failures, which may occur at any time.
and
18.3 Fault-tolerant services
(This section describes approaches to achieving fault tolerance. It introduces the correctness criteria of linearizability and sequential consistency, then explores two approaches: passive (primary-backup) replication, in which clients communicate with a distinguished replica; and active replication, in which clients communicate by multicast with all replicas.)
18.4 Case studies of highly available services: The gossip architecture, Bayou and Coda
(Case studies of three systems for highly available services are considered. In the gossip and Bayou architectures, updates are propagated lazily between replicas of shared data. In Bayou, the technique of operational transformation is used to enforce consistency. Coda is an example of a highly available file service.)
I am trying to understand the difference and relation between fault tolerance and (high) availability.
Does fault tolerance imply high availability? I.e. if a services is fault tolerant, is it necessarily highly available?
From what I read (not necessarily correct), 18.3 Fault tolerant services assumes no network partition between replicas, and achieves linearizable consistency, while 18.4 Case studies of highly available services can tolerate network partition between replicas but have weak consistency. Is that true?
I am also wondering what difference and relation are between tolerating failures and masking failures.
Thanks.