What difference and relation are between fault tolerance and (high) availability?

Question

From Coulouris' Distributed Systems 5ed

Chapter 18 Replication

18.1 Introduction

Increased availability: Users require services to be highly available. That is, the proportion of time for which a service is accessible with reasonable response times should be close to 100%. Apart from delays due to pessimistic concurrency control conflicts (due to data locking), the factors that are relevant to high availability are:

server failures;

network partitions and disconnected operation (communication disconnections that are often unplanned and are a side effect of user mobility).

To take the first of these, replication is a technique for automatically maintaining the availability of data despite server failures. If data are replicated at two or more failure-independent servers, then client software may be able to access data at an alternative server should the default server fail or become unreachable.

Network partitions (see Section 15.1) and disconnected operation are the second factor that militate against high availability.

Fault tolerance: Highly available data is not necessarily strictly correct data. It may be out of date, for example; or two users on opposite sides of a network partition may make updates that conflict and need to be resolved. A fault-tolerant service, by contrast, always guarantees strictly correct behaviour despite a certain number and type of faults. The correctness concerns the freshness of data supplied to the client and the effects of the client’s operations upon the data. Correctness sometimes also concerns the timeliness of the service’s responses – such as, for example, in the case of a system for air traffic control, where correct data are needed on short timescales.

The same basic technique used for high availability – that of replicating data and functionality between computers – is also applicable for achieving fault tolerance. If up to f of f + 1 servers crash, then in principle at least one remains to supply the service. And if up to f servers can exhibit Byzantine failures, then in principle a group of 2f + 1 servers can provide a correct service, by having the correct servers outvote the failed servers (who may supply spurious values). But fault tolerance is subtler than this simple description makes it seem. The system must manage the coordination of its components precisely to maintain the correctness guarantees in the face of failures, which may occur at any time.

and

18.3 Fault-tolerant services

(This section describes approaches to achieving fault tolerance. It introduces the correctness criteria of linearizability and sequential consistency, then explores two approaches: passive (primary-backup) replication, in which clients communicate with a distinguished replica; and active replication, in which clients communicate by multicast with all replicas.)

18.4 Case studies of highly available services: The gossip architecture, Bayou and Coda

(Case studies of three systems for highly available services are considered. In the gossip and Bayou architectures, updates are propagated lazily between replicas of shared data. In Bayou, the technique of operational transformation is used to enforce consistency. Coda is an example of a highly available file service.)

I am trying to understand the difference and relation between fault tolerance and (high) availability.

Does fault tolerance imply high availability? I.e. if a services is fault tolerant, is it necessarily highly available?
From what I read (not necessarily correct), 18.3 Fault tolerant services assumes no network partition between replicas, and achieves linearizable consistency, while 18.4 Case studies of highly available services can tolerate network partition between replicas but have weak consistency. Is that true?

I am also wondering what difference and relation are between tolerating failures and masking failures.

Thanks.

score 7 · Answer 1 · answered Dec 29 '19 at 16:15

The basic concepts are orthogonal, however, they are related. One has to do with the availability of your application, and the other has to do with the correctness of your application. Remember, there are differing levels of faults (also known as bugs).

Fault Tolerance

"Everything fails, all the time" -- Werner Vogels, Amazon CTO

When you design for fault tolerance, you are addressing issues such as:

What happens if my service can't access required resources (database, storage, message queues, etc.)
What happens if I receive bad data?
How should I handle another service taking too long to respond?
What performance criteria do I have to meet?

There are other related questions you have to ask yourself. It's important to realize that there are a lot of modes where your code can fail, and the longer it is running the more likely it is to hit a soft failure scenario.

Not all failures cause your code to crash, but avoiding crashes is a big reason that fault tolerant systems are also more available.

High Availability

Ensuring your software is available for use means that users should not inconvenienced when other people are also using the system. There is a direct relationship to the cost of running a system and its availability. Quite simply, it costs more to be generally available.

To be available means you need contingencies for when things outside of your code's control are causing problems:

How do I scale my application?
How quickly can I scale the application?
What happens when my data center is not available?
What happens when the network gets saturated?
How should I respond if my data center is a victim of disaster (natural disaster and war can cause you to have to change data centers)
How much will this cost me to run?

You can start to see why microservices and the cloud are key components of engineering for the internet. The idea is that you can scale out (more instances) much more quickly and cheaply than scaling up (more CPU and RAM). Additionally, you can scale back down just as easily when the spike in traffic is over to save you some money when traffic is slower.

Of course you need to design for this as well. That may require working with data partitioning, multi-region deployments, multi-availability-zone redundancy, etc. Of course, if this is running on the user's hardware (i.e. a desktop application), then you may need to swap modules in and out of memory depending on whether you are using them.

End Game

You need both of these disciplines to engineer something that can scale smoothly as demand for your application grows. You have to start making your best guess at what the appropriate problems you are most likely to face now, and then monitor your application while it is running. That will show you where your current design is having a hard time keeping up with demand, reliably.

I guarantee you that none of the big systems people expect to rely on started where they are now. You can look at Twitter, Facebook, Airbnb, Netflix, and even this site and find that each of their architectures are different even though they share some commonalities. That's because the different ways those systems are used required different customization. What's more telling is the decision process of how they got there. Most of us are not going to engineer something so large as those big names in the internet, but we can learn from the decisions they made and apply it at a smaller scale.

What difference and relation are between fault tolerance and (high) availability?

1 Answers1

Linked