Questions tagged [fault-tolerance]
13 questions
11
votes
1 answer
Concurrent fault-safe data structure
I am building an application that aims to process ~10M data items per second.
Each item is exactly 42 bytes small (including a sorting key) which means the total data rate will not be big 420 MB/s.
The data structure entries are supposed to be…

SmartArray
- 227
- 1
- 6
8
votes
10 answers
Boneheaded exceptions should not be caught. Then how to provide fault tolerance and reliability?
I've always been taught that fatal exceptions (indicating problems that cannot be solved programmaticaly) and boneheaded exceptions (resulting from bugs in my code) should not be caught, should not be handled, should not be ignored, instead they…

gaazkam
- 3,517
- 3
- 19
- 35
3
votes
5 answers
Design pattern for objects in invalid states
General design pattern for object error state
Consider a simple class Wallet that models a wallet. A Wallet contains a certain amount of Wallet.Cash and it is possible to take money out / put money in.
public class Wallet
{
///…

Benj
- 169
- 5
1
vote
5 answers
When do I stop being paranoid about my code failing?
I'm currently designing a system that, no matter how hard I try to break, slow network, failures, random server deaths, it can recover and it can re-build again. Each action it does is a fragment and it can pick up from where it left off. Each…

Daniel Smith
- 83
- 3
1
vote
0 answers
How to design a highly available and fault tolerant file storage drop location in linux box
Am trying to build a highly available and fault tolerant file drop location in linux server. Please find the current system design below:
We got 2 linux servers in secured zone into which several clients from unsecured zone will be dropping files…

Valath
- 127
- 2
1
vote
0 answers
Byzantine Failures - Client Voter and Replica on Same Processor
I need help understanding the following text from section 4.2 on this paper: Implementing Fault-Tolerant Services Using the State Machine
Approach: A Tutorial
Dependent-Failures Output Optimization.
If a client and a state machine replica
run…

KJP
- 117
- 5
0
votes
1 answer
What is the crux of difference between N version programming and self monitoring architecture?
Source-:https://cs.ccsu.edu/~stan/classes/CS410/Notes16/11-ReliabilityEngineering.html
This is self monitoring architecture. So here computations carried across 2 channels, if they both provide same result then system is operating correctly else…

cuajiu
- 9
- 1
0
votes
4 answers
How to guarantee HTTP message delivery in fault tolerant way
When an application A communicates with an external/3rd party system B, there is always a chance that B is down.
Say that A raises an event that should be sent as a message to B via HTTP, what is the best way to guarantee that the message is…

bangnab
- 746
- 6
- 14
-2
votes
1 answer
Unexpected shutdown before a saga completion
Suppose we have some microservices and a saga will run to do a transaction in 6 microservices.
What if the whole system dies(unexpected shutdown), on middle of saga process in the step number 4?(System died, So state is lost)

Amin Shojaei
- 107
- 4
-3
votes
2 answers
What is the difference between masking and tolerating failures?
Distributed Systems 5ed by Coulouris says on p21-22
1.5.5 Failure handling
Detecting failures: Some failures can be detected. For example, checksums can be used to detect corrupted data in a message or a
file. Chapter 2 explains that it is …

Tim
- 5,405
- 7
- 48
- 84
-4
votes
1 answer
What difference and relation are between fault tolerance and (high) availability?
From Coulouris' Distributed Systems 5ed
Chapter 18 Replication
18.1 Introduction
Increased availability: Users require services to be highly available. That is, the proportion of time for which a service is
accessible with reasonable…

Tim
- 5,405
- 7
- 48
- 84
-4
votes
3 answers
How does a distributed system both tolerate network partition and achieve consistency?
In the CAP theorem, there are only three possible cases:
C and A without P
C and P without A
A and P without C
How can we have both P and C in the second case?
Doesn't propagation of update from one replica to the other require a communication…

Tim
- 5,405
- 7
- 48
- 84
-4
votes
1 answer
Does stale data due to weak level of consistency count as Byzantine failure?
I have difficulty understand Section 18.3 Fault Tolerance Services under Ch18 Replication in Coulouris' Distributed Systems. If my reading and understanding is correct (which might not),
Section 18.3.1 Passive Replication describes services that…

Tim
- 5,405
- 7
- 48
- 84