Questions tagged [fault-tolerance]

13 questions
11
votes
1 answer

Concurrent fault-safe data structure

I am building an application that aims to process ~10M data items per second. Each item is exactly 42 bytes small (including a sorting key) which means the total data rate will not be big 420 MB/s. The data structure entries are supposed to be…
8
votes
10 answers

Boneheaded exceptions should not be caught. Then how to provide fault tolerance and reliability?

I've always been taught that fatal exceptions (indicating problems that cannot be solved programmaticaly) and boneheaded exceptions (resulting from bugs in my code) should not be caught, should not be handled, should not be ignored, instead they…
gaazkam
  • 3,517
  • 3
  • 19
  • 35
3
votes
5 answers

Design pattern for objects in invalid states

General design pattern for object error state Consider a simple class Wallet that models a wallet. A Wallet contains a certain amount of Wallet.Cash and it is possible to take money out / put money in. public class Wallet { ///…
Benj
  • 169
  • 5
1
vote
5 answers

When do I stop being paranoid about my code failing?

I'm currently designing a system that, no matter how hard I try to break, slow network, failures, random server deaths, it can recover and it can re-build again. Each action it does is a fragment and it can pick up from where it left off. Each…
1
vote
0 answers

How to design a highly available and fault tolerant file storage drop location in linux box

Am trying to build a highly available and fault tolerant file drop location in linux server. Please find the current system design below: We got 2 linux servers in secured zone into which several clients from unsecured zone will be dropping files…
Valath
  • 127
  • 2
1
vote
0 answers

Byzantine Failures - Client Voter and Replica on Same Processor

I need help understanding the following text from section 4.2 on this paper: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Dependent-Failures Output Optimization. If a client and a state machine replica run…
0
votes
1 answer

What is the crux of difference between N version programming and self monitoring architecture?

Source-:https://cs.ccsu.edu/~stan/classes/CS410/Notes16/11-ReliabilityEngineering.html This is self monitoring architecture. So here computations carried across 2 channels, if they both provide same result then system is operating correctly else…
cuajiu
  • 9
  • 1
0
votes
4 answers

How to guarantee HTTP message delivery in fault tolerant way

When an application A communicates with an external/3rd party system B, there is always a chance that B is down. Say that A raises an event that should be sent as a message to B via HTTP, what is the best way to guarantee that the message is…
-2
votes
1 answer

Unexpected shutdown before a saga completion

Suppose we have some microservices and a saga will run to do a transaction in 6 microservices. What if the whole system dies(unexpected shutdown), on middle of saga process in the step number 4?(System died, So state is lost)
-3
votes
2 answers

What is the difference between masking and tolerating failures?

Distributed Systems 5ed by Coulouris says on p21-22 1.5.5 Failure handling Detecting failures: Some failures can be detected. For example, checksums can be used to detect corrupted data in a message or a file. Chapter 2 explains that it is …
Tim
  • 5,405
  • 7
  • 48
  • 84
-4
votes
1 answer

What difference and relation are between fault tolerance and (high) availability?

From Coulouris' Distributed Systems 5ed Chapter 18 Replication 18.1 Introduction Increased availability: Users require services to be highly available. That is, the proportion of time for which a service is accessible with reasonable…
Tim
  • 5,405
  • 7
  • 48
  • 84
-4
votes
3 answers

How does a distributed system both tolerate network partition and achieve consistency?

In the CAP theorem, there are only three possible cases: C and A without P C and P without A A and P without C How can we have both P and C in the second case? Doesn't propagation of update from one replica to the other require a communication…
Tim
  • 5,405
  • 7
  • 48
  • 84
-4
votes
1 answer

Does stale data due to weak level of consistency count as Byzantine failure?

I have difficulty understand Section 18.3 Fault Tolerance Services under Ch18 Replication in Coulouris' Distributed Systems. If my reading and understanding is correct (which might not), Section 18.3.1 Passive Replication describes services that…
Tim
  • 5,405
  • 7
  • 48
  • 84