2

Is it possible to make an electronic system, where every sub-system is duplicated and absolutely nothing is a single point of failure and if a failure (or multiple failures) occurs the system itself will seamlessly route around it?

Back in the 80s I worked with Stratus fault tolerant computers, their party trick was to be able to hot-swap the microprocessor boards, memory or hard discs seamlessly without interruption for the end user. But even this system wasn't quite as good as the adverts - both the clock and the motherboard were single point of failure.

At the simplest level, I've build devices that will choose which of two batteries to connect to (based on state of charge) but even with a simple set-up like this the relay becomes a single point of failure.

I've also worked with microprocessor based systems that would connect to three different GPS feeds and take a 'vote' on the most plausible position if one of the three disagreed with the others. But even with this, the voting system and bus it communicated over was a single point of failure.

Is a possible to build a multi-function system (example - drone autopilots, missile control systems, nuclear reactor safety systems) where every system and sub-system is fully redundant and any internal system failure is seamlessly routed around?

I'll freely admit this is a follow-up to my previous question about space probes. The consensus answer seemed to be that (in the case of Voyager) reliability was ensured by using one-of-everything and making sure that each component was maximum quality and 100% reliable.

It occurs to me that an alternative approach is three-of-everything, and some mechanism to to ensure that if one-of-the-three becomes faulty it gets ignored.

brhans
  • 14,373
  • 3
  • 34
  • 49
ConanTheGerbil
  • 923
  • 7
  • 15
  • 2
    3 of everything is a recognised technique used in aeroplanes I believe. – Andy aka Jan 12 '20 at 11:08
  • 1
    An additional "takeaway" from the Voyager answers probably should have been to derate components for power and in some cases voltage. A top quality design is not top reliability without this. – Russell McMahon Jan 12 '20 at 11:18
  • @Andyaka Even with 3-way redundancy there comes a point when the hardware must reduce the 3 answers to a single action...such as activating an actuator in a plane. Redundancy certainly reduces the risk of failure but I don't think you can reasonably say that it eliminates all risk. – Elliot Alderson Jan 12 '20 at 14:40
  • I agree, I was just pointing out to the OP that it was a recognised technique. @ElliotAlderson – Andy aka Jan 12 '20 at 14:59
  • Another shortcoming with triple redundancy (3 of everything) is that you need to ensure that no single fault can take out all 3 systems, Case in point is the DC-10 plane crash, in 1989. There was a failure in the 2nd engine (the one that went through the tail) where the main rotor disc came apart, basically exploded. The shrapnel severed the 3 hydraulic lines for flight control, system (all 3 ran in approximately the same location), taking out the main and redundant control paths in one fell swoop. – SteveSh Jan 12 '20 at 21:29

2 Answers2

3

Theoretically, it would be possible if somehow we could detect every possible fault (which we cannot - see next paragraph).

The primary problem is complexity; every time we add a component, we add complexity to a circuit. In the case of an autopilot (for example) there may be several hundred (or more) components for a single channel and it becomes time prohibitive to analyse every possible fault. If we tried to detect every possible fault, we would exponentially increase the complexity of an already complex system (thus increasing the potential for faults) and then there is the conundrum of how do I ensure the detection circuits are not faulty?

(I actually encountered this problem where the startup behaviour of an opamp caused the fault monitor to trip even though no fault was present; thankfully this was while we were still in the development phase).

We very quickly end up in an exponential loop with no apparent exit.

The other problem is that some faults are not directly detectable - think things like bypass capacitors failing open.

There may also be a (previously) unknown fault mode for some components (I had to analyse why apparently perfectly good new dry tantalum devices were failing in a spectacularly pyrotechnic manner - it turns out that the act of reflow can damage the part and if it is fed from a low impedance source, it can go into thermal runaway).

In that case, the device becomes a short circuit for a short period of time (a second or so) and then becomes an open circuit and pretty much undetectable in most cases.

Even flight safety critical equipment, such as flight control computers (fly by wire) which are redundant (in the case of some aircraft, triple redundant with the associated sensors also triple redundant) may have backups.

The communication between processors is also galvanically isolated to minimise the possibility of an electrical fault in one processor from affecting the other processing channels (but it is still non-zero).

For civil programmes that use microprocessors, the architecture of each processor is different and this is to guard against a particular type of common mode problem; a bug in the microcode. Different architectures have a vanishingly small chance of having such a bug at the same point in a programme (but it is still non-zero).

On some aircraft, the euphemistically named Alternate flight control computer is in fact a backup for the primary computer(s) and it is the 'last man standing' (I salute any test pilot who performs the test flights to prove it actually does operate as they must deliberately fail the primary flight controls).

Safety analysis is a numbers game; for level A avionics (system failure is catastrophic) the requirement is to show that the possibility of failure is <\$10^{-9}\$ per flight hour as it is simply not possible to consider every single potential fault.

We do look at all circuits using a fault tree and extensive analysis (and strict derating in many cases) but the fact is that although we can mitigate against single point of failure mechanisms, we cannot state that such a mechanism does not exist; we can only state the likelihood of such a mechanism existing and fail in a predictable manner where possible.

Peter Smith
  • 21,923
  • 1
  • 29
  • 64
1

You cannot guarantee that a system cannot fail because you can't guarantee that every possible failure has been accounted for. I worked on railroad control systems (specifically Positive Train Control or PTC) for many years. Safety is the primary concern of these systems. One approach to trying to guarantee that no single fault will cause an accident is to use fault tree analysis wherein you try to identify all possible faults by looking at high level failures (collisions, derailments, etc) and work your way down to what events could cause those failures until you get to the bottom of the tree with single failure events. Then you try to mitigate each of those events using either redundancy or bypasses or simply stopping the trains to avert accidents. However. there is no 100% guarantee because you may have overlooked a failure. Just look at the history of the space program, where similar safety techniques were used, but failures still occurred due to overlooked events (even as simple as a unit conversion error).

Barry
  • 15,733
  • 1
  • 26
  • 28
  • Your railway example is good but note that it, certainly with the older relay logic, we didn't guarantee that the circuit wouldn't fail but did design so that it would fail in a safe or more restrictive manner. e.g. If the relay for the green signal fails the driver will see yellow or red. – Transistor Jan 12 '20 at 13:28
  • @Transistor I never said that the system was designed to not fail; it was designed to either have redundancy or go to a safe state (such as stopping the trains) in the event of a failure. The great feature of the Westinghouse air brakes used on trains is that in the event of a failure (loss of air pressure in the main hose), the brakes are applied (contrary to what is often seen on TV and the movies). – Barry Jan 12 '20 at 23:13
  • Sorry, while I addressed it to you it was really meant for the OP. – Transistor Jan 12 '20 at 23:30