Triple Modular Redundancy w/ Majority Voting when a microcontroller fails

Question

In a simple majority voting system, the majority input values will be the output, as shown below.

_{(source: 6502.org)}

However, how does one deal with scenarios where one or two microcontroller fails? Also, how do we overcome the problem of having floating inputs when a microcontroller dies?

score 2 · Answer 1 · answered Dec 29 '15 at 04:28

The general assumption in safety and redundantly designed systems is that only one failure occurs.

Generally, it is assumed (and designed) that failures are independent, so a single failure doesn't trigger additional failures (cascade of failures). For instance, if a circuit overheats, a thermal protection trips, rather than having that circuit's heat cause another circuit to fail etc.

In a redundant system as described, you have to assume that only a single failure occurs -- this means that you have to ensure (through the physical design, wiring etc.) that for instance A and B can't short together, or Q shorts to A etc.

The general failure assumption is that the MCU's pin doesn't fail high Z, but fails to an incorrect logic state. In case of high Z, a passive pull-down R on the outputs (perhaps implemented in the majority gate) might also be used. Note that in the redundant system, a high Z won't be any worse than a stuck-at state.

score 2 · Answer 2 · edited Dec 29 '15 at 19:49

Floating inputs are easy -- you bias a line with a pull-up or pull-down resistor (generally 10kOhm or so). These are generally good practice -- they cover situations where your microprocessor is in reset, is blank, is in the middle of being programmed, etc. They basically cover you in states where your software is not running.

Let's say you have a net that controls the enable to some other IC / device, and it is active high. Placing a 10k resistor on this net to GND will bias that net low in the absence of any other stimuli. Now, to turn it on, your micro controller outputs say a 3.3V logic signal to turn it on. This will expend 330uA (virtually nothing) to overcome the resistor, and the circuit will function as designed.

Now, if we're talking fault scenarios where an IO pin has latched up, or you've suffered a SEU (single-event upset aka bit flip) in an I/O port data register, that is much, much harder to defend against outside of an IC without a physical, external majority voter gate. A 10k resistor pull-down won't do a thing against a low-impedance MCU I/O pin that has latched high and can source 10s of milliamperes.

Latch-up protection is generally implemented with a LCL, or latching current limiter. This can be as simple as a putting your circuit behind a power-switch IC that has a programmable current limit threshold, like a TI TPS2556. In the event of a downstream latch-up, this IC will limit the current that can flow and potentially protect against permanent hardware damage that occurs as a result of the localized heating during a latch-up event. Terrestrial causes of latch-up are generally due to over-voltage; orbital cause are due to energetic particles that impart sufficient LET (linear energy transfer) to trigger the parasitic SCR / latch-up condition. (See also: https://en.m.wikipedia.org/wiki/Latch-up)

Triple Modular Redundancy (TMR) protects you against single-faults as your truth table shows. For multi-fault scenarios, it gets very complex -- and these are often considered pathological fault cases that are deemed so statistically unlikely that further effort is not expended.

I suppose you could extend further to n-modular redundancy (say jump to 5), but I can tell you that for space applications I've worked on, our system designs are fine with TMR. I'd be curious to hear what you have that requires stricter reliability.

score 1 · Answer 3 · answered Dec 29 '15 at 05:32

The problem with the question is that the assumptions are too simplistic when applied to reality - which is essentially what you are asking about - so it's a fair question.

The "truth table" logic above assumes that a signal exists which declares the output of a system either as a 0 or a 1 that IT always considers valid. This solution can be compared with that of 1 or more other systems to see if they also agree about validity.

In the simplistic case

A single failure works as intended.
Two or a majority of) failures cause the two faulty controllers to "win" over the good controller and the output is erroneous. This is the logical outcome of majority voting*. This assumes that the output is able to be represented as a 0 or 1. (* and happens in politics too :-) )

In the real world, if this is to be useful, an independent means of detecting failure is useful. The failure status leads can be used to modify voting.

In the caqse of floating inputs or other failure modes the answer is "whatever works for you". ie the situation will vary from case to case and, if you can detect the floating state or other demonstrably erroneous condition you accommodate it. If you can't then you can't.

Note that if there is a floating signal from one (faulty) controller and the other two are OK then the voting result will be valid.

Sampling of the signals must be done synchronously (using clocked sampling) if erroneous results are to be avoided during transitions. This applies whether three valid signals or two valid plus one invalid or floating is used.

Triple Modular Redundancy w/ Majority Voting when a microcontroller fails

3 Answers3