PFHD of Dual Redundant Self Checking Circuit

Question

I'm calculating some ridiculous PFHD values for a Dual Redundant Self Checking Circuit and since I'm no expert in reliability calculations (cough) I wanted to check some of the logic.

I have two identical, but independent instances of the same circuit (C1 and C2) both being checked by a third checking circuit (C3). C3 only allows C1 and C2 to be reset if both C1 and C2 are in the tripped state when the Reset button is pressed. Using component FIT data and MIL-HDBK-217F I calculate the failure rate of each of C1 and C2 to be 1E-4/hour, so 100,000 FIT (dominated by a relay in each). C3 consists solely of logic ICs and passives and I calculate a failure rate below 100 FIT (1E-7 /hour).

The machine this controls can only run if both C1 and C2 independently allow it. Ignoring the Self Checking Circuit (C3), I calculate the probability of failure PFHD = 1E-4 * 1E-4 = 1E-8 /hour. That's impressive to say the least (SIL 4) and makes me nervous of the validity of my calculation already.

The trip rate is reasonably high and much higher than the rate at which a significant hazard is averted, so I presume that I can treat the Self Checking circuit (C3) as effective automatic self diagnosis.

For the system as a whole to fail, C1 and C2 both have to fail. But C3 would detect the point at which just one has failed and then the system would be taken offline and repaired. So, for Failure On Demand, C3 must have failed as well (as an aside, it would need to be before the first of C1 and C2). As the Self Checking circuit (C3) is independent, the Pr(failure) = Pr(failure without self checking) * Pr(C3 failure). So PFHD = 1E-8 * 1E-7 = 1E-15 /hour. You can probably see why I'm doubting my calculation as that is about once per 10 ages of the universe!

"C3 only allows C1 and C2 to be reset if both C1 and C2 are in the tripped state when the Reset button is pressed." Are you sure about that? As described, if only one unit trips, it cannot be reset. — WhatRoughBeast, Apr 26 '22 at 18:54
Thanks @WhatRoughBeast Yes that is correct as it signals a failure of either C1 or C2 and that the system needs to be fixed. If one was tripped, they should have both tripped and so there is a problem with one of them if that's not the case. — user1228123, Apr 26 '22 at 19:09
" If one was tripped, they should have both tripped " Probably not. Unless the two are perfectly matched, one will trip first. This will interrupt the current to the second, and it should not trip. — WhatRoughBeast, Apr 27 '22 at 16:54
It's only a conceptual drawing, so for the sake of argument let's say it's a water heater and circuits C1 and C2 trip 2 minutes after they detect a temperature 60C and the system rises at 2C/minute and with sensors are accurate to 1C. — user1228123, Apr 27 '22 at 19:33
Ever wonder why with one pilot per engine on commercial aircraft and twin instruments, they still have pilot error? Someone made false assumptions of "mean time" when something happens at the "same time" — Tony Stewart EE75, May 02 '22 at 17:25
I'm assuming that "to fail" means that the system does not receive power through C1 and C2. If this is the case, and Ignoring C3 for a moment, then C1 and C2 are a simple serial reliability connection, in that both C1 and C2 have to be working properly in order for the machine to be powered up. And that puts the FIT of C1+C2 to be 50,000. Right? — SteveSh, May 02 '22 at 18:22
Maybe a better way of asking the question is "what is your definition of success"? Is it that the system is receiving power, or is it that the system is turned off if a fault is detected? The calculations are different for those two cases. — SteveSh, May 02 '22 at 18:30
SteveSh you've identified the key difference between failure and dangerous failure. It's a safety system so it failing to make the machine available when it should be available is not dangerous whereas failing to stop the machine when it should is dangerous. You're correct that the probability of any failure (i.e. the machine going down) is C1+C2+C3, but the majority of those failures are safe failures and I need to calculate the probability of unsafe failure. — user1228123, May 03 '22 at 11:49

Tony Stewart EE75 · Answer 1 · 2022-05-02T18:30:35.473

I would not agree with the SIL4 calculation as the mean-time-to-failure might not be random if both are stressed simultaneously. What about if downtime for planned repairs cannot occur during demand operation?

I cannot say what the correct calculation result should be, without knowing if a fault is truly random or if it can occur at the same time.

A dual component for redundancy requires more in-depth knowledge and analysis, ( in my mind ) and may require triple or more redundancy if SIL4 is required.

I'm not an expert anymore on Mil-Std-217, but I would expect a full tree diagram of all root causes, stress margins vs failure thresholds, metrics on material quality like real-time estimates of contact resistance, assessments of all environmental stress relative to the assumed environmental limits like fault thresholds and design margins for all environmental stresses such as ;

%RH, Temp, Shock, Vibration, Stress factors for margin to max current or power rating, electrical transients, power surges, ionization, climatic environmental, mechanical environment, exceptions to assume environmental application. e.g. lightning stroke frequency vs peak currents.
When dual series paths fail at the same time, the environments for each component need to be isolated to assume the behaviour is predictable from Arhennius aging rates or by stress acceleration or biased by manufacturing flaws. The components are not shown to be isolated or stress reduced by sharing, so your assumption of independent random failure rates is in doubt from Arhennius's effects.
Stress faults that are simultaneous triggers might be some thermal runaway rapid event, as the contact resistance suddenly increases by an order of magnitude from arcing then self-heating results in a fusing closed of some bimetallic contact switch. (such as extreme cold environment stress)
IMHO, it would take a much smarter high-quality system to ensure SIL4 performance and avoid false trips with preventive maintenance measurements. The components would have to be protected from all significant environmental stress by design.
C3 would have to be flawless which can never be verified in real-time only using many assumptions on knowing all the properties of accelerated aging and margins by design.

Thanks Tony, I don't need SIL4 and completely agree with your points. It's far more likely that there is a systematic failure (wrong part populated in a batch, design flaw, common mode failure (lightning strike?) than 1E-8 or 1E-15. But, I need to understand the basic calculation as I think we actually need more like SIL2 or SIL3 and the system is arguably over engineered. The calculation I have done assumes true probabilistic independent failure which is never truly possible. Are you happy with the calculation approach for SIL2 or SIL3 and that the number ends up crazy small? — user1228123, May 03 '22 at 11:54
Perhaps a statistically minded person with math skills can answer that for you. I always doubted the accuracy due to synchronous stress-induced acceleration factors vs the random quality rates. — Tony Stewart EE75, May 03 '22 at 14:31

SteveSh · Answer 2 · 2022-05-07T15:01:26.267

0

Let me take a stab at this.

You've said in your comments that 1) your definition of success is that the machine is turned off when a dangerous failure occurs and 2) that circuits C1 and C2 are what monitor for those dangerous conditions, and 3) either one can shut down the machine.

Is this correct, because if not, then the analysis that follows is not applicable.

I am going to ignore C3 for now. So then from a reliability model standpoint, what you have are C1 and C2 in parallel, meaning that you only need one of them to function correctly in this regard. Here is a graphical model of that:

C1 and C2 have a failure rate (I assume you mean MTBF) of 10,000 hours. This means that C1 and C2 have a 37% chance of operating correctly (that is, being able to shut down the machine) after 10,000 hours.

Put another way, C1 and C2 each have a 63% chance of failing in that 10,000 hours period, and there a 60% probability that both will fail in the 10,000 hour period, assuming no repairs or maintenance is done.

In order to calculate a probability of success - that the checkers C1 and C2 will work as desired - you need to specify a time period. You did not indicate the time period, so I will use 1,000 hours in the example below.

Over that 1,000 hr period, C1 and C2 each have a probability of working (Ps) of 90.4%, or a probability of failing (Pf) of 9.6%. The probability one or both of them will work properly (shut the machine down if a dangerous condition is detected) is 1-(1-Ps)(1-Ps) = 99.1%.

Is this consistent with your view?

Analysis based on 1 hour period

In a given 1 hour period, C1 and C2 have a Ps=0.999900, each. Combined per the above model, the Ps of both of them taken together is 1.000000, given that the Pf of both of them together is 9.999E-9, which I think is the same (within number of digits) of your 1E-8/hour.

edited May 07 '22 at 15:01

answered May 06 '22 at 14:01

SteveSh

9,672
2
14
31

Thanks SteveSh. The time period I gave was per hour, but per 1000 hours is valid for consideration. You've shown that the probability of both failing in that time period is 0.9% (i.e. 100%-99.1%). I calculated 1E-8/hour. The big difference is I've calculated the chance of them failing in the same hour, so that;s why it's not just 10^-8 * 1000 =1E-5 to find out the chance of failure in 1000 hours because that 1E-5 is the chance that they both fail in the same 1 or 1000 hours. Your probability covers one failing 100s of hours before the other. I think PFHD calculations are done per hour though. – user1228123 May 07 '22 at 13:25
Adding in the self checking aspect, ensures that the state of one of C1 or C2 being operational doesn't allow the system to reset. This means that both C1 and C2 have to fail close enough together for the user to persist trying to reset the system. E.g. if they failed days apart the user would have given up trying by then and servicing would have initiated. So maybe PFHD should be the probability of them both failing in an hour (assumes 1 hour of retrying) plus the probability of C3 having failed before C1 and C2 both failing. That then makes probability of an unsafe sate in 1000hours = – user1228123 May 07 '22 at 13:32
probability of an unsafe state in 1000hours = (1E-4 * 1E-4)*1000 + (1000*1E-8)*(1000*1E-7) = 1E-5 + 1E-9 = 1E-5 . The second constituent of the probability is overstated as it doesn't consider the order in which failure happened, but it's negligible anyway. – user1228123 May 07 '22 at 13:42
I think it all comes down to an understanding of the standard approach. Maybe the self check aspect just makes it valid to calculate it on an hourly basis and conclude PFHD=1E-8/hour, because otherwise, as you point out they could fail at different times and be unnoticed. – user1228123 May 07 '22 at 13:44
Rats! the third comment should be: probability of an unsafe state in 1000hours = (1E-4 * 1E-4)*1000 + (1000*1E-4)*(1000*1E-4)*(1000*1E-7) = 1E-5 + 1E-6 = 1E-5 . The second constituent of the probability is overstated as it doesn't consider the order in which failure happened, but it's smaller anyway. – user1228123 May 07 '22 at 14:17
@user1228123 - I don't follow your analysis in assessing the probability of an unsafe state. First of all 1E-4 is not the probability of an unsafe state in C1 or C2 (C1 or C2 failing). It's 1-e^(-t/MTBF), which is 0.9 for t=1,000 hrs and MTBF=10,000 hrs. – SteveSh May 07 '22 at 15:01

score 0 · Answer 3 · answered May 09 '22 at 08:35

I don't know if this is the correct answer according to the standardised approach, but:

let the probability of failure per hour of each circuit of type C1/C2 be p
let the probability of failure per hour of C3 be p'
let T be the service life of the machine in hours

There are two scenarios under which the system fails unsafely:

Case 1: Both C1 and C2 fail within the same hour (1hour taken as the time a user would persist with a system that refuses to reset) at any point in the life of the machine
Case 2: C3 fails first and then C1 and C2 fail at any point there after in the life of the machine

Case 1 is easy and is: Pr(Case 1) = T.p.p

Case 2 is trickier and we first have to work out the probability of C3 failing before a circuit of type C1/C2. A statistical whizz could prove that the probability of that is of order (p' / p) and that feels intuitively correct. There are two circuits C1 and C2 though so the probability is halved, (p' / 2p). If the service life of the machine was infinite, then that would be the end of the story for Case 2 as C1 and C2 will always subsequently fail, but that isn't the case. The probability that C1 fails in a service life of T hours is (1 - (1-p)^T) which if T >> 1/p approximates to Tp. So the probability of both failing in service life T is that squared, so (T.p) * (T.p) and the Pr(Case 2) = (p' / p) * (T.p) * (T.p) = Pr(Case 2) = 2.T^2.p.p' .

So the probability of dangerous failure in the life of the machine is:

Pr(Case 1 or Case 2) = Pr(Case 1) + Pr(Case 2) [assumes independence which isn't right so will overestimate)

Pr(Case 1 or Case 2) = T.p.p + 2.T^2.p.p' = T.p.(p + 2T.p')

I guess you could then state it as per hour by dividing by T:

PFHD = p.(p + 2T.p')

With my numbers in the question, and adding a service life of 1000 hours: p=1E-4, p'=1E-7, T=1000:

PFHD = p.(p + 2T.p') = 1E-4 * (1E-4 + 2 * 1000 * 1E-7) = 3E-8/hour

[Note: if T.p isn't much less then 1, but is instead equal to 1 then (1 - (1-p)^T) = 0.632 and the Case 2 probability will tend to 0.400*(p' / p) so PFHD = p.p + 0.4*(p' / p) = 1E-8 + 0.4*(1E-7 / 1E-4)/1E4 = ~0.5E-7/hour] [Note: if T.p isn't much less then 1, but is instead greater than 1 then the Case 2 probability will tend to (p' / p) so in the limit PFHD = p.p + ((p' / p)/T) , hmm that feels problematic though as it goes down with increased service life]

I've no idea if this is the correct formulation, but it makes some sense probabilistically.

The probability in the question was the probability that all three circuits fail in the same hour, which isn't relevant.

I wanted an answer that was definitive against the standard approach, which this isn't so it remains unanswered, but perhaps this helps.

PFHD of Dual Redundant Self Checking Circuit

3 Answers3