Fault modeling for embedded systems

Question

I have a wireless sensor circuit with a microcontroller and a 2.4 GHz transceiver module, some integrated sensors with I²C interface, a UART port and the necessary discrete components.

This board is designed for scavenging power from a solar (PV) panel, with a LiPo battery and a shunt charger. This allows the sensor to be self powered and be operating for an indefinite time, requiring the least possible maintenance.

I'd like to explore the possible faults that can occur in a system like this, and that can be due to aging, violation of environmental specs (temperature, humidity and so on) or wrong maintenance (not design issues/bugs), in order to maximize its operation lifetime.

The environment in which the sensor node operates is a building, sticked to the ceiling or the walls. So extreme temperatures or rain are not considered.

What I came up with are some faults that I try to summarize:

Component broken -> open\short circuit
Sensor faulty -> wrong output values (but how wrong?)
Defecting isolation due to dust\water -> increased leakage
Temperature out of range -> ???

How can I estimate how the sensor node is going to fail, and why?

Don't forget that the sensor can be just smashed by whoever/whatever and mechanically broken which can cause any faults you could imagine. — sharptooth, Jan 24 '12 at 09:57
Yes, by now I was also neglecting tampering, as it's a limit case...but any suggestion is welcome! — clabacchio, Jan 24 '12 at 10:14
solar panel getting mucked up and not generating enough power. I'm sure the life on some MEMS device are very sensitive to the environment...guessing. — kenny, Jan 27 '12 at 15:39
What is the purpose of your study? It could for instance be reducing failure rate, reducing failure effect (fail soft), reducing risk (detecting failure instead of bluntly going on), etc, which all require different approaches. — Wouter van Ooijen, Jun 12 '12 at 17:31

Adam Lawrence · Accepted Answer · 2012-06-15T17:47:15.333

There are far too many degrees-of-freedom to understand "all" the possible faults. There are, however, techniques to identify and mitigate faults early in the design cycle (i.e. before wide release).

Design-time activites (pre-hardware)

Peer review is always a great way to find bugs. Have someone else analyze your design, and be prepared to defend against their questions (or acknowledge that they found a bug, and fix it!) There's no substitute for scrutiny, and fresh eyes often see things that are missed by tired ones. This works for both hardware and software - schematics can be reviewed just as easily as source code.

For the hardware, as others have said, a DFMEA (Design Failure Mode and Effects Analysis) is a good recommendation. For each component, ask yourself "what happens if this shorts out" and "what happens if this goes open-circuit", and make a record of your analysis. For ICs, also imagine what happens if adjacent pins are shorted to each other (solder bridges, etc.)

For the firmware, static code analysis tools (MISRA, lint, etc.) can be used to reveal hidden bugs in the code. Things like floating pointers and equality-instead-of-compare (= vs ==) are common 'oopsies' that these tools will not miss.

A written theory of operation is also very helpful, for both hardware and software. A theory of operation should describe in a fairly high level how the system works, how the protections work, sequencing, etc. Simply putting to words how the logic should flow often leads to one realizing that some cases may have been missed ("Um, waitasec, what about this condition?")

Prototype level testing

Once you get hardware in hand, it's time to get to "work".

After all of the theoretical analysis is done, it is crucial to accurately characterize how the device operates within spec. This is commonly referred to as validation testing or qualification. All of the allowable extremes need to be tested.

Another important qualification activity is component stress analysis. Every part is evaluated against its maximum voltage/current/temperature, in a defined operating condition. In order to ensure robustness, an appropriate derating guideline should be applied (don't exceed 80% of voltage, 70% of power, etc.)

Only once you know how things are under normal conditions can you start to speculate about external abnormals, or multiple abnormals like you're describing. Again, the DFMEA model (what happens if X happens) is a good approach. Think of any possible thing a user could do to the unit - short outputs, tie signals together, spill water on it - try them, and see what happens.

A HALT test (highly accelerated life test) is also useful for these types of systems. The unit is put into an environmental chamber and exercised from minimum to maximum temperature, minimum and maximum input and output, with vibration. This will find all sorts of issues, both electrical and mechanical.

This is also a good time to do some embedded fuzz testing - exercise all of the inputs well beyond their expected ranges, send gibberish in through UARTs / I2C, etc. to find holes in the logic. (Bit-banged I2C routines are notorious for locking up the bus, for instance.)

Strife testing is a good way to demonstrate robustness. Disable any protection features like overtemperature, overload, etc. and apply stress until something breaks. Take the unit up as high in temperature as it can go until something fails or some erratic behaviour occurs. Overload the unit until the powertrain fails. If some parameter fails only slightly above worst-case conditions, its an indication of marginality and some design consideration may have to be revisited.

You can also take the next-level approach and physically test some of your DFMEA conclusions - actually do the shorts and opens and pin-shorts and see what blows up.

Further reading

My background is in power conversion. We have an industry standard called IPC-9592A which is an effort to standardize how products should be qualified in terms of what tests and how they should be done. Many of the types of tests and methodologies referred to by this document could easily be used in other electrical disciplines.

score 6 · Answer 2 · edited Feb 01 '12 at 14:37

With multiple devices on the I2C interface you have the possibility of the "babbling idiot" problem where one device fails, hogs the I2C, and kills all other I2C transmissions.

Soak testing combined with environmental testing would provide a different form of failure analysis. Using marginal components, maximum/minimum/fluctuating temperatures, different humidities, dirty power supplies, noisy rf environments etc over a periods of time simulates a much longer period of normal usage. The system will have real failures and failure rates can be calculated.

score 3 · Answer 3 · answered Feb 01 '12 at 17:31

Most likely fault is firmware bugs. Everything I've done has had a few.

Make sure you have a watchdog timer enabled, and require all critical repeated functions to happen before "petting the dog." I like to set a flag in the timer interrupt and use it to clear the watchdog in the main loop.

Test your firmware recovery over reset cycles too.

Since startup is when a lot of failures occur, I like to power through a relay, then write a quick script to power cycle, wait for the radio to indicate wakeup, repeat. Then do this for 10000 cycles or so.

Very interesting power on test. My last company had a project that had to run for multiple years staying synced to a dumb transmitter and could not fault during that time, removing firmware bugs was probably the hardest part. — Kortuk, Jun 11 '12 at 13:21

score 2 · Answer 4 · answered Jan 24 '12 at 23:00

A few obvious ones:

Battery failure. Possibly loss of electrolyte leading to contamination of the electronics
Overvoltage from the PV system
Is it moving or near machinery? Then shock/vibration
Loss of communication due to external environment (rain/snow absorbing the signal, etc).

If you're doing an FMEA you need to first consider how critical the system is before you can decide what constitutes a fault.

score 2 · Answer 5 · answered Jun 12 '12 at 21:28

I am surprised that nobody has mentioned Accelerated Life Testing and Highly Accelerated Life Testing.

One of the important tools you have at your disposal is that for every 10 degrees Centigrade rise in temperature, the average reliability is decreased by 50 percent. You can get some idea of the life of your product by testing it at a greatly increased temperature. You don't have to test components beyond their rated temperature to take advantage of this.

Fault modeling for embedded systems

5 Answers5