Programs resistant to hardware issues

Question

I recall at one point reading about embedded development where the programmer took into account things like memory corruption and possibly other hardware issues. For example:

If an instruction in memory is somehow corrupted, the program would run correctly anyway.
If the value of some variable in memory is changed, the program will still produce the correct result.

Dealing with #2 seems like a reasonable application of error correcting codes, but #1 seems to me like it would be very difficult. Does anyone know of any references or examples of someone doing that in software?

Much higher level, but somewhat related: http://codegolf.stackexchange.com/questions/4486/write-a-program-that-always-outputs-2012-even-if-its-modified — Snowball, Mar 29 '13 at 16:31

score 6 · Accepted Answer · answered Mar 29 '13 at 09:03

There are various techniques to reduce the problem like the ones you mention, but there is no 100% solution.

Memory corruption can be corrected by error correcting (ECC) memory, at the cost of extra memory and the correcting hardware itself (which causes extra delay). In some cases you must take care to access all memory regularly to prevent single-bit errors to develop into (uncorrectable) multi-bit errors.
Sensors are often a source of problems. Reading multiple values and averaging and/or throwing out the outliers helps.
Processors can fail, and software can contain bugs. The space shuttle is a famous example of multiple processors (not all of the same type!) and software written by independent sources. Arbitrating between processors/programs that claim different results can be tricky.
In most cases an occasional problem can be tolerated if it is detected and handled in a safe (or otherwise satisfactory) way. This can vary from halting the system to offering degraded performance.

In practice you will have to assess which problems are likely to occur, and then find ways to handle those problems. There is no catch-all solution.

It should also be noted that not all errors affect behavior. E.g., the change of a value derived only from other values and used only to drive a binary decision will not affect behavior as long as the change does not change the binary decision. Similarly, areas where lossy compression can be applied can be more tolerant of errors. Even something like motor control can be made to tolerate errors (e.g., a robot would normally have navigational checks that would correct for too little or too much movement). — , Mar 29 '13 at 23:25

Jeanne Pindar · Answer 2 · 2015-01-20T01:17:35.240

This isn't exactly what you describe, but there are several techniques for forcing embedded systems to reboot when something unexpected happens, after which they hopefully will run correctly.

A watchdog timer is a circuit that reboots the processor if the timer does not get reset by a software instruction every so often. This adds protection against the program getting into an infinite loop, getting stuck waiting for a peripheral, etc.

Filling unused program memory with instructions that cause a reboot (or a jump to a location that does that) will help if the processor somehow jumps to those addresses.

score 2 · Answer 3 · answered Mar 29 '13 at 15:55

The only way I could think of to implement #1 is to continually CRC the program memory against a known value and if it fails the CRC check retrieve a known good copy of the program from another source. There's many gotchas to this: the stored CRC value might get corrupted somehow or there may be no 'other source' of your program readily available - user intervention might be required. Also, since the corruption might be anywhere in program memory it might affect the CRC code so it doesn't run or the bootloader so it won't function (you can go as far down the rabbit hole as you like in this regard).

You could combine the CRC check with a watchdog timer or external health monitor such that if the CRC code fails to run or fails to produce the correct result the microcontroller will reset and run a special recovery bootloader instead of the application. What the recovery bootloader would do depends on your application: it could somehow alert the users that a new program load is needed or if you designed for it attempt to retrieve a pristine copy of your program from external memory if available. The same rabbit hole as above applies: how do you know that the external memory hasn't been corrupted? Or, if the CRC is corrupted your program would be right but always fail the check.

At some point your device can't handle this type of error by itself and if you want the thing to keep running you'll have to bring a development system and programmer to it to bring it back up. This type of scheme will probably add a few 9's to your reliability though even if it's not perfect.

score 1 · Answer 4 · answered Mar 29 '13 at 12:36

There are two techniques that I know ,used in this are of studies

Virtualization.
Fault torrence computing. [http://en.wikipedia.org/wiki/Fault-tolerant_computer_system]

There may be more techniques in the industry that I don't know. Feel free to complete the list.

score 0 · Answer 5 · answered Mar 29 '13 at 14:14

As metioned in other answers there are dedicated HW solutions (such as ECC memory). I've seen several approaches to try and deal with corrupt/failing memory through software:

Write each memory block with a CRC/hash/checksum so that you can detect if a bit were erroneously changed or a memory cell went bad.
There are also special schemes designed to not just detect bad data (bit errors) but also correct it by introducing redundancies (Reed-Solomon Error Correction for example) which will allow you to read back the data correctly, even if part of it is wrong.
If memory is on the edge of going bad, it will show most at boundary cases (low voltage for example). You can test for memory locations which are about to go bad and mark them as unusable by programatically lowering the memory voltage to just above the lower operational threshold, writing a known test pattern, and then reading it back. If the memory is on the edge of corrupt, sometimes it can be detected before it fails if your read-back pattern doesn't match you test pattern.

These are just a few ways that software can avert/detect corrupt/failed memory locations.

Programs resistant to hardware issues

5 Answers5