6

Have you ever seen an otherwise happy AVR spontaneously glitch and require a reset?

Assuming:

  1. a nice stead power supply that stays inside the specified range
  2. a correctly sized decoupling cap directly between Vcc and ground
  3. normal (not too noisy) household conditions

...what is the real-world MTB glitches?

I've had hundreds of AVR's running for years and I don't think I ever have seen a real glitch, but maybe I am just lucky?

Note that I know that you should always use a watchdog, I know. Don't flame me - but if the likelihood of glitches is very low, there could be applications where it would be reasonable to maybe not use the watchdog to get lower power usage in sleep.

Note also that I understand that the watchdog also protects you from firmware bugs, but I am only asking about spontaneous hardware glitches.

bigjosh
  • 9,888
  • 29
  • 48
  • Just like you use those features for software, because software can have bugs, hardware can have bugs too. – PlasmaHH Nov 23 '14 at 21:53
  • A watchdog wouldn't be particularly good at guarding against random hardware glitches anyway - you can't possibly monitor every condition that could be modified by a lightning strike or a gamma ray. – Nick Johnson Dec 17 '14 at 17:01
  • If you presume that the error will show up as a spontaneous random change in volatile memory (RAM or a register [which includes the program counter]), then I think it is possible to write software that can recover from a single corruption event. Imagine that all your program does is blink an LED on and off. If you fill all of program memory with the repeating instructions that alternate between on and off, then I thikn you can guarantee that the blinking will resume its normal pattern within a single iteration following the corruption. The more complicated the operation, the harder it gets. – bigjosh Dec 18 '14 at 20:15

4 Answers4

4

Cosmic ray hits and SEU (Single event upsets) are very real. Just look up data about DRAM and the need for ECC (Error correcting) and from that you should be able to get a sense for the probability vs. area. Some processes are less prone, and smaller processes while being more sensitive also present a smaller capture cross section, sometimes that is a benefit and sometimes not.

Keep those watch dogs running!

placeholder
  • 29,982
  • 10
  • 63
  • 110
  • 1
    I totally accept that these things happen and we should be worried about them - I am just trying to get a loose quantitative feel for how often they actually do happen. For a single MPU, what would you expect is the real-world mean time between events? 1 year? 10 years, 1 million years? – bigjosh Nov 22 '14 at 02:31
  • 1
    +1, but this answer would be improved with some hard numbers. The first large-scale study of DRAM errors, as mentioned in the [Wikipedia ECC memory article](http://en.wikipedia.org/wiki/ECC_memory), measured an average of around 200 to 600 DRAM errors per year per gigabit. – davidcary Nov 28 '14 at 13:06
  • @davidcary Hard numbers? Hmmm, I tell you what, you tell what process this is and let me see the layout and I'll give you a very hard number. Until one knows these details saying more is simply guessing. – placeholder Nov 28 '14 at 18:57
  • 1
    @placeholder: OK, I'll bite. Can you give me some numbers about the 350 nanometer process ATMEGA88; you can see the layout at [Cesar ATMEGA88 Teardown](http://blog.ioactive.com/2008/01/atmega88-teardown.html). Or you could write a sentence about whatever you think is the most important number from the DRAM study that the Wikipedia ECC memory article links to. – davidcary Nov 28 '14 at 21:23
  • @davidcary sure I'll wait until you can tell the well depths, and S/D implant depths if you don't have that then the implant energies will suffice. And those pictures? I need to see active or even just an estimate of well plan view areas and junction edge length. – placeholder Nov 28 '14 at 22:47
  • 2
    So which is it? To get a rough estimate of average errors per year, can I "Just look up data about DRAM ... and from that you should be able to get a sense for the probability vs. area."? Or do I also need a bunch of information about well depths or implant energies that none of the datasheets for the parts I use ever mention? – davidcary Dec 02 '14 at 04:01
3

Depends on the environment and the configuration. It is just about impossible to practically guarantee that, say, a nearby lighting strike will not have enough EMI energy to cause a problem. You can reduce the likelihood with good design, but unless the system is in a Faraday cage with magnetic shielding and heavily filtered feedthroughs there is some possibility of an upset. In space applications, the earth's magnetic field does not have the usual shielding effect, so random upsets are more likely than on earth (but still non-zero in either case). The chances of a small self-contained system (no inputs or output and battery powered) seeing an upset are much less than if there are wires attached.

There are plenty of systems out there without watchdogs and without proper reset circuits- if the cost of a lockup is low, nobody cares (just cycle the power!). If the cost is high then using a WDT (internal or external), redundant processors, mechanical overrides or other means may be desirable. Modern processors (and better software design) can support reset on anomalies even without a WDT- for example, if the program counter goes out of range. Unused memory may be filled with jumps to a cold-start routine, and other techniques can be used. I'm sure there are a lot of WDTs in use that are pretty much useless because they're being kicked by an ISR or something silly like that.

Spehro Pefhany
  • 376,485
  • 21
  • 320
  • 842
  • 1
    Is there a name for these techniques, perhaps [immunity-aware programming](http://en.wikipedia.org/wiki/immunity-aware_programming)? – davidcary Nov 28 '14 at 13:07
  • I'd put them under the broader category of safety-related software engineering techniques (because that's often the motivator for reducing the risk- property damage or possible injury or death to people). Never heard of "immunity-aware", thanks. – Spehro Pefhany Nov 28 '14 at 14:29
2

Interesting official word from ATMEL:

Hello Josh, I understand that you’re concerned about the interrupt control bits getting flipped randomly. This could not happen unless they’re somehow modified in the firmware or the device is kept in a noisy environment that could cause flash corruption. To prevent the possibility of flash corruption, please refer to the device datasheet section 18.7 Preventing flash corruption. As long as the design conforms to the considerations mentioned for preventing flash corruption, there is no possibility of the interrupt control bits getting corrupted in the device. Hope this clarifies. Please get back to us in case of further queries.

Best regards, Ineyaa N Atmel Support Team

UPDATE

One year later, I now have 10's of thousands of these little AVRs out in the world running 24x7 and so far I have not seen a single case of a spontaneous glitch. Pretty amazing. Will update next year!

bigjosh
  • 9,888
  • 29
  • 48
1

Well... in typical environment and modern microcontrolers it is not often.

So rare that its hard to measure and determine it.

It depends on many factors, including unndesireable events on production line. Hardware glitches should never happen in not damaged microcontroller working in normal environment, so datasheets don't say anything about reliability.

I personally don't use watchdog very often, because many of my projects just don't require such protection.

When I use it - I use it for:

  • extra software bug protection
  • partially damaged microcontroller protection

Im only using it when:

  • microcontroler drives some expensive peripherials (big transistors with big load)
  • for safety reasons with circuits related with mains somehow or when microcontroller drives something that may destroy something or harm someone
  • when data processed in microcontroller must be very reliable
Kamil
  • 5,926
  • 9
  • 43
  • 58