6

We have a bunch of ARM microcontrollers on test at the moment. The test runs for 360 hours and mostly completes without a hitch, but very occasionally, one of the microcontrollers will hang. We have seen this problem occur twice so far.

Looking at the firmware, there is only one type of place where the code might legitimately hang, and that's where it's waiting for an internal peripheral to complete (e.g. for an EEPROM write to complete, or a byte to finish transmitting on the SPI). There are no other while loops in the code.

There seem to be only two possibilities:

  1. One of the peripherals is getting stuck and failing to complete, causing the code to get stuck in a while loop.
  2. The CPU itself has stopped executing code.

Both of these scenarios seem unlikely, but I was wondering if anyone had seen anything similar?

Further to this question: How common is it to receive PCBs with failed or weak vias?, we know that some of the PCBs they're mounted on have potentially weak vias. It's possible that the supply voltage could have been interrupted very briefly during the test.

Rocketmagnet
  • 26,933
  • 17
  • 92
  • 177
  • 6
    Is the MCU oscillator still running, does is it have a pin that outputs its clock that you can observe on a 'scope? Not a software-generated waveform on a GPIO but the clock from the internal oscillator. As an aside, I've had boards where the MCU stopped because of EMI, as described [here](https://electronics.stackexchange.com/questions/308379/switching-hv-dc-relay-on-crashes-microcontroller/308416#308416). – TonyM Jul 28 '22 at 09:39
  • 4
    There's a third option: A (obscure) error in the code makes execution end up in a place where it shouldn't. – Unimportant Jul 28 '22 at 09:53
  • Regarding the topic of the question, yes, it is possible that messing with power can hang a MCU. How do you mess with the power then? Perhaps there are other things in the hardware that can cause the MCU to hang. We'd need to see the schematics and source code to analyze why it might hang and under what circumstances. – Justme Jul 28 '22 at 10:18
  • @TonyM - We don't have access to the MCU's internal clock, but a PWM output of the chip was still toggling, which means that the Master Clock was still running. The CPU clock is a PLL which runs off the Master Clock. – Rocketmagnet Jul 28 '22 at 11:25
  • That's good. Was the PWM frequency correct, too? – TonyM Jul 28 '22 at 11:40
  • @TonyM - The guy who 'scoped that didn't check the frequency at the time. – Rocketmagnet Jul 28 '22 at 11:46
  • 3
    @Unimportant - It's not impossible, but I would say it's massively unlikely. There are no GOTOs in the code, no assembler or function pointer arithmetic. And the MCUs run for hundreds of hours, executing the same code over and over again. – Rocketmagnet Jul 28 '22 at 11:48
  • 9
    Note that you might want to enable a watchdog timer in the production version. – user253751 Jul 28 '22 at 12:31
  • 3
    I've done long duration testing of hardware. My advice is to buy a USB logic analyzer that can run indefinitely and probe 1-2 debug pins per device which you periodically toggle at known points in the software. Review the logic analyzer data when a device crashes and see if it's doing what you expect. And absolutely probe your device power and make sure it's stable. – user1850479 Jul 28 '22 at 12:38
  • I wonder if the same pcb’s will have the glitch if you would test again. I would check soldering of those boards. – RemyHx Jul 28 '22 at 12:41
  • 1
    As a quick (well, up to 360 hours) and dirty test, you could add decoupling directly onto the MCU pins, if accessible. If you can get 100 nF on, great, 10 uF in parallel as well is much better. See what that does. (Nearly posted this earlier, @MarcusMuller makes the same point.) – TonyM Jul 28 '22 at 12:55
  • 1
    @user253751 - Yes, we're implementing a watchdog timer now. If the code is indeed getting stuck, then the timer will trigger, and we'll be able to download a full stack trace from the chip. That should tell us where it's getting stuck (if that's what's happening). – Rocketmagnet Jul 28 '22 at 12:59
  • @RemyHx - Well we've seen the symptom twice now. Also, we've just received the batch of re-made PCBs from a different manufacturer, so hopefully we'll never see it again. – Rocketmagnet Jul 28 '22 at 13:00
  • 1
    You say you are using a PLL. So what happens if someone flips a light switch and burst of EMI causes the PLL to get unlocked even momentarily? Does the MCU switch to slower stable clock or run with unstable clock? Is there an interrupt about loss of lock? – Justme Jul 28 '22 at 14:02
  • The other thing I have seen is sometimes you might have an input pin unconnected to anything, and if that pin can generate an interrupt, but there is no interrupt handler installed for it, then it can execute an infinite loop when that interrupt fires. A very small disturbance can cause a floating input to toggle. Make sure that interrupts are disabled if they are not used. – user57037 Jul 28 '22 at 16:36
  • 1
    Happens with your PC too. – DKNguyen Jul 28 '22 at 18:08
  • 2
    For ARM MCUs in particular, a common cause of such hangs is one of the first few exceptions/interrupts triggering unexpectedly, such as a "Bus Fault". These can happen in case of electrical disturbances and if in such a case the default "loop forever" handler is called, then it looks like crashing. – DCTLib Jul 29 '22 at 12:11

5 Answers5

9

Very possible, if supply, and especially supply decoupling, is insufficient.. It's even been used extensively in breaking into and reverse engineering electronics: It's one way of glitching the processor to take the wrong turn at some point.

Here's a technical-but-not-quite-expert audience article on it, linking to quite nice videos and articles on how to glitch USB firmware, where "messing with the supply" lead to the CPU not noticing that it should stop outputting the bytes from the internal memory containing the device ID.

Marcus Müller
  • 88,280
  • 5
  • 131
  • 237
  • Very interesting video, thanks for the link. It certainly looks like it's possible to make the CPU execute incorrectly by glitching the power supply, and at that point, it's anybody's guess what it'll do. – Rocketmagnet Jul 28 '22 at 13:03
7

Regarding code getting stuck in a while loop, it’s always wise to specify an iteration limit, if only to generate some indication that something has gone terminally wrong. Regarding power fluctuations causing a processor to hang, this is quite possible, especially when carrying out relatively power-hungry tasks such as writing to flash.

Frog
  • 6,686
  • 1
  • 8
  • 12
  • Indeed, and in all of our code, that's what we do. But there is a lot of library code that we didn't write that contains while loops. Still, those loops would only get stuck if some kind of hardware fault happened that caused a peripheral to get stuck. – Rocketmagnet Jul 28 '22 at 11:21
  • 2
    The cure for which is to improve the library code. –  Jul 28 '22 at 11:37
  • 3
    Sometimes you **can't** specify an iteration limit with any practical value. Suppose you have a while loop that waits for the user to press a button. This event may happen in the next second or not for another 10 days. What iteration limit would you chose, and what would you do if the limit was exceeded? At some point the hardware itself must be reliable. – Elliot Alderson Jul 28 '22 at 12:44
  • 7
    @ElliotAlderson - In that case you would structure the code differently. Instead of waiting for the button in a while() loop, it would come and check the button every millisecond, , and in between checks, it would work on its other jobs. That way it could never be stuck there. – Rocketmagnet Jul 28 '22 at 12:56
  • @user_1818839 - Sadly, yes. That may be what we have to do. – Rocketmagnet Jul 28 '22 at 12:56
  • 3
    @Rocketmagnet There aren't always other things to do, and this still won't prevent the process from being stuck in the while loop if the button hardware fails. My point is that sometimes you must trust the hardware. – Elliot Alderson Jul 28 '22 at 13:21
  • 1
    If there's nothing else to do than waiting for a button press and you're "stuck" in a while loop waiting for the button do be pressed and it's not exiting then what you've got is just a broken button. Whether that's mechanically, electrically, or in the (loop free) code which polls the button, it's not a code structure problem at that point. – Dannie Jul 29 '22 at 09:56
5

Does your MCU have a "brownout" reset vector?

By default, most BSPs define this vector as an endless loop, so an instability in the power supply or an unclean power cycle will make the CPU hang instead of reset.

Simon Richter
  • 12,031
  • 1
  • 23
  • 49
  • It can detect brownouts, but it looks like that was not triggered in this case. Presumably, if this was caused by a brownout, then it wasn't low enough to trigger a reset. – Rocketmagnet Jul 28 '22 at 16:35
  • 3
    @Rocketmagnet, that's the point -- you end up in the brownout vector when the dip wasn't low enough to trigger a reset, and if the handler then goes into an endless loop instead of jumping to the reset vector, the MCU appears to hang. – Simon Richter Jul 28 '22 at 16:57
4

Yes, this is very well possible. I've faced a similar issue on an ATmega AVR MCU where power sequencing at the start causes some devices to hang on bootup.

And the only place in the code it could be hanging is a while loop where the device waits for the (drumroll) EEPROM to become ready.

I won't say it must be the same problem but your analysis points in that direction. EEPROM writes are quite susceptible to low voltage and in the datasheet of the ATmega 2561 I'm using there are some warnings about making sure not to issue EEPROM writes when the supply voltage is too low.

In my case, not even a reset on the reset line solved the issue. Only power off and power on within about 30 seconds solved the issue. Not all the MCUs behave the same and it seems to depend on the production lot of what percentage of MCUs have a problem.

So check your power system sequencing, stability of the supply, and Vcc blocking capacitors. If you have a few I/O pins to spare toggle some of these pins at different places in the code. This will make it much easier to see if your CPU has just halted or is in a spinlock.

ocrdu
  • 8,705
  • 21
  • 30
  • 42
kruemi
  • 2,994
  • 16
  • 29
  • 1
    Thanks for this. It's possible that it's the same issue. By design the power supply is pretty stable, but in practice we had some badly made boards with weak vias which sometimes go open-circuit. Perhaps one of these open circuits happened at just the right time for one of the periodic EEPROM writes. – Rocketmagnet Jul 28 '22 at 13:45
1

Don't assume that the failed CPU is executing code you have written, or any code at all - if the internal functioning is disrupted, the program counter could be set to anything, or you could have hit an unhandled exception, and the CPU is just halted.

If you think there might have been a problem with faulty vias, have you tried gently flexing the board, or giving it a thermal shock to see if that crashes the CPU? If the CPU still runs, then that is unlikely to be the cause of the problem.

Improving the EEPROM interface might help, but don't be surprised if it doesn't fix the problem completely; if you need high reliability, an independent watchdog circuit is highly desirable.

jayben
  • 1,247
  • 2
  • 5
  • Indeed, that is one of the scenarios I am imagining. The reason I mention code in my question is that that's the first thing people will suggest when I ask about a CPU seeming to have hung. – Rocketmagnet Jul 29 '22 at 16:09