Options for debugging MCU that is freezing/crashing (AVR32)

Question

I'm currently trying to figure out why it seems like the MCU is crashing/freezing.
I can recreate the crash/freezing almost every-time by doing the same procedure, and I have been looking through the code extensively, trying to find what could be the problem, but I can't seem to find the culprit.

A timer is running on the MCU (clocked at 48 MHz) with an interrupt every 10 μs. To try to debug it further, I added some code to toggle a diode(every 100 ms) inside this interrupt, and suddenly this diode just stops blinking and the MCU is non-responsive. No USB/UART communication etc., everything seems dead.
I have measured the VDD voltage and it seems to be fine, no glitches/voltage drops.

I did not write the code, but I have narrowed it down to a part which takes care of decoding a serial signal which is coming in through a pin interrupt. But not any further.

The MCU is a AT32UC3 and I have a ATMEL-ICE debugger on it, but I do not have much experience with debugging on a live MCU.

I suspect it could be some part of the memory that is getting written and corrupted, but I'm not sure.

Any advice on how to proceed with this sort of problem?

@Jeroen3 Yes it is 10us, but it is not executing code every 10us, A variable is just counted. It's just for better resolution when it's then used. (The MCU is clocked at 48MHz) — Linkyyy, Nov 18 '18 at 15:56
For that sort of application at that speed a hardware timer might be a better choice than an interrupt. When you say you “added code” to the interrupt, do you mean inside the interrupt driver itself or it’s just code that reads the updated variable? — Edgar Brown, Nov 18 '18 at 19:34
@EdgarBrown: I added to the code directly to the interrupt routine, so i made sure it always ran — Linkyyy, Nov 18 '18 at 20:05
What's the speed of the system clock? How much interrupt latency are you counting with just for entering/leaving the interrupt? How many CPU ticks is the code inside the ISR? — Lundin, Nov 19 '18 at 11:48
@Lundin: The system clock is 48MHz. Im not sure what you mean by "interrupt latency". Inside the interrupt(I assume you mean the pin ISR?) there is a few lines of code that calculate the time since last interrupt, checks if its within specs(smaller than max allowable), then puts the time into an array. Im not sure how to view the asm code and estimate the number of clock ticks.. — Linkyyy, Nov 19 '18 at 14:28
So one CPU tick in your system is 20.8ns and you have 480 ticks per 10us. Roughly 100-150 assembler instructions. That's tough real-time requirements. Needless to say, you'll need to disassemble the C code and make a theoretical calculation. Nice disassemblers give out the number of ticks per instruction. As for interrupt latency, I mean the overhead - the number of ticks the CPU needs to store SP etc and jump to the ISR, then similar when it returns from the ISR. This needs to be taken into account when you disassemble and calculate timing. — Lundin, Nov 19 '18 at 15:20

score 1 · Answer 1 · answered Nov 18 '18 at 20:32

By adding your modifications to the interrupt routine you seem to have either exposed a race condition on the original code or created a new race condition.

Writing re-entrant code can be tricky, writing re-entrant interrupt code is even trickier. Writing interruptible interrupt routines is fraught with perils.

The fast interrupt was very likely intended to be fast, do its thing and get out of the way as soon as possible, leaving time for other things to happen. It is also likely to be high-priority interrupting everything that gets in its way and it’s likely to be consuming >5% of all the executed instructions, by changing this code you made it increase that share.

You also state that communications are interrupt-driven, which suggest another time-critical decision. Some portions of this code is not likely to be very forgiving of increased delays in its execution. Your modification added the possibility of extra delays.

By modifying the code and increasing the time the fast interrupt takes to complete. One or more of these conditions arose:

The time to execute the communications interrupt routine increased to the point its logic could not execute correctly.
The communications interrupt routine was not able to finish before it interrupted itself, creating a reentrancy problem or simply overflowing the stack.
You are modifying a register or flag in the fast interrupt routine that is not restored properly thus causing problems in the communications interrupts.

For doing slow things, like blinking an LED, it’s really not a good idea to increase the execution time of something that is supposed to be fast.

Thanks for your reply. Maybe it was bad wording, but the part about adding the LED toggle in the interrupt routine was to debug what was going on. The MCU was already freezing up before i added the extra code. Firstly i added the LED toggle in the main loop, but i could not be sure if it was some loop that it was stuck in or what, so i moved it into the interrupt routine instead, to make sure it was fired routinely — Linkyyy, Nov 18 '18 at 20:38
About the communication interrupt routine, i added a interrupt disable at the start, and a interrupt enable again and the end of the routine because of the same reason as you mentioned, that maybe the code was not able to finish before the next interrupt fired. This did not seem to fix it though — Linkyyy, Nov 18 '18 at 20:46
@Linkyyy this sounds like stack corruption or leak. You have a hardware debugger on, so take a look at what is the stack pointer(s) doing. They should always remain mostly stable at the same point in the code. — Edgar Brown, Nov 18 '18 at 21:17

Options for debugging MCU that is freezing/crashing (AVR32)

1 Answers1