How to debug reset caused by WDTCTL security key violation in MSP430F1611?

Question

I'm working on a sensor project which uses TelosB based on MSP430F1611 running TinyOS. My program is reset some time after boot up. After the PUC reset, IFG1 is found with WDTIFG bit set, indicating the watchdog timer initiates the reset. This can happen under two cases:

1) Watchdog timer expiration when in watchdog mode only. But watchdog timer is never started, so this cannot happen.

2) Watchdog timer security key violation. There is no place that my program explicitly writes WDTCTL (i.e., 0x0120h). So there must be some memory access bug in my code, which illegally writes WDTCTL and causes security key violation.

Is there any debug tool to help locate where this happens? Or any any suggestion on how I should proceed to locate the bug? My program is of thousands of lines, so manual check is non-trivial.

Is the watchdog ever disabled? The watchdog is running upon boot, and the application must explicitly disable it. Early versions of the mspgcc disabled the watchdog as part of the C runtime initialization. The mistake was corrected in later versions. — markrages, Oct 29 '12 at 01:22
Yes, it is disabled when a node is initialized: WDTCTL = WDTPW + WDTHOLD; — sinoTrinity, Oct 29 '12 at 01:34
Consider as a debug test creating a do-nothing application where sets things up as your normal application would, then simply loops (or counts on the serial port or whatever). This will minimize the chance of runaway execution, helping you see if the reset is caused by never successfully disabling the watchdog. Alternatively you could add code to periodically kick the watchdog; if that makes the resets stop, then you know it was not disabled. — Chris Stratton, Nov 29 '12 at 20:52

Michael Karas · Answer 1 · 2012-11-11T22:05:06.953

If the problem is repeatable it will help a tremendous amount. The debug process can be addressed if you have a couple of key embedded development tools available. These tools would include a digital oscilloscope and two spare GPIO pins on your processor. In the absence of two spare pins it is often possible to temporarily re-deploy two in-use pins for the purpose of the debug exercise.

The method of debug is one of divide and conquer to discover what section(s) of your code are in use when the errant reset happens. The first step involves adding code in your startup routine that sets one of the spare pins as an output and to a logic level that will then get changed back to default when the MCU goes back into reset. This GPIO will be used as a trigger for your digital scope such that the "end trigger mode" will be used. It is possible that you may already have an I/O in your system that can serve as this trigger event. In some cases the WDT reset may also appear directly visible on the MCU reset pin. If either of these latter cases applies then you can use that instead of having to add use of this first GPIO pin.

The next step is to set the other spare GPIO pin as an output and then set it to its non default level followed by setting it back to to the other state within a specific branch of your program code. The common method for most MCUs is that GPIO pins default to inputs after reset so a external pullup is installed to keep the pin high whilst it is still an input. Your test code would then set the pin as an output to low and then after a time later in the code branch set the output to a high level.

Re-compile, link and load the test version of the code. Now you put the scope to "end trigger" on the occurrence of first mentioned GPIO which will signal that the errant WDT reset has happened. Have the other GPIO on a second channel of the oscilloscope. Then start up your system and wait till the errant reset happens. If you notice that this second GPIO sets low and back high before the trigger event then you will know that the WDT event did not happen within the branch of code where you added the handshake signal. On the other hand if this second GPIO returns to its default state in conjunction with the WDT reset then you know that the guilty code is somewhere between the two places in the code where you set and cleared the handshake signal. (Note that if you do not see the second GPIO handshake at all on the scope then you know that particular branch of code was never even entered and thus probably not the guilty region of the code).

When evaluating the scope traces do take note that the WDT reset will happen some time before the scope trigger happens due to the fact that you setup the scope trigger GPIO in the code execution just after the code has re-started. You will need to evaluate how much delay there is here on your MCU and in your system so that you can accurately judge whether the second signalling GPIO happens to clear at WDT reset time or earlier in time. Practice runs can help to fine tune the technique.

The investigation process now involves moving the setting and clearing of the second GPIO handshake around in various parts of your code execution path to try to isolate the specific parts of the code that are involved with the problem. If you find a branch that is suspect you can then narrow down the area of focus....i.e. "divide and conquer".

One very nice thing about this technique is that the setting and clearing of the signalling GPIO takes very little execution time on the processor and so does not have a serious impact on the real time performance of time critical code. There are other debugging techniques available but they all generally involve much greater execution overhead than the described technique.

Thanks for your insightful reply. However, there are two constraints preventing me from utilizing your tactic: first, I don't have an oscilloscope; second, there are 100+ sensors, the bug only manifests on a few of them (less than 10) each run, and these sensors change across runs. Even I have some oscilloscopes, it's not easy for them to catch these few buggy sensors. Any further suggestion? BTW, the reset is caused by security key violation. Hope this can reduce the problem space. — sinoTrinity, Nov 11 '12 at 21:30
Only comment I have to make is that this type of problem can be extremely difficult to debug. You need to get the right tools for any trade. Embedded programming is no different in this regard than say a plumber or an eletrician. You don't expect to see the doctor whip out a jack knife to perform an appendectomy. — Michael Karas, Nov 11 '12 at 21:56

score 2 · Answer 2 · answered Nov 29 '12 at 20:45

2

Indeed this is one of the challenges of embedded programming. I see several options to try and narrow down the issue:

1) Use JTAG connected to the TelosB and put a breakpoint on an access to the Watchdog timer. At some point hopefully it will manifest and it will break and you can look at it.

2) use the FLASH (either info or other flash) to log status of operations for what you're doing with a timestamp and other useful information. Then you can connect using JTAG (be careful of not erasing the Flash), pull the data and look at it. Perhaps even store rudimentary trace information that can be useful.

answered Nov 29 '12 at 20:45

Gustavo Litovsky

7,619
3
25
44

Aside from what I suggested, you can also approach the debugging from another point. You can disable parts of the code you're running (and perhaps to make it faster different parts of the code for each board). This will work if the issue is localized in one of your routines. However, if it requires multiple routines to fail, you'll have to continue experimenting. What I'm basically suggesting is that if you have two tasks, one for blinking LED and another one for communicating over the radio, then you take two boards, and on each just run one. (However, the LED is unlikely to be the culprit). – Gustavo Litovsky Nov 30 '12 at 16:53

How to debug reset caused by WDTCTL security key violation in MSP430F1611?

2 Answers2