Mysterious hard-fault when I step over

Question

This question was rewritten to remove several updates and improve clarity.

I have a Cortex M3 based (and rather obscure) MCU. I have a rather big project, written in C++, in Keil MDK5 with armcc compiler.

I have this function:

bool CanHandle::sendMsg(CanMsg & msg)
{
    bool result = false; // <--------- problematic line

    CAN_TxMsgTypeDef txMsg;

    if (format == FrameFormat::EXTENDED)
    {
        txMsg.IDE = CAN_ID_EXT;
    }
    else
    {
        txMsg.IDE = CAN_ID_STD;
    }

    ENTER_CRITICAL_SECTION(); 

        uint32_t bufN = CAN_GetEmptyTransferBuffer(set.mdrCan);
        if (bufN != CAN_BUFFER_NUMBER)
        {
            CAN_Transmit(set.mdrCan, bufN, &txMsg);
            isTxQueueFull = false;
        }
        else
        {
            isTxQueueFull = true;
            result = false;
        }

    LEAVE_CRITICAL_SECTION();

    return result;
}

When I compile with -O1, compiler produces this assembly listing for it:

0x0800068C E92D41FF  PUSH     {r0-r8,lr}
0x08000690 4604      MOV      r4,r0
0x08000692 2700      MOVS     r7,#0x00
0x08000694 7A20      LDRB     r0,[r4,#0x08]
0x08000696 2600      MOVS     r6,#0x00
0x08000698 F04F0801  MOV      r8,#0x01
0x0800069C 2801      CMP      r0,#0x01
0x0800069E D013      BEQ      0x080006C8
0x080006A0 F88D6005  STRB     r6,[sp,#0x05]
0x080006A4 4860      LDR      r0,[pc,#384]  ; @0x08000828
0x080006A6 6800      LDR      r0,[r0,#0x00]
0x080006A8 F3C00508  UBFX     r5,r0,#0,#9
0x080006AC F8D401FC  LDR      r0,[r4,#0x1FC]
0x080006B0 F000FBCA  BL.W     CAN_GetEmptyTransferBuffer (0x08000E48)
0x080006B4 4601      MOV      r1,r0
0x080006B6 2920      CMP      r1,#0x20
0x080006B8 D009      BEQ      0x080006CE
0x080006BA 466A      MOV      r2,sp
0x080006BC F8D401FC  LDR      r0,[r4,#0x1FC]
0x080006C0 F000FBC4  BL.W     CAN_Transmit (0x08000E4C)
0x080006C4 7266      STRB     r6,[r4,#0x09]
0x080006C6 E004      B        0x080006D2
0x080006C8 F88D8005  STRB     r8,[sp,#0x05]
0x080006CC E7EA      B        0x080006A4
0x080006CE F8848009  STRB     r8,[r4,#0x09]
0x080006D2 B004      ADD      sp,sp,#0x10
0x080006D4 4638      MOV      r0,r7
0x080006D6 E8BD81F0  POP      {r4-r8,pc}

And here's the funny thing: when I try to step over line bool result = false;, I get a hard fault with UNDEFINSTR flag set. PC recovered from stack shows some unexisting address.

But - and here's the really mysterious thing - if I step over assembly, or step into C code or set a breakpoint on line 2 and press run - everything is fine! No hardfault, program runs from there. If I compile with -O0 or make result volatile, compiler produces different assembly and hard fault doesn't occur. I tried using different versions of compiler or IDE - problem persists.

Running this program without debugger produces no fault.

Debugging in simulator produces no fault, but stepping over that particular line doesn't actually step over, it makes program runs indefinitely.

This function is called from main, so after it's end there is just while(1). I believe there is no problem in the outside code.

Code of the function is now at its minimum, if I remove any line, literally any line, problem goes away. I previously posted several wrong guesses, I removed them for now. I can't pinpoint any particular instruction or address that produces hard-fault. All function calls there are dummies, they return immediately. CRITICAL_SECTION macros do not produce critical sections at all but something has to be at those lines or no fault occures.

I have literally no idea how that can happen. I don't know how debugger works exactly, what's the difference between stepping over and setting a breakpoint and hitting "run".

I imagine you get the hardfault at the first line of C++ code. But can you tell exactly in which asm instruction you get the fault? — next-hack, Sep 07 '17 at 15:38
@next-hack I get hardfault _only_ if I step over this first line "bool result = false". If I set a breakpoint at the next line - it's fine. If I step over assembly - it's fine. Every time I try to click at that particular area of assembly, IDE makes a wild jump somewhere. If I put a breakpoint in the assembly (first three lines) and then make a step over C++ code - I got a fault. Next assembly lines - fine. I'm starting to think that there is some very specific "alignment of stars" thet generating a fault, not a particular instruction. — Amomum, Sep 07 '17 at 15:59
I had similar problem on another MCU. Hard errors occurred only under very particular cases. After many hours, I found-out in the errata-sheets I had mounted an old silicon version of the chip, that did not support running at the maximum speed. Check the datasheet for errata... — next-hack, Sep 07 '17 at 16:20
@next-hack errata has 56 pages and lists plenty of errors but nothing resembling my problems (at list at the first glance). Clocking seems fine. — Amomum, Sep 07 '17 at 16:30
The function bleargh actually only corrupts registers :) I'm surprised that after that function there hasn't been an hard fault! Anyway, what are you using to debug? The uLink 2, in swd? — next-hack, Sep 07 '17 at 17:03
And also, if you use the simulator, of course everything is fine, right? — next-hack, Sep 07 '17 at 17:04
@next-hack I tried ulink 2 with jtag and st-link with swd (though mcu is not stm). All is fine in simulator, as expected. — Amomum, Sep 07 '17 at 17:27
"I tried this minimal faulting code on stm32f103 (same Cortex-M3, same memory offsets) and it faults there!" - show us _all_ of this code. — Bruce Abbott, Sep 08 '17 at 11:47
@BruceAbbott _all_ of the code is still to big to post here. I posted what I could, please see update of the qustion. — Amomum, Sep 08 '17 at 12:06
You could show us a _little_ more of it though, right? What's in the rest of CanHandle::sendMsg(), and what is after it in main()? — Bruce Abbott, Sep 08 '17 at 12:17
@BruceAbbott I'm not sure what difference does it make becasue the rest of the code doesn't execute anyway. Please, see edited answer. — Amomum, Sep 08 '17 at 12:24
Is there any initialization missing on the variables? We need to know what's that class. — next-hack, Sep 08 '17 at 12:39
When stepping in the simulator, what is the next line it stops on after 'bool result = false'? — Bruce Abbott, Sep 08 '17 at 13:07
@next-hack What class? CanHandle is really big and looks properly initialized. I think that since sendMsg doesn't even touch `this` so it should not matter. — Amomum, Sep 08 '17 at 13:21
@BruceAbbott it stops on `if (format == FrameFormat::EXTENDED)` — Amomum, Sep 08 '17 at 13:22
@BruceAbbott actually, no, sorry, I was wrong. In simulator when I make step over, debugger actually doesn't stop at all, like I pressed "Run". But when I make step into - it steps over. Huh. And in the real hardware "step into" doesn't produce fault. — Amomum, Sep 11 '17 at 16:01
So as I suspected, the 'step over' is actually executing a lot more than one line of code. Set a breakpoint in the machine code, moving it down one instruction at a time until the hardfault occurs (or until you notice some anomaly). — Bruce Abbott, Sep 11 '17 at 19:07
@BruceAbbott it still doesn't make much sense. I put a breakpoint on the next line (or CMP instruction) and step over - it faults and doesn't hit breakpoint. If I run or step into - it hits breakpoint and doesn't fault. If the problem was in the code alone - wouldn't it fault without debugger attached? — Amomum, Sep 12 '17 at 09:10
The debugger may use resources (eg. stack, peripherals) which are changing the machine state enough to trigger the hardfault. This suggests there may be a latent bug in your code. The 4 instructions you are focusing on are probably benign, and something _after_ them is the problem. You need to know what code is being executed during the 'step over'. So use the simulator to find where to put a breakpoint that it actually reaches, then do it in hardware. — Bruce Abbott, Sep 12 '17 at 17:43
Check your stack size. I had recent mysterious hardfaults that turned out to be too small stack size. — Tut, Sep 14 '17 at 20:52
"execution **reaches** MRS and here is the hard fault." - What exactly do you mean by this, does it actually execute the MRS instruction, or merely reach the address it is at? What is the minimum amount of preceding code you can leave in that still produces the fault? — Bruce Abbott, Sep 15 '17 at 16:37
@BruceAbbott as far as I understand now, fault is produced only if I make a step over C code, that step over fails (i.e. doesn't stop on the next line of C code) and makes a program just run and exectuion 'runs over' MRS. Just pressing 'Run' won't do it, executing MRS by stepping into won't do it, running program without debugger won't do it. I don't know how to check if MRS is getting executed since PC recovered in hard fault handler is corrupted. I don't understand why stepping over doesn't work normally but it seems to be crucial. — Amomum, Sep 15 '17 at 19:02
Please edit your question to show **only** the _minimum_ source code that still causes the fault, assembly listing of this code, and all information provided by the hard fault handler. — Bruce Abbott, Sep 15 '17 at 20:08
@Tut i checked three mcus from two different vendors. And I didn't found anything relevant in the errata. — Amomum, Sep 18 '17 at 16:14

score 2 · Answer 1 · answered Sep 14 '17 at 18:51

Some ideas which might help when debugging a hard fault on a cortex m4, maybe some of them are useful: - the line which causes the hard fault is put on stack at address +0x18, if the interrupt is synchronous, BFARVALID bit set, if not, it can be forced by setting the bit DISDEBUF from ACTLR system register.

Anothe thing, when executing code from flash, things like the wait state configuration, caches, prefetch buffers might sometime cause errors like this.

I currently have a similar issue, which seems to be influenced by a linker option(ignore_debug_references), not sure how yet...

I think I was able to reduce my problem to one assembly line. I still have no idea why fault is happening. Please see update to my question. — Amomum, Sep 15 '17 at 12:23

score 1 · Answer 2 · answered Sep 14 '17 at 19:59

MCU is obscure enough to have very nice hardware bugs (and it does have them, actually) but until now there were no bugs in the core itself (because it was bought from ARM, I presume). But there is a possibility.

It is rare that you find a bug in the core or compiler. Less rare to find bugs in a peripherals, and very common to find bugs in code.

I have literally no idea how that can happen. I don't know how debugger works exactly, what's the difference between stepping over and setting a breakpoint and hitting "run".

Step does not interrupt, step-over does.

Also, please enable vector catch. You can do this with a debugger script.

Vector catch is a nice trick, thank you. I usually just put BKPT instruction in all exception handlers. So that didn't tell me anything new, unfortunately. — Amomum, Sep 15 '17 at 09:10

Mysterious hard-fault when I step over

2 Answers2