3

So I've made two prototypes of a new custom board with a STM32F405RG microcontroller at heart and while one of them works totally fine, after a few seconds or more I'm getting a hard fault on the second one.

To investigate the origin of the hard fault I turned on the full assert in StdPeriph library and found the DMA_Cmd function from the library to be the first place where something goes wrong (or a first place where something wrong is detected).

It begins like this:

void DMA_Cmd(DMA_Stream_TypeDef* DMAy_Streamx, FunctionalState NewState)
{
    /* Check the parameters */
    assert_param(IS_DMA_ALL_PERIPH(DMAy_Streamx));
    assert_param(IS_FUNCTIONAL_STATE(NewState));
    (...)
}

Apparently, the IS_DMA_ALL_PERIPH macro produces a false result, suggesting that the DMAy_Streamx variable isn't actually a DMA stream.

Here's what the macro looks like:

#define IS_DMA_ALL_PERIPH(PERIPH) (((PERIPH) == DMA1_Stream0) || \
                                   ((PERIPH) == DMA1_Stream1) || \
                                   ((PERIPH) == DMA1_Stream2) || \
                                   ((PERIPH) == DMA1_Stream3) || \
... 
more of the same
...
                                   ((PERIPH) == DMA2_Stream6) || \
                                   ((PERIPH) == DMA2_Stream7))

The weird thing is that it's called from the same context every time (with a DMA1_Stream4 value), and the argument of the function definitely is a DMA stream.

So I decided to dive into the register values and assembly. Instead of the assert_param macro I set a simple trap to catch the first occurence of the fault and inspect the register values (the function gets called properly thousands of times before the fault occurs):

if(IS_DMA_ALL_PERIPH(DMAy_Streamx) == 0) {
    while(1);
}

Here's the disassembly of the relevant part of the function:

08000aac:   ldr     r3, [pc, #136]  ; (0x8000b38 <DMA_Cmd+140>)
08000aae:   cmp     r0, r3
08000ab0:   push    {r4, lr}
08000ab2:   mov     r4, r0
 485        if(IS_DMA_ALL_PERIPH(DMAy_Streamx) == 0) {
08000ab4:   beq.n   0x8000b14 <DMA_Cmd+104>
08000ab6:   adds    r3, #24
08000ab8:   cmp     r0, r3
08000aba:   beq.n   0x8000b14 <DMA_Cmd+104>
08000abc:   adds    r3, #24
08000abe:   cmp     r0, r3

...
12 identical checks here
...

08000b0a:   beq.n   0x8000b14 <DMA_Cmd+104>
08000b0c:   adds    r3, #24
08000b0e:   cmp     r0, r3
08000b10:   beq.n   0x8000b14 <DMA_Cmd+104>
08000b12:   b.n     0x8000b12 <DMA_Cmd+102>

So, the r0 value gets copied to the r4 register and then is successively compared to different values, stored in r3.

But the weird thing happens when investigating the register values after the program falls into the trap (loops at 08000b12):

r0  0x40026078 (Hex)    
r1  0x0 (Hex)   
r2  0x40003800 (Hex)    
r3  0x400264b8 (Hex)    
r4  0x40026070 (Hex)    
(...)
sp  0x1000fe50  
lr  0x800225b (Hex) 
pc  0x8000b12 <DMA_Cmd+102>

What? r0 value is copied to r4, gets compared to r3 without changing it's value and somehow after those comparisons r0 and r4 differ in value by 8?

I thought at first that maybe some interrupt code stops this function in the middle and corrupts the value of r0, but I don't think that's possible - the code I've shown gets called from the ISR of a 0 Preemption Priority, which means it cannot get preempted by other interrupts (other than fault exceptions, NMI and other stuff I don't use).

Does anyone have any idea what could be the cause of this weird behaviour?

adam
  • 31
  • 1
  • Do you know the type of error that causes the hardfault? There are registers for this in the SCB. – Jeroen3 Jul 22 '19 at 09:44
  • with the asserts on I don't even get to the hard fault, because the error gets caught earlier (by the `IS_DMA_ALL_PERIPH` macro). – adam Jul 22 '19 at 10:17
  • You should not focus on tested code, you should focus on code around it. Something probably corrupted whatever it loaded into `*DMAy_Streamx`. – Jeroen3 Jul 22 '19 at 10:55
  • I know, but I'm just having a hard time figuring out what could possibly be corrupting this value. As the registers show, the *DMAy_Streamx value is correct at the entry point of the function (it gets copied to r4, which contains a proper value upon inspection) and is not changed afterwards (as the disassembly shows). This code is called from the interrupt of preemption priority 0, and if I understand correctly how the NVIC works, it should not be interruptible from any other user code. Could this possibly be a hardware or configuration issue? I'm confused, because the second board works fine. – adam Jul 22 '19 at 11:14
  • What is the value of r0 before the branch instruction? – Jeroen3 Jul 22 '19 at 12:04
  • huh. Interestingly enough, after I added an instruction to store the r0 value (`DMA_Stream_TypeDef* volatile x = DMAy_Streamx;`) the problem miraculously disappeared. Damn Heisenbugs... It probably only suppressed the issue, but it's still weird. – adam Jul 22 '19 at 13:12
  • Making the pointer volatile is not the solution... Just saying. Now we're commenting too much. – Jeroen3 Jul 22 '19 at 13:16
  • I know, I just wanted to inspect the r0 value, I used the `volatile` so that the compiler wouldn't optimise the value away (I'm working on -Os optimization level so it got rid of the variable without the `volatile` qualifier. And you're right. – adam Jul 22 '19 at 13:34
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/96485/discussion-between-jeroen3-and-adam). – Jeroen3 Jul 22 '19 at 13:37

0 Answers0