Multi MPU programming standard or algorithm?

Question

I've been reading that electronics in space vehicles (satellites or crafts) have redundant systems. I'm wondering if there are standards or algorithms (not sure here on either term) that I could use if say I wanted to play with the concept of using 4 (not sure about the number either) identical PCBs each with a MPU to do something. For the MPU I'm thinking a 16-bit MPU like a Microchip PIC24. The idea is that if any of the PCBs get hit with a single event upset (SEU), like radiation, the rest of the PCBs could continue to work ignoring the PCB that was hit. I imagine there should also be a way to check if the hit PCB can recover. I'm asking for the info to look into for this kind of programming. Typing multi MPU or CPU in Google keeps coming up with multi-core programming which is not what I'm looking for.

Thanks

Try keywords like "modular redundancy" and "fail-safe". – Elliot Alderson Sep 17 '21 at 21:01 — Elliot Alderson, Sep 17 '21 at 21:01

score 1 · Answer 1 · answered Sep 17 '21 at 21:12

There is nothing really special about this programming in terms of examining inputs and generating outputs. They should all have the same programming for that.

The place that this diverges from normal programming is that each "module" needs to back up the other modules, that is examine their calculated output against expected output. If the module is producing different output, then either the current module or the module being checked is suspect...

How do you figure out which one? Voting is one common method used. For example:

Board 1 calculates 123.3
Board 2 calculates 224.2
Board 3 calculates 123.3
Board 4 calculates 123.3

Which board is wrong? This type of system requires at least 3 voting machines. Once it is down to two or less the voting doesn't work anymore. Some systems may also use supervisors...

They can also use independent inputs so they can check sensors against each other (often sensors will either be smart sensors providing CRC'd data, or sensors that just provide voltages). The boards can compare sensors to find out if a board is operating with a bad input.

Additionally there are self-check routines, where the boards do memory integrity checks, firmware CRC, checks of inputs/outputs, etc. If one is found to have bad data, it can take itself offline.

These are just some of the methods used to have redundancy, the actual implementation and how fancy you get really depends on your failure modes. The starting point is to determine how your system can fail and design a redundancy scheme around that.

score 1 · Answer 2 · answered Sep 17 '21 at 23:11

The easiest way would be to use a master/slave setup and have Peripherals using some sort of BUS (like I2C) and doing monitoring in a closed loop.

You can then easily enable/disable the peripheral modules if outputs from them don't match expected results from what they were commanded to do, and you could even switch from one to the other to validate each other. In this way you only need 2 modules, not 3, because by comparing the results of each to the closed loop expected results, you can determine which is correct or if results have actually gone outside expected parameters. This is how car engines work but without the redundancy.

The most difficult part is determining what triggers the salve controller to take over and writing the code to validate that the master is doing what it should. Especially with a SEU it's not going to be obvious or easy to predict/determine, and responses to an SEU will be different for every application, because some SEUs won't matter for certain types of things and could be extremely bad for others.

Multi MPU programming standard or algorithm?

2 Answers2