Are thermal effects on an FPGA different depending on how it is programmed/configured?

Question

I am currently on the fence in a debate regarding FPGA environmental testing (thermal cycling).

The question is, can the same FPGA stressed to the same thermal limits yield different results depending on how it is configured/programmed?

For example, data rates, data corruption, etc.

With something like a microcontroller I would say that regardless of how it is programmed the hardware remains the same, so its thermal properties will remain the same. FPGAs are a bit of a grey area, because although the configuration is described in code (i.e. HDL), it is creating a configuration in hardware within the FPGA.

Thoughts and opinions welcome, but if you have any relevant research papers or articles that would be great.

Thanks

Edit: To add a bit of clarity, I will be more specific about the use case. The under test is a system control unit, where the main processor is an i.MX6 Arm CPU, with the main signal processing unit being an IGLOO2 FPGA. There are two builds of software available for it:

Test control software, designed to stress it for design proving activities, but serves no purpose in the real world.
Operational software, the software that will be shipped with the item

The argument is that because the environmental stress testing has been performed with the TCS there is a gap in our proving evidence because thermal cycling has not been performed using the software which it will be shipped with. The testing is performed in an environmental chamber which cycles between -40C and +70C. The counter argument is that the TCS is inherently designed to excessively stress the item, therefore the operational software should be fine.

The third argument thrown in to the mix is that regardless of the build of software, the hardware is the same, so the effects of external thermal stimuli should be the same.

Edit 2: Clearly I'm still struggling to explain myself. The product was designed internally but is being manufactured by a contractor in another country. The main qualification testing (environmental and EMC) is being performed by the contractor, however we are unable to ship them the 'real' software as it is restricted and cannot leave the country. With that in mind, we provided the contractor with technical requirements such as thermal environment, bus speeds, and a 'sanitized' functional requirements which were designed to be representative but without breaking restrictions. With this, they created their own software (a.k.a the TCS). The question I am asking is if the code is functionally representative but simply 'not the same' as the software that will ship with the product, then how much of a gap do we really have in our evidence?

Are you talking about things like spreading the configuration out on the die so it runs cooler? — DKNguyen, Feb 02 '21 at 15:58
*"With something like a microcontroller I would say that regardless of how it is programmed the hardware remains the same, so its thermal properties will remain the same"* Not sure I understand what you mean but one thing is sure: depending on the code it runs, a microcontroller may be cooler or hotter. Waiting for an interrupt is not the same as switching a lot of flips-flops due to heavy calculations. The generated heat may even depend on the input data, not only on the executed code. Note sure how you link thermal stressing with data rates and data corruption, though. — dim, Feb 02 '21 at 16:31
@DKNguyen No, I'm talking about external temperature cycling — Kierran Purden, Feb 02 '21 at 17:36
@dim I've added more detail to the question. I'm talking about external thermal variances, but agreed the processing load on the device will affect its internal temperature. — Kierran Purden, Feb 02 '21 at 17:37

Elliot Alderson · Answer 1 · 2021-02-02T17:49:27.617

1

Sorry, but both a microcontroller and an FPGA will have different thermal behavior depending on how they are used. If you intend to do thermal cycling or life testing you must run the application software (or similar) on a microcontroller and apply the application configuration (or similar) to an FPGA.

You need to be careful to differentiate between the thermal properties (which depend only on the physical construction and packaging) and the thermal behavior (which depends on usage as well). For thermal cycling and life test you want to be sure that the die itself is as hot, and in the same places, as it will be in use.

EDIT: It sounds like your real problem is that you haven't convinced the customer that your "TCS" really is an overstress. If the customer remains skeptical then you have no choice but to run the actual application software. If you are trying to demonstrate something about reliability or failure rate, then running a test without application code (or demonstrably worse) is no test at all.

edited Feb 02 '21 at 17:49

answered Feb 02 '21 at 16:49

Elliot Alderson

31,192
5
29
67

"For thermal cycling and life test you want to be sure that the die itself is as hot, and in the same places, as it will be in use." Agreed, this is in relation to external temperature variances. I understand that the device temperature will vary based on usage and load. The external temperature is designed to stress the unit, and is therefore above the maximum expected device temperature. See edits to original post – Kierran Purden Feb 02 '21 at 17:38
My comments were not in regard to external (chamber) temperature but to the die temperature. Thermal resistance between the die and ambient will make the die hotter, and certain **areas** on the die may be hotter than others because of how the logic is used. You need to be extremely clear when you say "device temperature" whether you are referring to the die or to the outside surface of the package, or to something else altogether. – Elliot Alderson Feb 02 '21 at 17:44
As we are in development phase, the ‘customer’ at this point in time is internal to the company. This thermal testing will be for design robustness, but general pass-out testing will be performed using application software prior to any deliveries to real customers. Pass-out testing is usually performed at lab conditions (20-25C). Would you agree then that design proving tests with TCS, and pass-out testing with real software does indeed leave an evidential gap? – Kierran Purden Feb 02 '21 at 19:42
Sorry, I'm not familiar with the phrases "design proving tests" and "pass-out testing", and I've worked on projects with the most stringent reliability requirements in the industry. Tell us what exactly you want to prove...functionality? reliability? life time? – Elliot Alderson Feb 02 '21 at 19:52
I'm with Elliott here. My customers would never accept a unit that was only tested over the narrow temperature range you have mentioned, 20-25C. – SteveSh Feb 02 '21 at 20:28
@Elliot Alderson - Maybe "design proving test" == Design Verification Test, or maybe Qual Test. And "pass-out testing" == acceptance testing? – SteveSh Feb 02 '21 at 20:31
@SteveSh Your guess sounds reasonable to me. I'm still concerned that the OP can't explain what exactly they are testing and why. – Elliot Alderson Feb 02 '21 at 20:32
The power dissipation profile of your test cycle may mean that the device never reaches the extremes of the external temperature cycle. For example, if your application software spends a lot of time in sleep mode, but the test software is always active, the device may get a lot closer to your negative temperature limit when application code is running, leading to a greater delta-T that could result in reduced life. Reliability is related in many ways to the delta-T of the chip/package/pcb/assembly combination, so be sure you know how the difference between test and application affects it. – elchambro Feb 02 '21 at 23:51
@ElliotAlderson valid point, I really am struggling to explain my use case here (somewhat due to limitations in what I can actually divulge). Basically, I'm not suggesting only testing the product at a temp range of 20-25C, I'm saying that the design gets proven and qualified with stringent tests, then during the manufacture phase, individual unit acceptance will be simple functionality testing, with periodic batch samples being subjected to further testing. The question is really what is the consequence of using different software for those two stages? – Kierran Purden Feb 03 '21 at 07:34
But what does it mean that "the design gets proven and qualified"? This is too vague to be actionable. Is there a requirement for reliability or lifetime for your product, with definitions of worst-case environmental conditions? If so, how do you demonstrate that the product meets those requirements? – Elliot Alderson Feb 03 '21 at 12:39

SteveSh · Answer 2 · 2021-02-03T01:46:45.360

There's an old saying in the aerospace industry that goes along the line of "Test like you fly, and fly like you test". That means that when you're performing an acceptance test on a product as part of a sell off to your customer, you want the test environment/scenario to be as close as possible to what the unit will see in use. The only difference in testing vs use is that acceptance testing usually involves a slightly wider temperature range that what might be seen operationally, in order to weed out marginal devices.

So following this line of reasoning, you do NOT want to use a special TCS firmware/software load for your acceptance tests.

Edit 1 - To answer Op's questions

Not sure what you mean by the "evidential gap". But if you really want to stress the unit, just run it over a wider temperature range than what it expects to see operationally. Here's a contrived example. Say your unit is supposed to operate from 0C to 40C. At a minimum, every unit that's sold should be tested over (say) -5C to 45C. On a design verification (qualification) unit, you expand that temperature range even more, say running one unit over -15C to +60C.

The only reason I would do what you're suggesting - come with a test load for the FPGA to make it run hotter, is if your thermal analysis shows that the FPGA is the weak link - the part with the least amount of thermal margin, and other components may not survive the extended temperature range testing I mentioned above.

As I’ve said to Elliot, this is design proving/robustness testing. Routine pass-out when they are being delivered to customers will be performed using real application software, albeit at a controlled lab temperature (20-25C). I’ve asked Elliot the same question, but with these facts in mind would you also agree that there is still an evidential gap? — Kierran Purden, Feb 02 '21 at 19:46

score 0 · Answer 3 · answered Feb 02 '21 at 22:20

You can ask the TCS if retesting is required. I would recommend making an evaluation first (see item 3. below).
Thermal stress does depend on the FPGA code and how it is used.
Your FPGA will run hotter if you are running it at 100MHz than when you are running it at 100kHz.
Example: if you would have a big serial shift register in your FPGA and you feed it with a sequence of alternating ones and zeros at 100MHz, you can expect that it will be running hotter than when feeding it with all zeros.
You can evaluate the risk by measuring the temperature when running the program used for testing, and comparing that to running the application program under worst case conditions.
It all depends on whether you wanted to pass testing like Volkswagen did, or if you want to actually test your product and make sure it is safe and functional under all expected conditions. Even when using a typical application, it does depend on whether your device is actually doing hard work or not.
In the end you're still only testing a typical system - not one where all your components have worst case characteristics.

Are thermal effects on an FPGA different depending on how it is programmed/configured?

3 Answers3