28

Once a chip overheats it can start malfunctioning - for example many programs may start failing once some or all parts in a computer overheat.

What exactly happens that makes chips malfunction when they overheat?

stevenvh
  • 145,145
  • 21
  • 455
  • 667
sharptooth
  • 12,374
  • 26
  • 83
  • 144

6 Answers6

28

To expand on other answers.

  1. Higher leakage currents: this can lead to more heating issues and can easily result in thermal runaway.
  2. Signal to noise ration will decrease as thermal noise increases: This can result in a higher bit error rate, this will cause a program to be misread and commands to be misinterpreted. This can cause "random" operation.
  3. Dopants become more mobile with heat. When you have a fully overheated chip the transistor can cease being transistors.This is irreversible.
  4. Uneven heating can make the crystalline structure of Si break down. A normal person can experience by putting glass through temperature shock. It will shatter, a bit extreme, but it illustrates the point. This is irreversible.
  5. ROM memories that depend on a charged isolated plate will be able to lose memory as temperature increases. The thermal energy, if high enough, can allow electronics to escape the charged conductor. This can corrupt program memory. This regularly happens to me during soldering of ICs that are already programmed when someone overheats the chip.
  6. Loss of transistor control: With enough thermal energy your electrons can jump the bandgap. A semiconductor is a material that has a small bandgap so that it is easily bridged with dopants but large enough that the required operating temperature does not turn it into a conductor where the gap is smaller then the thermal energy of the material. This is an oversimplification and is the basis of another post, but I wanted to add it and put it in my own words.

There are more reasons, but these make an important few.

Kortuk
  • 13,362
  • 8
  • 60
  • 85
  • It seems likely that timing failures would be one of the "more reasons" (wire resistance tends to increase with temperature, so resistance-capacitance limited timing paths might violate their guaranteed worst case time). Of course, DRAM also leaks charge (like flash memory) faster at higher temperatures; without a compensation in refresh rate data can be lost. –  Feb 20 '15 at 12:00
14

The main problem with IC operation at high temperatures is the greatly increased leakage current of individual transistors. The leakage current can increase to such an extent that the switching voltage levels of the devices is affected, so that signals can't propagate properly within the chip, and it stops functioning. They usually recover when allowed to cool down, but that is not always the case.

Manufacturing processes for high-temperature operation (up to 300C) employ silicon-on-insulator CMOS technology because of the low leakage over a very wide temperature range.

Leon Heller
  • 38,774
  • 2
  • 60
  • 96
9

Just one addition to some excellent answers: Technically it isn't the dopants that get more mobile it is an increase in intrinsic carrier concentration. If anything the dopants/carriers get less mobile as the silicon crystal lattice starts to "vibrate" due to the increase thermal energy making it harder for the electrons and holes to flow through the device - optical phonon scattering I believe phsyics calls it but I may be wrong.

When the intrinsic carrier concentration increases beyond the doping level you loose electrical control of the device. Intrinsic carriers are the ones that are there before we dope the silicon, the idea of semiconductors is that we add our own carriers in to generate pn junctions and the other interesting things that transistors do. Silicon tops out about 150degC so heat sinking RF and high speed processors is very important as 150degC is not too difficult achieve in practice. There is a direct link between intrinsic carrier concentration and the off leakage current of a device.

Like the other chaps have shown, this is just one of the reasons chips fail - it can even get down to something as simple as a wire bond getting too hot and popping off it's pad, there's a huge list of things.

SimonBarker
  • 1,445
  • 2
  • 17
  • 22
  • When I say that the dopants become more mobile, I mean the physical atoms, not the carriers. The PN junction can drift and stop being a diode with time and heat. Second, When you get a higher enough temp your thermal energy, which creates both high energy phonons that interact with the electrons and much higher IR levels inside the structure, can give electrons high enough energy to jump the band-gap between conduction and valence layers. The Si tops out because its bandgap is such that 150degC will give electrons the ability to jump. – Kortuk May 05 '11 at 06:38
  • Yeah, I think we are saying the same thing just from different starting point. – SimonBarker May 05 '11 at 08:41
  • 1
    The way you are explaining it sounds exactly how I would of after taking device physics, after taking some applied Quantum and solid state devices I say it a little differently, but we both know how oversimplified these explanations are. I added a bit about this affect to my answer as I think it is very important, I gave you your first +1, which you deserved. This is an important affect as it leads to thermal runaway very quickly. – Kortuk May 05 '11 at 08:46
8

Although leakage currents increase, I would expect a bigger issue for many MOS-based devices is that the amount of current passed through a MOS transistor in the "on" state will decrease as the device gets hot. For a device to operate correctly, a transistor which is switching a node must be able to charge or discharge any latent capacitance in that part of the circuit before anything else relies upon that node having been switched. Reducing the current-passing ability of transistors will reduce the rate at which they can charge or discharge nodes. If a transistor is unable to charge or discharge a node sufficiently before another part of the circuit relies upon that node having been switched, the circuit will malfunction.

Note that for NMOS devices, there was a design trade-off when sizing passive pull-up transistors; the bigger a passive pull-up, the more quickly the node could switch from low to high, but the more power would be wasted whenever the node was low. Many such devices were therefore operated somewhat near the edge of correct operation and heat-based malfunctions were (and for vintage electronics, remain) fairly common. For common CMOS electronics, such issues are generally less severe; I have no idea in practice the extent to which they play a part in things like multi-GHZ processors.

supercat
  • 45,939
  • 2
  • 84
  • 143
  • 2
    This is a very important effect, I was about to ask Kortuk to add it to his answer. One of the factors behind the max Tj spec for a processor is that above that Tj the processor may not work at the rated speed. This is also why better cooling helps in overclocking. – Andy May 04 '11 at 16:00
  • The first paragraph is why your computer stops working when it gets hot - it slows down too much to keep pace with the clock frequency. – W5VO May 04 '11 at 20:39
  • Actually, there's another factor which may possibly have played a role in NMOS devices, though I wouldn't expect it in most typical designs: many NMOS devices had *minimum* clock speeds, imposed by a requirement to use or refresh the data in dynamic storage nodes before it got drained out by leakage. If leakage currents increase with temperature, the minimum clock speed would also increase. I suspect most devices were operated sufficiently above minimum clock speed that an increase in the minimum speed wouldn't be a problem, but I'm not sure. – supercat May 04 '11 at 20:53
  • @Andy, @W5VO, I was writing my answer last night and forgot that mid way. Night shift does damage to your brain. – Kortuk May 05 '11 at 00:00
2

To complement existing answers, today's circuits are sensitive to the following two aging effects (not only these but they're the main ones on processes < 150nm):

Because temperature increases carriers mobility, it increases HCI and NBTI effects, but temperature is not the primary cause for NBTI and HCI:

  • HCI is caused by a high frequency
  • NBTI by a high voltage

These two silicon aging effects cause both reversible and irreversible damages to the transistors (by affecting/deteriorating the insulator substrates) which increase the transistor voltage threshold (Vt). As a result the part will require a higher voltage to maintain the same level of performance, which implies an increase in the operating temperature and, as said in other posts, an increased transistor gate leakage will follow.

To summarise, temperature will not really make the part age faster, it is higher frequency and voltage (i.e. overclocking) that will make a part age. But transistors aging will require higher operating voltage wich make the part heat more.

Corolary: the consequence of overclocking is an increase in temperature and required voltage.

Eric
  • 139
  • 3
1

The general reason ICs fail irreversibly is because the Aluminium metal inside them that is used to create interconnects between the various elements melts and opens or shorts devices.

Yes, leakage currents will increase, but generally it's not the leakage current itself that is a problem, but the heat that this causes, and the consequent damage to the metal inside the IC.

Power circuits (e.g. power supplies, high current drivers etc.) can get damaged because at high voltages, when the transistor drivers switch off quickly, internal currents are generated which cause latch up of the device, or uneven power distribution inside it which causes local heating and subsequent metal failure.

A large (1000's) number of repeated thermal cycles can cause failure because of mismatches between mechanical expansion of the IC and the package, eventually causing bond wires to be ripped off or delimitation of the plastic package material and subsequent mechanical failure.

Of course a large number of IC parametric specs are only specified over a given temperature range, and these may not be in spec outside this. Depending on the design, this can cause failure, or unacceptable parametric shift (while the IC is outside the temperature range) -- this can occur for extreme high or low temperatures.

jp314
  • 18,395
  • 17
  • 46
  • Aluminium melts at 660°C (​1220°F). ICs die well before this temperature is reached. – Dmitry Grigoryev Jan 08 '16 at 08:59
  • Fundamentally no. At temperatures below this, you certainly can get undesired electrical behavior; excessive heating and thermal runaway, but this doesn't actually cause a permanent failure until some portion of the circuit reaches a temperature where the Al (or other metal) diffuses into the silicon. This (eutectic point) is around 500-600 C. Most other failures are recoverable. Additional failures may be caused by electrical malfunctions allowing excessive voltage to be applied to transistor's gates or thermal cycles (which cause mechanical failures). – jp314 Jan 09 '16 at 02:31
  • I still have my doubts. For example, ICs usually specify max soldering temperature around 300°C, so it seems that going over that limit is enough to cause permanent damage. – Dmitry Grigoryev Jan 11 '16 at 11:14