65

I've read a bit about the construction of a digital computer in Shocken/Nisan's: The Elements of Computing Systems. But this book says nothing about certain electrical aspects in computers, for example: It's often said that 0's and 1's are represented by voltage, if the voltage is in the interval [0, 0.9), then it is a 0. If the voltage is in the interval [0.9, 1.5), then it is a 1 (voltages may vary, I'm only giving an example).

But I never read what keeps electrical voltages "well-behaved" in a way that a 0 could never accidentally become a 1 due to electrical volatility[1] inside the computer. Perhaps it's possible for the voltage to be very near 0.9, then what is done to avoid it passing the threshold?

[1]: Supposing it exists.

Null
  • 7,448
  • 17
  • 36
  • 48
Red Banana
  • 767
  • 1
  • 6
  • 7
  • 8
    The current is never very near 0.9, because nothing ever makes the current very near 0.9. – user253751 Apr 27 '16 at 00:04
  • @immibis Yes. I believe it is so and have empirical evidence of that. But I want to know why. – Red Banana Apr 27 '16 at 00:12
  • 7
    Because things are designed to not output currents very near 0.9. You might as well ask "I have solid empirical evidence that my laptop is not charged to 50 gigavolts; why isn't it?" Simply because there's no reason it would be. – user253751 Apr 27 '16 at 00:14
  • 14
    Nitpick: Most digital logic uses voltages, not currents, to represent logical states. –  Apr 27 '16 at 00:54
  • @duskwuff Yes. I always mix them both. It also happens when I try to use left/right. :P – Red Banana Apr 27 '16 at 02:02
  • @Voyska OK. To avoid further confusion, I've suggested an edit to your question to make it consistently use the word "voltage". –  Apr 27 '16 at 02:40
  • 13
    anecdotal evidence: In 2011 I had a bit swapped in a file on a hdd that was working fine for 5 years. – PlasmaHH Apr 27 '16 at 07:54
  • 4
    @PlasmaHH Unrelated. That's [bit decay](https://en.wikipedia.org/wiki/Data_degradation), which occurs because the fact hard disks are capable of storing data for long periods of time at all is a miracle. – cat Apr 27 '16 at 11:55
  • 3
    Every year Microsoft's crash reporting service gets a few (accurate!) reports of machines that crashed on no-op instructions. What does this tell you? – Eric Lippert Apr 27 '16 at 12:18
  • 7
    These accidental switches are exploitable. [Google's explanation](http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html) is an interesting case for this. It definitely makes security a lot tougher when the real API for "Set bit X to 1" is "Set bit X to 1 and sometimes accidentally set bit Y to 1". This also becomes very common in the case of overclocked machines. I recall one company actually added a bunch of calculations (every frame) with known results to see whether the hardware was reliable enough to run the game. – Brian Apr 27 '16 at 13:16
  • 2
    There is a PDF linked at [this article](http://www.zdnet.com/article/flipping-dram-bits-maliciously/) about how manipulating RAM in a certain way can potentially flip a bit somewhere else in that RAM. It's been a while since I read the report, but it's relevant to this topic. – Steve Apr 27 '16 at 20:59
  • @EricLippert: Given that processors execute many instructions simultaneously, should be surprising that a NOP would be executing when an asynchronous failure is detected? – supercat Apr 28 '16 at 18:39
  • @Steve: I wonder what efforts have been made since then. Having each access to a row randomly trigger a refresh 0.1% of the time will eliminate the issue with a performance cost typically bounded at about 0.1%; I wonder how many systems do such a thing? – supercat Apr 28 '16 at 18:41
  • @supercat: My point is: crash dumps occasionally indicate actually impossible situations like no-ops dereferencing null; this is not a failure of the reporting system giving an instruction on the wrong thread. When MSFT investigated a representative sample of these by following up with customers, they were found to be highly likely to have purchased machines secondhand that they did not know had been overclocked. The processors were hot enough and operating off-label enough that the exception bits were being flipped. So how can we be sure that computers don't flip bits? Keep 'em cool and slow! – Eric Lippert Apr 28 '16 at 21:39
  • 1
    Altitude increases probability of bits flipping, as well as factors like age, heat, manufacturing quality, etc.; see [this field test article](http://dl.acm.org/citation.cfm?doid=2503210.2503257) for some more details. – WBT Apr 28 '16 at 22:07
  • @Brian came here just to comment about rowhammer. Super interesting. – Qix - MONICA WAS MISTREATED Apr 29 '16 at 03:28
  • 3
    We can't be sure. That's why we need ECC RAM for servers running for a long time and special software+hardware for special environments like space, inside X-ray machines or other radioactive devices. [Compiling an application for use in highly radioactive environments](http://stackoverflow.com/q/36827659/995714) – phuclv Apr 29 '16 at 03:34
  • @PlasmaHH Surely that would have been caused by magnetism rather than electricity? (Or possibly electromagnetism, which is sort of both). – Pharap Apr 29 '16 at 05:20
  • @Pharap: since the blocks on the HDD are checksummed, this would have been a weird coincidence. The more likely explanation would be some filesystem defragmentation action going on and on the way something flipped, but no one will ever find out for sure – PlasmaHH Apr 29 '16 at 07:10
  • I interpreted the question as being about the normal case - i.e. "why don't computers accidentally switch 0 to 1 *all the time*?" Almost everything goes wrong occasionally, that's the price of having cheap things. – user253751 Apr 29 '16 at 11:48
  • @EricLippert: I don't know anything close to all the details of modern x86 platforms, but I would expect that Intel/AMD would want to include a "something went off the rails somewhere" trap in cases where the instruction sequencing logic found itself in an "impossible" situation [e.g. one part of the executing unit is waiting for a result from another part that has no operations pending]. I would not expect the instruction-pointer value stored by such a trap to be very "precise", since the whole reason for the trap would be that the processor had lost track of what instructions it was... – supercat Apr 29 '16 at 16:15
  • ...supposed to be executing, so if a nop was somewhere in an execution pipeline when the fault occurred, I would not be surprised if it got listed as the trap address. Are you saying that the faults on nops were something other than "gaaaaahh! execution-unit problem!" faults? – supercat Apr 29 '16 at 16:17
  • 2
    @supercat: Raymond has the details: https://blogs.msdn.microsoft.com/oldnewthing/20050412-47/?p=35923 – Eric Lippert Apr 29 '16 at 17:11
  • 1
    @plasmahh, that does not happen. hard drive sectors are hashed to detect bit errors,it would take several errors in the hash part of the recorrd to flip a bit in the data part without triggering a disk read error. it's possible you had a ram failure before the file was written, – Jasen Слава Україні May 01 '16 at 11:58
  • [xkcd: Real Programmers](https://xkcd.com/378/) - and stop pressing `C-x M-c M-butterfly` – Elliott Frisch Mar 13 '17 at 22:25

12 Answers12

102

It's often said that 0's and 1's are represented by voltage, if the voltage is in the interval [0, 0.9), then it is a 0. If the voltage is in the interval [0.9, 1.5), then it is a 1 (voltages may vary, I'm only giving an example).

To some degree, you've mostly created this problem by using an unrealistic example. There is a much larger gap between logical low and high in real circuits.

For example, 5V CMOS logic will output 0-0.2V for logic low and 4.7-5V for logic high, and will consistently accept anything under 1.3V as low or anything over 3.7V as high. That is, there are much tighter margins on the outputs than the inputs, and there's a huge gap left between voltages which might be used for logical low signals (<1.3V) and ones that might be used for logical high (>3.7V). This is all specifically laid out to make allowances for noise, and to prevent the very sort of accidental switching that you're describing.

Here's a visual representation of the thresholds for a variety of logic standards, which I've borrowed from interfacebus.com:

Logic level thresholds][1]

Each column represents one logic standard, and the vertical axis is voltage. Here's what each color represents:

  • Orange: Voltages in this range are output for logical high, and will be accepted as logical high.
  • Light green: Voltages in this range will be accepted as logical high.
  • Pink/blue: Voltages in this range won't be interpreted consistently, but the ones in the pink area will usually end up being interpreted as high, and ones in the blue area will usually be low.
  • Bluish green: Voltages in this range will be accepted as logical low.
  • Yellow: Voltages in this range are output for logical low, and will be interpreted as logical low.
  • 4
    Good answer, though I think it could be more complete: you only cover immunity to (or rather, protection against) noise. There are many other mechanisms responsible for digital errors, and just as many means of protection. The good thing is, I didn't cover immunity to noise in my answer :) – Mister Mystère Apr 27 '16 at 11:42
  • 1
    @MisterMystère That's true! Error correction is kind of a huge topic, though, and I couldn't possibly cover all of it in a single answer. –  Apr 28 '16 at 01:51
  • 1
    @MisterMystère: Well, "noise" is a term that covers all kinds of stochastic error sources. Your examples of EM interference and cosmic radiation fall into the category of "noise" just fine. The only other cause of digital error is a deterministic one, which we call a "bug". But this question is only about the accidental errors. – Ben Voigt Apr 28 '16 at 14:29
  • In your third bullet, I believe you have the colors or the logic switched. the pink should be low and the blue should be high. – Guill Apr 28 '16 at 21:41
  • @Guill Huh? The pink region is above V_T, so it will unreliably be treated as logical high. –  Apr 28 '16 at 22:02
  • So doesn't this just invite the question "How often do voltages fall inside the blue/pink ranges?"? Some of those "unreliable" ranges seem quite large. How do they compare to real-world tolerances? How often would you expect a bit to be accidentally flipped in, say, the RAM of a modern laptop? – naught101 Apr 29 '16 at 02:03
  • @naught101 Keep in mind that these voltage standards only apply to signalling. Voltages inside DRAM are a completely different issue. –  Apr 29 '16 at 02:10
  • @duskwuff: ok, "between CPU and RAM" or similar then - I mean, there are lots of pathways in a computer, how often does one of them result in an accidental flipped bit? – naught101 Apr 29 '16 at 02:26
  • Yeah but who uses 5V transistors these days? This is a real problem for modern small processes that work with much lower voltages - as the troubles with DDR3 ram clearly demonstrate (look up rowhammer attacks). You really can't say "this is not a problem" when there are existing exploits out there that do exactly that. – Voo Apr 29 '16 at 18:16
  • @Voo Margins on 3.3V (or lower) CMOS are, generally speaking, proportional to the 5V margins. And as I've said already, DRAM levels are a separate issue. –  Apr 29 '16 at 23:51
  • @duskwuff Even then, a transistor can easily start violating its specifications (operating outside its temperature range, manufacturing errors,..), which would then violate the propagation delay assumptions of the circuit which would then lead to mistaking a high for low or vice-versa. But I guess it really depends how you interpret the question. – Voo May 01 '16 at 09:02
  • The 5V CMOS example looks completely symmetric. Is this an accident of the materials used or is there a deep reason behind it? – Nayuki May 01 '16 at 23:12
66

We can't. We are just decreasing the probability of errors by adding checks to the data. Depending on what type of data is to be checked, it can either be done via hardware or software, and can take any form from simple checksum bits in serial streams to cyclic state machines allowing only specific transitions to be made at any given time.

But it's a vicious circle, isn't it? How can we ensure the circuit in charge of checking the data is not affected by the same disturbances as the data and give a false positive? Adding another one? You can see how this can get quite expensive for very little gain in the end.

The question is: how reliable do you want your system to be? Satellites, which embed some of the most reliable computer systems available, for example sometimes resort to cross redundancy of non-identical systems as well as votes: three different computers run the same algorithm coded by three different persons in three different ways, and if one of the computers gives a different result from the two others, it is restarted (and if it happens again, isolated). But again, if two computers are faulty at the same time, then the wrong computer will be restarted/isolated. Usually, "cold redundancy" is enough: a primary and a secondary circuit are implemented, the primary runs until an error is detected by some sort of (non-protected) monitoring circuit and the secondary circuit is swapped in. If it's just an error in RAM, code can be re-run to refresh the data. You just got to decide wisely where to draw the line, it's impossible to make a 100% reliable error detection circuit.

Satellites (especially at high altitude or in the Van Allen belt) and computers in nuclear plants or other radioactive environments are particularly subject to (keyword:) Single Event Upsets or latchups because of high energy particles colliding with or being absorbed by crystal lattices of semiconductors. Books covering these fields will certainly be your best bet. Paint gets degraded by displacement damage from radiation, so it is totally understandable that semiconductors can be damaged or upset by incoming radiation as well.

Mister Mystère
  • 9,477
  • 6
  • 51
  • 82
  • 3
    I am not sure if you meant to write 'vicious circle', but 'viscous circle' sounds equally funny. – svavil Apr 28 '16 at 07:17
  • 1
    Actually it was viscious circle but viscous circle made me laugh :) – Mister Mystère Apr 28 '16 at 09:20
  • 1
    It only takes log n bits to be able to locate and correct a single bit error in n bits. – Thorbjørn Ravn Andersen Apr 28 '16 at 23:14
  • 1
    This reminds me of a recent question about writing programs to account for hardware errors in computers exposed to radioactive compounds: http://stackoverflow.com/questions/36827659/compiling-an-application-for-use-in-highly-radioactive-environments – Pharap Apr 29 '16 at 05:27
  • I didn't know about the satellites thing that three different 'groups' (or more) are coding the same thing in a different way. Sounds cool :) – desmond13 May 02 '16 at 15:49
33

Single event upsets are no longer a thing of space nor aircraft; we have been seeing them happen on the surface for over a decade, maybe two by now.

As mentioned though, at least in space applications we deal with upsets using triple voting (each bit is really three, and a two thirds vote wins, so if there is one that changes the other two will cover it.). And then ECC or EDAC, with scrubbers that go through the RAM at a rate higher than the predicted single event update rate to clean single event upsets (ones that actually push the two thirds vote wrong).

Then there is total dose; over time the material just gets too radioactive to work, so you use enough material to exceed the life of the vehicle. Not something we worry about on the surface normally. (And latchup) Using three/multiple sets of logic in parallel is/was a way to try to not have to use traditional rad-hard tech, and well, you can find how well that is working out.

The folks that used to know how to make stuff for space have for the most part retired or moved on, so we have a number of programs making space trash now. Or treating space like earthbound products, instead of trying to make everyone of the work and have a controlled re-entry and burnup, we now expect a certain amount of space trash out of every constellation.

We do see upsets on the surface. Any memory stick (DRAM) you buy has a FIT, Failures In Time, and any chip with RAM in it (all processors, many others), will have a FIT spec as well (for the RAM (SRAM) blocks). RAM is more dense and uses smaller transistors, so it is more susceptible to upset, internally created or external. Most of the time we don't notice or care as the memory we use for data, watching a video, etc. is written, read back and not used again before it sits long enough to have an upset. Some memory, like one holding a program or the kernel, is more risky. But we have long been used to the idea of just rebooting our computer or resetting/rebooting our phone (some phones/brands you would have to regularly remove the battery periodically). Were these upsets or bad software or a combination?

The FIT numbers for your individual product may exceed the life of that product, but take a large server farm, you factor in all the RAM or chips or whatever and the MTBF goes from years or orders past that, to days or hours, somewhere in the farm. And you have ECC to cover what you can of those. And then you distribute the processing load with failovers to cover the machines or software that fails to complete a task.

The desire for solid state storage, and the move from spinning media has created a problem related to this. The storage used for SSDs (and other non-volatile storage) to get faster and cheaper, is much more volatile than we would like and relies on EDAC, because we would be losing data without it. They throw a lot of extra bits in and ecc the whole thing, doing the math to balance speed, cost and longevity of storage. I don't see us turning back; folks want more non-volatile storage everywhere that fits in a tiny package and doesn't dominate the price of the product.

As far as normal circuits go, from the beginning days of using transistors for digital circuits to the present, we pass through the linear portion of the transistor and use it as a switch, we bang it between the rails with some excess to insure it sticks. Like the light switch on your wall, you flip it more than half way a spring helps the rest and holds it there. This is why we use digital and not try to live in the linear region; they tried early on, but failed. They couldn't stay calibrated.

So we just slam the transistor into its rails and both sides of a signal will settle by the next clock cycle. Great pains are taken, and the current tools are significantly better than they used to be, in doing the analysis of the chip design, to see that by design there is margin on the timing. Then testing each die on each wafer (that and/or after packaging), to see that each chip is good.

Chip tech relies heavily on statistics based on experiments. When you overclock your CPU, well you are pushing that margin, stay within the advertised clock rate, temperature, etc. and your chances are significantly lower of having problems. A 3 GHz xyz processor is simply a 4 GHz chip that failed at 4 GHz but passed at 3 GHz. The parts are speed graded basically from a production line.

Then there are the connections between chips or boards, and those are subject to problems as well, and lots of time and effort go into making standards and board designs, etc, to mitigate error on those interfaces. USB, keyboard, mouse, HDMI, SATA, and so on. As well as all the traces on the board. On and off the board you have crosstalk issues; again, lots of tools are available if you use them as well as experience in avoiding the problems in the first place, but yet another way where we may not see the ones and zeros be fully engaged.

None of the technologies, even space, are perfect. It only has to be good enough, enough of a percentage of the product has to cover enough of the expected life span of the product. Some percentage of the smart phones have to make it at least two years, and that's it. Older foundries or technology has more experimental data and can produce a more reliable product, but it is slower, and may not be new designs, so there you go. The cutting edge is just that, a gamble for everyone.

To your specific question, the transistors on each end of a signal are pushed quickly through their linear region and lean into one of the rails. Analysis is done on every combinational path to determine that it will settle before the clock at the end of the path latches it, so that it is truly made a zero or one. The analysis is based on experiments. The first chips of a product line are pushed beyond the design boundaries, schmoo plots are made to determine there is margin in the design. Variations on the process are made and/or individual candidates are found that represent the slow and fast chips. It is a complicated process and some have more material some have less, running faster but using more energy or running slower, etc.

You push those to the margins as well. And basically get a warm fuzzy feeling that the design is okay to go into production. JTAG/boundary scan are used to run random patterns through the chips between each latched state to see the combinational paths are all solid for a design. And where there are concerns, some directed functional tests may happen as well. Further testing of the first silicon and perhaps random testing to make sure the product is good. If/when failures occur, that may push you back to more functional tests on the production line. It is heavily dependent on statistics/percentages. 1/1000000 bad ones getting out may be okay or 1/1000 or whatever; it depends on how many you think you will produce of that chip.

The vulnerabilities are as mentioned here and with others. First the chip itself, how good was the design and the process, how close to the margin is the weakest path of a specific chip in the product you bought. If too close to the edge then temperature change or other can cause timing problems and bits will latch data that has not settled into a one or zero. Then there are single event upsets. And then there is noise. again stuff already mentioned...

Peter Mortensen
  • 1,676
  • 3
  • 17
  • 23
old_timer
  • 8,203
  • 24
  • 33
  • 4
    Your first paragraph makes it sound like there aren't any more issues with aerospace environments, while I think you meant that SEU are no longer just experienced in those environments. – W5VO Apr 27 '16 at 05:03
  • Note that SEUs can be caused from SnPb solder on BGAs due to some of the lead being part of the Uranium decay chain quite apart from free neutron activity. – Peter Smith Apr 27 '16 at 06:56
  • @W5VO, yes, I meant that upsets due to radiation are no longer just a space problem, they are a problem all the way down to the surface. Not as bad as in space, but present. – old_timer Apr 27 '16 at 13:55
  • 1
    I seem to recall that some DEC minicomputer buses had problems with metastability in practice. That's a distinct mechanism for bit-errors from those you've named, right? Or not? – davidbak Apr 27 '16 at 19:40
  • if the signal doesnt settle at all, then you will simply have problems. didnt think of that case, granted they didnt have the tools we have today. I have seen and know of chips that if you enable too many things inside at once, then that can cause an inrush current that can either do damage or at least mess up everything. I think some GPUs could be programmed to do too much work and do something similar. other ways not have ones be ones and zeros be zeros. – old_timer Apr 27 '16 at 19:46
  • and of course the HCF halt and catch fire instruction. – old_timer Apr 27 '16 at 19:46
  • 2
    @davidbak: Metastability is nasty problem whose most common effect is that in cases where several bits' values are contingent upon whether some input was low or high at some time in the recent past, may not all switch together in a fashion consistent with the input being low, nor with in a fashion consistent with it being high, but may instead yield an arbitrary mixture of the two behaviors. For example, if code is supposed to branch when a button is pushed, the program counter bits may end up holding an arbitrary mix of the values they would have had if the button was pushed, or if it wasn't. – supercat Apr 27 '16 at 21:47
  • @supercat - yes, and in the case of the DEC buses I mentioned ... sometimes the data sent from a device to the CPU (or vice versa) was not the data received ... – davidbak Apr 27 '16 at 21:53
  • "triple voting" also known as "triple modular redundancy" see [Wikipedia](https://en.wikipedia.org/wiki/Triple_modular_redundancy) – boink May 04 '16 at 11:05
  • I correct myself: "triple voting" is one part/subset of "triple modular redundancy" – boink May 04 '16 at 11:14
12

If you are after a simple answer:

Each digital component in a computer is more restricted in the outputs it produces, than in the inputs it accepts. For example, any "input" value from 0V to 2V will be accepted as a 0, but an "output" of 0 will always be in the range 0 to 0.5V. (See duskwuff's answer for some actual values.)

This means that each component helps to "correct" for some of the deviation or noise that has occurred along the line. Of course, if the noise is large enough, the system cannot compensate. Computers in high-radiation environments can be frequently affected by 1s changing to 0s and vice versa.

Basically, computers are designed to tolerate certain levels of noise/interference, which is good enough for most practical purposes.

Artelius
  • 296
  • 1
  • 4
8

It is theoretically possible for the signals to change between a 0 and a 1 because of thermal (and other) noise, however it is extremely unlikely.

Digital circuits are designed with an attribute called 'noise margin'. This is the amount by which its input has to change before the output flips state. Generally in CMOS circuits this is about 50 % of the supply voltage. Unavoidable thermal noise (comes from electrons moving around at any temperature above 0 kelvin) in these circuits generates << 1 mV of noise, and the probability that these spikes could exceed (say) 500 mV, is exceedingly small.

Digital (e.g. CMOS) gates have a gain and saturation characteristics. What this means is that when the input signal is close to the middle of the range, the output changes quickly (high gain), but when it is close to extremes of the range, it changes slowly. The outcome of this is that when an input signal is 'close' to the rails, the output is even closer -- this means that noise doesn't get amplified.

Other features mentioned above (error correction etc.) mean that errors even if they do occur, do not propagate.

jp314
  • 18,395
  • 17
  • 46
4

In systems that are prone to error such as communication channels and magnetic storage (and sometimes even RAM), a checksum, CRC, or ECC is stored to reject bad data or correct small errors.

Generally binary systems are designed so that this isn't possible, but once in a few million or billion* times a cosmic ray or a blip of noise will nudge things over the line, the the error detection/correction is needed to keep the corruption from affecting the computer in a serious way.

*Communication channels can have a much, much higher error rate!

Daniel
  • 6,168
  • 21
  • 33
4

Computer hardware has become more robust and dependable. But hardware is much too broad for a simple answer. However it may be of interest to know there is a dependability difference between a common desktop computer and an enterprise server computer. I found this question/answer thread about server hardware. A server will cost many times that of a comparable desk top. The cost is the result of better hardware that is possibly several times less likely to unexpectedly "switch a 1 and 0".

But hardware is only half the story. Computers can also protect data from unexpected errors using software. Hamming code is an example where, by adding a small amount of additional data, a small number of errors can not only be detected but also corrected.

st2000
  • 3,228
  • 9
  • 12
3

There are two ways that are commonly used to minimize the probability that a logic bit will be accidentally switched (0 to 1, or 1 to 0).
The first is by providing as large a gap, between the voltage levels defined for a 0 and a 1, as possible. As you mention, a 0 is defined as a voltage level < .9v, while the 1 is defined as a voltage level > 2.9v (not as you say .9 to 1.5). This leaves a voltage gap of 2v. This means that the signal voltage would have to vary by 200%, before it would "accidentally" switch the state of the bit (very unlikely).
The second is by "clocking" the logic signals. Since "accidental" voltage/noise is random and short-lived, by allowing state changes only at particular (and short) intervals, the probability of the "variation" hitting at the time of the clock is minimized.

There are, off course, other means and methods used, depending on the degree of reliability required (ECD, ECC, etc.).

Guill
  • 2,430
  • 10
  • 6
2

Good engineering.

A lot of effort goes into a design to prevent data corruption, or to correct it when it can't be sufficiently prevented (eg ECC memory).

Things that can cause data corruption include:

  • electrically noisy environments
  • power related issues
  • timing issues (eg between clock and data lines, or between two differential lines)
  • electrical cross-talk

In short, a lot of engineering has goes into digital designs so that software engineers can make the simple assumption that '0' means '0' and '1' means '1'.

sbell
  • 1,571
  • 12
  • 20
  • And of course, ECC memory can also do funny things like trigger a non-maskable interrupt (NMI) if more bits get corrupted than are possible to repair. I think for modern ECC RAM this is more than one bit error in 64 bits (the encodings can correct single-bit errors, and detect but not correct two-bit errors), but I could be wrong about that. In situations where you care about your data, halting the system immediately if something is amiss beyond repair may very well be preferable to limping along not knowing whether the data can be trusted (or worse, knowing that it cannot be trusted). – user Apr 29 '16 at 22:32
2

The two fundamental aspects of practical electronic computers are:

  1. A very stable power supply

  2. Time (usually defined as clock frequency or delay)

Power supplies for computing systems are very strictly specified and regulated. In fact, for any computing system the power supply is typically regulated several times: at the power supply (or battery charger), at the main input to the motherboard, at the input to daughtercards and finally on the chip itself.

This eliminates a lot of noise (electrical volatility). What the CPU sees is a very stable, non-volatile voltage source that it can use to process logic.

The next main source of intermediate values (voltages between what's considered 0 or 1) comes when values transition. Either 1 changing to 0 (fall time) or 0 changing to 1 (rise time). You really can't do much about it except wait for the transition to finish before accepting the output of the circuit. Before the transition completes the circuit's output is considered garbage.

In engineering the solution to this problem is to simply write how long you need to wait for the results to be correct on paper. This is the origin of CPU clock frequency. How many GHz you can run the CPU at depends on how long it takes for state changes in the CPU to stabilize.

There is actually a third source of volatility: inputs into the circuit. The solution to that problem is similar to the general problem above: make sure the signal (voltage or current) entering the system is stable and make sure the signal have enough time to stabilize.

The second part of the problem is why we sample inputs into latches or registers before processing. The signal may be garbage. But they'll be 0 or 1 garbage on the inside of the registers when they are processed. The first part of the problem is what warranties are for.

slebetman
  • 1,660
  • 11
  • 14
2

But I never read what keeps electrical voltages "well-behaved" in a way that a 0 could never accidentally become a 1 due to electrical volatility1 inside the computer. Perhaps it's possible for the voltage to be very near 0.9, then what is done to avoid it passing the threshold?

Feedback is what prevents it from approaching the threshold voltage, and forces it to be well behaved.

This is usually in the form of a latching circuit of some sort, often a clocked latching circuit.

As a simple example, consider the flip-flop. It's designed so that the output is fed back into the logic circuit as an additional input. The logic within the element, therefore, knows what it's outputting and it will keep outputting the same value until the other inputs force it into the opposite state.

Since the circuit is designed so the transistors are fully on or fully off, then it will always output near the power supply and ground limits - it won't go near the 0.9v level, and when it does transition it will quickly and completely move to the other state. The circuits are designed specifically to avoid operating in the analog region between the two states.

Adam Davis
  • 20,339
  • 7
  • 59
  • 95
0

Under normal operating conditions, 0's and 1's rarely glitch, but under bad operating conditions (i.e. low battery or that brief time after the AC power is disconnected and the voltage in the capacitor is falling), weird stuff happens and 0s and 1s get messed up all the time.

For this reason, if you are writing defensive code (software or HDL) that you don't want to lock-up, you should always consider the case where a number might glitch. For example,

while(1)
{
  i++;
  do something here
  if (i == 10) break;
}

better to change the == to >= just in case the value of i jumps from 9 to 11 or any number greater than 10, which would cause you to never exit the loop until i wraps around to 0 (say after 4 billion iterations).

while(1)
{
  i++;
  do something here
  if (i >= 10) break;
}

It happens, trust me...

Mark Lakata
  • 354
  • 3
  • 9