7

Back in the early 2000s I remember asking about why it was so important that servers use ECC memory. The prevailing wisdom at the time was that systems with lots of RAM would be, statistically, more likely to suffer bitflips. This makes sense - if each cell has a 10-20 probability of suffering a bitflip per second, then 109 cells have a 10-11 probability per second. The more cells you have, the higher the probability of a bitflip in any given time period.

Back then we would be looking at a ballpark of 128MB to 1GB of RAM. These days we regularly put 16GB or more in laptops and desktops, with workstations commonly having 64GB or more. For argument's sake, let's say we've increased total RAM amounts by two orders of magnitude. So we should see a hundred times or so more bitflips, on average, in any given system, assuming that nothing else changed.

The more I thought about it, though, the more I realised that the random bitflip rate should be much higher in newer systems:

  • Lower operating voltages means lower distinction between a 0 and a 1.
  • Lower gate charge means less overall energy required to flip a bit.
  • More densely packed gates increases the likelihood of being affected by cosmic rays.
  • Refresh times don't seem to have gone anywhere. DDR2 tRFC was 40-60 clocks, DDR3 tRFC was more like 90-130 clocks, and DDR4 tRFC is more like 200-450 clocks. When you divide by the internal memory clock rates to get a wall time for each refresh timing it doesn't really show much of a trend - it's effectively flat but with a higher margin either way as time goes on.

But, as far as I know, we're not seeing bitflips everywhere on non-ECC RAM, at least within the confines of our atmosphere.

So, what's the deal? Why aren't we seeing endless bitflips everywhere, at least 100x if not 10000x more frequently than two decades ago? Is ECC actually important in the context of growing RAM sizes, or do the stats not back it up? Or is there some other technology advance that is mitigating bitflip problems in non-ECC memory? I'm particularly interested in answers with authoritative references rather than speculation about error rates.

Polynomial
  • 10,562
  • 5
  • 47
  • 88
  • It's a big deal if that RAM has your bank transaction on it – Tony Stewart EE75 Apr 17 '19 at 13:08
  • 2
    @SunnyskyguyEE75 That's irrelevant to the question and makes an assumption that the probabilistic risk of using non-ECC memory is high in the first place, which is the exact assumption that I'm challenging here. I understand probabilistic risk and impact (my career depends upon it); the question is about whether those risks are being overstated in the first place. – Polynomial Apr 17 '19 at 13:15
  • Margin can be tested and correlated with error rate but MTBF for correctable bits depends on brand, temperature , and margin. Ask Crucial ,It's a question that depends on the cost of failure and the unknown – Tony Stewart EE75 Apr 17 '19 at 13:17
  • not all failures are deterministic, but some can by reducing the BIOS delay settings and running Memtest 86 for a few hrs until error free at threshold, then backoff for margin. But the difference between a soft error and hard error depends on brand ( noise, crosstalk , supply ripple, motherboard) etc – Tony Stewart EE75 Apr 17 '19 at 13:21
  • but like LCD displays have gone from 10 dead pixels per 2 MPix to 0, DRAM has also improved several orders of magnitude – Tony Stewart EE75 Apr 17 '19 at 13:25
  • 3
    @SunnyskyguyEE75 I'm not asking how to fix the problem, or how to model probabilistic MTBFs. Just interested in authoritative references as to why a 16GB non-ECC DIMM from 2019 seems to have the same apparent MTBF as a 128MB non-ECC DIMM from 2002, despite the conventional wisdom back then saying "if you have more than 1GB of RAM you should use ECC, because of bit flips" and there being two whole orders of magnitude between the sizes, plus all the other negative factors listed above. If you want to speculate, feel free, but that's not really what I'm looking for in an answer. – Polynomial Apr 17 '19 at 13:29
  • I used to measure Soft and Hard Error Rates for a living in the early 80 's as a Test Engineer on HDD's and choose which vendors the corporation should buy or disqualify. You can measure MTBF, by field returns but you cannot predict it in general. SO ask Crucial – Tony Stewart EE75 Apr 17 '19 at 13:32
  • 6
    To a great extent it comes down to the use case. The statistical chance of a bit flip (most commonly due to free neutrons) at sea level is really low, but non-zero, however the other part of that is that you may not notice because the bit that gets flipped may be in an area not currently being used. A study by Boeing on servers in Denver established that bit flips (SEUs in the trade) *do indeed* occur regularly. In avionics we are **required** to use ECC as the chance increases drastically with altitude. – Peter Smith Apr 17 '19 at 13:33
  • 3
    @SunnyskyguyEE75 As I've said, I'm not looking to *predict* it. I work in infosec; I understand the difference between risk/probabilities and predicting actual discrete incidents. It's practically the same as radioactivity - you can model decays/sec, but you can't predict the individual decays with any certainty. I'm asking about differences between rates. "Ask Crucial" isn't a useful answer here. – Polynomial Apr 17 '19 at 13:36
  • 4
    The only company I know of to do a long term study is Xilinx on the Rosetta project. They do publish soft error rates for all their parts. Note that internal geometries and layout (which is used to reduce the neutron cross section) play a large part in effective vulnerability. – Peter Smith Apr 17 '19 at 13:38
  • There are many causes of soft errors different from hard errors. dozens of reasons... so you can not generalize for all brands but you can for one brand e,g. timing margin, cross talk , voltage margin, temp margin, supply noise, EUV lithographic maturity or contamination rate changes in process controls etc etc – Tony Stewart EE75 Apr 17 '19 at 13:39
  • @PeterSmith Interesting info. I'd be interested in a link to that paper if it's public. I'm aware of the requirement of ECC in higher rad environments (aviation/space) hence my quip of "at least within the confines of our atmosphere". I'm still not sure that proving the existence of bitflips answers the question of why we don't tend to see impactful ones 100-10000x more frequently than in the early 2000s though. – Polynomial Apr 17 '19 at 13:40
  • Have you read - https://en.wikipedia.org/wiki/ECC_memory - On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. – HandyHowie Apr 17 '19 at 13:42
  • @HandyHowie Surely smaller cells make more targets because the overall die size tends to be roughly the same as we increase density in line with capacity expansion? The individual cells are smaller (and have lower gate charge) but you've got more of them per square millimeter. – Polynomial Apr 17 '19 at 13:48
  • 1
    I will look for the Boeing report (from many years ago). Here is a backgrounder from Xilinx: https://www.xilinx.com/support/documentation/white_papers/wp286.pdf – Peter Smith Apr 17 '19 at 13:48
  • As MTBF has improved, cost has been reduced with density and smaller lithography so yields must be maintained somewhat constant implying that MTBF "may" also for short term effects like row hammer crosstalk – Tony Stewart EE75 Apr 17 '19 at 13:50
  • Rate slope can be predicted for example by S/N ratio and the number of picoseconds of margin where a 3.3GHz CPU can specify the number of 300 picoseonds per step for write and read address delays etc. but one cannot predict non-random noise events. e.g. an solar flare EMP immunity with shielding unless by design with tests. WHile density has increased with Moore's Law, the particle rate in the clean rooms must also be reduced for epiwafers to maintain yields and field failures so we cannot simply say it is 100x better than 20 yrs ago. There are too many variables to be specified. – Tony Stewart EE75 Apr 17 '19 at 13:59
  • yet, "some" companies can say this based on their selection cirteria setup and environmental tests. Similarily for aerospace, due to higher gamma radiation, lithography must be larger – Tony Stewart EE75 Apr 17 '19 at 14:05
  • 1
    Xilinx Conclusion: *It is not possible to make a statement about foundry, process, voltage, or temperature effects without side-by-side experiments in the same beam, at the same time, with a few thousand upsets on each.* – Tony Stewart EE75 Apr 17 '19 at 14:10

2 Answers2

5

Single-event upsets (SEU) at sea level tend to be caused either by radioactive contaminants in the IC manufacturing materials (particularly the metals) generating alpha particles or by high-energy neutrons (caused by cosmic rays in the atmosphere) ionizing atoms in the silicon itself.

Over the years, manufacturers have greatly reduced the threat caused by radioactive contaminants. There are also proprietary approaches to cell layout that can help to mitigate the risk of SEU. All of this is probably going to be trade secrets and not public information.

And, no, I'm not going to do your literature search for you. However, I recommend that you go through the IEEE Transactions on Nuclear Science if you want references.

Elliot Alderson
  • 31,192
  • 5
  • 29
  • 67
  • 1
    Are saying these are the only significant causes of correctable errors ? Or uncorrectable errors or both or just limited to one cause of the SEU – Tony Stewart EE75 Apr 17 '19 at 15:04
  • 1
    Building on this, for the cosmic ray induced SEU case, the probability of getting hit with a cosmic ray correlates to die area, not number of bits within that die, and in general, memory die area has actually been going down over the years. – Nate S. Apr 17 '19 at 16:56
2

The answer is, more work needs to be done, and they aren't sure:

The results show that the radiation susceptibility has actually improved somewhat for devices that have advanced to the 0.13μm level, which contradicts earlier predictions. This trend is encouraging, but it may not necessarily continue for devices that are scaled below 0.1 μm. It is important to note that the recent computer modeling calculations for SEE susceptibility of scaled devices predicts a large increase in collected charge that is directly in conflict with test results for heavy ions and neutron soft errors.
Source: The Effect of Device Scaling on Single-Event Effects in Advance CMOS Devices

enter image description here
Source: The Effect of Device Scaling on Single-Event Effects in Advance CMOS Devices

The actual mechanism for SEE's/SEU's has more to do with voltage and geometry than size. The one would think that smaller size and smaller charge per memory element would make memory elements easier to flip and cause errors, but the effect is small and is more related to voltage and geometry. Which is good for space applications that rely increasingly on commercial technology (like cubesats).

Voltage Spike
  • 75,799
  • 36
  • 80
  • 208