9

I'm right now combing the electrical engineering literature on the sorts of strategies employed to reliably produce highly complex but also extremely fragile systems such as DRAM, where you have an array of many millions of components and where a single failure can brick the whole system.

It seems like a common strategy that's employed is the manufacturing of a much larger system, and then the selective disabling of damaged rows/columns using settable fuses. I've read[1] that (as of 2008) no DRAM module comes off the line functioning, and that for 1GB DDR3 modules, with all of the repair technologies in place, the overall yield goes from ~0% to around 70%.

That's just one data point, however. What I'm wondering is, is this something that gets advertised in the field? Is there a decent source for discussing the improvement in yield compared to the SoA? I have sources like this[2], that do a decent job of discussing yield from first principles reasoning, but that's 1991, and I imagine/hope that things are better now.

Additionally, is the use of redundant rows/columns still employed even today? How much additional board space does this redundancy technology require?

I've also been looking at other parallel systems like TFT displays. A colleague mentioned that Samsung, at one point, found it cheaper to manufacture broken displays and then repair them rather than improve their process to an acceptable yield. I've yet to find a decent source on this, however.

Refs

[1]: Gutmann, Ronald J, et al. Wafer Level 3-d Ics Process Technology. New York: Springer, 2008. [2]: Horiguchi, Masahi, et al. "A flexible redundancy technique for high-density DRAMs." Solid-State Circuits, IEEE Journal of 26.1 (1991): 12-17.

Mephistopheles
  • 427
  • 1
  • 3
  • 11
  • 3
    Row and column redundancy is still used today. Block-level redundancy was used in the Itanium 2 L3 cache (see Stefan Rusu et al., "Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache", 2004). Another consideration for yield is binning both for speed/power/operating temperature and "capacity" (e.g., chip multiprocessors can be sold with a range of core counts; even high defect count DRAM might, in theory, be sold as a half-capacity part). –  Feb 26 '15 at 23:06
  • fascinating, thank you. Looking at the cache design, I see 140 subarrays, each with 2 sub-banks, which in turn have eight 96x256 array blocks. Each block has 32 bits. Which means there's, in total, 140*2*8*96*256*32 = 1.762x10^9 bits required to produce 48x10^6 bits of storage. Is this correct? – Mephistopheles Feb 27 '15 at 00:02
  • 3
    No, the 32 bits are part of 96x256 block (12 cache ways * 8 * 4 * 32 bits per cache line). It should also be noted that some of the bits are used for ECC, so the cache had 6MiB of *data*. (The use of ECC introduces another wrinkle in yield under binning. ECC requirements vary by application and excess ECC can be used to support lower voltage (or refresh rate for DRAM) without data loss for a lower power part as well as provide correction for manufacturing defects. Such is more a theoretical consideration as marketing factors generally do not allow such flexibility.) –  Feb 27 '15 at 00:49
  • thanks again. This is more to gain an estimate for the overall cost of the manufacturing process. That is, how much additional board space (as a representative for physical resources expended) is required to reach this 6MiB? I'll try to estimate this from the area taken up by the L3 cache and get back to you. – Mephistopheles Feb 27 '15 at 03:23
  • So I took a guess, from this chart[1], that the die area was ~400mm^2 and from Rusu et. al. I'm assuming the bit cell area was 2.45 um^2. [1] also admits that 50% of the board area is L3 cache, and that each subarray communicates with the bus directly. That means there's ~81x10^6 bits possible, but only 48x10^6 utilized. (not sure what the ecc is, but) Does a factor of ~2 sound reasonable? [1]:http://www.decus.de/slides/sy2004/22_04/3c08.pdf – Mephistopheles Feb 27 '15 at 03:41
  • 2
    Using bit cell area does not account for row decode and other overhead. The area overhead of redundancy could be simply estimated by recognizing that 4 of the 140 subarrays are spares (a little less than 3% overhead), ignoring extra routing overhead. It should also be noted that 3MiB L3 cache versions were sold, so yield for 6MiB versions was allowed to be lower. (I would *guess* that using larger than minimum-sized transistors for the SRAM cells, for lower leakage, might also decrease the effective defect rate slightly.) 136 used subarrays indicates 8 for ECC (6+% overhead). –  Feb 27 '15 at 12:41
  • alright, thanks again for clearing that up. My analysis work seems to indicate that having even 4 redundancies can provide a very large increase in yields, especially in a system that has only on the order of a hundred parts (if we take each module as a discrete part with a probability of failure), so this is still helpful. I guess without solid numbers on the yields for 6MiB, 3MiB, etc. cache versions it's still moot. – Mephistopheles Feb 27 '15 at 17:07
  • Generally we refer to "die" rather than "board" when talking about semiconductors rather than PCBs. 3% seems reasonable, factor of two absolutely not. – pjc50 Jun 24 '15 at 15:26
  • @Mephistopheles Yes, redundancy techniques are still employed, see my [survey paper](https://www.academia.edu/19490711/A_Survey_Of_Architectural_Techniques_for_Managing_Process_Variation) which reviews multiple papers on this, along with their scope of improvement. – user984260 Dec 30 '15 at 19:57

1 Answers1

1

No manufacturer will ever release yield data unless they have to for some reason. It's considered a trade secret. So- to answer your question directly, no- it isn't advertised in the industry.

However, there are many engineers whose jobs are to improve the line throughput and end-of-line yield. This often consists of using techniques like binning and block redundancy to make losses off the line function enough to be saleable. Block redundancy is certainly used today. It's pretty easy to analyze:

(failed blocks per part)/(blocks per part)*(failed blocks per part)/(blocks per part)

That'll get you the probability of both parallel blocks failing. I'd doubt you'd end up with a yield as low as 70%, as typically 90% is the minimum acceptable yield.

Tom Brendlinger
  • 395
  • 1
  • 8
  • 2
    While I appreciate your answer, @Paul-a-clayton provided this information and was also able to cite real publications (specifically the Itanium 2) in the comments. Furthermore, while block redundancy is discussed in those papers, it says "This use of subarrays optimizes the die area utilization without constraining the core floor-plan" with no mention of fault-tolerance. If you have papers that specifically propose block redundancy as a tool for error addressing, they would be greatly appreciated. – Mephistopheles Apr 08 '15 at 16:07