Why does RAM (any type) access time decrease so slowly?

Question

This article shows that DDR4 SDRAM has approximately 8x more bandwidth DDR1 SDRAM. But the time from setting the column address to when the data is available has only decreased by 10% (13.5ns). A quick search shows that the access time of the fastest async. SRAM (18 years old) is 7ns. Why has SDRAM access time decreased so slowly? Is the reason economic, technological, or fundamental?

Could another possible reason be that it simply isn’t that necessary? — Sebastiaan van den Broek, Feb 20 '19 at 06:08
For example low access time is necessary to make search a data in the memory more faster. — Arseniy, Feb 20 '19 at 07:16
I realize that, extra speed is always nice, but coming from a software developer perspective, perhaps compared to all other IO and architecture (including microservices that can literally run on different data centers), RAM speed just isn't that much of a bottleneck anymore. Sometimes 'good enough' is good, or at least doesn't warrant the extra R&D into speeding it up. I would consider adding that as a potential reason in your question too. — Sebastiaan van den Broek, Feb 20 '19 at 07:21
According to [Wikipedia](https://en.wikipedia.org/wiki/CAS_latency) DDR3-2200 has a First Word latency of 6.36 ns, that is how long it takes a signal to propagate around 3ft on FR4, I would say we are pretty close to the physical limits — Mark Omo, Feb 20 '19 at 21:23

C_Elegans · Accepted Answer · 2019-02-19T17:39:02.370

It's because it's easier and cheaper to increase the bandwidth of the DRAM than to decrease the latency. To get the data from an open row of ram, a non trivial amount of work is necessary.

The column address needs to be decoded, the muxes selecting which lines to access need to be driven, and the data needs to move across the chip to the output buffers. This takes a little bit of time, especially given that the SDRAM chips are manufactured on a process tailored to high ram densities and not high logic speeds. To increase the bandwidth say by using DDR(1,2,3 or 4), most of the logic can be either widened or pipelined, and can operate at the same speed as in the previous generation. The only thing that needs to be faster is the I/O driver for the DDR pins.

By contrast, to decrease the latency the entire operation needs to be sped up, which is much harder. Most likely, parts of the ram would need to be made on a process similar to that for high speed CPUs, increasing the cost substantially (the high speed process is more expensive, plus each chip needs to go through 2 different processes).

If you compare CPU caches with RAM and hard disk/SSD, there's an inverse relationship between storage being large, and storage being fast. An L1$ is very fast, but can only hold between 32 and 256kB of data. The reason it is so fast is because it is small:

It can be placed very close to the CPU using it, meaning data has to travel a shorter distance to get to it
The wires on it can be made shorter, again meaning it takes less time for data to travel across it
It doesn't take up much area or many transistors, so making it on a speed optimized process and using a lot of power per bit stored isn't that expensive

As you move up the hierarchy each storage option gets larger in capacity, but also larger in area and farther away from the device using it, meaning the device must get slower.

Great answer. I just want to emphasise the physical distance factor: at maybe 10cm for the furthest RAM stick, 1/3 to 1/2 of the speed of light as the signal speed, plus some extra length to route & match the PCB tracks, you could easily be at 2ns round trip time. If ~15% of your delay is caused by the unbreakable universal speed limit... you're doing real good in my opinion. — mbrig, Feb 19 '19 at 20:22
L1 is also organized uniquely, is directly in the core that uses it, and uses SRAM. — forest, Feb 20 '19 at 04:20
@forest And also has a pretty strict size limit - make it too large, and there's no way to keep it so fast. — Luaan, Feb 20 '19 at 09:12
L1d cache can also be heavily optimized for latency, e.g. fetching tags and data in parallel for all ways in set. So when a tag match just muxes the data to the output, instead of needing to fetch it from SRAM. This can also happen in parallel with the TLB lookup on the high bits of the address, if the index bits all come from the offset-within-page part of an address. (So that's one hard limit on size, like @Luaan mentioned: size / associativity <= page-size for this VIPT = PIPT speed hack to work. See [VIPT Cache: Connection between TLB & Cache?](//stackoverflow.com/q/46480015)) — Peter Cordes, Feb 20 '19 at 19:18

score 6 · Answer 2 · answered Feb 19 '19 at 17:37

6

C_Elegans provides one part of the answer — it is hard to decrease the overall latency of a memory cycle.

The other part of the answer is that in modern hierarchical memory systems (multiple levels of caching), memory bandwidth has a much stronger influence on overall system performance than memory latency, and so that's where all of the latest development efforts have been focused.

This is true in both general computing, where many processes/threads are running in parallel, as well as embedded systems. For example, in the HD video work that I do, I don't care about latencies on the order of milliseconds, but I do need multiple gigabytes/second of bandwidth.

answered Feb 19 '19 at 17:37

Dave Tweed

168,369
17
228
393

And it should definitely be mentioned that software can be designed for the "high" latency pretty easily in most cases, compared to the difficulty and cost of decreasing the latency. Both CPUs and their software are very good at eliminating the effective latency in most cases. In the end, you don't hit the latency limit as often as you might think, unless you have no idea about how the memory architecture and CPU caching/pre-fetching etc. works. The simple approach usually works well enough for most software, especially single-threaded. – Luaan Feb 20 '19 at 09:15
On modern Intel CPUs, memory latency is the limiting factor for *single-core* bandwidth: bandwidth can't exceed max_concurrency / latency, and a single core has limited capacity for off-core requests in flight at once. A many-core Xeon (with higher uncore latency from more hops on the ring bus) has *worse* single-core bandwidth than a quad-core desktop chip, despite have more DRAM controllers. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](//stackoverflow.com/q/39260020). It takes many more threads to saturate memory B/W on a many-core Xeon. – Peter Cordes Feb 20 '19 at 19:54
Overall your main point is correct: most accesses hit in cache for low latency to avoid stalling the out-of-order back-end. HW prefetch mostly just needs bandwidth to keep up with sequential accesses and have data ready in cache before the core needs it. DRAM latency is hundreds of core clock cycles, so efficient software has to be tuned to use access patterns that *don't* cache misses by defeating both spatial/temporal locality and HW prefetching. Especially for loads, because store buffers can decouple store latency from the rest of the out-of-order backend. – Peter Cordes Feb 20 '19 at 20:02
For disk I/O, latencies of milliseconds would matter if we didn't have readahead prefetch to hide it for sequential accesses. But the higher the latency, the harder it is to hide. (The better your prefetch algorithms need to be, and the more predictable your access patterns need to be.) And the more requests / data bytes you need to keep in-flight to get the bandwidth you want. – Peter Cordes Feb 20 '19 at 20:06

Michel Keijzers · Answer 3 · 2019-02-20T07:45:56.197

2

I don't have that much insights, but I expect it is a bit of all.

Economic

For the majority of computers/telephones, the speed is more than enough. For faster data storages, SSD has been developed. People can use video/music and other speed intensive tasks in (almost) real time. So there is not so much need for more speed (except for specific applications like weather prediction etc).

Another reason is to process a very high RAM speed, CPUs are needed which are fast. And this comes with a lot of power usage. Since the tendency of using them in battery devices (like mobile phones), prevents the use of very fast RAM (and CPUs), thus makes it also not economically useful to make them.

Technical

By the decreasing size of chips/ICs (nm level now), the speed goes up, but not significantly. It is more often used for increasing the amount of RAM, which is needed harder (also a economic reason).

Fundamental

As an example (both are circuits): the easiest way to get more speed (used by SSD), is to just spread the load over multiple components, this way the 'processing' speeds adds up too. Compare using 8 USB sticks reading from at the same time and combining the results, instead of reading data from 1 USB stick after each other (takes 8 times as long).

edited Feb 20 '19 at 07:45

answered Feb 19 '19 at 16:54

Michel Keijzers

13,867
18
69
139

1

What exactly do SSDs have to do with SDRAM latency? – C_Elegans Feb 19 '19 at 17:01
@C_Elegans they are both circuits, for this 'generic' question I don't think there is so much difference. – Michel Keijzers Feb 19 '19 at 17:04
2

The amount of time to open a page hasn't really decreased that much due to the precharge cycle; the amount of energy required is not significantly different today than it was a decade ago. That dominates the access time in my experience. – Peter Smith Feb 19 '19 at 17:08
Every search in data array use truly random access not a data stream. Is it so rare task? >>the easiest way to get more speed (used by SSD), is to just spread the load over multiple components << Looks very reasonable. So can we say that true progress of RAM have stoped more than 20 years ago? – Arseniy Feb 19 '19 at 17:10
6

@MichelKeijzers While they are both circuits, SSDs and SDRAM serve very different use cases, and make use of different techniques for storing data. Additionally, saying that CPUs don't really need faster RAM doesn't make much sense, the entire reason most modern CPUs have 3 levels of caches is because their ram can't be made fast enough to serve the CPU. – C_Elegans Feb 19 '19 at 17:17
Very good question! If DDR-4 is fast enough why does CPU use so many caches? – Arseniy Feb 19 '19 at 17:21
@C_Elegans That's true indeed. So in the end it's more a economic reason. Very fast RAM is expensive, and people don't buy computer with (only) very fast RAM, if people would need (want to spend money to) it, the sellers of computers would put more fast RAM inside a computer. – Michel Keijzers Feb 19 '19 at 17:42
Also fast ram is low density and the larger it gets the harder to make it fast. That is why even with SRAM, CPUs have multiple stages of cache. The L3 cache is much larger than L1 but slower even though they are both SRAM. – Evan Feb 19 '19 at 17:49
1

You said for **bigger** storage there are SSDs. Did you mean **faster**? It's more expensive to get the same amount of storage in an ssd than an hdd. The main selling point of SSDs is the speed, and perhaps the noise and reliability. For capacity, HDDs are still better – user198712 Feb 20 '19 at 07:09
@user198712 Yes, thanks (I updated the answer). – Michel Keijzers Feb 20 '19 at 07:46
@Arseniy: *If DDR-4 is fast enough why does CPU use so many caches?* **Because DRAM is nowhere near fast enough**. A full cache miss takes hundreds of core clock cycles, far too long for out-of-order execution to hide even with a huge ROB. Even *with* caching + HW prefetching, modern CPUs with wide SIMD vectors and many cores need ever more computational intensity (ALU work per load/store) to saturate their ALUs instead of bottlenecking on memory. e.g. do more while it's in registers. L1d cache B/W usually scales with ALU width, but that means you need to cache-block your code carefully. – Peter Cordes Feb 20 '19 at 20:12

Why does RAM (any type) access time decrease so slowly?

3 Answers3