Hardware that accelerates pointer dereferencing?

Question

Most modern languages make a heavy use of pointers / references: a typical OOP language uses VMT lookups, a typical functional language builds key data structures out of pointers, etc. Even typical C code often sports things like foo->bar->baz.

It is well-known that dereferencing a pointer is cache-unfriendly, unless the pointer happens to point very near some just-accessed location and hits the same cache line.

Is there, or has there been, hardware that tries to address this problem? It's not currently widely deployed; why?

Hitting L1 or even L2 cache is so much faster than hitting RAM on current hardware that the goal of making pointer dereferencing reap some of the caching benefits seems worthy.

remember that `foo->bar->baz` is `*(*(foo+barOff)+bazOff)` this is 2 derefs that can be anywhere in memory, there is no real way to predict what needs to be prefetched — ratchet freak, Jul 08 '13 at 14:08
@ratchetfreak: Control flow jumps are not easy to predict, too, but dedicated branch prediction logic, coupled with prefetching and speculative execution, is said to have a major effect in modern CPUs. Why branch prediction is practical but "pointer prediction" is not? — 9000, Jul 08 '13 at 14:16
With branch prediction, there are only ever two outcomes to weigh--branch or not. The processor can guess which of two options is more likely and pipeline it. Dereferencing has as many possibilities as there are memory addresses--far too many to make an intelligent guess. — mgw854, Jul 08 '13 at 14:22
This is already done extensively. By far the most common pattern of memory access is *ptr++ or similar and caches already optimize for sequential access. — JohnB, Jul 08 '13 at 14:35
@JohnB: `*ptr++` is handled by reading a cache-line-sized block, but I'm not sure it's the most _common_ pattern, it's just the _easiest to handle_ by current hardware (line-based caches, DDR memory). I think VMT access is more common. — 9000, Jul 08 '13 at 14:45
@ratchetfreak There is no way to predict what needs to be prefetched, but at an OS or compiler level one can do a lot to organize how memory is laid out in relation to how it's used (probably better done by the compiler as it has clearer understanding of usage patterns) to make foo, bar, and baz in a sequential block due to their being accessed as such, which would take advantage of caches better than an ignorant approach to memory organization. — Jimmy Hoffa, Jul 08 '13 at 15:18
And not a complete answer 9000, but one thing that acts to enhance cache usage is immutability, not as related to what you're referring to but immutability allows individual core's caches to be used more efficiently as there's less memory sharing between cores so there's less global cache access and barrier instructions to communicate between the caches. You'll see this advantage actively used in code generated by the Haskell compiler. — Jimmy Hoffa, Jul 08 '13 at 15:21
@JimmyHoffa: while I'm all for immutability and advantages it offers, I can't help but notice that many purely functional immutable structures (like those from the Okasaki book) use pointers heavily to achieve economical immutability. OTOH the only structure with cache-friendly access currently seems to be a traditional contiguous array. — 9000, Jul 08 '13 at 15:29
Remember though, Okasaki's structures are all about creating traditional non-functional structures in a functional way with the best efficiency approach he could acquire. In some ways using Okasaki's structures is antithetical to the FP way, where other approaches are more natural and performant in FP. That said, I don't quite see where you get him using pointers, his structures are immutable value type based structures, though compilers may use pointers under the covers, they also may not. — Jimmy Hoffa, Jul 08 '13 at 15:49
@JimmyHoffa: I meant things like [trees that reuse unmodified parts instead of making a full copy](http://en.wikipedia.org/wiki/Persistent_data_structure#Trees). Possibly lists can also be implemented with much fewer pointers than a classic linked list uses, but I don't know if a fully pointer-less implementation is used by e.g. GHC. Same applies to imperative languages (first of all, the VMT and friends). — 9000, Jul 08 '13 at 16:11
9000 I understand what you're referring to as pointers there, but I disagree that they are, semantically it's not pointers, because though the function receives a set of values and returns some of the same values, in FP the assumption is the returned values are *new memory* even if some of the same values are in it. The reality may be many compilers tend to use pointers for that behaviour, but there are also scenarios where GHC recognizes that a `membar`less approach is preferable. Remember GHC does some amazing optimizations for instance http://stackoverflow.com/questions/578063 — Jimmy Hoffa, Jul 08 '13 at 16:45
@JimmyHoffa: While stream fusion is great, it depends on lazy evaluation, and Haskell's lazy evaluation depends on thunks, and don't thunks use pointers to the 'actual calculation to do'? In this question, I'm interested in the _hardware_ side of things, in a way to optimize memory reads that might look random but actually are not, e.g. known at compile time. — 9000, Jul 08 '13 at 16:55
Interesting related read: http://www.yosefk.com/blog/its-done-in-hardware-so-its-cheap.html — , Jul 08 '13 at 20:52

score 3 · Answer 1 · answered Jul 08 '13 at 14:30

there are cache pre-filling strategies. Those I'm aware of are trying to detect (some subset of) linear patterns of memory accesses and when they are successful, they are quite effective (for instance I remember having measured a difference between forward and backward sequential accesses on a processor, difference which was no more present on later generations). I'm not aware of tentative to use the memory or register content to do that pre-filling, but I'm not following that closely enough to be sure that nothing has been done in that area, it could have been proved useless in practice or less useful than some other things.
there is the whole set of efforts done around the principle of "if you have to wait, try to do something else and hope it will be useful". OoO execution is the application for single thread (finding some instructions in the thread which don't have to wait for memory or for the result of not-finished-yet instructions, branch prediction being there to help to find more candidates). And there is a bunch of variants in ways to make the processor thread aware and try to use the processor resources to advance the other threads while one is stuck on a slow memory access, hyperthreading being just one of them.

Hardware that accelerates pointer dereferencing?

1 Answers1