15

Reading these two questions, I see that understanding CPU caching behaviour can be important when dealing with large amounts of data in memory. I would like to understand the way the caching works to add another tool to my optimisation toolbox.

What are the core points about the way the CPU cache works so I can write code that uses it sensibly? Relatedly, is there a way to profile code to see if poor cache use is slowing things down?

Timothy Jones
  • 261
  • 1
  • 2
  • 7
  • Caches are not the same everywhere; most obviously, they vary in size. Don't expect to learn any deep secrets, only good practices (like Michael Borgwardt's advice). – David Thornley Dec 19 '11 at 18:10

4 Answers4

19
  • Keep your data small if possible
  • Keep things that will be accessed together (or right after another) next to each other in memory
  • Learn about your compiler's optimization parameters
  • Read What every programmer should know about memory for more details than you could ever want
Michael Borgwardt
  • 51,037
  • 13
  • 124
  • 176
13

The intricacy of this issue has been beyond human comprehension these days. (It has been that way since the last 5 years.) Combine that with short-vector parallelism (SIMD) and you have a sense of hopeless that optimizing code by hand is no longer economically feasible - not that it's not possible, but it would not be cost-effective anymore.

The current approach is to rely on teaching computers how to optimize - by making variations of code that compute the same answers with different structures (loops, data structure, algorithms) and automatically evaluating the performance. The rules for code transformations are specified with a very rigorous mathematical model, so that it is something both computer scientists can understand and computers can execute.

The following is a link posted by Larry OBrien in one of his answers.

http://onward-conference.org/2011/images/Pueschel_2011_AutomaticPerformanceProgramming_Onward11.pdf

rwong
  • 16,695
  • 3
  • 33
  • 81
  • 2
    the fasttest BLAS implementation (GotoBLAS) uses hand-optimized code to ensure maximal cache usage for matrix multiplication – quant_dev Jan 14 '12 at 18:42
  • My home computer has 8 cores, and 12 MB L2 cache per group of 4 cores. So single-threaded code can use 12 MB L2 cache. 8 threads can use 4 MB L2 cache. For 2 or 4 threads you’d try to use cores from each group of four. Good fun. Worst case: You measure the cache size using one thread. Then you run eight threads and assume wrongly that each has 12 MB. – gnasher729 Mar 27 '23 at 16:57
2

It is quite possible to understand and optimize for caches. It starts with understanding the hardware and continues with being in control of the system. The less control you have over the system the less likely you will be to succeed. Linux or Windows running a bunch of applications/threads that are not idling.

Most caches are somewhat similar in their properties, use some part of the address field to look for hits, have a depth (ways), and a width (cache line). Some have write buffers, some can be configured to write through or bypass the cache on writes, etc.

You need to be acutely aware of all the memory transactions going on that are hitting that cache (some systems have independent instruction and data caches making the task easier).

You can easily make a cache useless by not carefully managing your memory. For example, if you have multiple data blocks you are processing, hoping to keep them in cache, but they are in memory at addresses that are even multiples relative to the caches hit/miss checking, say 0x10000 0x20000 0x30000, and you have more of these than ways in the cache, you may very quickly end up making something that runs quite slow with the cache on, slower than it would with the cache off. But change that to perhaps 0x10000, 0x21000, 0x32000 and that might be enough to take full advantage of the cache, reducing the evictions.

Bottom line, the key to optimizing for a cache (well, other than knowing the system quite well) is to keep all of the things you need performance for in the cache at the same time, organizing that data such that it is possible to have it all in the cache at once. And preventing things like code execution, interrupts and other regular or random events from evicting significant portions of this data you are using.

The same goes for code. It is a little harder though as you need to control the locations where the code lives to avoid collisions with other code you want to keep in the cache. While testing/profiling any code that goes through a cache that adding a single line of code here and there or even a single nop, anything that shifts or changes the addresses where the code lives from one compile to another for the same code, changes where the cache lines fall within that code and changes what gets evicted and what doesn't for critical sections.

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
old_timer
  • 969
  • 5
  • 8
1

Both nwong's and Michael Borgwardt's answers give good advice.

Also, trust first the compiler's optimizations on these issues.

If using a recent GCC compiler, you might use (with parsimony) its __builtin_prefetch function. See this answer on stackoverflow.

Basile Starynkevitch
  • 32,434
  • 6
  • 84
  • 125