1

I am running a program that, among other things, does some matrix multiplications, singular value decompositions and accessing matrix subsets on a very large data set (these are the lines of code that take the most time).

I am running computations on a CPU with 40 cores and 1TB RAM (yes, 1TB). The dataset itself is ~450GB and another matrix in the code is equally big, meaning I use ~95% of the 1TB memory. I've learned that my computations scale quite badly when I increase the dataset size.

My question is: Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?

Andy
  • 19
  • 1
  • 3
    The _specific_ architecture & topology of your CPUs, interconnects, caches and memory does affect performance. Unfortunately no _general_ analysis of this is likely to fit in an answer here, so you really need to describe all of those things (plus your program's access patterns, how performance varies with dataset size, and what profiling you did) to get anything resembling an answerable question. – Useless Nov 08 '22 at 16:56
  • To start with, you can look at things like `numastat` (on Linux) and whatever perf counters your OS/CPU offer to get an idea of what changes as your dataset grows. – Useless Nov 08 '22 at 17:00

3 Answers3

2

Does it slow my computations that I am using several cores instead of just one when memory is taking up most the time?

Maybe. It depends if your system is NUMA1. This can cause slowdown when one core has to ask another to access its memory


1: Non-uniform memory access

Laiv
  • 14,283
  • 1
  • 31
  • 69
pjc50
  • 10,595
  • 1
  • 26
  • 29
  • 1
    A system with 40 cores and 1TB is definitely NUMA. However, the memory is already split up (probably) 5 or 10 or 20 ways. Using more cores means each core is closer to some memory and farther away from some. – user253751 Nov 08 '22 at 14:03
  • 1
    Regardless of NUMA, there is also the possibility to be bottlenecked by memory bandwidth. – user253751 Nov 08 '22 at 14:05
2

Matrix multiplication with a TB of data, memory bandwidth can easily be your bottleneck. You just have 40 CPUs, probably nicely vectorised code… That will put lots of pressure on your memory subsystem. And the better your code, the higher the pressure.

I would start by making sure that you are partitioning the work into chunks that can be handled inside your cache on each CPU. Splitting each matrix into 256 chunks < 2 MB. You might check how much work per dollar you get out of a M1 Mac. Not that much RAM, but tons of bandwidth (800GB per second), very fast SSD / virtual memory and a lot cheaper than any 1TB machine.

PS your caches may have problems if the distance between rows is a power of two. A 4096x4096 matrix could be a very bad idea. If that’s what you have, try changing it to 4100 x 4096 for example.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • 1
    I bet you the 1TB machine also has 800GB per second memory bandwidth or more. – user253751 Nov 10 '22 at 09:12
  • But it also has an enormous price tag. Dell or Apple machines with 1.5TB are between 20,000 and 30,000 dollars. And 800 GB is an awful lot. I wouldn’t bet on it until I have seen the spec. – gnasher729 Nov 10 '22 at 23:02
0

It really depend on the implementation.

A very important part of optimization is to utilize the caches efficiently. While L1 and L2 is typically per core, L3 is often shared. In the absolute worst case, the increased amount of memory requests could cause L3 cache lines to be evicted just before they are needed again.

A well optimized implementation should try to load a section of data that fits into the L1 cache, and do as much computation as possible before loading the next chunk. And ideally do the same with the L2 cache. This typically results in almost unreadable code, but it can give huge performance gains.

So the first thing I would do is check the library. I would expect common and popular libraries like numpy to perform well. Or libraries where the main selling point is performance, like Intel performance primitives. I would expect internal or smaller libraries to perform worse since optimization requires a huge time investment.

workstation-class processors typically also scale the amount of cache and memory channels with the number of cores. If you have a 40 core processor you likely have far more bandwidth available than a single core will use. 4-8 cores per memory channel seem fairly common today.

A well optimized library should make optimal usage of the resources available. In some cases manual tuning parameters might help, but this will likely be a trial and error process. But multiplying huge matrices is just fundamentally slow, so you will likely not solve your problem of poor scaling.

JonasH
  • 3,426
  • 16
  • 13
  • Just an idea: If your code is multithreaded (I bet it is), you may be able to control how many threads are used. And determine the optimal number. – gnasher729 Nov 09 '22 at 16:50
  • My Mac at home: L1 is per core. L2 is per group of 4 cores. L3 is shared between all cores and GPUs. And fast memory with 16 byte lines. – gnasher729 Mar 13 '23 at 14:09