61

I've got some process in Go. Here's an example counting lines in text, though the question is meant to be far more general than this particular example:

func lineCount(s string) int {
    count := 0
    for _, c := range s {
        if c == '\n' {
            count++
        }
    }
    return count
}

Alright, not bad, but it's too slow, so let's make it concurrent:

func newLine(r rune, c chan<- struct{}, wg sync.WaitGroup) {
    if r == '\n' {
        c <- struct{}
    }
    wc.Done()
}

func sumLines(c <-chan struct{}, result chan<- int) {
    count := 0
    for _ := range c {
        count++
    }
    result <- count
}

func lineCount(s string) int {
    c := make(chan struct{})
    var wg sync.WaitGroup
    for _, r := range s {
        wg.Add(1)
        go newLine(r, c, wg)
    }
    result := make(chan int)
    go sumLines(c, result)
    wg.Wait()
    close(c)
    return <-result
}
    

Better, because now we're using all our cores, but let's be honest, one goroutine per letter is probably overkill, and we're likely adding a lot of overhead between the horrendous number of goroutines and the locking/unlocking of the wait group. Let's do better:

func newLine(s string, c chan<- int, wg sync.WaitGroup) {
    count := 0
    for _, r := range s {
        if r == '\n' {
            count++
        }
    }
    c <- count
    wc.Done()
}

func sumLines(c <-chan int, result chan<- int) {
    count := 0
    for miniCount := range c {
        count += miniCount
    }
    result <- count
}

func lineCount(s string) int {
    c := make(chan int)
    var wg sync.WaitGroup
    for i := 0; i < len(s)/MAGIC_NUMBER; i++ {
        wg.Add(1)
        go newLine(s[i*MAGIC_NUMBER : (i+1)*MAGIC_NUMBER], c, wg)
    }
    result := make(chan int)
    go sumLines(c, result)
    wg.Wait()
    close(c)
    return <-result
}

So now we're dividing up our string evenly (except the last part) into goroutines. I've got 8 cores, so do I ever have a reason to set MAGIC_NUMBER to greater than 8? Again, while I'm writing this question with the example of counting lines in text, the question is really directed at any situation where the problem can be sliced and diced any number of ways, and it's really up the programmer to decide how many slices to go for.

TheEnvironmentalist
  • 1,159
  • 1
  • 8
  • 14
  • 30
    "Any reason" is incredibly broad. An obvious reason is that you are doing N completely different things, for which it makes sense to use N different processes. Veeeeeeery exaggerated example: on my laptop, my email program and my audio workstation are two different processes, and I see no reason why they should be one. – Jörg W Mittag Aug 31 '20 at 21:27
  • 7
    It certainly requires benchmark to prove actual performance optimality. My domain is in image processing, and many of the image algorithms I designed process pixels by the rows. These rows are grouped into consecutive bands; each band becomes a "task" that is sent off to a worker for execution. When parallelizing, I have learned that the optimal number of bands is higher than the number of CPU cores. The actual value always need to be found out through experimentation (benchmarking), there's no reliable rule for prediction. – rwong Aug 31 '20 at 21:29
  • @rwong Mind writing that as an answer? Also if you’ve developed an intuition over the ideal number of processes/threads/goroutines relative to number of processor cores – TheEnvironmentalist Aug 31 '20 at 21:32
  • 3
    Factor 1: L1 and L2 locality. Factor 2: The "task submitted" to "tasks started executing" latency, which is a property of the OS scheduling (or Goroutine scheduling) limit, and which causes some discrepancy between the finishing times of a group of tasks, even if the group of tasks are all started at the exact same time. – rwong Aug 31 '20 at 21:32
  • 2
    @TheEnvironmentalist This question is absolutely a duplicate of something; I just don't bother to find it. It's better to put valuable answers in one place (i.e. the "canonical" Q&A question on the same topic) than to have them scattered around. – rwong Aug 31 '20 at 21:33
  • Factor 3: how busy the computer is doing "other stuff". If it's busy, it's better not to further overload the CPU. – rwong Aug 31 '20 at 21:35
  • 4
    Think of various types of I/O. If a thread is blocked, then it isn't running. Therefore the most efficient number is more than the number of cores available, by some magic number. – NomadMaker Sep 01 '20 at 09:48
  • I'm going to throw your app in a container and limit its CPU usage anyway, so go wild and don't worry too much about this. – Michael Hampton Sep 01 '20 at 12:30
  • 2
    You have the code and the ability to change `MAGIC_NUMBER`. What are the results of **your measurements** of throughput versus `MAGIC_NUMBER`? Until you can demonstrate a difference, it is irrational to search for "a reason to set `MAGIC_NUMBER` to greater than 8?" – Eric Towers Sep 01 '20 at 17:14
  • Consider hyper threading: Your 8 cores would support up to 16 parallel processes... in a way that's not as fast as 16 own cores, but faster than "just" 8 cores. – R. Schmitz Sep 03 '20 at 14:25
  • I agree with benchmarking. I remember a discussion about the Prime95 hand-optimised Windows code, being on the knife edge between CPU bound and IO (memory) bound. One of the users thought he could speed up his throughput by doubling the memory in his machine. Turns out, it was slower! After some discussions about what's going on under the hood, it seems that the added memory shared the same fixed cache lines with the existing memory, suffered more cache misses, and reduced throughput. This was also somewhat data dependent. – Quantum Mechanic Sep 09 '20 at 10:49
  • Creating threads is relatively costly. If what they do saves more time than you burn creating them it's a win – Jay Sep 09 '20 at 15:35
  • Well, it won't let me actually answer for some reason, but a very good reason for using more processes than cores is in super-computing applications where your cores are vector processors which can perform arrays of operations at once (not just a single operation). Optimized correctly, you can run #processes = #cores x #vector arrays – JoeDuncan Sep 10 '20 at 15:42
  • Keep in mind that modern CPUs implement Thermal Dynamic Voltage and Frequency Scaling (DVFS) that makes benchmarking **difficult**, and theoretical reasoning **impossible**, and model-based simulations **unrealistic**. – rwong Sep 24 '20 at 10:52
  • @rwong I wouldn't say theoretical reasoning need be impossible, just more complicated – TheEnvironmentalist Sep 24 '20 at 16:00
  • Sometimes. For example: [Simulating a real-world system in Go](https://www.dotconferences.com/2017/11/sameer-ajmani-simulating-a-real-world-system-in-go). – peterSO Sep 26 '20 at 14:28

10 Answers10

182

The canonical time when you use far, far more processes than cores is when your processes aren't CPU bound. If your processes are I/O bound (either disk or more likely network), then you can absolutely and sensibly have a huge number of processes per core, because the processes are sleeping most of the time anyway. Unsurprisingly enough, this is how any modern web server works.

Philip Kendall
  • 22,899
  • 9
  • 58
  • 61
  • That's a great point. In the case of I/O bound processes limited by disk read/write speed, would the number of read/write heads (two for most hard drives) become the number to aim for? – TheEnvironmentalist Aug 31 '20 at 22:04
  • 15
    You'd first of all have to convince me of your use case for using magnetic disks rather than SSDs. – Philip Kendall Aug 31 '20 at 22:06
  • 1
    Touché, good sir – TheEnvironmentalist Aug 31 '20 at 22:21
  • 2
    @TheEnvironmentalist Two heads per HD? They're ganged together, it doesn't help. However, what is the device? I have an 8-headed device here. (RAID 5 array.) – Loren Pechtel Sep 01 '20 at 03:41
  • 21
    Note that some seemingly CPU bound tasks can also be I/O bound when processing large amounts of data. People often forget that RAM is I/O and fetching data from RAM is basically I/O wait time. During the fetch you can process previously fetched data that's currently in cache. This is why some CPU intensive tasks like 3D rendering with heavy use of textures or signal processing (image, video, audio etc.) can sometimes benefit from having more tasks than CPU cores – slebetman Sep 01 '20 at 06:30
  • 39
    "this is how any modern web server works." is not true. There are lots of async servers out there. The best know and insanely popular (1/3 of the market) is Nginx. Even though it allows spawning more processes than cores it just doesn't make sense to do that. – freakish Sep 01 '20 at 06:38
  • 14
    @freakish it depends how weasely you want to be about the use of the word "concurrent". Strictly speaking, concurrent tasks don't need to be parallel - you just need one task to start before another one ends, and it's meaningful to talk about asynchronous concurrency. Nginx tasks are not that different to Golang goroutines (that the OP asks about), in that respect. – James_pic Sep 01 '20 at 08:34
  • @slebetman Indeed, GPU hardware is structured around this principle. To get peak performance on a CUDA device you must oversubscribe the hardware. The Cray XMT took this idea to an extreme (some would say absurd) degree by supporting up to 128 hardware threads per core (and here again, you had to oversubscribe in this way to get peak performance from the machine). – Nobody Sep 01 '20 at 12:15
  • @TheEnvironmentalist: When using older head-positioning technologies, it might have been advantageous to duplicate the read/write logic in a drive to allow multiple heads that share a movement mechanism to be read or written simultaneously. Today's drives, however, require that the head move continuously in and out slightly while reading a track to compensate for wobble or other imperfections; it would be unlikely that multiple heads could be kept lined up with their respective tracks unless they had independent mechanisms to move them. – supercat Sep 01 '20 at 15:57
  • When you send a signal to hardware, the hardware/firmware may be able to perform operations without making use of a CPU thread at all. Of course, your code hooks will eventually need to run in a process. However, this will be after the hardware signals your device driver (e.g., via CPU interrupt) to signal your code (e.g., by hooking an event using a driver API) that an event has occurred. The actual async task might not be using any processes (or might be using only 1 for polling). – Brian Sep 01 '20 at 15:58
  • @TheEnvironmentalist: Perhaps there might be some advantage to having a system use both a "fine" and "coarse" movement control for the heads, analogous to the way many CD-players work, and have multiple heads share the coarse movement control while using separate fine controls, but I think hard drive research is focused more on archival capacity than speed. – supercat Sep 01 '20 at 16:00
  • @slebetman: I find myself thinking that operating systems should offer more control over hyperthreading and core affinity. If two threads will be using a lot of shared data, and would need tight memory consistency with each other but not with other threads, running them hyper-threaded on a common core may offer better caching efficiency than running them on separate cores, especially if data that was shared exclusively between those two threads could be processed with "ordinary" reads and writes without having to synchronize with other cores. – supercat Sep 01 '20 at 16:04
  • @slebetman: Additionally, if an operating system had "hints" about which tasks were apt to be memory-bound versus CPU bound, it might be advantageous to pair a memory-bound task with a CPU-bound task on the same core (in cases where neither would benefit from being paired with one particular associated task), so the CPU-bound task could run every time the memory-bound task had a cache miss. – supercat Sep 01 '20 at 16:06
  • @supercat Getting very inappropriate for comments here, but check out Linux's `taskset` and `sched_setaffinity`. Pretty coarse, but they do want you suggest. – Philip Kendall Sep 01 '20 at 16:08
  • 9
    @freakish The body of the question makes it very clear that “process” is used here to mean “any thread of execution running concurrently,” not just “OS process” (since goroutines are lightweight green threads, not OS threads and definitely not OS processes). With that definition, the quote is trivially true: nginx obviously allows more concurrent requests than there are cores. This answer doesn’t claim that the concurrency has to come from OS processes. – Alexis King Sep 01 '20 at 16:54
  • 4
    @AlexisKing The body of the question uses the word "process" only once in the first sentence. While ambigous, the op uses "goroutine" elsewhere, multiple times. How is that clear? The answer at best redefines the widely accepted definition of "process" and leads to confusion. – freakish Sep 01 '20 at 18:12
  • 3
    @AlexisKing: Even if it's possible to interpret this answer in a way that makes it correct, I think that many or most readers will interpret "[process](https://en.wikipedia.org/wiki/Process_(computing))" in the usual way, and so be misled. (After all, not everyone here knows how goroutines work. This site is for software engineering in general, not Go in particular.) – ruakh Sep 01 '20 at 23:32
59

Short answer: Yes.

Longer answer:

Set your magic number stupid high, benchmark it, set it low, benchmark it again, and keep doing that until you have your answer.

The number of moving parts here is way too high to arrive at an answer via analysis in any kind of reasonable timeframe, you'll get a much more reliable answer much more quickly by just running comparative benchmarks.

It's not perfect, but it beats the hell out of trying to out-think the web of interactions between a compiler, an OS (that is running other processes), BIOS, and hardware to arrive at an ideal number (which will change with the weather anyway).

Iron Gremlin
  • 1,115
  • 6
  • 8
  • 24
    This answer contains a great truth: benchmarking beats lots of theory and speculation. – qwr Sep 03 '20 at 07:43
  • 6
    @qwr An adage also known as : *"One experiment is worth a thousand expert opinions"* – J... Sep 04 '20 at 13:25
  • 5
    @qwr: another great truth is that benchmarking is terribly difficult. In particular when developing general purpose solutions such that you don't know the problem size and specifics in advance (and sometimes even have no idea of realistic problems). Some understanding of the trends in behavior is not luxury. – Yves Daoust Sep 09 '20 at 09:49
  • 3
    @YvesDaoust, this is especially true for parallelized algorithms, where it is far harder to set up "controlled" benchmark experiments. – taciteloquence Sep 10 '20 at 10:29
  • @taciteloquence the few things I remember from my university days was a lesson about what used to be simply called 'network computing': unlike single-physical-CPU systems, they're chaotic and totally unpredictable. There is no such thing as a 'controlled benchmark experiment' — you can establish a few baselines, sure, but real life will always be different from what you can assemble in a _lab_. – Gwyneth Llewelyn Sep 10 '20 at 11:14
  • @YvesDaoust Cast not away therefore your boldness, which hath great recompense of reward. – Mast Sep 11 '20 at 13:05
  • The most important part is the "Short answer: yes" As the OP question was 2do I ever have a reason". And yes there are situations where it will be viable and even recomended to use more threats than cores but which situations is a much longer answer without upper bond :) – David Mårtensson Oct 07 '20 at 19:06
12

In A.I. it is common for people to observe super-linear speedups when they write parallel algorithms (that is, > K times speedup with K processes running on K cores). This is because you are often looking for something (for example, the answer to a combinatorial problem), and you stop as soon as one core finds the answer.

Such algorithms can be redesigned to not need many cores, by just "time-sharing" a single core, but this is much harder to implement than just spawning more independant threads, each searching part of the problem.

Chris Jefferson
  • 221
  • 1
  • 3
  • 1
    That is a very interesting aspect. It's like doing a breadth-first search if you suspcet the desired item is in a shallow section. – Peter - Reinstate Monica Sep 02 '20 at 07:24
  • *this is much harder to implement than just spawning more independant threads* This will depend on the programming language. – Max Barraclough Sep 02 '20 at 13:41
  • It would be useful to explain what properties you are thinking of. "green threads" (languages which support functionality which looks exactly like threads to the user, but are run on the same OS thread, with time-slicing done by the language) would make this easy I suppose. Were you thinking of anything else? – Chris Jefferson Sep 02 '20 at 14:14
  • 1
    *This is because you are often looking for something, and you stop as soon as one core finds the answer* - a stop after one core finds a result gives you a K speedup, not a >K speedup. otherwise, how do you actually define a K speedup? – HeyJude Sep 02 '20 at 20:22
  • 1
    Imagine your search must make a "lucky first choice", with 50/50 chance of being right (for example, it must pick an integer and the integer must be even). Once you have picked even you find the answer in 1s, if you pick odd it will take 1,000s to prove wrong The average time of one run is .5 + (.5*.5)*1001 + (.5*.5*.5)*2001 + ..., which is at least > 500. Now imagine we start 10 processes off at the same time on one CPU. It will take them 10 seconds to get to answer if they are "fast", but there is only a 1/1024 chance none of them will be lucky, so our speedup is much more than 10x – Chris Jefferson Sep 03 '20 at 06:25
  • 1
    It's worth mentioning that when a task is parallelized, it is often broken up into *convenient* chunks, rather than chunks that require the same amount of *time*. If the threads must occasionally synchronize, then you will always be waiting for the slowest thread to finish. In this case it is better to use more threads than cores so that your cores aren't idle while waiting for the slowest one to finish its work. – taciteloquence Sep 10 '20 at 10:32
11

You can take the example of compiled Linux distributions (like Gentoo): to optimize the compilation time, it is obviously using parallel compilation using more processes than the number of available "cores" (or processor threads when Hyperthreading is enabled on Intel processors, these are virtual cores even if they share some parts of the internal pipelines and the processing units are internally scheduled) and the default is to use the number of (virtual) cores plus one to avoid being too much bound by the I/O limits.

Note that I/O limits on disk are not systematic because modern OSes use aggressive filesystem caching in memory. The I/O bounds are replaced most of the time by memory access time bounds (when data does not fit the L1-L3 CPU caches or optional extra caches on the motherboards, something that has disappeared with modern processors that have integrated the memory controller in the CPU chip along with the L3 cache).

Compiling Linux requires very frequent access to highly cachable data (notably header files, but as well the temporary compiled units and various stages of the compiler used), so these Linux installer are much more bound today to CPU limits than to I/O limits (on disk or on external network storage, which is also cached).

Now if you work aggressively in memory, the real limitations is about asynchronous behavior between threads/processes taking unequal time to complete their task and with many "rendez-vous" that must be met: there are idle time where some threads are waiting, and using one extra core allows using this without excessive costly preemption and scheduling (changes of contexts between threads or processes have a cost on the OS, but using 9 processes/threads on an 8-core CPU limits this overhead to at most 12.5% in infrequent cases, but can benefit from suppressing frequent cases where some cores will be idle doing nothing).

If you have only a dual-core processor the benefit of using one more thread would be less obvious. On a single CPU, you gain nothing, and instead you reduce the performance if you try to use 2 competing threads.

I bet then that using (nbcores+1) threads is the best default strategy when (nbcores>2) and only (nbcores) threads otherwise.

But you may want to provide a way to profile your usage to experiment what is best for your application and then provide an easily tunable parameter to run it according to your last profiling on the target platform (just like settings for compiling Gentoo for some platforms, notably on virtualized OSes or for on-demand deployment).

There's no absolute answer about how many cores you should use, as this completely depends on what your threads are doing and if they are severely bound to disk I/O or network I/O or to other input events controlled by the user: generally user input has lot of idle time, even in games with a very active user moving their mouse, performing many clicks: the typical user input events are slow, at most around 10 milliseconds, while other I/O are now much faster to react, notably disk I/O and network I/O today; external memory bounds are even faster and measured in microseconds and comparable to the time needed by the OS to schedule threads; cache bounds are even faster, with idle times measured in nanoseconds).

Glorfindel
  • 3,137
  • 6
  • 25
  • 33
  • 5
    +1 for profiling. When you plot throughput vs workers, there's usually a linear part of the curve, and then a part where it starts to flatten out or even decreases. Just before that knee where it flattens out is often where you want to be. And certainly not where throughput decreases. Without profiling, you don't know if the added workers are helping, doing no good, or actually hurting. – Wayne Conrad Sep 01 '20 at 17:36
5

It depends. Mainly upon your workload and scheduler concept. Speaking precisely about Go, it is not just common, but absolutely right decision to spawn much more goroutines that you physical ability to parallelize if you're doing IO. Sharing CPU will degrade once number of fighting threads (or whatever you call them) becomes orders of magnitude higher than working CPUs.

Note that there are somewhat different scheduler implementations, which perform much, much, MUCH better than that: Erlang with it's glorious ability to spawn thousands, tens of thousands and even hundreds of thousands processes is a nice example.

Zazaeil
  • 345
  • 1
  • 8
  • 1
    s/span/spawn/ . – Carsten S Sep 01 '20 at 12:15
  • 1
    @CarstenS how does that relate to Go? – Zazaeil Sep 01 '20 at 13:02
  • 2
    sorry for not being clear enough. You wrote “span” twice where I suppose you meant “spawn”. – Carsten S Sep 01 '20 at 13:44
  • @CarstenS oh thanks, I stand corrected. Initially I thought that was some esoteric reference to something on Linux, like `dev/null`. lol. – Zazaeil Sep 01 '20 at 15:00
  • 7
    @SerejaBogolubov It kind of is a reference to something on Linux. `s/regular expression/replacement/` is a `sed` and `vi` command that replaces things that match the regular expression part with the replacement part. It's usually called the substitute command. Some Internet commenters have adopted the syntax to point out typos or suggest replacement words. – 8bittree Sep 01 '20 at 15:31
  • 4
    @SerejaBogolubov s/Linux/Unix – Jens Sep 03 '20 at 21:09
2

You ask for “any reason”. One reason would be that I don’t want to bother counting the number of available cores or virtual cores. And the number of available cores isn’t a good hint either, in case other running apps use the CPU as well.

In other words: It is very very difficult to determine the optimal number of threads, so why bother?

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • 2
    That's why *work stealing schedulers* are popular (e.g. Scala's parallel collections as well as both popular actor frameworks for Scala use it, Clojure's concurrency engine use it, …) You spawn as many threads as you have parallel cores, divide your task by the number of threads, but whenever a thread is waiting, it will throw the task back into the queue and pick the next one, and if there isn't a next one, it will "steal" one from a different thread. So, by separating out the notion of "concurrent things to do" (task) from "parallel executor" (thread), you approach a fairly good sweet spot. – Jörg W Mittag Sep 02 '20 at 05:34
  • If a thread is starved because there are other apps using the CPU, it will simply have its tasks stolen, so having "too many" threads is not an issue. And you know that you will at least need as many threads as you have parallel execution engines (cores, hyper threads, whatever you want to call them), so finding a "near optimal" number of threads is not an issue. – Jörg W Mittag Sep 02 '20 at 05:36
  • The requirement is, of course, that the tasks are "large" compared to the context switch time. – Jörg W Mittag Sep 02 '20 at 05:37
2

Others have added great answers already, but I'd like to pitch in one more approach.

Start by figuring out what your bottleneck is. That's done by profiling or just using common sense. Then optimize accordingly.

  • If it's I/O (file, network, database, etc) then a single thread might be all you need since it will spend most of its time sleeping and waiting for the next data anyway. Add some asynchronicity (note: not multithreading) so that the I/O operation can happen in the background while you do your CPU stuff.
  • If it's CPU, then make as many threads as there are cores. More threads will just slow things down with context switches.
  • Often overlooked, your bottleneck could also be RAM. It's awfully slow compared to the CPU and most modern CPUs spend much of their time just waiting for data to arrive from the RAM. That's why CPU caches and hyperthreading were invented. And I think it would also be the case in the example given here. I don't know Go, but I assume that a string always resides in RAM and doesn't employ any IO behind the scenes. I'll also assume that the computer has enough RAM and doesn't need to swap data out to the disk. And finally I'll assume that the string in question is much larger than the CPU cache, otherwise all the optimisation is irrelevant. So in this case since you're mostly waiting for RAM, you might see some speedup from multiple threads since they could read data from multiple RAM chips at once, but you'll have to be careful about your MAGIC_NUMBER. Pick a wrong one and you'll clash on the cache lines or the memory chips and essentially serialize everything. After you manage to saturate your memory bus and/or memory chips, you'll hit a ceiling though. And also this number would be VERY specific to the particular combination of hardware so finding it out might be difficult. Perhaps some sort of algorithm that tries to adjust it automatically on the fly?
Vilx-
  • 5,320
  • 4
  • 20
  • 24
1

You may want to take a look at how Linux load averages are calculated. Essentially, only processes ready to run are counted when evaluating the system load, processes waiting for user input or other data are not counted, which means you can have many more of such processes than CPU cores. The whole trick is what to count as load. A prime example is swap: on a system running out of RAM some processes will be waiting for their RAM pages to be loaded. This typically puts little strain on the CPU, however, spawning even more processes in this situation will only lead to more swapping without increasing system throughput.

In short:

  • Spawning less processes than CPU cores guarantees to keep CPU utilisation under 100%. Therefore, limiting the number of process to CPU cores is a good first-order approximation.
  • Spawning more process than CPU cores might increase throughput if not all processes are CPU-bound. So, spawning new processes until CPU utilisation reaches 100% would be a second-order approximation. Problem is, on some systems it never will, so there should be at least a cap on the number of processes. Common cap values are N+1 or 2N for N CPU cores.
  • Finally, there are more complex metrics of system load, like Linux load averages. They work well most of the time and allow much more processes than CPU cores, while still keeping the system responsive.
Dmitry Grigoryev
  • 494
  • 2
  • 13
0

For a simple task like counting newlines, it's going to be quite difficult to do better than just a simple single threaded count, your bottleneck here is going to be reading the string from disk or network, which is a serial operation anyway and a single thread is going to already be significantly faster than the related IO. For the more general case, I'd suggest reading up on map-reduce programming model.

As Philip Kendall's answer suggest though, IO bound task is where you'd benefit from running more threads than you have cores, if you have a CPU bound task, you're unlikely to benefit much from splitting up the job more than you have worker cores.

Lie Ryan
  • 12,291
  • 1
  • 30
  • 41
  • 3
    Depends on your disk. High end PCIe/NVME storage may have up to four PCI lanes per disk. So in theory reading a single file can be sped up by using up to 4 threads. The parallelism goes even higher with RAID arrays or sharded filesystems – slebetman Sep 01 '20 at 06:33
  • 3
    @slebetman: The number of PCIe lanes is totally irrelevant; that's just bandwidth. The relevant measure is the number of [_command queues_](https://en.wikipedia.org/wiki/Native_Command_Queuing). In SATA, that number is 1. All IO's from all threads end up mixed in that one queue, and that includes I/O from the OS too. NVMe allows up to 65536 queues; clearly a quantum jump. Real SSD's today have not nearly that many queues, 8 would not be unusual. – MSalters Sep 01 '20 at 14:27
  • @MSalters But command queues does not affect how parallel the actual data transfer can be. You can have more than 4 queues fetching data in parallel from multiple flash chips but all that data would still be serialized into a maximum of 4 lanes to be transferred in parallel. The completion of transfer of the PCI lanes are not necessarily synchronous which allows you to use your CPU time to process the current chunk of completed transfer while other lanes are completing. – slebetman Sep 01 '20 at 15:08
  • Neither the number of PCIe lanes or commands queues are relevant. If you read from multiple queues or PCIe lanes, you're dividing bandwidth to multiple readers, but you're not actually going to be reading any faster, and in most cases a single-threaded newline counter would be able to keep up with the full read bandwidth. If you have a really fast, high end SSD array, it might be possible to read faster than a single thread can handle the line processing, but machines with the type of SSD that can saturate the CPU thread for such a simple workload is still fairly uncommon. – Lie Ryan Sep 01 '20 at 15:08
  • In either case, programs don't really read from SSD directly, the OS will parallelize the read you multiple PCIe lanes or queues if it has nothing else it needs to read. So these are irrelevant anyway. – Lie Ryan Sep 01 '20 at 15:12
0

Yes. Example: NVidia recommends approximately 3x the number of ALUs since context switching is lightning fast but memory is extremely slow by comparison. In particular you could consider GPU memory access as I/O. As other have said, in general you want you "just" use all your resources as they become available and the distribution of consumers depends then on both the hardware configuration and the nature of the problem being solved. The balance is usually mediated by an OS and it's inner workings cost as well and that must be taken into account. For example for some applications RT versions of Linux are needed because the standard pre-emption machinery is not suitable for RT applications.

Yttrill
  • 727
  • 5
  • 10