There are many components in your question.
That article is very intriguing, but please treat it as an eye-opener instead of being a piece of universal truth. The article blends too many things together (across many abstraction levels) to the point of absurdity that I'm not able to identify any central theme from it. (Apologies for this criticism.)
The beginning of that article goes like this:
- Observation: modern multi-core CPUs and their caches use a cache coherency protocol that works like such and such.
- This part is indisputable. It is a fact that is made public by CPU manufacturers.
- Thesis (from the article's author): If we write our data-passing code in a certain way, it should have higher efficiency than if we used OS locking or CAS.
- This part is a falsifiable statement. You can perform your own experiments to see whether the thesis is relevant to your use-cases, and whether the claim holds up when tested on a real computer (as opposed to arguing on paper).
The persuasion in the article is essentially that of decentralization, or the removal of a single point of bottleneck:
- If your system makes use of point-to-point communication between actors, implementing that with channels dedicated to each producer-consumer pair would result in the least contention at the CPU level.
How that relates to the hardware observation?
- You have to tell the compiler, JIT, or VM runtime to avoid reordering your memory read/write instructions.
- You do not need to otherwise use any of the hardware memory synchronization primitives at all, because the cache-coherency protocol already takes care of this use case.
- The table in the article claims that by avoiding the extra hardware memory synchronization primitives, write throughput can increase by at least 20 times.
- Caveat: every generation (family) of CPU is different. I wouldn't take these numbers for granted. If performance is very important to you, you would have performed your own benchmarks.
Drawbacks:
- if you have M producers, and N consumers, and every producer has to talk to every consumer, then you need
(M*N)
as many dedicated communication channels - as opposed to just one, or any number in between.
- Also, each datum must occupy at least one CPU cache-line. Otherwise the cache coherency argument isn't relevant anymore.
- If the dedicated channel is a fixed-size ring buffer, it will have a fixed capacity, so you will have to choose between discarding data if the consumer didn't catch up, or the producer being told it couldn't put more data into it.
You raised a question about:
- What about algorithms and system requirements that inherently require mutual exclusion?
Then the persuasion in the article doesn't apply.
On the other hand, there are many non-trivial distributed algorithms and distributed data structures that have less or no reliance on mutual exclusion, and these are what Facebook, Twitter, AWS, etc. would use. Needless to say these are miles beyond the comprehension of the average programmers. (I don't know much about those either.)
And as you asked that question, you made a leap between conceptual queues to actual queue implementations.
- LMAX Disruptor is a particular queue implementation (a fixed capacity, CPU-cache-friendly ring buffer, whose sole producer doesn't need to issue specific hardware synchronization primitives and the consumers do not need to be synchronized with respect to other consumers)
- The "queues" that are used by Facebook and Twitter are conceptual; in particular they are more like dataflows than queues, even when speaking abstractly. You cannot use LMAX Disruptor for such cases; instead you might have used Apache Storm etc.
A result aggregator is a valid and an important pattern in distributed computing (when there are many machines). I guess this is an indisputable fact.
The article's response to that comment question is that you can borrow that pattern back to the single-machine, too.
However, it is not clear (i.e. dubious) whether doing that on a single-machine brings any benefit on top of using multi-producer, multi-consumer CAS (obstruction-free) queues. It is possible that someone could have refuted this claim with experimental results.
One of my disappointment is that the author's response portrays their approach as being "an alternative to queues", as if it's for the sake of being different.
Ultimately, here is my lesson: I would dare not take one observation from one abstraction level, e.g. cache coherency, and stretch it to something vastly different, e.g. distributed computing.