Understanding the Single-Writer Principle

Question

The article on the Single-Writer Principle of the Mechanical Sympathy blog explains how bad queues are (performance-wise), because they need to be able to receive messages from multiple produers, and how instead we should be using only single Consumer-Producer pairs in our systems, like their "Disruptor" does.

But I fail to understand how to exactly implement that.

For example: Assume some service (e.g. Facebook or Twitter) that tracks data objects (e.g. people or messages). Now assume you need to insert a new data object that somehow affects other data objects (e.g. a new user signs up and other users need to be asked if he's a friend of theirs, or a new message is published and subscribers need to be notified of it).

How does one implement that without a queue of some sort, considering that new data objects are coming in from all directions across all sorts of clients (i.e. producers). You can't exactly run the signup service using using just one thread on one server, and expect them to keep retrying till the signup succeeds, right?

One user on the comments of the mentioned article asks exactly that, and the response is to have the producers just publish their results, and then one additional process that collects them from those individual producers, aggregates them, and then republishes them, so that they are now being published by only "one" producer.

Isn't that just a queue in disguise too? Walking all those producers is going to take time and effort too, right? Why would this implementation be preferable to having the producers synchronize on trying to write into some proper queue in the first place?

Have you read [this](https://lmax-exchange.github.io/disruptor/) and [this](http://martinfowler.com/articles/lmax.html)? — Robert Harvey, Feb 05 '16 at 16:19
Disruptor is about one-to-one direct queues (i-th producer and j-th consumer), and are supposed to run on a small number of computers (nodes) on a fast and reliable network physically located together. Facebook or AWS are about distributed queues across geographics, which require an entire book to explain, but the [AWS SQS introduction](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/Welcome.html) gives a decent picture. It is not so much "performance-wise" but rather what you do when the system's size becomes bigger than could be fit on a single computer. — rwong, Feb 05 '16 at 17:22
*"... trying to write into some proper queue"* can you elaborate what you mean by "proper queue"? A queue on a single computer with low-overhead "lock-free" technique and accessed by CPUs all on cache-coherent interconnects? — rwong, Feb 05 '16 at 17:30
@rwong "proper" being whatever Erlang, Akka, and all those other concurrency solutions use. — BVN, Feb 05 '16 at 19:16

score 3 · Accepted Answer · answered Feb 05 '16 at 21:01

There are many components in your question.

That article is very intriguing, but please treat it as an eye-opener instead of being a piece of universal truth. The article blends too many things together (across many abstraction levels) to the point of absurdity that I'm not able to identify any central theme from it. (Apologies for this criticism.)

The beginning of that article goes like this:

Observation: modern multi-core CPUs and their caches use a cache coherency protocol that works like such and such.
- This part is indisputable. It is a fact that is made public by CPU manufacturers.
Thesis (from the article's author): If we write our data-passing code in a certain way, it should have higher efficiency than if we used OS locking or CAS.
- This part is a falsifiable statement. You can perform your own experiments to see whether the thesis is relevant to your use-cases, and whether the claim holds up when tested on a real computer (as opposed to arguing on paper).

The persuasion in the article is essentially that of decentralization, or the removal of a single point of bottleneck:

If your system makes use of point-to-point communication between actors, implementing that with channels dedicated to each producer-consumer pair would result in the least contention at the CPU level.

How that relates to the hardware observation?

You have to tell the compiler, JIT, or VM runtime to avoid reordering your memory read/write instructions.
You do not need to otherwise use any of the hardware memory synchronization primitives at all, because the cache-coherency protocol already takes care of this use case.
The table in the article claims that by avoiding the extra hardware memory synchronization primitives, write throughput can increase by at least 20 times.
- Caveat: every generation (family) of CPU is different. I wouldn't take these numbers for granted. If performance is very important to you, you would have performed your own benchmarks.

Drawbacks:

if you have M producers, and N consumers, and every producer has to talk to every consumer, then you need (M*N) as many dedicated communication channels - as opposed to just one, or any number in between.
Also, each datum must occupy at least one CPU cache-line. Otherwise the cache coherency argument isn't relevant anymore.
If the dedicated channel is a fixed-size ring buffer, it will have a fixed capacity, so you will have to choose between discarding data if the consumer didn't catch up, or the producer being told it couldn't put more data into it.

You raised a question about:

What about algorithms and system requirements that inherently require mutual exclusion?

Then the persuasion in the article doesn't apply.

On the other hand, there are many non-trivial distributed algorithms and distributed data structures that have less or no reliance on mutual exclusion, and these are what Facebook, Twitter, AWS, etc. would use. Needless to say these are miles beyond the comprehension of the average programmers. (I don't know much about those either.)

And as you asked that question, you made a leap between conceptual queues to actual queue implementations.

LMAX Disruptor is a particular queue implementation (a fixed capacity, CPU-cache-friendly ring buffer, whose sole producer doesn't need to issue specific hardware synchronization primitives and the consumers do not need to be synchronized with respect to other consumers)
The "queues" that are used by Facebook and Twitter are conceptual; in particular they are more like dataflows than queues, even when speaking abstractly. You cannot use LMAX Disruptor for such cases; instead you might have used Apache Storm etc.

A result aggregator is a valid and an important pattern in distributed computing (when there are many machines). I guess this is an indisputable fact.

The article's response to that comment question is that you can borrow that pattern back to the single-machine, too.

However, it is not clear (i.e. dubious) whether doing that on a single-machine brings any benefit on top of using multi-producer, multi-consumer CAS (obstruction-free) queues. It is possible that someone could have refuted this claim with experimental results.

One of my disappointment is that the author's response portrays their approach as being "an alternative to queues", as if it's for the sake of being different.

Ultimately, here is my lesson: I would dare not take one observation from one abstraction level, e.g. cache coherency, and stretch it to something vastly different, e.g. distributed computing.

score 1 · Answer 2 · edited Jun 16 '20 at 10:01

I may be wrong here, but I believe this essentially comes down to overheads.

As stated in your linked article:

For highly contended data it is very easy to get into a situation whereby the system spends significantly more time managing contention than doing real work.

Let's take an example...

Let's say we have an office of programmers, all of whom are having their annual performance appraisal (yay! dead eyes) with the boss. Now the boss wants a list of names for the order of the office's performance appraisals. There are two (well many, but let's just use two here) ways to do this:

The boss puts a sheet of paper and a pen up front. All the programmers in the office (Bob, Alice, Fred, and their friends), go up to the sheet of paper, fight over the pen, and write their name (each having to wait for the pen, and then only getting a line on the paper after whomever is in-front of them is done writing).

- OR -
Bob, Alice, etc all each write their name on a piece of paper, and then hand it to the person at the next desk. Fred (who sits at the end of the row) then hands these pieces (in order - that he received them) to the boss. (Since all forms of data that computer can work with can be stored in computer memory, extra pieces of paper are essentially free)

In both scenarios the boss gets a list of everyone's names, and that list is in an order that is (around about) the same (based on their desk location). However, in the first scenario, significant time was spent waiting (for the pen and paper to become free), as it had many producers and a single consumer. In the second scenario, each programmer is a producer and a consumer - essentially a ring topology (broken at the boss end though).

"how to exactly implement that"

Well, now there's a question!

Obviously it's hard to discuss without a concrete scenario (which would then make this more of an SO question). But basically, each writeable thing in your domain/solution/environment/system would only be writeable by one particular actor at a time. (Depending on your scale, that actor may be a thread, process, machine, cluster, data-center, etc.) Reads would occur from a cache (such as the caching on processors, in web-servers, etc).

For example, the way Google structures its index of the entire (shallow) web, is based upon a divide and conquer technique using clusters responsible for certain keywords. When the crawler finds a new web page with a keyword, only the relevant index cluster needs to be notified.

I'm not sure about this .. Who's the next person down the line in your average cube farm that extends across n dimensions? If you can figure that out, why can't you figure out how to have the people sign up in an orderly fashion? — BVN, Feb 05 '16 at 19:32

Understanding the Single-Writer Principle

2 Answers2

Let's take an example...

"how to exactly implement that"