What lessons did you learn from a project which nearly/actually failed due to bad multithreading?

Question

Sometimes, the framework imposes a certain threading model that makes things an order of magnitude more difficult to get right.

As for me, I have yet to recover from the last failure and I feel that it is better for me not to work on anything that has to do with multithreading in that framework.

I found that I was good at multithreading problems which have simple fork/join, and where data only travels in one direction (while signals can travel in a circular direction).

I am unable to handle GUI in which some work can only be done on a strictly-serialized thread (the "main thread") and other work can only be done on any thread but the main thread (the "worker threads"), and where data and messages have to travel in all directions between N components (a fully connected graph).

At the time when I left that project for another one, there were deadlock issues everywhere. I heard that 2-3 months later, several other developers managed to fix all of the deadlock issues, to the point that it can be shipped to customers. I never managed to find out that missing piece of knowledge I'm lacking.

Something about the project: the number of message IDs (integer values which describe the meaning of a event which can be sent into the message queue of another object, regardless of threading) runs into several thousands. Unique strings (user messages) also run into about a thousand.

Added

The best analogy I got from another team (unrelated to my past or present projects) was to "put the data in a database". ("Database" referring to centralization and atomic updates.) In a GUI that is fragmented into multiple views all running on the same "main thread" and all the non-GUI heavy-lifting is done in individual worker threads, the application's data should be stored in a single plase which acts like a Database, and let the "Database" handle all the "atomic updates" involving non-trivial data dependencies. All other parts of GUI just handle screen drawing and nothing else. The UI parts could cache stuff and the user won't notice if it's stale by a fraction of a second, if it's designed properly. This "database" is also known as "the document" in Document-View architecture. Unfortunately - no, my app actually stores all data in the Views. I don't know why it was like that.

Fellow contributors:

(contributors don't need to use real/personal examples. Lessons from anecdotal examples, if it is judged by yourself to be credible, are also welcome.)

I think being able to 'think in threads' is somewhat of a talent and less something that can be learned, for lack of better wording. I know a lot of developers who have been working with parallel systems for a very long time, but they choke up if the data has to go in more than one direction. — dauphic, May 09 '11 at 21:23
Not reading [What Every Dev Must Know About Multithreaded Apps](http://msdn.microsoft.com/en-us/magazine/cc163744.aspx) — jgauffin, May 09 '11 at 10:49

score 13 · Answer 1 · answered May 09 '11 at 08:45

13

My favorite lesson – very hard won! – is that in a multithreaded program the scheduler is a sneaky swine that hates you. If things can go wrong, they will, but in an unexpected fashion. Get anything wrong, and you'll be chasing weird heisenbugs (because any instrumentation you add will change the timings and give you a different run pattern).

The only sane way to fix this is to strictly corral all the thread handling into as small a piece of code that gets it all right and which is very conservative about ensuring that locks are properly held (and with a globally constant order of acquisition too). The easiest way to do that is to not share memory (or other resources) between threads except for messaging which must be asynchronous; that lets you write everything else in a style that is thread-oblivious. (Bonus: scaling out to multiple machines in a cluster is much easier.)

answered May 09 '11 at 08:45

Donal Fellows

6,347
25
35

+1 for "to not share memory (or other resources) between threads except for messaging which must be asynchronous;" – Nemanja Trifunovic May 09 '11 at 15:23
1

The *only* way? What about immutable data types? – Aaronaught May 09 '11 at 22:32
`is that in a multithreaded program the scheduler is a sneaky swine that hates you.` - no it doesn't, it does exactly what you told it to do :) – mattnz May 10 '11 at 01:19
@Aaronaught: Global values passed by reference, even if immutable, still require global GC and that reintroduces a whole bunch of global resources. Being able to use per-thread memory management is nice, since it lets you get rid of a whole bunch of global locks. – Donal Fellows May 10 '11 at 08:27
It's not that you can't pass values of non-basic types by reference, but that it requires higher levels of locking (e.g., the “owner” holding a reference until some message comes back, which it's easy to mess up in maintenance) or complex code in the messaging engine to transfer ownership. Or you marshal everything and unmarshal in the other thread, which is much slower (you have to do that when going to a cluster anyway). Cutting to the chase and not sharing memory at all is easier. – Donal Fellows May 10 '11 at 08:33
I'm very curious about your opinion of how well Go handles this? And whether you've used Go and found it to modify your advice above? – Wildcard Dec 15 '18 at 00:02

score 6 · Answer 2 · answered May 09 '11 at 05:28

Here's a few basic lessons I can think of right now (not from projects failing but from real issues seen on real projects):

Try to avoid any blocking calls while holding a shared resource. Common deadlock pattern is thread grabs mutex, makes a callback, callback blocks on same mutex.
Protect access to any shared data structures with a mutex/critical section (or use lock free ones - but don't invent your own!)
Don't assume atomicity - use atomic APIs (e.g. InterlockedIncrement).
RTFM regarding thread safety of libraries, objects or APIs you're using.
Take advantage of synchonization primitives available, e.g. events, semaphores. (But pay close attention when using them that you know you're in a good state - I've seen many examples of events signalled in the wrong state such that events or data can get lost)
Assume threads can execute concurrently and/or at any order and that context may switch between threads at any time (unless under an OS that makes other guarantees).

score 6 · Answer 3 · answered May 09 '11 at 08:14

6

Your entire GUI project should only be called from the main thread. Basically, you shouldn't put a single (.net) "invoke" in your GUI. Multithreading should be stuck in separate projects that handle the slower data-access.

We inherited a part where the GUI project is using a dozen threads. It's giving nothing but problems. Deadlocks, racing issues, cross thread GUI calls...

answered May 09 '11 at 08:14

Carra

4,261
24
28

Does "project" mean "assembly"? I don't see how the distribution of classes among assemblies would cause threading problems. – nikie May 09 '11 at 09:47
In my project it's indeed an assembly. But the main point is that all code in those folders has to be called from the main thread, no exceptions. – Carra May 09 '11 at 11:27
I don't think this rule is generally applicable. Yes, you should never call GUI code from a another thread. But how you distribute classes to folders/projects/assemblies is an independent decision. – nikie May 09 '11 at 13:50

score 1 · Answer 4 · answered May 09 '11 at 06:09

1

Java 5 and later has Executors which are intended to make life easier for handling multi-threading fork-join style programs.

Use those, it will remove a lot of the pain.

(and, yes, this I learned from a project :) )

answered May 09 '11 at 06:09

1

To apply this answer to other languages - use high quality parallel processing frameworks provided by that language whenever possible. (However, only time will tell whether a framework is really great and highly usable.) – rwong May 09 '11 at 11:00

score 1 · Answer 5 · edited May 09 '11 at 21:02

1

I have a background in hard realtime embedded systems. You can't test for the absence of problems caused by multithreading. (You can sometimes confirm the presence). Code has to be provably correct. So best practice around any and all thread interaction.

#1 rule: KISS - If does not need a thread, don't spin one. Serialise as much as possible.
#2 rule: Don't break #1.
#3 If you can not prove through review it's correct, its not.

edited May 09 '11 at 21:02

ChrisF

38,878
11
125
168

answered May 09 '11 at 07:55

mattnz

21,315
5
54
83

+1 for rule 1. I was working on a project that initially was going to block until another thread was complete - essentially a method call! Fortunately, we decided against that approach. – Michael K May 09 '11 at 18:54
#3 FTW. Better to spend hours struggling with lock timing diagrams or whatever you use to prove that it's good than months wondering why it sometimes falls apart. – May 09 '11 at 22:31

score 1 · Answer 6 · answered May 09 '11 at 21:55

An analogy from a class on multithreading I took last year was very helpful. Thread synchronization is like a traffic signal protecting an intersection (data) from being used by two cars (threads) at once. The mistake a lot of developers make is turning lights red across most of the city to let one car through because they think it's too hard or dangerous to figure out the exact signal they need. That might work well when traffic is light, but will lead to gridlock as your application grows.

That's something I already knew in theory, but after that class the analogy really stuck with me, and I was amazed how often after that I would investigate a threading issue and find one giant queue, or interrupts being disabled everywhere during a write to a variable only two threads used, or mutexes being held a long time when it could be refactored to avoid it altogether.

In other words, some of the worst threading issues are caused by overkill trying to avoid threading issues.

score 0 · Answer 7 · answered May 09 '11 at 03:44

0

Try doing it again.

At least for me, what created a difference was practice. After doing multi threaded and distributed work quite a few times you just get the hang of it.

I think debugging is really what makes it difficult. I can debug multi threaded code using VS but I'm really at a complete loss if I have to use gdb. My fault, probably.

Another thing that is learning more about is lock free data structures.

I think this question can be really improved if you specify the framework. .NET thread pools and background workers are really different than QThread, for an example. There's always a few platform specific gotchas.

answered May 09 '11 at 03:44

Vitor Py

4,838
1
27
33

I'm interested in hearing stories from any frameworks, because I believe there are things to learn from each framework, especially ones which I haven't been exposed to. – rwong May 09 '11 at 03:49
1

debuggers are largely useless in a multi-thread environment. – Pemdas May 09 '11 at 04:50
I already have multi-threaded execution tracers which tells me what the problem is, but won't help me solve it. The crux of my problem is that "according to current design, I can't pass message X to object Y in this manner (sequence); it has to be added to a giant queue and it will *eventually* be processed; but because of this, there is no way for messages to appear to the user at the right time - it will always happen anachronisticly and make the user *very, very* confused. You may even need to add progress bars, cancel buttons or error messages to *places which shouldn't have those*." – rwong May 09 '11 at 11:07

score 0 · Answer 8 · answered May 09 '11 at 04:31

0

I've learned that callbacks from lower level modules to higher level modules are a huge evil because they cause acquiring locks in an opposite order.

answered May 09 '11 at 04:31

Sergej Zagursky

109
2

callbacks are not evil...the fact they do anything other than thread break is probably the root of the evil. I would be highly suspect of any callback that didn't just send a token to message queue. – Pemdas May 09 '11 at 04:48
Solving an optimization problem (like minimizing f(x)) is often implemented by providing the pointer to a function f(x) to the optimization procedure, which "calls it back" while looking for the minimum. How would you do it without a callback? – quant_dev May 09 '11 at 05:42
1

No downvote, but callbacks aren't evil. Calling a callback *while holding a lock* is evil. Don't call anything inside a lock when you don't know if it might lock or wait. That not only includes callbacks but also virtual functions, API functions, functions in other modules ("higher level" or "lower level"). – nikie May 09 '11 at 07:05
@nikie: If a lock *must* be held during the callback, either the rest of the API needs to be designed to be reentrant (hard!) or the fact that you're holding a lock needs to be a documented part of the API (unfortunate, but sometimes all you can do). – Donal Fellows May 09 '11 at 08:36
@Donal Fellows: If a lock must be held during a callback, I'd say you have a design flaw. If there's really no other way, then yes, by all means document it! Just like you would document if the callback will be called in a background thread. That's part of the interface. – nikie May 09 '11 at 13:55

What lessons did you learn from a project which nearly/actually failed due to bad multithreading?

8 Answers8