How do I unit test multi-threaded code?

Question

Are there ways to unit test your multi-threaded code for race conditions and deadlocks?

To see if they are performing the way they should be...

Josh Kelley · Accepted Answer · 2015-09-23T16:53:10.083

19

CHESS, a project of Microsoft Research. Quoting their site:

CHESS is a tool for finding and reproducing Heisenbugs in concurrent programs. CHESS repeatedly runs a concurrent test ensuring that every run takes a different interleaving. If an interleaving results in an error, CHESS can reproduce the interleaving for improved debugging. CHESS is available for both managed and native programs.

Update (9/23/2015): For C, C++, and Go, you can use ThreadSanitizer.

edited Sep 23 '15 at 16:53

answered Oct 02 '10 at 14:05

Josh Kelley

10,991
7
38
50

CHESS is a great tool, but it doesn't seem to have been updated for a long time. – Andy Lowry Nov 25 '10 at 16:50
2

Does it need to be updated, or is it far enough to be very useful? – Klaim Jan 29 '11 at 11:23
7

Looks great, however the licencing forbids you from using it for any practical purpose so it's basically a toy. (From licence.txt) `SCOPE OF RIGHTS: You may use, copy, reproduce, and distribute this Software for any non-commercial academic purpose Some purposes which can be non-commercial academic are teaching, academic research, and personal experimentation. ... You may not use or distribute this Software or any derivative works in any form for commercial purposes. Examples of commercial purposes would be ... using the Software in the creation or use of commercial products.` – Andy Krouwel Jan 21 '16 at 09:41
Wow this is the only tool I have ever seen that seriously provides a smart way to test concurrency head-on (not the usual advice "don't do concurrency!"). Such a shame it never became productized. Thread interleaving combinatorics seems to be a smart approach to really testing code with synchronization primitives – Schneider Apr 24 '17 at 01:20

score 11 · Answer 2 · answered Nov 24 '10 at 17:04

Valgrind has Helgrind which really helps. Not only does it help point out races that could lead to starvation or deadlock, the slight slowdown of having the program profiled sometimes exposes races that might not be seen otherwise.

So, even if you go commando with some kind of lock free method, it still helps :)

It is POSIX centric, however. It ships with headers that easily make simple unit test libraries like TAP aware that it is running, which is also really helpful. For instance, you could have a thread that normally would not block when trying to acquire a lock go ahead and block (perhaps randomly), just to simulate starvation.

speaking of slow down, activating HT on a Pentium provides exactly the opposite thing. It "speeds" up code somehow, which may expose bugs :) — Matthieu M., Jan 29 '11 at 13:37

Roger C S Wernersson · Answer 3 · 2010-11-25T06:22:05.207

5

I don't remember the details exactly, but this is the general idea. And I've only done it once, but what I did was separate the re-entrant code from the code performing the task, using an interface to be able to mock the task class.

Then I designed my mock-up to be able to lock on a call so that I know the thread is in the critical section, and then call it again and verify that it is waiting, before releasing the first thread and finish cleanly.

Something like that.

I'm not sure that would work for more complex scenarios, but it helps preserve the behaviour during refactorings.

edited Nov 25 '10 at 06:22

answered Nov 24 '10 at 13:23

Roger C S Wernersson

756
5
12

I've done this too: separate the part that handles concurrency from the part that does stuff. I ended up with little state-changing objects that plugged into things that handled concurrency. – Frank Shearar Nov 24 '10 at 13:45

score 2 · Answer 4 · answered Nov 24 '10 at 14:32

At JAOO/GOTO this year I saw this presentation:

http://gotocon.com/aarhus-2010/presentation/Testing%20Asynchronous%20Behaviour%20in%20an%20Instant%20Messaging%20Server

The trick is that you model what your hairball application is supposed to DO, in terms of invocation steps as well as the actual operations on your application. John Hughes software then systematically tries many permutations of invocation steps repeatedly in parallel and checks afterwards that the state of the application corresponds with the state of the model. If an error is found, the software knows how to reduce the steps to the minimal case producing the error.

He demonstrated live how to catch several bugs in core Erlang libraries which had been lurking for 15 years and occasionally reported but nobody could figure out where they came from and hence how to fix. With the minimal cases reported by the software, the library maintainer was able to fix each bug within a day.

It was SO impressive.

John Hughes sells this software through his company.

The basic technology is called QuickCheck - http://en.wikipedia.org/wiki/QuickCheck — , Jan 29 '11 at 11:26

score 2 · Answer 5 · answered Nov 24 '10 at 16:03

Tests with non-reproducible results are useless. That rules out completely random tests, but leaves in tests generated from pseudo-random sequences.
Every actor in a concurrent environment has algorithmic or otherwise non concurrency components that can be tested by conventional means. Having tested them, any remaining failures must be lie in the concurrency logic.
The events in a concurrent system are always in fact a linear sequence of events. If enough precision is used to measure time, then there are no events happening "at the same time". That means that the actors in a concurrent system can be tested by generating events sequentially. Capturing the sequence of events around the time of failure of a concurrent system provides the required test cases.
The code that provides actors with liveliness (threads) is more-often-than-not provided by the operating system or by system libraries. It is safe to assume that said code needs not be tested. The code in charge of communication and synchronization is normally written by the applications programmer. That code can be tested without invoking the system code, that is, without launching any threads.
Boundary conditions in the algorithmic code (queue empty) often require handling in the synchronization code, and that is a good target for testing.
Defining proxies around system code (t.wait()) allows for the use of stubs/mocks of the functionality during testing.

random tests can be reproduced, just keep a log of the random values generated — Steven A. Lowe, Nov 24 '10 at 16:14
The problem with (3) is that many of those events are implicit, such as the CPU updating a cache line. And then there is the joy of a CPU or compiler that reorders instructions. — CodesInChaos, Jan 29 '16 at 16:21

score 1 · Answer 6 · answered Jan 29 '11 at 11:14

You may try my Relacy Race Detector. It's designed to carefully and precisely verify synchronization algorithms like producer-consumer queues and concurrent containers, but not very suitable for verification of whole programs. However, perhaps it's a good idea to spread synchronization and mutexes allover a program anyway, but instead concentrate synchronization in specialized components (that can be verified with Relacy).

score 0 · Answer 7 · answered Jan 29 '16 at 09:45

Not a strict unit test, but a runtime check that has helped me with some intermittently failing tests. It's quick & dirty, but it worked.

When a mutex is granted I keep a track of which thread has it. All mutex requests have a thirty-second timeout, after which they scream deadlock.

I can then use the list of granted mutexes to see which thread is holding the blocking mutex, and why for so long. In my cases so far it's because it was deadlocked on something else, so I can then fix that situation.

This worked for me because my mutexes have a cross-platform wrapper class making it easy to inject the record keeping and timeout. I also knew enough about the application to know it should never be blocking on a mutex for 30 seconds.

It might not be completely general purpose, but it saves a lot of debugging for about a couple of hour's programming effort. The overhead is negligible, and it can be debug-build only.

I'm going to look at extending it to record nested mutex request sequences, and see if any are potentially deadlock inducing (eg. one thread locks A then B and another locks B then A) rather than just actually deadlock inducing, but so far it's been a lot of benefit for trivial effort.

score 0 · Answer 8 · answered Oct 05 '10 at 08:09

0

It's not easy, but basically the only way is to call the multi-threaded code concurrently from multiple threads and change timing and ordering randomly by playing with random Thread.sleep() and Thread.yield() calls (assuming Java).

There are also ready tools available (such as TestNG) that do something like described above, but they're not very mature yet, as far as I know.

answered Oct 05 '10 at 08:09

Joonas Pulakka

23,534
9
64
93

3

As long as you have some way of saying "this combination of timings resulted in an error" because otherwise your tests fail without you knowing why, particularly. – Frank Shearar Oct 10 '10 at 12:59

How do I unit test multi-threaded code?

8 Answers8