Should I take care of race conditions which almost certainly has no chance of occuring?

Question

Let's consider something like a GUI application where main thread is updating the UI almost instantaneously, and some other thread is polling data over the network or something that is guaranteed to take 5-10 seconds to finish the job.

I've received many different answers for this, but some people say that if it is a race condition of a statistical impossibility, don't worry about it at all but others have said that if there's even a 10^-53% (I kid you not on the numbers, this is what I've heard) of some voodoo magic happening due to race condition, always obtain/release locks on the thread that needs it.

What are your thoughts? Is it a good programming practice to handle race condition in such statistically-impossible situations? or would it be totally unnecessary or even counterproductive to add more lines of code to hinder readability?

When people are stating chances like that, why doesn't anyone ask about the education of person stating that number? You need a formal education in statistics before you can back up with a number like that. — Pieter B, Aug 17 '12 at 08:08
And it would not bother you as a programmer? Maybe deep inside your thoughts? — Den, Aug 17 '12 at 08:09
As a physicist, p<1E-140 means p=0. Not going to happen in this universe. 0.00000000000000000000000000000000000000000000000000001% is a _lot_ bigger. — MSalters, Aug 17 '12 at 08:20
Make sure this race condition can't lead to someone _willingly_ crashing your app. This could be the cause of a security problem. — toasted_flakes, Aug 17 '12 at 08:26
"others have **said** that if there's even a 0.00000000000000000000000000000000000000000000000000001% (I kid you not on the numbers, this is what I've **heard**)" I hope they said that number in scientific notation! :) — George Duckett, Aug 17 '12 at 08:41
Readability wouldn't factor in my considerations of whether to use synchronisation or not. Syncrhonisation code follows well known patterns that do not hamper developer's ability to read and understand the code. Performance would factor in my considerations. Locking and serializing is a good practice, but sometimes not worth it if it hampers performance. For examples in logs or some other reports on the app's execution, I really don't care much whether it has processed 1,000,000 or 1,000,020 lines at a certain point in time. — Marjan Venema, Aug 17 '12 at 09:18
"almost certainly has no chance of occuring?" means it happens in production at 3 AM and most likely be very expensive. — , Aug 17 '12 at 12:38
I'm surprised no one has mentioned [Murphy's Law](http://en.wikipedia.org/wiki/Murphy's_law). — Casey Kuball, Aug 17 '12 at 14:13
Can you re-write to prevent the race condition from happening? How does the race happen? — Paul, Aug 17 '12 at 15:05
@KazDragon: Catchy, but not true. "One in a million" means that it "happens 0.00001 times out of ten". It's an observation about the relative frequency of an event. If it happened "nine out of ten times", it wouldn't be a "one in a million chance", it would be a "nine in ten chance". — Joel Cornett, Aug 17 '12 at 19:54
@JoelCornett http://www.goodreads.com/quotes/95458-scientists-have-calculated-that-the-chances-of-something-so-patently — Gareth, Aug 18 '12 at 01:14
@Gareth: Hehe. I had heard that quote before, but was unsure where it had come from :) — Joel Cornett, Aug 18 '12 at 01:16
The probability of an error due to cosmic radiation is greater than that race condition. One should put more efforts to harden the computer ;-) http://stackoverflow.com/questions/2580933/cosmic-rays-what-is-the-probability-they-will-affect-a-program According to the above answer - 1.4 × 10^(-15) is the probability of errors being introduced by cosmic radiation ;-) — Lord Loh., Aug 18 '12 at 05:04
It really depends what the chances are, and what the costs and benefits are. If the chance is smaller than the chance of the atoms in the computer rearranging themselves into a bowl of petunia, don't bother. (Actually, I assume the chances of a bit flip because of cosmic rays, or that an aircraft falls on it, or that a lightning hits it, are waaaay bigger than the chance you listed). Remember: unless you work for NASA, the client wants 99.9% working code **now**, instead of 100% working code 5 years later at ten times the cost. — vsz, Aug 18 '12 at 07:03
Before doing anything else, be sure that the percentage chance of the race condition happening is not one of the sixty-eight percent of statistics which are made up. — hippietrail, Aug 18 '12 at 16:18
"almost certainly has no chance of occurring?" -- famous last words. — David Božjak, Aug 24 '12 at 06:49

score 138 · Accepted Answer · answered Aug 17 '12 at 03:47

138

If it is truly a 1 in 10^55 event, there would be no need to code for it. That would imply that if you did the operation 1 million times a second, you'd get one bug every 3 * 10^41 years which is, roughly, 10^31 times the age of the universe. If your application has an error only once in every trillion trillion billion ages of the universe, that's probably reliable enough.

However, I would wager very heavily that the error is nowhere near that unlikely. If you can conceive of the error, it is almost certain that it will occur at least occasionally thus making it worth coding correctly to begin with. Plus, if you code the threads correctly at the outset so that they obtain and release locks appropriately, the code is much more maintainable in the future. You don't have to worry when you're making a change that you have to re-analyze all the potential race conditions, re-compute their probabilities, and assure yourself that they won't recur.

answered Aug 17 '12 at 03:47

Justin Cave

12,691
3
44
53

67

I'm reminded of a comment I read years ago but can't find now "A 1 in a million chance is usually next Tuesday". +1 for saying it's "nowhere near that unlikely". – Bevan Aug 17 '12 at 04:02
2

+1 for the wager. The best way to deal with race conditions is to get rid of them. – Blrfl Aug 17 '12 at 04:14
It's always nice when probability provides humor ("that's probably reliable enough"), and a good answer besides. – Chelonian Aug 17 '12 at 04:17
10

@Bevan "A 1 in a million chance is usually next Tuesday" ...unless you are playing a lottery :) – Sergey Kalinichenko Aug 17 '12 at 04:39
23

@dasblinkenlight But chances of *someone* winning in most lotteries approaches 100%. Predicting *who*, now that's the challenge. – Bevan Aug 17 '12 at 04:59
3

@Bevan: That comment was exactly what was going through my mind when I read the question - here is the reference: http://blogs.msdn.com/b/larryosterman/archive/2004/03/30/104165.aspx – Doc Brown Aug 17 '12 at 06:39
1

I upvoted the answer overall, but "If you can conceive of the error, it is almost certain that it will occur at least occasionally" seems very dubious. – LarsH Aug 17 '12 at 14:58
@LarsH: Agreed. The theoretical possibility of something happening has no bearing on its actual probability, which could range anywhere in `0 < x < 1`. – Joel Cornett Aug 17 '12 at 19:58
I down voted this one. The cost of the error matters not its chances of occurring. – Apoorv Aug 18 '12 at 12:13

score 71 · Answer 2 · answered Aug 17 '12 at 03:50

71

From the cost-benefit standpoint, you should write additional code only when it gets you enough benefit.

For example, if the worst thing that would happen if a wrong thread "wins the race" is that the information would not display, and the user would need to click "refresh", don't bother guarding against the race condition: having to write a lot of code is not worth fixing something that insignificant.

On the other hand, if the race condition could result in incorrect money transfers between banking accounts, then you must guard against race condition no matter how much code you need to write to solve this problem.

answered Aug 17 '12 at 03:50

Sergey Kalinichenko

17,393
4
57
73

20

+1: For making the distinction between "Failure that looks like failure" and "Failure that looks like success". Incorrect information is much more serious, depending on the domain. – deworde Aug 17 '12 at 08:37
2

+1 it makes a big difference what the results of the race condition could be. – Grant Aug 17 '12 at 15:38
+1 The consequence of the race condition should be a major deciding factor in if it should be addressed. A race condition that might cause an airplane crash is far different from a condition that might force the user to reopen an application. – poke Aug 17 '12 at 17:30
1

+1: I would say that the consequences are probably what you should be analyzing and not the probability of it occuring. If the consequences don't matter, you might not have to handle the race condition EVEN if it is very common. – Leo Aug 17 '12 at 20:28
1

But don't assume that fixing a race condition automatically means that you have to write more code. It might just as well mean remove a large chunk of buggy code and replace it with a smaller chunk of correct code. – JesperE Aug 19 '12 at 08:06

score 45 · Answer 3 · answered Aug 17 '12 at 04:37

45

Finding a race condition is the hard part. You probably spent almost as much time writing this question as it would have taken you to fix it. It's not like it makes it that much less readable. Programmers expect to see synchronization code in such situations, and actually might waste more time wondering why it's not there and if adding it would fix their unrelated bug.

As far as probabilities are concerned, you would be surprised. I had a race condition bug report last year that I couldn't reproduce with thousands of automated tries, but one system of one customer saw it all the time. The business value of spending 5 minutes to fix it now, versus possibly troubleshooting an "impossible" bug at a customer's installation, makes the choice a no-brainer.

answered Aug 17 '12 at 04:37

Karl Bielefeldt

146,727
38
279
479

1

This too! Avoid having other programmers ponder about possible problems when reading your code, by doing what is necessary (even if it is 'unlikely' to fail). – Casey Kuball Aug 17 '12 at 14:16
Your point is well taken (fixes made now are quicker and cheaper than those made later) except that it's never going to be just "5 minutes to fix it now". – iconoclast Aug 17 '12 at 14:52
2

+1 for pointing out that the probability of the race condition probably depends on many factors, so even if it looks unlikely in *your* configuration, it may happen more frequently on a customer system / on a different OS / in the next release etc. – sleske Aug 17 '12 at 18:56

score 27 · Answer 4 · answered Aug 17 '12 at 03:53

27

Obtain and release the locks. Probabilities change, algorithms change. It's a bad habit to get into, and when something goes wrong you don't have to stop and wonder whether you got the odds wrong...

answered Aug 17 '12 at 03:53

jmoreno

10,640
1
31
48

6

+1 for algorithms change. Right now, when you are aware of the race condition, the probabilities are low. After a year, when you've forgotten about the race condition, you may make a change to your code which significantly changes the timing and probability of a bug. – Phil Aug 17 '12 at 13:29

score 13 · Answer 5 · answered Aug 17 '12 at 13:12

and some other thread is polling data over the network or something that is guaranteed to take 5-10 seconds to finish the job.

Until someone introduces a caching layer to improve performance. Suddenly that other tread finished near instantaneous and the race condition manifests more often than not.

Had exactly this happen a few weeks ago, took about 2 full developer days to find the bug.

Always fix race conditions if you recognize them.

score 8 · Answer 6 · answered Aug 17 '12 at 04:46

8

Simple vs correct.

In many cases, simplicity trumps correctness. It's a cost issue.

Also, race conditions are nasty things that tend not to obey simple statistics. Everything goes fine until some other seemingly unrelated synchronization causes your race condition to suddenly happen half the time. Unless you turn the logs on or debug the code of course.

A pragmatic alternative to preventing a race condition (which can be tricky) can be to detect and log it (bonus for failing hard and early). If it never happens, you lost little. If it does actually happen, you got a solid justification to spend the extra time fixing it.

answered Aug 17 '12 at 04:46

ptyx

5,851
2
22
21

1

+1 for logging and fail early if fixing it outright is too complicated. – Martin Ba Aug 17 '12 at 07:22
In many cases, simplicity trumps completeness. Synchronization is almost never among those cases. It will almost always come back to bite you (or the poor guy tasked with maintaining your code) later. – reirab Sep 30 '14 at 04:25
@reirab I disagree. If you consider infrequent events, then logged failure is cost effective. An example: if your phone app has a 1/100 failure rate (crash) if the user is switching network at an exact month transition (1/31 23:59:00 -> 2/1 00:00:00), you'll probably never hear about it. But then a 1/10^9 chance of crash on connection on a server is unacceptable. It depends. – ptyx Sep 30 '14 at 16:34

score 7 · Answer 7 · answered Aug 17 '12 at 11:30

If your race-condition is security-related, you should always code to prevent it.

A common example are race conditions with creating/opening files in unix, which can in some circumstances lead to privilege escalation attacks if the program with the race condition is running with higher privileges than the user interacting with it, such as a system daemon process or worse still, the kernel.

Even if a race condition has something like 10^(-80) chance of happening randomly, it may well be the case that a determined attacker has a decent chance of creating such conditions deliberately and artificially.

score 6 · Answer 8 · answered Aug 21 '12 at 03:17

Therac-25!

Developers on the Therac-25 project were pretty confident about the timing between a UI and an interface related issue in an therapeutic XRAY machine.

They should not have been.

You can learn more about this famous life-and-death software disaster at:

http://www.youtube.com/watch?v=izGSOsAGIVQ

or

http://en.wikipedia.org/wiki/Therac-25

Your application may be much less sensitive to failure than medical devices. A helpful method is to rate risk exposure as the product of the likelihood of occurrence and the cost of occurrence over the life of the product for all the units that could be produced.

If you have chosen to build your code to last (and it sounds like you have), you should consider Moore's law that can easily lop off several zeros every few years as computers inside or outside your system get faster. If you ship thousands of copies, lop off more zeros. If users do this operation daily (or monthly) for years, take away a few more. If it is used where Google fiber is available, what then? If the UI garbage collects mid GUI operation, does that affect the race? Are you using an Open Source or Windows library behind your GUI? Can updates there affect timing?

Semaphores, locks, mutexes, barrier synchronization are among the ways to synchronize activities between threads. Potentially if you are not using them, another person who maintains your program might and then pretty quickly assumptions about relationships between threads can shift and the calculation about the race condition might be invalidated.

I recommend that you explicitly synchronize because while you might not ever see it create a problem, a customer might. In addition, even if your race condition never occurs, what if you or your organization are called to court to defend your code (as Toyota was related to the Prius a few years ago). The more thorough your methodology, the better you will fare. It might be nicer to say "we guard against this unlikely case like this..." than to say, "we know our code will fail, but we wrote down this equation to show it won't happen in our lifetime. Probably."

It sounds like the probability calculation comes from someone else. Do they know your code and do you know them enough to trust that no error was made? If I calculated a 99.99997% reliability for something, I might also think back to my college statistics classes and remember that I did not always get 100%, and back off quite a few percent on my own personal reliability estimates.

While I think this is a good answer, you could argue that your hobby GUI project surely won't cause people to die if you fail to eliminate a race condition. — marktani, Sep 11 '12 at 16:42
I am not much for arguing, but if I were I might argue that anytime we write code we should write it right. If we can practice getting the race conditions out of our hobby projects where the code is simpler and perhaps we are the only author, we will be that much more ready when we tackle work projects where the work of several authors needs to be integrated together. — DeveloperDon, Sep 14 '12 at 16:02

score 4 · Answer 9 · answered Aug 17 '12 at 14:26

would it be totally unnecessary or even counterproductive to add more lines of code to hinder readability?

Simplicity is only good when it's also correct. Since this code is not correct, future programmers will inevitably look at it when looking for a related bug.

Whichever way you handle it (either by logging it, documenting it, or adding the locks -- this depends on the cost), you will save other programmers time when looking at the code.

score 3 · Answer 10 · answered Aug 17 '12 at 03:54

3

This would depend on the context. If its a casual iPhone game, probably not. The flight control system for the next manned space vehicle, probably. It all depends on what the consequences are if the 'bad' result happens measured against the estimated cost of fixing it.

There is rarely a 'one size fits all' answer for these types of questions because they are not programming questions, but instead economics questions.

answered Aug 17 '12 at 03:54

GrandmasterB

37,990
7
78
131

3

"The flight control system for the next manned space vehicle" *DEFINITELY*. – deworde Aug 17 '12 at 08:38
probably... definately... it'd depend on who was in the rocket :-) – GrandmasterB Aug 17 '12 at 18:40

score 3 · Answer 11 · answered Aug 17 '12 at 15:09

Yes, expect the unexpected. I have spent hours (in other peoples code ^^) tracking down conditions that should never happen.

Things such as always have an else, always have a default on case, initialize variables (yes, really.. bugs happen from this), check your loops for reused variables for each iteration, etc.

If you are worried about threading issues specifically, read blogs, articles, and books on the subject. The current theme seems to be immutable data.

score 3 · Answer 12 · answered Aug 17 '12 at 15:48

Just fix it.

I've seen exactly this. One thread manages to make a network request to a server which does a complex database lookup and respond before the other thread has got to the next line of code. It happens.

Some customer somewhere will decide one day to run something that hogs all the CPU time for the "fast" thread while leaving the slow thread running, and you'll be sorry :)

Mark Hurd · Answer 13 · 2012-08-22T07:01:43.893

1

If you've recognised an unlikely race condition, at least document it in the code!

EDIT: I should add that I'd fix it if at all possible, but at the time of writing the above no other answer explicitly said at least document the problem in the code.

edited Aug 22 '12 at 07:01

answered Aug 17 '12 at 19:41

Mark Hurd

343
1
3
12

1

Yep, and at least try and detect it and log it if it happens. IMHO it's perfectly fine not to avoid every error. But at least let someone know that it occurred, and that your assumption that it wouldn't was misguided. – Steve Bennett Aug 22 '12 at 08:21

score 0 · Answer 14 · answered Aug 17 '12 at 13:07

0

I think that if yo already know how and why it could happen, might as well deal with it. That is if it doesn't take up an copious amount of resources.

answered Aug 17 '12 at 13:07

Sjaak van der Heide

149
5

score 0 · Answer 15 · edited Aug 18 '12 at 00:34

It all depends on what the consequences of a race condition is. I think the people answering your question are correct for their line of work. Mine is router configuration engines. For me, race conditions either makes systems stand still, corrupt or unconfigured even though it said it was successful. I always use semaphores per router so that I don't have to clean anything up by hand.

I think some of my GUI code still is prone for race conditions in such way that a user might be given an error because a race condition happened, but I would not have any such possibilities if there is a chance of data corruption or misbehaviour of the application after such event.

score 0 · Answer 16 · answered Aug 18 '12 at 11:01

Funnily enough, I encountered this problem recently. I didn't even realise a race condition was possible in my circumstance. The race condition only presented itself when multi-core processors became the norm.

The scenario was roughly like this. A device driver raised events for the software to handle. Control had to return to the device driver as soon as possible to prevent a timeout on the device. To ensure this, the event was recorded and queued in a separate thread.

Receive event from device:
{
    Record event details.
    Enqueue event in the queuing thread.
    Acknowledge the event.
}

Queueing thread receives an event:
{
    Retrieve event details.
    Process event.
    Send next command to device.
}

This worked fine for years. Then suddenly it would fail in certain configurations. It turns out that the queueing thread was now running truly in parallel to the event handling thread, rather than sharing a single processor's time. It managed to send the next command to the device before the event had been acknowledged, causing an out-of-sequence error.

Given it only affected one customer in one configuration, I shamefully put a Thread.Sleep(1000) in where the problem was. There's not been a problem since.

Should I take care of race conditions which almost certainly has no chance of occuring?

16 Answers16