53

Let's consider something like a GUI application where main thread is updating the UI almost instantaneously, and some other thread is polling data over the network or something that is guaranteed to take 5-10 seconds to finish the job.

I've received many different answers for this, but some people say that if it is a race condition of a statistical impossibility, don't worry about it at all but others have said that if there's even a 10-53% (I kid you not on the numbers, this is what I've heard) of some voodoo magic happening due to race condition, always obtain/release locks on the thread that needs it.

What are your thoughts? Is it a good programming practice to handle race condition in such statistically-impossible situations? or would it be totally unnecessary or even counterproductive to add more lines of code to hinder readability?

TtT23
  • 1,553
  • 4
  • 20
  • 28
  • Fooled by Randomness – Job Aug 17 '12 at 05:40
  • 21
    When people are stating chances like that, why doesn't anyone ask about the education of person stating that number? You need a formal education in statistics before you can back up with a number like that. – Pieter B Aug 17 '12 at 08:08
  • 4
    And it would not bother you as a programmer? Maybe deep inside your thoughts? – Den Aug 17 '12 at 08:09
  • 27
    As a physicist, p<1E-140 means p=0. Not going to happen in this universe. 0.00000000000000000000000000000000000000000000000000001% is a _lot_ bigger. – MSalters Aug 17 '12 at 08:20
  • 15
    Make sure this race condition can't lead to someone _willingly_ crashing your app. This could be the cause of a security problem. – toasted_flakes Aug 17 '12 at 08:26
  • 4
    "others have **said** that if there's even a 0.00000000000000000000000000000000000000000000000000001% (I kid you not on the numbers, this is what I've **heard**)" I hope they said that number in scientific notation! :) – George Duckett Aug 17 '12 at 08:41
  • Readability wouldn't factor in my considerations of whether to use synchronisation or not. Syncrhonisation code follows well known patterns that do not hamper developer's ability to read and understand the code. Performance would factor in my considerations. Locking and serializing is a good practice, but sometimes not worth it if it hampers performance. For examples in logs or some other reports on the app's execution, I really don't care much whether it has processed 1,000,000 or 1,000,020 lines at a certain point in time. – Marjan Venema Aug 17 '12 at 09:18
  • 27
    One in a million chances happen nine times out of ten. – Kaz Dragon Aug 17 '12 at 11:30
  • 27
    "almost certainly has no chance of occuring?" means it happens in production at 3 AM and most likely be very expensive. –  Aug 17 '12 at 12:38
  • 2
    I'm surprised no one has mentioned [Murphy's Law](http://en.wikipedia.org/wiki/Murphy's_law). – Casey Kuball Aug 17 '12 at 14:13
  • Can you re-write to prevent the race condition from happening? How does the race happen? – Paul Aug 17 '12 at 15:05
  • If its in your face and you already know about it why not? – Rig Aug 17 '12 at 19:47
  • @KazDragon: Catchy, but not true. "One in a million" means that it "happens 0.00001 times out of ten". It's an observation about the relative frequency of an event. If it happened "nine out of ten times", it wouldn't be a "one in a million chance", it would be a "nine in ten chance". – Joel Cornett Aug 17 '12 at 19:54
  • 2
    @JoelCornett http://www.goodreads.com/quotes/95458-scientists-have-calculated-that-the-chances-of-something-so-patently – Gareth Aug 18 '12 at 01:14
  • @Gareth: Hehe. I had heard that quote before, but was unsure where it had come from :) – Joel Cornett Aug 18 '12 at 01:16
  • 1
    The probability of an error due to cosmic radiation is greater than that race condition. One should put more efforts to harden the computer ;-) http://stackoverflow.com/questions/2580933/cosmic-rays-what-is-the-probability-they-will-affect-a-program According to the above answer - 1.4 × 10^(-15) is the probability of errors being introduced by cosmic radiation ;-) – Lord Loh. Aug 18 '12 at 05:04
  • It really depends what the chances are, and what the costs and benefits are. If the chance is smaller than the chance of the atoms in the computer rearranging themselves into a bowl of petunia, don't bother. (Actually, I assume the chances of a bit flip because of cosmic rays, or that an aircraft falls on it, or that a lightning hits it, are waaaay bigger than the chance you listed). Remember: unless you work for NASA, the client wants 99.9% working code **now**, instead of 100% working code 5 years later at ten times the cost. – vsz Aug 18 '12 at 07:03
  • Ever heard of a black swan? The 2007 financial crisis? – Apoorv Aug 18 '12 at 12:17
  • 4
    Before doing anything else, be sure that the percentage chance of the race condition happening is not one of the sixty-eight percent of statistics which are made up. – hippietrail Aug 18 '12 at 16:18
  • "almost certainly has no chance of occurring?" -- famous last words. – David Božjak Aug 24 '12 at 06:49

16 Answers16

138

If it is truly a 1 in 10^55 event, there would be no need to code for it. That would imply that if you did the operation 1 million times a second, you'd get one bug every 3 * 10^41 years which is, roughly, 10^31 times the age of the universe. If your application has an error only once in every trillion trillion billion ages of the universe, that's probably reliable enough.

However, I would wager very heavily that the error is nowhere near that unlikely. If you can conceive of the error, it is almost certain that it will occur at least occasionally thus making it worth coding correctly to begin with. Plus, if you code the threads correctly at the outset so that they obtain and release locks appropriately, the code is much more maintainable in the future. You don't have to worry when you're making a change that you have to re-analyze all the potential race conditions, re-compute their probabilities, and assure yourself that they won't recur.

Justin Cave
  • 12,691
  • 3
  • 44
  • 53
  • 67
    I'm reminded of a comment I read years ago but can't find now "A 1 in a million chance is usually next Tuesday". +1 for saying it's "nowhere near that unlikely". – Bevan Aug 17 '12 at 04:02
  • 2
    +1 for the wager. The best way to deal with race conditions is to get rid of them. – Blrfl Aug 17 '12 at 04:14
  • It's always nice when probability provides humor ("that's probably reliable enough"), and a good answer besides. – Chelonian Aug 17 '12 at 04:17
  • 10
    @Bevan "A 1 in a million chance is usually next Tuesday" ...unless you are playing a lottery :) – Sergey Kalinichenko Aug 17 '12 at 04:39
  • 23
    @dasblinkenlight But chances of *someone* winning in most lotteries approaches 100%. Predicting *who*, now that's the challenge. – Bevan Aug 17 '12 at 04:59
  • 3
    @Bevan: That comment was exactly what was going through my mind when I read the question - here is the reference: http://blogs.msdn.com/b/larryosterman/archive/2004/03/30/104165.aspx – Doc Brown Aug 17 '12 at 06:39
  • 1
    I upvoted the answer overall, but "If you can conceive of the error, it is almost certain that it will occur at least occasionally" seems very dubious. – LarsH Aug 17 '12 at 14:58
  • @LarsH: Agreed. The theoretical possibility of something happening has no bearing on its actual probability, which could range anywhere in `0 < x < 1`. – Joel Cornett Aug 17 '12 at 19:58
  • I down voted this one. The cost of the error matters not its chances of occurring. – Apoorv Aug 18 '12 at 12:13
71

From the cost-benefit standpoint, you should write additional code only when it gets you enough benefit.

For example, if the worst thing that would happen if a wrong thread "wins the race" is that the information would not display, and the user would need to click "refresh", don't bother guarding against the race condition: having to write a lot of code is not worth fixing something that insignificant.

On the other hand, if the race condition could result in incorrect money transfers between banking accounts, then you must guard against race condition no matter how much code you need to write to solve this problem.

Sergey Kalinichenko
  • 17,393
  • 4
  • 57
  • 73
  • 20
    +1: For making the distinction between "Failure that looks like failure" and "Failure that looks like success". Incorrect information is much more serious, depending on the domain. – deworde Aug 17 '12 at 08:37
  • 2
    +1 it makes a big difference what the results of the race condition could be. – Grant Aug 17 '12 at 15:38
  • +1 The consequence of the race condition should be a major deciding factor in if it should be addressed. A race condition that might cause an airplane crash is far different from a condition that might force the user to reopen an application. – poke Aug 17 '12 at 17:30
  • 1
    +1: I would say that the consequences are probably what you should be analyzing and not the probability of it occuring. If the consequences don't matter, you might not have to handle the race condition EVEN if it is very common. – Leo Aug 17 '12 at 20:28
  • 1
    But don't assume that fixing a race condition automatically means that you have to write more code. It might just as well mean remove a large chunk of buggy code and replace it with a smaller chunk of correct code. – JesperE Aug 19 '12 at 08:06
45

Finding a race condition is the hard part. You probably spent almost as much time writing this question as it would have taken you to fix it. It's not like it makes it that much less readable. Programmers expect to see synchronization code in such situations, and actually might waste more time wondering why it's not there and if adding it would fix their unrelated bug.

As far as probabilities are concerned, you would be surprised. I had a race condition bug report last year that I couldn't reproduce with thousands of automated tries, but one system of one customer saw it all the time. The business value of spending 5 minutes to fix it now, versus possibly troubleshooting an "impossible" bug at a customer's installation, makes the choice a no-brainer.

Karl Bielefeldt
  • 146,727
  • 38
  • 279
  • 479
  • 1
    This too! Avoid having other programmers ponder about possible problems when reading your code, by doing what is necessary (even if it is 'unlikely' to fail). – Casey Kuball Aug 17 '12 at 14:16
  • Your point is well taken (fixes made now are quicker and cheaper than those made later) except that it's never going to be just "5 minutes to fix it now". – iconoclast Aug 17 '12 at 14:52
  • 2
    +1 for pointing out that the probability of the race condition probably depends on many factors, so even if it looks unlikely in *your* configuration, it may happen more frequently on a customer system / on a different OS / in the next release etc. – sleske Aug 17 '12 at 18:56
27

Obtain and release the locks. Probabilities change, algorithms change. It's a bad habit to get into, and when something goes wrong you don't have to stop and wonder whether you got the odds wrong...

jmoreno
  • 10,640
  • 1
  • 31
  • 48
  • 6
    +1 for algorithms change. Right now, when you are aware of the race condition, the probabilities are low. After a year, when you've forgotten about the race condition, you may make a change to your code which significantly changes the timing and probability of a bug. – Phil Aug 17 '12 at 13:29
13

and some other thread is polling data over the network or something that is guaranteed to take 5-10 seconds to finish the job.

Until someone introduces a caching layer to improve performance. Suddenly that other tread finished near instantaneous and the race condition manifests more often than not.

Had exactly this happen a few weeks ago, took about 2 full developer days to find the bug.

Always fix race conditions if you recognize them.

Michael Borgwardt
  • 51,037
  • 13
  • 124
  • 176
8

Simple vs correct.

In many cases, simplicity trumps correctness. It's a cost issue.

Also, race conditions are nasty things that tend not to obey simple statistics. Everything goes fine until some other seemingly unrelated synchronization causes your race condition to suddenly happen half the time. Unless you turn the logs on or debug the code of course.

A pragmatic alternative to preventing a race condition (which can be tricky) can be to detect and log it (bonus for failing hard and early). If it never happens, you lost little. If it does actually happen, you got a solid justification to spend the extra time fixing it.

ptyx
  • 5,851
  • 2
  • 22
  • 21
  • 1
    +1 for logging and fail early if fixing it outright is too complicated. – Martin Ba Aug 17 '12 at 07:22
  • In many cases, simplicity trumps completeness. Synchronization is almost never among those cases. It will almost always come back to bite you (or the poor guy tasked with maintaining your code) later. – reirab Sep 30 '14 at 04:25
  • @reirab I disagree. If you consider infrequent events, then logged failure is cost effective. An example: if your phone app has a 1/100 failure rate (crash) if the user is switching network at an exact month transition (1/31 23:59:00 -> 2/1 00:00:00), you'll probably never hear about it. But then a 1/10^9 chance of crash on connection on a server is unacceptable. It depends. – ptyx Sep 30 '14 at 16:34
7

If your race-condition is security-related, you should always code to prevent it.

A common example are race conditions with creating/opening files in unix, which can in some circumstances lead to privilege escalation attacks if the program with the race condition is running with higher privileges than the user interacting with it, such as a system daemon process or worse still, the kernel.

Even if a race condition has something like 10^(-80) chance of happening randomly, it may well be the case that a determined attacker has a decent chance of creating such conditions deliberately and artificially.

6

Therac-25!

Developers on the Therac-25 project were pretty confident about the timing between a UI and an interface related issue in an therapeutic XRAY machine.

They should not have been.

You can learn more about this famous life-and-death software disaster at:

http://www.youtube.com/watch?v=izGSOsAGIVQ

or

http://en.wikipedia.org/wiki/Therac-25

Your application may be much less sensitive to failure than medical devices. A helpful method is to rate risk exposure as the product of the likelihood of occurrence and the cost of occurrence over the life of the product for all the units that could be produced.

If you have chosen to build your code to last (and it sounds like you have), you should consider Moore's law that can easily lop off several zeros every few years as computers inside or outside your system get faster. If you ship thousands of copies, lop off more zeros. If users do this operation daily (or monthly) for years, take away a few more. If it is used where Google fiber is available, what then? If the UI garbage collects mid GUI operation, does that affect the race? Are you using an Open Source or Windows library behind your GUI? Can updates there affect timing?

Semaphores, locks, mutexes, barrier synchronization are among the ways to synchronize activities between threads. Potentially if you are not using them, another person who maintains your program might and then pretty quickly assumptions about relationships between threads can shift and the calculation about the race condition might be invalidated.

I recommend that you explicitly synchronize because while you might not ever see it create a problem, a customer might. In addition, even if your race condition never occurs, what if you or your organization are called to court to defend your code (as Toyota was related to the Prius a few years ago). The more thorough your methodology, the better you will fare. It might be nicer to say "we guard against this unlikely case like this..." than to say, "we know our code will fail, but we wrote down this equation to show it won't happen in our lifetime. Probably."

It sounds like the probability calculation comes from someone else. Do they know your code and do you know them enough to trust that no error was made? If I calculated a 99.99997% reliability for something, I might also think back to my college statistics classes and remember that I did not always get 100%, and back off quite a few percent on my own personal reliability estimates.

DeveloperDon
  • 4,958
  • 1
  • 26
  • 53
  • 1
    +1 for mention of Therac-25. Many important lessons here. – Stuart Marks Aug 21 '12 at 04:24
  • While I think this is a good answer, you could argue that your hobby GUI project surely won't cause people to die if you fail to eliminate a race condition. – marktani Sep 11 '12 at 16:42
  • I am not much for arguing, but if I were I might argue that anytime we write code we should write it right. If we can practice getting the race conditions out of our hobby projects where the code is simpler and perhaps we are the only author, we will be that much more ready when we tackle work projects where the work of several authors needs to be integrated together. – DeveloperDon Sep 14 '12 at 16:02
4

would it be totally unnecessary or even counterproductive to add more lines of code to hinder readability?

Simplicity is only good when it's also correct. Since this code is not correct, future programmers will inevitably look at it when looking for a related bug.

Whichever way you handle it (either by logging it, documenting it, or adding the locks -- this depends on the cost), you will save other programmers time when looking at the code.

Casey Kuball
  • 213
  • 1
  • 6
3

This would depend on the context. If its a casual iPhone game, probably not. The flight control system for the next manned space vehicle, probably. It all depends on what the consequences are if the 'bad' result happens measured against the estimated cost of fixing it.

There is rarely a 'one size fits all' answer for these types of questions because they are not programming questions, but instead economics questions.

GrandmasterB
  • 37,990
  • 7
  • 78
  • 131
3

Yes, expect the unexpected. I have spent hours (in other peoples code ^^) tracking down conditions that should never happen.

Things such as always have an else, always have a default on case, initialize variables (yes, really.. bugs happen from this), check your loops for reused variables for each iteration, etc.

If you are worried about threading issues specifically, read blogs, articles, and books on the subject. The current theme seems to be immutable data.

Paul
  • 730
  • 3
  • 13
3

Just fix it.

I've seen exactly this. One thread manages to make a network request to a server which does a complex database lookup and respond before the other thread has got to the next line of code. It happens.

Some customer somewhere will decide one day to run something that hogs all the CPU time for the "fast" thread while leaving the slow thread running, and you'll be sorry :)

JohnB
  • 1,231
  • 2
  • 8
  • 10
1

If you've recognised an unlikely race condition, at least document it in the code!

EDIT: I should add that I'd fix it if at all possible, but at the time of writing the above no other answer explicitly said at least document the problem in the code.

Mark Hurd
  • 343
  • 1
  • 3
  • 12
  • 1
    Yep, and at least try and detect it and log it if it happens. IMHO it's perfectly fine not to avoid every error. But at least let someone know that it occurred, and that your assumption that it wouldn't was misguided. – Steve Bennett Aug 22 '12 at 08:21
0

I think that if yo already know how and why it could happen, might as well deal with it. That is if it doesn't take up an copious amount of resources.

0

It all depends on what the consequences of a race condition is. I think the people answering your question are correct for their line of work. Mine is router configuration engines. For me, race conditions either makes systems stand still, corrupt or unconfigured even though it said it was successful. I always use semaphores per router so that I don't have to clean anything up by hand.

I think some of my GUI code still is prone for race conditions in such way that a user might be given an error because a race condition happened, but I would not have any such possibilities if there is a chance of data corruption or misbehaviour of the application after such event.

gnat
  • 21,442
  • 29
  • 112
  • 288
Sylwester
  • 529
  • 2
  • 7
0

Funnily enough, I encountered this problem recently. I didn't even realise a race condition was possible in my circumstance. The race condition only presented itself when multi-core processors became the norm.

The scenario was roughly like this. A device driver raised events for the software to handle. Control had to return to the device driver as soon as possible to prevent a timeout on the device. To ensure this, the event was recorded and queued in a separate thread.

Receive event from device:
{
    Record event details.
    Enqueue event in the queuing thread.
    Acknowledge the event.
}

Queueing thread receives an event:
{
    Retrieve event details.
    Process event.
    Send next command to device.
}

This worked fine for years. Then suddenly it would fail in certain configurations. It turns out that the queueing thread was now running truly in parallel to the event handling thread, rather than sharing a single processor's time. It managed to send the next command to the device before the event had been acknowledged, causing an out-of-sequence error.

Given it only affected one customer in one configuration, I shamefully put a Thread.Sleep(1000) in where the problem was. There's not been a problem since.

Hand-E-Food
  • 1,635
  • 1
  • 11
  • 15