What's a schrödinbug?

Question

A schrödinbug is a bug that manifests only after someone reading source code or using the program in an unusual way notices that it never should have worked in the first place, at which point the program promptly stops working for everybody until fixed. The Jargon File adds: "Though... this sounds impossible, it happens; some programs have harbored latent schrödinbugs for years."

What is being talked about is very vague..

Can someone provide an example of how a schrödinbug is like (like with a fictional / real-life situation)?

I think you'd better understand shrodinbug if you knew about Shrodinger's cat: http://en.wikipedia.org/wiki/Shrodingers_cat — Eimantas, Jul 29 '11 at 11:45
@Eimantas I'm actually now more confused but that is an interesting article :) — , Jul 29 '11 at 19:28

score 84 · Accepted Answer · edited Jan 09 '13 at 09:47

84

In my experience the pattern is this:

System works, often for years
An error is reported
The developer investigates the error and finds a bit of code which seems to be completely flawed and declares that it "could never have worked"
The bug gets fixed and the legend of the code that could never have worked (but did for years) grows

Let's be logical here. Code that could never have worked... could never have worked. If it did work then the statement is false.

So I'm going to say that a bug exactly as described (that is observing the flawed code stops it working) is patently nonsense.

In reality what has happened is one of two things:

1) The developer hasn't fully understood the code. In this case the code is usually a mess and somewhere in it has a major but non-obvious sensitivity to some external condition (say a specific OS version or configuration that governs how some function works in some minor but significant way). This external condition is altered (say by a server upgrade or change which is believed to be unrelated) and in doing so causes the code to break.

The developer then looks at the code and, not understanding the historical context or having the time to trace through every possible dependency and scenario, declared that it could never have worked and rewrites it.

In this situation, the thing to understand here is that the idea that "it could never have worked" is provably false (because it did).

That's not to say rewriting it is a bad thing - it's often not, while it's nice to know exactly what was wrong often that's time consuming and rewriting the section of code is often faster and allows you to be sure that you've fixed things.

2) Actually it never worked, just no-one has ever noticed. This is surprisingly common, particularly in large systems. In this instance someone new starts and starts looking at things in a way no-one did before, or a business process changes bringing some previously minor edge case into the main process, and something which never really worked (or worked some but not all of the time) is found and reported.

The developer looks at it and declares "it could never have worked" but the users say "nonsense, we've been using it for years" and they're sort of right but something they consider irrelevant (and usually fail to mention until the developer finds the exact condition at which point they go "oh yes, we do do that now and didn't before") has changed.

Here the developer is right - it could never have worked and didn't ever work.

But in either case one of two things is true:

The claim "it could never have worked" is true and it never has worked - people just thought it did
It did work and the statement "it could never have worked" is false and down to a (usually reasonable) lack of understanding of the code and its dependencies

edited Jan 09 '13 at 09:47

Giorgio

19,486
16
84
135

answered Jul 29 '11 at 11:33

Jon Hopkins

22,734
11
90
137

1

Happens to me so often – genesis Jul 29 '11 at 12:30
2

*Great* insight into the realism these situations – StuperUser Jul 29 '11 at 13:17
1

I would guess it is usually the result of a "WTF" moment. I had that once. I re-read some code I wrote and realized that a bug that was recently noticed should have made the whole app break down. Actually, after further inspection, another component I wrote was so good it compensated the mistakes. – Thaddee Tyl Jul 29 '11 at 15:41
1

@Thaddee - I've seen that before but I've also see two bugs in code modules that called each other cancelling each other out so it actually worked. Look at either one and they were broken but together they were fine. – Jon Hopkins Jul 29 '11 at 16:01
7

@Jon Hopkins: I also got a case of 2 bugs cancelling each other, and that's really surprising. I found a bug, mouthed the infamous statement "it could never have worked", looked deeper to find why it worked anyway, and found another bug which sort of corrected the first one, in most of the cases at least. I was really stunned by the discovery, and by the fact that with only ONE of the bugs, the consequence would have been catastrophic! – Alexis Dufrenoy Nov 23 '11 at 17:13
There is also **third case**. When somebody looks at the code, comes up in the test case, tries it and finds that it indeed fails as he expected from reading the code. At which moment everybody else realizes that it's indeed wrong, but they never tried or noticed. And a **fourth case**, when by coincidence the users realize they need the failing case just about the time somebody reads the code and notices that it's wrong (because intervals between random events have exponential probability distribution, it's quite probable). – Jan Hudec Mar 19 '12 at 12:50
1

@Jan - The additional cases you give are just variants of case 2 - that it never worked. Ultimately the key thing here is not the detail of why it did or didn't work, it's really just dispelling the myth of something which worked perfectly stopping working for no reason. There is always a reason. – Jon Hopkins Mar 20 '12 at 17:24
1

I've also seen case 5: the code depended on undefined behaviour, which happened to do what was expected for a few years. Then a different compiler was used: schrödinbug. – Jan 09 '13 at 09:27
IMHO a more common http://en.wikipedia.org/wiki/Heisenbug is when a program happens to work due to a bug elsewhere in the system. The bug was always there but it didn't show until you fixed something. e.g. code which uses undocumented edge cases which could change at any time, or which has work-arounds on work arounds. Java is full of "features" which can't be fixed because it would break code which incorrectly relied on poor behaviour. – Peter Lawrey Jan 10 '13 at 14:08
When I was programming the Siemens 9750 terminal emulation, there was a user, who probed undefined ESC sequences at the original terminal and discovered a behaviour he found useful (it was something like `delete data and attributes from this cursor position down to 3 lines beyond` or something like that. I had to add that "feature" to the code and even added his name in the comment (around 1987). – ott-- Jan 10 '13 at 14:53
1

@Graham Lee I saw that once. The code assumed an uninitialized flag was zero. Worked fine until the compiler was changed to (correctly) not zero out uninitialized memory. – Gort the Robot Jan 10 '13 at 20:12
You took all the fun out of it – Tulains Córdova Feb 21 '14 at 18:45
Note also in case 1 that "it could never have worked" is ambiguous. It might mean, "had it ever executed, it certainly would have failed". If this is the case, then as you observe it must never have been executed. Since it has worked before, the developer would be wrong to say, "had it executed, it would have failed". But they might have meant, "it was never guaranteed to work", i.e. "it was always wrong", and this might be true despite it working on a specific configuration or phase of the moon. – Steve Jessop Jan 29 '15 at 11:07
A third possibility, or perhaps a variant of the first: a major but non-obvious sensitivity to an *internal* condition. In systems that rely on a large amount of Evil Global State, two bugs could cancel each other out. One is noticed by users and fixed, wherupon the other manifests. Unaware of the relationship because of the complexity of the global state, the developers are left wondering how it ever worked. – Kevin Krumwiede Oct 27 '15 at 03:40
Variation on #2 - the code didn't actually work but the system is using different binaries. Once you recompile & get them in sync, you've now introduced the bug into the system. – Sean McSomething Jan 18 '17 at 22:19

score 54 · Answer 2 · answered Jul 29 '11 at 11:59

Because everyone mentions code that should never have worked, I'll give you an example I ran into, about 8 years ago on a dying VB3 project that was being converted to .net. Unfortunately the project had to be kept up-to-date until the .net version was complete - and I was the only there who even remotely understood VB3.

There was one very important function which was called hundreds of times for each calculation - it calculated monthly interest for long-term pension plans. I'll reproduce the interesting parts.

Function CalculateMonthlyInterest([...], IsYearlyInterestMode As Boolean, [...]) As Double
    [about 30 lines of code]
    If IsYearlyInterestMode Then
        [about 30 lines of code]
        If Not IsYearlyInterestMode Then
            [about 30 lines of code (*)]
        End If
    End If
End Function

The part marked with a star had the most important code; it was the only part that did actual calculation. Clearly this should never have worked, right?

It took a lot of debugging, but I eventually found the cause: IsYearlyInterestMode was True, and Not IsYearlyInterestMode was also true. That is because somewhere along the line somebody cast it to an integer, then in a function that is supposed to set it to true incremented it (if it's 0 for False it would be set to 1, which is VB True, so I can see the logic there), then cast it back to a boolean. And I was left with a condition that can never happen and yet happens all the time.

Epilogue: I never fixed that function; I just patched the failing call site to send 2 in like all the other ones. — configurator, Jul 29 '11 at 11:59
@Pacerier: More often when the code is such a mess that it only works correctly by accident. In my example, no developer meant for `IsYearlyInterestMode` to evaluate as both true and not true; the original developer that added a few lines (including one of the `if`s didn't actually understand how it works - it just happened to work so it was good enough. — configurator, Aug 01 '11 at 10:37

score 16 · Answer 3 · answered Jul 29 '11 at 10:48

16

Don't know a real-world example, but to simplify it with an example situation:

A bug isn't noticed for a time, because the application doesn't run the code under conditions that cause it to fail.
Someone notices it by doing something outside of normal use (or inspecting the source).
Now that the bug is noticed, the application fails until normal conditions as well, until the bug is fixed.

This may happen because the bug will corrupt some state of the application that cause failures in the previously normal conditions.

answered Jul 29 '11 at 10:48

StuperUser

6,133
1
28
56

4

One explanation is that there had been random failures in the software, that nobody was able to link together mentally. Thus, those errors were thought to be of natural cause (such as random hardware failures). Once the source code is read, people are now able to relate all prior random errors to this one cause, and will realize that it should never have worked in the first place. – rwong Jul 29 '11 at 16:42
4

A second explanation is that there's a part in the software that is implemented with a chain-of-responsibility pattern. Each handler is written in a robust way, despite that one handler has a critical bug. Now, the first handler will always fail, but because of the second handler (which has overlaps in responsibility) tries to accomplish the same task, the overall operation would seem to have succeeded. If there is any change in the second module, such as change in responsibility area, it would cause an overall failure, although the real bug is in a different location. – rwong Jul 29 '11 at 16:47

score 13 · Answer 4 · answered Jul 29 '11 at 15:24

A real-life example. I can't show code, but most people will relate to this.

We have a big internal library of utility functions where I work. One day I'm looking for a function to do a particular thing, and I find Frobnicate() try to use it. Uh-oh: it turns out that Frobnicate() always returns an error code.

Digging into the implementation, I find some basic logic errors in Frobnicate() that make it always fail. In source control I can see that the function hasn't been modified since it was written, meaning that the function has never worked as intended. Why hasn't anybody noticed this? I search through the rest of the source enlistment and find that all of the existing callers of Frobnicate() are ignoring the return value (and therefore contain subtle bugs of their own). If I change those functions to check the return value like they should, then they start failing, too.

This is a common case of condition #2 that Jon Hopkins mentioned in his answer, and it's depressingly common in large internal libraries.

... which makes a good reason to avoid writing internal library wherever an external one is usable. It will be more tested and thus have far fewer such nasty surprises (open-source libraries are preferable, because you can fix them if they do anyway). — Jan Hudec, Mar 19 '12 at 12:57
Yeah, but if programmers ignore the return codes that's not the library's fault. (By the way, when was the last time you checked the retcode of `printf()`?) — JensG, Feb 21 '14 at 17:53

score 9 · Answer 5 · answered Jan 09 '13 at 09:39

Here's a real Schrödinbug I saw in some system code. A root daemon needs to communicate with a kernel module. So the kernel code creates some file descriptors:

int pipeFDs[1];

then sets up communication over a pipe that will be attached to a named pipe:

int pipeResult = pipe(pipeFDs);

This shouldn't work. pipe() writes two file descriptors into the array, but there's only space for one. But for about seven years it did work; the array happened to be before some unused space in memory that got coopted into being a file descriptor.

Then, one day, I had to port the code to a new architecture. It stopped working, and the bug that never should have worked was discovered.

score 5 · Answer 6 · answered Jan 09 '13 at 07:34

5

A corollary to the Schrödinbug is the Heisenbug - which describes a bug that disappears (or occasionally appears) when attempting to investigate and/or fix it.

Heisenbugs are mythical clever little blighters that run and hide when a debugger is loaded, but come out of the woodwork once you've stopped watching.

In reality, these are usually seem to be caused by one or other of the following:

the impact that optimization, where code compiled with -DDEBUG is optimized to a different level from the release build
subtle timing differences due to real-world communication buses or interrupts being subtly different from simulated "perfect" dummy loads

Both highlight the importance of testing release code on release equipment, as well as unit/module/system test using emulators.

answered Jan 09 '13 at 07:34

Andrew

2,018
2
16
27

Why did I not notice S.Lote's answer and delnan's comment before I posted this? – Andrew Jan 09 '13 at 10:39
I have little experienced but have found a couple of this. I was working in an Android NDK environment. When the debugger found a breakpoint it only halted the Java threads, not the C++ ones, making some calls possible because elements were initialized on C++. If left without debugger, the Java code would go faster than C++ and try to use values that were not initialized yet. – MLProgrammer-CiM Jan 10 '13 at 15:37
I discovered a Heisenbug in our usage the [Django](https://www.djangoproject.com/) database API a few months ago: When `DEBUG = True`, the name of the "parameters" arg to a raw SQL query changes. We had been using it as a keyword arg for clarity due to the length of the query, which broke completely when it was time to push to the beta site, where `DEBUG = False` – Izkata Feb 22 '14 at 02:38

score 2 · Answer 7 · answered Oct 27 '15 at 03:01

I've seen a few Schödinbugs and always for the same reason:

Company policy required that everyone was supposed to use a program.
Nobody really used it (mostly because there was no training for it.)
But they couldn't tell management this. So everyone had to say "I've been using this program for 2 years and never encountered this bug until today."
The program never really worked, except for a minority of users (including the developers who wrote it.)

In one case, the program had been subject to plenty of testing, but not on the real database (which was deemed too sensitive, so a fake version was used.)

score 2 · Answer 8 · answered Jan 18 '17 at 15:06

I have an example from my own history, this was some 25 years ago. I was a child doing rudimentary graphics programming in Turbo Pascal. TP had a library called BGI which included some functions that let you copy a region of the screen into a pointer-based memory-block, and then blit it elsewhere. Combined with xor-blitting on a black-and-white-screen it could be used to do simple animation.

I wanted to take it a step further and make sprites. I wrote a program which drew big blocks and controls to color them in, as you did so it reproduced these as pixels, producing a simple drawing program to create sprites, which it could then copy to memory. There was just one problem, to use these blitted sprites, they would have to be saved to a file so that other programs could read them. But TP had no way of serialising pointer-based memory allocation. The manuals flat out stated they couldn't be written to file.

I came up with a piece of code that did, successfully, write to file. And started writing a test program that blitted a sprite from my drawing program on a background - on my way to creating a game. And it worked, beautifully. The next day however, it stopped working. It showed nothing but a garbled mess. It never worked again. I created a new sprite, and it worked, perfectly - until it didn't, and it was a garbled mess again.

It took a long time but eventually I figured out what was happening. The drawing program was not, as I thought, saving the copied pixel data to file - it was saving the pointer itself. When the next program read the file, it ended up with a pointer to the same block of memory - which still contained what the last program had written there (this was on MS-DOS, memory management was non-existent). But it worked... right until you rebooted, or had run anything which had reused that same area of memory, and then you got a garbled mess because you were blitting a bunch of utterly unrelated data to the video-memory block.

It should never have worked, it should never even have appeared to work (and on any real OS it wouldn't have) but it still did, and once it broke - it stayed broken.

score 0 · Answer 9 · answered Jan 10 '13 at 20:33

I have never seen a true schrodinbug and I don't think they can exist--finding it won't break things.

Rather, something changed that exposed a bug that's been lurking for ages. Whatever changed is still changed and thus the bug keeps showing up while at the same time someone finds the bug.

score 0 · Answer 10 · answered Jul 29 '11 at 11:04

0

This happens all the time when people use debuggers.

The debugging environment is different from the actual -- no debugger -- production environment.

Running with a debugger may mask things like stack overflows because the debugger's stack frames mask the bug.

answered Jul 29 '11 at 11:04

S.Lott

45,264
6
90
154

I don't think it's referring to the difference between code running in a debugger and when compiled. – Jon Hopkins Jul 29 '11 at 11:20
26

That's not a schrödinbug, that's a [heisenbug](http://en.wikipedia.org/wiki/Unusual_software_bug#Heisenbug). – Jul 29 '11 at 11:20
@delnan: It's at the edge, IMO. I find it to be an indeterminate thing because there are unknowable degrees of freedom. I like to reserve heisenbug for things where measuring one thing actually disturbs another (i.e., race conditions, optimizer settings, network bandwidth limitations, etc.) – S.Lott Jul 29 '11 at 11:34
@S.Lott: The situation you describe does involve the observation changing things by messing with the stack frames or the like. (The worst such example I ever saw was the debugger would peacefully and "correctly" execute loads of invalid segment register values in single-step mode. The result was some routines in the RTL that shipped despite loading a real mode pointer while in protected mode. Since it was only being copied and not dereferenced it behaved perfectly.) – Loren Pechtel Jan 10 '13 at 20:31

What's a schrödinbug?

10 Answers10