58

When you arrive in the morning, you find that your software does not work anymore, even though it did when you left yesterday evening.

What do you do? What do you check first? What do you do to stop being angry and start working on your problem? Do you blame your colleagues and go directly to them? What can be done to avoid being in such a situation?

Thomas Owens
  • 79,623
  • 18
  • 192
  • 283
Nikko
  • 652
  • 6
  • 14
  • 10
    Blaming is never a great idea. Since you haven't elaborated the question or problem, it's impossible to guess that program didn't produce the error itself. For ex: If a website on a hosting server reaches bandwidth limit, it goes down itself. Therefore you can't blame anyone until you are sure that the code didn't behaved appropriately in the first place. – Pankaj Upadhyay Aug 31 '11 at 08:09
  • 1
    well that's not on stack overflow so its more a general question. The blaming part is a bit of a joke :) – Nikko Aug 31 '11 at 08:19
  • But generalization is itself about something :P . Was asking for the scenario which led you to ask this question. – Pankaj Upadhyay Aug 31 '11 at 08:24
  • Where's the relation to software development? – blubb Aug 31 '11 at 08:40
  • 1
    I think it is something that happens a lot in software development. What do you think? – Nikko Aug 31 '11 at 08:43
  • 28
    @Nikko - it didn't work yesterday either. THAT'S what happens a lot in software development :) – Joris Timmermans Aug 31 '11 at 09:07
  • 4
    To avoid being in the situation, don't rush testing so you can deploy in the last few minutes of the afternoon. And take off your rose tinted/peril sensitive sunglasses while testing. –  Aug 31 '11 at 09:48
  • 1
    @Nikko: It may happen often, but it's not really characteristic, is it? (Stupid metaphor: Coffee stains are probably quite common in software companies too, but that doesn't make them on-topic for programmers.SE). To me, this question is more specific to 'generic workplace' than software development. (Not meant to be rude, honestly!) – blubb Aug 31 '11 at 10:49
  • Should I make it clear in the question that it is a general question about software development that may happen to anyone ? Not that it happens to me everyday :) It's not like I'm working in a field (industry) where I have no choice, I can't make mistakes. I just thought it would help people to share this experience. – Nikko Aug 31 '11 at 11:08
  • 18
    Is it something to do with DateTime.Now() ??? – Sarawut Positwinyu Aug 31 '11 at 12:27
  • 1
    There is a rare thing that I haven't seen here, but often applies to me. The hardware sometimes can be flaky, which might cause this kind of affect. – PearsonArtPhoto Aug 31 '11 at 13:20
  • I've frequently known Virus scanners to defeat real time systems, that were pre-programmed to operate at certain times of the day... – PearsonArtPhoto Aug 31 '11 at 21:20
  • **I've removed a lot of comments that weren't related to the question at all. If you want to talk about obscure hardware issues causing problems, please use chat.** – ChrisF Sep 01 '11 at 08:14
  • This question appears to be off-topic because it is about proving to your coworkers that you're not crazy. – Ampt Oct 24 '14 at 14:46
  • could be great for http://workplace.stackexchange.com/ if you don't get your answer here; just they won't all be programmers – john mangual Oct 24 '14 at 19:09
  • How can you even consider blaming your coworkers? You do have vcs where all changes are tracked. Right?.... right?!?!? – Esben Skov Pedersen Oct 24 '14 at 20:39

21 Answers21

95

The usual suspects are:

  • You thought it worked yesterday, but after a full day of work you were too blind to realize that it didn't work.

  • This morning you no longer can refer to what was in IDE cache memory yesterday.

  • The workstation has rebooted last night or a nightly maintenance operation cleared /tmp directories.

  • Something has changed in the code base: check whether someone (possibly yourself) has commited changes between your last compile of yesterday and your last compile of today.

  • Something has changed in the support libraries: check whether those libraries have been recompiled or upgraded. The cause may be inside the project for specific libraries or outside if a new version of an apparently independent package has been deployed.

  • Something has changed in the testing environment: new version of a virtual machine, a stub that has been modified, changes in a remote database server...

  • Something has changed in the compilation chain: changes in Makefiles, new version of IDE, of compiler, of standard libraries...

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
mouviciel
  • 15,473
  • 1
  • 37
  • 64
  • 83
    You forgot "divine intervention" and "a high energy particle went through the server and randomly rearranged bits". And solar eruptions. – Kheldar Aug 31 '11 at 13:41
  • 17
    You forgot "you are using , which is notoriously unreliable. – configurator Aug 31 '11 at 14:35
  • 1
    Something has changed in the environment: A disk got filled with logfiles, a CD was ejected, another program exited, users stopped visiting your pages during night, a cronjob ran, ... – Konerak Aug 31 '11 at 15:49
  • 4
    @Kheldar - don't forget malicious sprites, gremlins et al. – immutabl Aug 31 '11 at 16:03
  • 5
    @configurator : that's why you should *always* write your own language. Ask Spolsky about Wasabi! _checks if Atwood is around and runs away_ – Kheldar Aug 31 '11 at 16:14
  • 1
    @Kheldar: I've had something of the sort. Back in the days of cassette tape storage a program started misbehaving--I finally found the problem was a one-bit change that was still legal, just wrong. – Loren Pechtel Aug 31 '11 at 16:16
  • 13
    another classic pitfall is the date change. This is of course especially true for "limit" dates (first or last day of month/week, 29th February, etc). – Brann Aug 31 '11 at 16:29
  • 2
    Or... or... someone called `unsaferPerformIO` (outside of your application, perhaps even in another country), and the trickling side-effects caught up with you. – Thomas Eding Aug 31 '11 at 16:53
  • 1
    @Brann This is a "phase of the moon" bug. – John Lyon Sep 01 '11 at 00:36
  • 1
    similar to 'the workstation got rebooted and /tmp was cleared' is '/tmp or /var some other relevant directory became full and thus files can't be created and written any more' – Andre Holzner Sep 03 '11 at 12:25
  • 2
    No-one seems to have mentioned a potential change in any input data/source data files or database content that breaks the code. Overnight is a favourite time to produce new data snapshots from large systems. – Stuart Sep 06 '11 at 07:22
  • 1
    Could be the .h() function. I once checked-in code, turned my head for a second and must have bumped the h key. I turn back and see the code not checked-in. I figure 'Hrm, coulda swore...guess not.' and checked the code (back) in. I went to Vegas for a work trip (I swear) and got a call about the code not working in my area. I was surprised because I tested the code and it worked fine. A bit of searching and sure enough, I was in Vegas with my team, fixing a rogue .h() function call. Beware the .h() function. It's out there, waiting. – Yatrix Oct 27 '11 at 18:11
  • you forgot. Forgetting to save .csproj after adding a file for visual studio users. – Esben Skov Pedersen Oct 24 '14 at 14:14
  • If it is a web application, it might be: I have tested it with Firefox and assumed it would work with Internet Explorer. Happened to me once and it screwed a live demo. – Giorgio Oct 24 '14 at 19:06
  • You forgot your Jewish scapegoat! – Thomas Eding Oct 24 '14 at 20:05
49

1) If it's not working today, it wasn't working yesterday either.

You thought it was working, but it wasn't.

2) There is a problem, and it must be solved.

Don't think about who's responsible for this or about blaming others.

If nothing has changed between yesterday and today (like I presume reading your question), it means you should do a better job at testing your code before actually state it's working.

To avoid this situation you have to do proper Testing and Debugging.

Define "working" and test the boundaries of your code routines.

  • Try to become one of the users who will use your program or code functionalities.
  • Push your code to the allowed limits and beyond, and actually check if it breaks.

One way of doing this is having an automated set of extensive tests run during the night, so that the day after you can check if something went wrong and fix the problems.

Jose Faeti
  • 2,815
  • 2
  • 23
  • 30
  • 7
    I want to give you two upvotes - One for "If it's not working today, it wasn't working yesterday either." and one for "there is a problem, and it must be solved". Both are key ideas that too many people forget. – MattBelanger Aug 31 '11 at 13:19
  • 2
    "If it's not working today, it wasn't working yesterday either." -> This happened to me yesterday doing some front end coding that relied on a cookie. Worked great when the cookie had already been set. Found out it wasn't being created properly anymore the next day once it had expired and was trying to be re-created. – Graham Aug 31 '11 at 14:45
  • @Graham: see "If nothing has changed between yesterday and today [...], it means you should do a better job at testing your code before actually state it's working.". You must be better at testing your code, think about what should happen, think about what MAY happen. Perhaps with a better understanding of the problem, it wouldn't have happened. – Jose Faeti Aug 31 '11 at 14:56
  • As for 1): Maybe the breaking change was in an auxiliary library – phresnel Aug 31 '11 at 18:49
  • Not strictly true... :P I accidently broke an application, by pulling some cache files into my application which were objects that had been serialized completely wrongly. The app was fine and it was working fine, It was just that when I did a git pull, because the cache files were on ignore, the program updated and it needed the objects in a different format. You still get the upvote though ;) – Laykes Aug 31 '11 at 22:01
  • Thinking about 1, supposing there was a date overrun triggered by today's date, but today is the first day when that date overrun occurred- could one not argue it was working yesterday and it is not working today? Or if the licence ran out on some subsiduary product. Details, yes, but I hate to leave an absolute unchallenged :) – glenatron Sep 02 '11 at 13:18
26

Trying to find someone to pass the blame is unconstructive and does not solve problems. Don't do it.

If something worked yesterday and does not work now, then either you have non-deterministic behavior (like a race condition) and having it work yesterday was just luck, or something has changed between then and now, and you need to find out what it is.

How exactly you find out which is the case and how it can be fixed depends on the specifics of the situation, but it always helps to be methodical in eliminating causes, i.e. don't change 5 things at once and stop looking if that helps - find out which specific thing caused the problem, and perhaps write down how to fix it so you can look it up when it happens again 3 weeks from now.

Using the appropriate diagnostic tools (debugger, profiler, network analysis tools) can also make a big difference.

ChrisF
  • 38,878
  • 11
  • 125
  • 168
Michael Borgwardt
  • 51,037
  • 13
  • 124
  • 176
25

I have worked with code that appeared to change overnight and after a while I came to the conclusion this was due to malevolent pixies crawling into my codebase at night and changing things in such a way that in spite of the fact it worked yesterday, it now doesn't work at all. Indeed in classic Schroedinbug style, not only does it not work now, it is transparently clear that there is no way that it ever could have.

Over time I have realised that it's just possible that in fact pixies have nothing to do with it and that possibly my "time to go home, that will be good enough" last build doesn't get the detailed testing and attention that perhaps it deserves.

My first assumption when I encounter this in the morning is that it is probably my fault as I'm usually the one responsible for my own features or corners of software I'm working on. My second assumption is that I might as well get that coffee now. If it's not something blatantly obvious that a monkey could figure out ( which it sometimes is ) then the chances are good that I have managed to drag in an old version of a library, mistakenly rolled back a file that didn't need to be rolled back or have something cached somewhere that brought it into the build without checking it. Going through my recent Source Control activity tends to reveal things I have done, cleaning the build often removes errant cached versions.

Sometimes it really is nothing to do with me - someone updated a dependency without mentioning it, WindowsUpdate installed something that changed the environment so that my code didn't work; there are a lot of background possibilities, but usually it's a case of manning up and accepting that, like most people, I am basically an idiot.

glenatron
  • 8,729
  • 3
  • 29
  • 43
  • 1
    This is a very humble and self-deprecating answer that many of us should adopt. :) I usually chalk these kinds of situations up to "Hey Moe, Hey Larry, I was trying to think and nothing was happening!" at the end of the day. I also use the method of "It works! Quick, check it in and go home before you have the urge to improve it" at the end of the day, to avoid these situations. – Jennifer S Aug 31 '11 at 13:48
  • 3
    One place I worked, nobody's code would function first thing in the morning. Turned out that when we booted our machines Skype would grab the port the application wanted to use when it first started up. – thepeer Aug 31 '11 at 14:22
  • Perhaps the pixie is nothing more than an uninitialized variable? Sometimes the debug version can work when the release version fails (crashes or behaves differently). – Jared Updike Sep 01 '11 at 17:34
20

Use version control. Do a diff, or use your VCS' blame functionality.:

  • diff: Every VCS. Shows you the differences of, uhm, different versions
  • blame: for example git. Shows you on a line by line basis who has changed what

If there is no version control, apart from it being your own or your boss's fault, you can look at the change dates of files and possibly look into your OS' logging facilities.

Apart from that: Recompile everything, make sure to also recompile auxiliary libraries.

Of course: If you've found the source of error, stay calm, ask for why a change was made, explain your problem, and propose a solution that makes you both happy. Don't shout at her/him, that would be poison to your productivity.

If there are no changes at all, it's time to see what has changed on the system. E.g., recently Mac OS computers have updated to a new version of Apache which has driven some configurations invalid.

phresnel
  • 554
  • 2
  • 12
11

Well, here's a real life example of code that "worked yesterday" and not today... It is from earlier this month.

The application in question pulls information from a database by date, and the default behavior is to get data for the current day. This worked fine on August 8th, but failed on the 9th. It was not tested earlier than this. It would have also worked on 9th of September, and the 10th of October...

Another clue is that we are in the UK, the database in question was in the USA...

So, my answer to your question about what to check first is to double check how you format your dates, because if you mix up the day and the month fields it will work perfectly, but only on 1 day per month:-)

Steve
  • 5,264
  • 1
  • 21
  • 27
5

Fix the bug (however you normally do). Then if you find who caused it send them a polite email letting them know what went wrong.

Every coder makes mistakes and if you start blaming then it will seriously backfire the next time you do the same thing. (possibly even this bug was yours)

Its only if you suspect them of being reguarly careless should you make a big deal out of a couple of bugs.

Tom Squires
  • 17,695
  • 11
  • 67
  • 88
5

...you run regression tests and focus on those that fail.

Actually is what you forgot to do yesterday before leaving, it happens.

You don't have any? Ok.. what where you saying? Blaming? Well... that might work, then

ZJR
  • 6,301
  • 28
  • 36
5

The first thing to do when something stops working is to ask yourself - What's different? What has changed?

When something worked last night but fails this morning, the one thing that obviously has changed is - the date and time :)

I'd try and think whether there's any part of the logic I'm working on that depends on dates and might be affected by the passing of time. It's surprising how many times that's the cause of such problems.

If that fails, you should definitely follow up on the other great advice supplied here.

urig
  • 103
  • 7
  • 2
    Bugs that involve time peculiarities such as switching into/out of daylight saving are favourites (in October and March)... – Julia Hayward Oct 24 '14 at 14:54
4

A kinda short answer (to write) but kinda long to get the gist of it: Why Programs Fail: A Guide to Systematic Debugging by Andreas Zeller (which might look a bit too academic but it's not)

Shady M. Najib
  • 261
  • 2
  • 8
  • 22
4

You look in your mailbox after the mail sent by the Continuous Integration engine when the unit test(s) failed (or the log page if you didn't watch that specific problem), and see who did the check-in just before that build.

Then go talk to him or her.

4

There are only two possible reasons why your code fails today, but worked yesterday.

Look at the data

There's something in the data you didn't test and or account for. Either data isn't validated properly or an error in logic wasn't revealed until a logical condition you didn't anticipate occurs. This means the bug was there yesterday, but it was hiding from you under valid data.

I once had some order entry code running fine for weeks. I went home one day, and it died. Investigation the next day revealed that I had a bug hidden in a chain of function calls. In a weakly typed language, I declared an integer when I should have used a long int. The language did the conversion between the two automatically until it couldn't because the number exceeded what would fit into an integer. The system failed on order number 32768.

Look at What Changed

Look at what changed since it worked. Did the IT section push out a OS update? Did another coder modify code that your program uses? Did the user's permission change? Often, if you find what changed, you'll find the bug.

Andrew Neely
  • 419
  • 2
  • 8
3

Binary chop

works especially well for difficult JavaScript errors. Basically comment half the code, see if you get the error, if you do it's in that half of the code. Half it again and carry on.

If your code is well encapsulated this is a fantastic, time saving, stress busting tool.

Once you've found the guilty code, it's often worthwhile isolating the error on it's own test page.

Jose Faeti
  • 2,815
  • 2
  • 23
  • 30
chim
  • 111
  • 3
  • of course if your project spans multi-files this can be extended by \*cough*randomly*cough* deleting half of your project's files, that definitely is an effective stress busting tool (make sure you've got a backup though). – Lie Ryan Sep 01 '11 at 07:49
  • Yep, definitely make sure you've got a back up! – chim Sep 05 '11 at 10:32
3

And of course, what can be done to avoid being in such a situation?

Addressing this question, you might want to look into Continuous Integration (CI). Simply put: CI is a process where developers frequently (as many as several times a day) integrate and test all code. The idea is that changes to one module that break another module are quickly found.

In practice, most teams that employ CI use a CI Server (see: Wikipedia's list). The CI Server is usually setup to monitor the SCM repository and start a build when it sees changes. When the build is complete, it will then run a series of automated tests and post the results via e-mail and/or webpage of the build and tests, along with what changes caused the build. Hopefully, when something does break the build or tests, you have only a very small change set to look at, so it gets solved quicker.

There are other questions here about which CI Server to use, so I'll let you find them in interested. Personally, I am a big fan of Jenkins.

[What should I do about things being broken.]

As others have already said, find out what broke and try to fix it. Spending time trying to place blame is time spent not solving the problem.

jwernerny
  • 988
  • 6
  • 12
  • Yes at work we use Jenkins and it's really useful. We can monitor builds on different systems and see right away what fails. We even have a real garage beacon that starts blinking when a build fails. – Nikko Aug 31 '11 at 20:50
3

My natural reaction is always to blame others but over time I have come to realise that it is usually me who is at fault. In addition to all the excellent comments above, it is important that you record for yourself what the final reason was. It doesn't matter whether you use a Wiki which is shared with other team members, a private Twiki, Evernote, a log book or a good memory. The important thing, at the moment you find the answer (and want to get back to work!) is to record the reason.

Ant
  • 225
  • 2
  • 4
2

Presumably if it's no longer working, you have identified the symptoms of it not working, ie, it hangs, or throws back a particular error dialog to the user.

If the sole description of the problem is "it's not working", the first thing you need to do is gather more information on the symptoms of the problem.

Then you start looking for possible causes, either via logs or attempted recreation of the problem or a combination of both - depends on how your system is set up I guess.

Then you start ruling them out.

temptar
  • 2,756
  • 2
  • 23
  • 21
2

That's what usually happens when I take holidays :-)

More seriously, I'd first tell them:

  • I will look into it to see what's wrong and what could be the root

  • I will touch base in 30-60 minutes once I had a chance to see what's happening

After that time, I can hazard an estimate of what might have happened and how long it will take the fix it if it's not already fixed and, if applicable, what data we may have lost (but I have good backups, so that never happens hopefully).


As for the blaming part:

  • if it's just a colleague typo, there is no need to mention it: shit happens and the fright from the bug most likely taught him a lesson and hopefully, he won't do it again.

  • if he intentionnally did something I told him not to (e.g. give the root password of the production server to the new guy and tell him to make a modification on it directly without supervision)(yes, that already happened...), then I have to mention it.

wildpeaks
  • 2,691
  • 1
  • 19
  • 17
2

If your usual bug tracing methods don't work and everything is a total mess, it can be wonderful to have a backup that you can restore easily.

This is what I run locally, automatically every hour from 8am to 6pm:

rdiff-backup /path/to/mystuff /path/to/mybackup

Simple, eh ?

If you ever have to restore anything, use

rdiff-backup -r 24h /path/to/mybackup/specific/dir /tmp/restored

rdiff-backup only stores files that differs. You can use rdiff-backup on Linux, mac and win.

Of course, this shouldn't be your only backup. But it's extremely easy and cheap way to have a local backup.

Now, I wouldn't recommend this as a normal bug fixing method, but if everything else fails, it's a fallback.

olafure
  • 111
  • 1
  • 3
    version control is easier – thepeer Aug 31 '11 at 14:26
  • @thepeer: absolutely agreed. However, there are things that resist source control (especially on the micro-commit schedule), such as large binary files. I'm just happy I can avoid such projects _most of the time_ – sehe Aug 31 '11 at 20:44
  • @thepeer: I didn't really think someone would consider this as an alternative to version control. That would be my idea of an organized chaos :) This way you have a copy of your stuff like it was yesterday. No matter who committed what and when to the version control system. Your last commit might also have been more than 2 days ago. Some projects have certain files ignored from version control. – olafure Sep 01 '11 at 12:14
  • @sehe: with git, which I am currently using, you have your own personal repo so there's no excuse not to commit at every step of the way. – thepeer Sep 02 '11 at 10:57
  • @olafure: any decent version control system should allow you to checkout / clone the complete state of the system at any given point. – thepeer Sep 02 '11 at 10:59
  • @thepeer: inverted my whole point: I'm the micro-commiting type myself; However, with git+large binaries that doesn't pan out. It doesn't matter how local that repo is. If it's several gigs, it hurts – sehe Dec 18 '11 at 22:28
2

The bug may have already existed, but been hidden by external factors, or deep system issues.

This happened to me. A bug developed between 2 builds of our project. Literally, the only change we had made was to update to a more recent build of the one of the underlying libraries.

Naturally we blamed them. But the only change they had made was to refactor some headers for a faster compile. I agreed that that should not have broken the system.

After much debugging it turned out that the problem was a rogue pointer bug that had been latent in my code for years. Somehow it was never triggered until their refactoring had changed the arrangement of the executable.

Matthew Scouten
  • 200
  • 1
  • 1
  • 7
1

it was working yesterday as it was being used correctly.

you find that other people use things in a way that there not suppose to which is a good way of breaking stuff.

its always good to update code early on in the day as this leaves you with a good testing environment.

Backup!

Robert
  • 1
  • 1
-2

I find setting breakpoints to pause and check my data very helpful, to pinpoint exactly where and how it's going bad.

DKC
  • 1