115

I am confused because in quite a few places I've already read that the so-called 'boneheaded' exceptions (ones that result from bugs in code) are not supposed to be caught. Instead, they must be allowed to crash the application:

At least two of the three above people are established authorities.

I am surprised. Especially for some (important!) use cases, like server side code, I simply can't see why is catching such an exception suboptimal and why the application must be allowed to crash.

As far as I'm aware, the typical solution in such a case is to catch the exception, return HTTP 500 to the client, have an automatic system that sends an emergency e-mail to the development team so that they can fix the problem ASAP - but do not crash the application (one request must fail, there's nothing we can do here, but why take the whole service down and make everyone else unable to use our website? Downtime is costly!). Am I incorrect?

Why am I asking - I'm perpetually trying to finish a hobby project, which is a browser based game in .net core. As far as I'm aware, in many cases the framework does for me out of the box the precise thing Eric Lippert and Stephen Cleary are recommending against! - that is, if handling a request throws, the framework automatically catches the exception and prevents the server from crashing. In a few places, however, the framework does not do this. In such places, I am wrapping my own code with try {...} catch {...} to catch all possible 'boneheaded' exceptions.

One of such places, AFAIK, is background tasks. For example, I am now implementing a background ban clearing service that is supposed to clear all expired temporary bans every few minutes. Here, I'm even using a few layers of all-catching try blocks:

try // prevent server from crashing if boneheaded exception occurs here
{
    var expiredBans = GetExpiredBans();
    foreach(var ban in expiredBans)
    {
        try // If removing one ban fails, eg because of a boneheaded problem, 
        {   // still try to remove other bans
            RemoveBan(ban);
        }
        catch
        {

        }
    }
}
catch
{

}

(Yes, my catch blocks are empty right now - I am aware that ignoring these exceptions is unacceptable, adding some logging is perpetually on my TODO list)

Having read the articles I linked to above, I can no longer continue doing this without some serious doubt... Am I not shooting myself in the foot? Why / Why not?

If and why should boneheaded exceptions never be caught?

gaazkam
  • 3,517
  • 3
  • 19
  • 35
  • `RemoveBan()` would probably be described as an example of a **"_vexing exception_"**, where its failure shouldn't actually throw an exception. [The article](https://blogs.msdn.microsoft.com/ericlippert/2008/09/10/vexing-exceptions/) gives an example of how `int Int32.Parse(string)` used to throw when it failed to parse an `int` from a (typically user-provided) `string`; this was a poor design choice, and it was later fixed to be `bool Int32.TryParse(string, out int)`, which instead of throwing on failure to parse, would `return false;`. Likewise, you'd probably want `TryRemoveBan()`. – Nat Jan 04 '20 at 18:43
  • @Nat Not sure why? A user is temporarily banned, their ban expired, this ban must be removed, no ifs and buts? Worst case scenario, there is a race condition and the ban gets removed anyway before `RemoveBan`'s `db.SaveChanges` succeeds, so we have an exception that I guess qualifies as exogenous and should be ignored. Boneheaded exceptions from `RemoveBan` are, however, still a problem – gaazkam Jan 04 '20 at 18:49
  • Yup, that'd be an exogenous problem if it's a synchronization issue that can't be properly addressed within the scope of the program's own logic. I guess the author would argue that you should just fix the boneheaded problems rather than catch them; but, if that's not a possibility, I guess that their advice doesn't apply. The worrisome thing about that scenario is that, if the program logic can't be trusted to not have boneheaded mistakes that throw exceptions, it'd seem like it couldn't be trusted to not have boneheaded mistakes that'd do something bad without throwing. – Nat Jan 04 '20 at 18:54
  • @Nat Yes, the cause of the boneheaded exception must be fixed, I agree. Problem is, let's be realistic, fixing it will likely take a whole day at least, given this is a hobby project. The question is if any runtime-detected bug should fail just this one request or incur a day-long downtime of the whole service? – gaazkam Jan 04 '20 at 18:59
  • You should add `// TODO` comments in those empty catch blocks. – RonJohn Jan 05 '20 at 05:20
  • 64
    This is a bit OT, but I strongly suggest to fix all empty catch blocks ASAP. Any such empty block can cost your hours when something goes wrong and you have no clue what and where. Just add *any* logging *now*, e.g., write a trivial class with a static method printing somewhere.... switching to something better later is trivial. – maaartinus Jan 05 '20 at 13:44
  • I've not dealt with application that had any sort of "ban" functionality but I would have thought that you don't need to *remove* expired bans. I'd expect you hand out a ban with some length and actions that fall under the ban's purview (e.g., logging in) will simply fail based on date. Once the date advances past the ban length, the action now succeeds. No removal needed unless it's prematurely lifted. – VLAZ Jan 05 '20 at 20:06
  • 30
    "logging is perpetually on my TODO list" is precisely why you should never catch boneheaded exceptions. I would also argue to never put `// TODO` comments in either. You never actually get back to properly handling either. (I was just recently burned wasting about 2 days in legacy code that swallowed a NPE in an empty catch block.) – Dave Rager Jan 06 '20 at 00:15
  • @DaveRager my personal policy for `//TODO` is to avoid committing them. It's not a very strict policy but the idea is that I'd place copious `//TODO` comments but I don't want them in the final released code. Either actually *do* them or remove them. Removal should be accompanied with a issue tracking ticket or just understanding that this will likely never be done (e.g., unneeded feature). I do have some `//TODO` comments left out in the code base even after that but the number is generally low. I'd prefer to be even stricter than this but...it's not usually worth it for me. – VLAZ Jan 06 '20 at 06:48
  • @VLAZ So you don't commit until ALL `//TODO`s are finished? – Tvde1 Jan 06 '20 at 12:00
  • @Tvde1 in Git, I do make commits but then only push them when I'm done. I tend to re-write the history to clean it up before a push anyway. Although sometimes I do have to push a `TODO` - no point in letting code rot on my machine for more than a few days without a push. As I said, it's not a very strict policy. – VLAZ Jan 06 '20 at 12:02
  • surely we have a few duplicates of this question : https://softwareengineering.stackexchange.com/questions/164256/is-catching-general-exceptions-really-a-bad-thing – Ewan Jan 06 '20 at 12:27
  • 1
    This should be required reading for all software engineers: https://www.martinfowler.com/ieeeSoftware/failFast.pdf. – JimmyJames Jan 06 '20 at 17:19
  • 1
    In C, we actually use a macro called `assert()` *to purposefully, and immediately crash the process*, without giving any other part of its code *any* chance of recovery. Boneheaded at its best. The effects of this are: Any error that triggers an `assert()` *will* be considered by a human, the programmer inspecting a failed `assert()` gets directly pointed at the spot where the error was detected, and programmers will work hard to make sure no `assert()` is ever triggered. This ensures that the process will never work with corrupted data that wouldn't pass the `assert()` calls in the code. – cmaster - reinstate monica Jan 06 '20 at 20:49
  • 10
    Serious question: **Why isn't the try-catch inside `GetExpiredBans` and `RemoveBan`**? The assumption here seems to be "these can fail arbitrarily at any time and we can always ignore that failure". OK, if that's the case then put the try-catch *inside them* and then the *caller* doesn't need them. Why did you decide to put the try-catch in the caller, rather than the callee? – Eric Lippert Jan 06 '20 at 20:55
  • 13
    Second serious question: How often do those catch blocks execute? That is, how often are you genuinely catching a "boneheaded" exception such as a null dereference? If the answer is "zero times per year" then the catch blocks are unnecessary. If the answer is not zero then *fix all the bugs until the answer is zero*, and then the answer will be zero. – Eric Lippert Jan 06 '20 at 20:59
  • 6
    I feel it important to point out something I haven't seen (m)any answers or comments on this page address. Yes, you should allow boneheaded errors to just happen, because hiding them means hiding a bug in your code. However, the reason they are called "boneheaded" is because they are your fault and should never have happened in the first place. If you find that a boneheaded error gets thrown by your server, don't just wrap it in a try/catch and attempt to "fail gracefully". Rewrite your code so it would never get thrown in the first place. – Abion47 Jan 07 '20 at 00:45
  • @EricLippert Re 1: These are not supposed to fail arbitrarily any time, but unless I make a mathematical proof of the correctness of their implementation I can never be certain they will not fail unexpectedly. Can we always ignore their failure? That depends on the failure, hopefully they won't corrupt the db, but the problem is that - AFAIK - in ASP.NET an exception from a background task takes the whole server down, which stops responding to requests. Which is why I thought I should wrap the entire `ExecuteAsync` code in `try`/`catch` - though probably I should remove the inner `try`/`catch` – gaazkam Jan 07 '20 at 10:59
  • 1
    @EricLippert Re 2: Zero times per year is the goal obviously, however, I can never be sure because no testing can ever prove correctness. Anyway, the question will only be answerable once the game has been deployed for at least a year, which is not yet the case. I intend fix all bugs until the answer is zero as soon as I know about these bugs, while at the moment I an trying to defend against the (always existing) possibility of *unknown*, yet still present bugs. – gaazkam Jan 07 '20 at 11:03
  • 1
    The short version is that the crash is the logging you were too lazy to add. – AI0867 Jan 07 '20 at 13:27
  • @AI0867 that's so right ! – beppe9000 Jan 07 '20 at 19:36
  • 1
    If your theory is that no amount of testing can prove correctness, and you cannot review the code for correctness, and that it is always fine to keep stumbling along when your program fails arbitrarily, then why is there not a try-catch around *every line of code*? I'm not trying to be a jerk about this; I'm trying to understand your attitude here because it is very strange to me. You seem to believe that there is a *specific granularity of code* that it makes sense to impose an "on error resume next" semantics, but I don't understand why you arrived at that granularity instead of another. – Eric Lippert Jan 07 '20 at 22:39
  • As you note in the question, for most people the granularity is "assume that an unexpected failure in the game does not violate the invariants **of the server** and keep the server up". That's an assumption that sounds more like an intuition than a principled argument, but it seems plausible. But you seem to think that the granularity of "recoverable failures" is some statements but not others, and I'm trying to understand why that is. – Eric Lippert Jan 07 '20 at 22:40
  • I think this rule is misunderstood. You shouldn't catch boneheaded exceptions, but unless you're a hardware developer, something else always will catch the exception. In a web framework, the framework will catch unhandled exception, in the web framework the web/application server will catch unhandled exception, in regular executable, the OS will catch unhandled exception (and likely terminate your program), in the kernel space the hardware will "catch" the error (by setting the right flags or raising interrupts). Boneheaded exceptions are simply those you let the higher level to deal with. – Lie Ryan Jan 08 '20 at 13:34
  • @EricLippert *for most people the granularity is "assume that an unexpected failure in the game does not violate the invariants of the server and keep the server up"* - yes, and this is what the outer layer of `try`/`catch` is for; but having read your text about 4 exception types I was understanding that you were arguing to crash the server and incur downtime in the service until the problem is manually fixed. So I got confused and asked this question. – gaazkam Jan 08 '20 at 22:38
  • As for the inner layer of `try`/`catch`... I think the answer to your question is that this granulity is to break each operation into subtasks such that no tasks fails because the other one failed - just like in case of processing requests: they should be independent from each other, so failing one request shouldn't entail failing any other request (as would happen if the server died). – gaazkam Jan 08 '20 at 22:38
  • Lifting expired bans seemed to me to be similarily independent from each other as processing different requests. Answers showed me, however, that I may have been wrong, since an invalid attempt to lift a ban risks corrupting the db. Though again, perhaps an invalid attempt to process a user's request (server not dead) incurs similar risk... – gaazkam Jan 08 '20 at 22:38
  • Think about it this way. The purpose of the exception handler is to maintain a program invariant in the event that there is an exception. **But the existence of the boneheaded exception is evidence that program invariants are already violated in a manner that you do not know how to mitigate**. So now you're in a situation where no matter what you do, program invariants have been violated. The program is no longer in a state where it behaves predictably. If you can shut down the program without shutting down *the server*, then great, do that. – Eric Lippert Jan 08 '20 at 22:46
  • 1
    Now, as for the question about "which is better?" let's at least phrase the question correctly. The question is: which is better: (1) crash the server and incur downtime until it is fixed, or (2) *continue to run a server that is running a program that you don't know what it is doing because its invariants are violated*. Whether you think it is better to run a program that does *something but you don't know what*, or run no program at all, well that is up to you. My choice would be "no program at all". – Eric Lippert Jan 08 '20 at 22:49
  • 1
    @EricLippert Many thanks for your explanations! – gaazkam Feb 07 '20 at 22:04

17 Answers17

243

Silent But Deadly

When writing enterprise software, you will eventually learn an essential truth: the worst bug in the world is not one that causes your program to crash. The worst bug in the world is one which causes your program to silently produce a wrong answer that goes unnoticed but eventually produces a massive negative effect (with severe financial implications for your employer). Thus, error messages and crashes are A Good ThingTM, because they indicate that your program detected a problem.

Amazing Grace

Now, this seems to conflict with another enterprise virtue, which is "degrade gracefully". Blowing up and not returning any response at all hardly looks like "graceful degradation". And this is why many folks will try very hard to return some response, if they can. Indeed, this is why many frameworks, like Spring, will catch all top-level exceptions and wrap them with a 500 response, as you describe. In general, I think this is OK. After all, most exceptions don't really require a restart of the entire app server if you can just kill/restart a server thread. A sane framework will be careful to not catch Java Errors, like OutOfMemory, for obvious reasons.

But there is one more point to consider: once you get beyond a single server, you will likely have a load balancer in front of your service. And when the LB times out or gets a closed connection, it will generally return a 500 to its client. Thus, the LB will often transform your "server crash" into a client 5xx automatically! Best of both worlds.

Worst Case

In your scenario, what is the worst that can happen if you don't catch the exceptions? Your answer: "Well, my game server dies, and nobody can play!!!" But that's not the worst case. The worst case is, everyone is playing your game, but griefers are ruining it. Players file a bug report and tell you that bans aren't working, but you look at the logs and everything looks fine. Or, legitimate players are getting banned by griefers, and instead of being able to rejoin in a timely manner, the bans are lasting indefinitely, because your server happily ignores failures. The worst thing isn't your game crashing. It's your player trust crashing. Good luck trying to reset that.

Lawnmower Man
  • 1,874
  • 1
  • 5
  • 7
  • 106
    Excellent answer. With a background in safety critical systems I can add one scenario to your arguments: even in e.g. a plane it is better to crash the computer, leaving the pilot with NO information, than taking the risk of showing WRONG information. The problem has to be solved on an completely different level, e.g. by redundancy. – Hartmut Braun Jan 05 '20 at 10:15
  • 2
    The problem with this argument and the argument in the other direction (that crashing is the worst case) is that they miss probability etc. How likely is the worst case and how much other errors are there? If the system is complex and a lot of small minor bugs could be indicated by an exception then the chance is pretty high that with directly crashing each such exception you elevate often occurring minor bugs to fatal bugs. There might be scenarios that are worse than a crash, but they might as well have a tiny chance. So it's all about risk assessment. – Frank Hopkins Jan 06 '20 at 02:27
  • 2
    And btw. there are ways to recover from errors as described: you can use backups to restore an earlier state, handout goodies to players that were active and are considered honest, you can "punish" bug users etc. pp. There are always other ways to mitigate impacts from bugs. That's part of community management. Btw. please consider the comments not as bashing or the like, but as additional information, pointing out that crashing isn't necessarily the worst that could happen is a very valid point! Feel free to incorporate parts or not as you wish. – Frank Hopkins Jan 06 '20 at 02:30
  • 24
    It depends on the situation. I'm thinking of the first launch of the Ariane V. An unchecked exception took out the rocket when the offending problem would have had absolutely no effect on the mission. When there's real hardware being controlled barfing is sometimes the wrong answer. (The calculation that caused the /0 error was actually only of value while sitting on the pad, it became moot in flight.) – Loren Pechtel Jan 06 '20 at 03:43
  • I agree that reality demands more nuance than a simple rule. However, with a good test environment, uncaught exceptions produce a highly visible record of failure, and allow them to become part of your acceptance criteria. Combine that with 100% unit test coverage which exercises all the exception paths, and you get a lot of robustness without a lot of hand-built tests. It's an entire testing philosophy which I don't have the space to describe here. – Lawnmower Man Jan 06 '20 at 06:01
  • 1
    Why not just log the exceptions to a log file/log DB? Taking down an entire application over an uncaught exception is foolish. – Hugo Jan 06 '20 at 09:58
  • 7
    @HugoZink usually a good idea, but depends on the application and the problem. If it's something that affects a single request, log it, alert sysadmins, and return an error value to the client. If it's something that causes the entire system to become unreliable (memory error, core configuration corruption, etc.) take down the system. There's no one-size-fits-all solution. – jwenting Jan 06 '20 at 10:11
  • 7
    The second worst case is that your servers are always down and so you lose player trust. Which is worse: having griefers run amok for a day, or having nobody play for a day? That depends on how much persistent state your game has, probably. If it's a Quake-style FPS with nothing saved between rounds, it'll be "hey remember that time half the other team could fly?". If it's a Minecraft-style building game, it'll be "hey remember that time we lost three months of work?" – user253751 Jan 06 '20 at 11:36
  • @HugoZink because you have no idea what might happen. It could have failed in such a way that all current requests will get the response meant for someone else (which could be a very expensive breach). This can happen in various async webs servers when trying to "handle" every possible exception. – OrangeDog Jan 06 '20 at 14:15
  • 7
    I can think of much worse cases than that, involving lawsuits and criminal charges. – OrangeDog Jan 06 '20 at 14:16
  • Let It Crash -- so your process dies. Your process monitor (you have one, right?) or equivalent will restart it. Your app gets a clean start. If it's caused by an external issue, well, external dependencies should be wrapped in circuit breakers. Worst case is the app keeps crashing and the process monitor stops restarting the app. Still better than going on in an unknown (unknowable?) state. – Rob Crawford Jan 06 '20 at 18:46
  • 1
    I note that in .NET there are some fatal exceptions that cannot be "caught and killed". For example, a stack overflow exception can be caught by a catch block, but if the catch block does not re-throw it, it is re-thrown automatically. – Eric Lippert Jan 06 '20 at 21:02
  • 1
    @FrankHopkins If a minor bug can be elevated to a fatal bug, then it isn't a minor bug. It's best to crash and make the problem immediately and painfully obvious so it can get fixed as quickly as possible than to run the risk of it happening for months or even years before anyone catches on. It's much easier to rebuild your community's trust if it was harmed because of a sketchy weekend rather than if it was because their accounts slowly got irrecoverably corrupted due to a "minor" rounding error over the the course of 5 years resulting in them having to start over from scratch. – Abion47 Jan 07 '20 at 00:55
  • 1
    It's only elevated by your decision to blindly crash. And yes it's harder to recover from a serious issue than from an outage of an hour, but 5hours outage every week compared to you in-game not being able to use a certain ship or not being able to talk to a certain NPC or a text not being displayed maybe way worse. Whether this is the case obviously depends on the rest of the system. Outside expectations etc. pp. But the blanket statement ignores things nearly as much as the counter-argument. – Frank Hopkins Jan 07 '20 at 01:00
  • 1
    @Abion47 And I've seen quite a few game servers run that just logged a lot of exceptions and kept going. The users might have rightly complained about the bugs, but they kept playing (and the bugs were fixed with the next release). If a game is repeatedly not available it can be quickly dead as users often directly look for something else. (What helps of course if you can roughly separate exceptions that likely have small impact from those that have major impact which might be hard to tell). – Frank Hopkins Jan 07 '20 at 01:01
  • @FrankHopkins So if a user tries to move a ship and the server threw a NRE, how is the game or the server supposed to handle that situation? It responds with a 5xx error, telling the game that it's an illegal move? It reattempts the action with the same state that caused the NRE to be thrown in the first place? It trucks on through the NRE and tries to come up with a "close enough" solution? If the game state threw an NRE, then the game state is diseased. Dump the state, log the error, and then (if possible) migrate the player to another fresh server. – Abion47 Jan 07 '20 at 01:08
  • @FrankHopkins If you're using prod to test, then yes, you will suffer. Running a proper test server means that bugs introduced by a new release are highly visible, and test players can report them immediately. Running in test for X hours without crashes gives you much more confidence in your release than if you treat problems more silently and hope you are sufficiently diligent to go fix them...if you even know they exist. – Lawnmower Man Jan 07 '20 at 01:08
  • @FrankHopkins Boneheaded exceptions are just that - boneheaded. If your server threw one, it's your fault, and maybe you deserve a handful of lost players because you clearly didn't test well enough. In any case, I would much rather lose a handful of players because of a buggy few hours than lose my entire community because of a long-standing bug that went unaddressed over my game's entire lifespan. – Abion47 Jan 07 '20 at 01:09
  • 1
    @LawnmowerMan rofl, there are many environments and types of bugs you will not find on the test environment under real life constraints given the resources you have. – Frank Hopkins Jan 07 '20 at 01:10
  • @Abion47 As I said, from experience, never lost a whole community due to a serious bug. And yes striving for quality is good, but having realistic outlooks and aiming for what likely will happen taking into account other solutions than the system alone is also helpful. As for the movement thing: If it's a mutliplayer server, the player simply will not be moved and if the same happens on the next click and the next and the next he hopefully tries to login out and in again and that resolves the issue for him, or he files a bug report in the wider sense and the issue gets fixed asap. – Frank Hopkins Jan 07 '20 at 01:13
  • @Abion47 without that server going down for each and every other player. Now if you have a single player instance, sure, kick him out, restart their game etc. I'm not saying crashing is never good/right. I'm just saying the evaluation purely based on the worst case can easily mislead in each case whether you argue for or against crashing. – Frank Hopkins Jan 07 '20 at 01:14
  • @FrankHopkins That depends on your application and users. A good testing strategy includes something like fuzz testing, where you generate a lot of typical inputs, but also a good number of unusual ones, to push your process into most of the state space. Also, "crash" is itself ambiguous, and includes everything from "total service unavailability" to "helper thread terminated". I don't advocate that you induce total unavailability in the presence of an unexpected exception, but I would still rather pay early to see a bug than late, because the cost usually increases dramatically. – Lawnmower Man Jan 07 '20 at 01:15
  • @LawnmowerMan Oh sure I know about different testing approaches and they sure can reduce the likelihood of something slipping through, but you are rarely running a software in particular if it is "just" gaming where you want to spent the resources to be absolutely sure the new version doesn't break anything under any circumstances. And yes, also, in general agreed, rather fail early in the testing. I'm just saying, if you look at crashing vs. logging and continuing a blank statement either way can be misleading, because the details, context and other problem resolution measures matter. – Frank Hopkins Jan 07 '20 at 01:18
  • @LawnmowerMan i.e. imho there are no easy always right answers (except likely for "never just swallow at least log"), just good guidance rules, but imho it should be made clear that they are exactly that and proper evaluation depends on your environment and likely scenarios. – Frank Hopkins Jan 07 '20 at 01:22
  • 3
    @FrankHopkins I'm going to take a step back here. We are talking about ***boneheaded*** exceptions. As in NREs, OutOfRange, DivideByZero, etc. These are errors that, with minimal foresight, should never have happened to begin with. If they _don't_ crash your server, then that means you have made the conscious decision to allow, catch, and log them, which is indication of code smell. Boneheaded errors should not be caught, because that implies the engineer in charge of that code was too lazy to actually fix the error. – Abion47 Jan 07 '20 at 01:23
  • @FrankHopkins And to be clear, when I say "crash the server", I don't mean SIGFAULT the entire machine and require it to cold reboot. It could mean anything from causing the one request to return with an error message to stopping the one game instance/thread. There are plenty of ways to "crash the server" that don't get any other games or players involved. – Abion47 Jan 07 '20 at 01:24
  • @Abion47 How could he been too lazy to fix the error when he didn't know about it? Div by zero can easily happen in an unforeseen case same for others. And sure that module might have been implemented better, but if you have a general check routine higher up you don't care why that module doesn't do it's work in this one case, you need to determine whether you want to crash your module along with potentially the whole server or the, I don't know, player movement thread affecting all players or if you just skip this one guy (and log it and maybe trigger an alert or whatnot). – Frank Hopkins Jan 07 '20 at 01:29
  • @Abion47 Anyway, I'm not trying to convince you and this is going down a chat route. I felt it important to point out that there is more to it than the worst case. Lawnmover Man might agree to place a hint into the answer that it isn't universal or not, I've done the part that was important to me. Have a nice day. – Frank Hopkins Jan 07 '20 at 01:32
  • "Because the app server/framework/etc. is already catching top-level exceptions and returning 500" should be an answer in itself. If you can't do anything to recover but return an error response anyway, custom handlers will probably just add clutter without much more value. – wrschneider Jan 07 '20 at 15:31
  • 2
    @FrankHopkins > I for one had an experience of a game where they caught exceptions blindly. Once, minor glitches caused the team to manually restart the server. It's only at that point they realized that they had had the game throw exceptions for a few weeks. How? Well, no game progress had been committed to storage in that time, and the exception was caught by the engine. Now, when they restarted the server, all players progress was rolled back several weeks. I never played again. Most likely I'm not the only one. – spectras Jan 08 '20 at 00:09
  • Note that response handlers are often executed on a separate threads which means not catching the exception might not actually crash the application. Therefore, not catching the exception might actually be the worst case. – JojOatXGME Jan 04 '21 at 13:41
  • @JojOatXGME I agree that letting exceptions silently kill a thread is A Very Bad Thing(TM). However, that is getting beyond the scope of the question, into a different question. – Lawnmower Man Jan 04 '21 at 21:24
  • And then product management tells you not to crash ever... – Wouter Apr 24 '21 at 22:32
59

Exceptions should be allowed to crash the system if the system has been left in an unrecoverable undefined state. If you can't put the system back in a defined state that ensures data integrity and security then you crash so the system can be rebooted into that defined state.

Whenever you catch an exception you're taking responsibility for doing all that recovery yourself. You should only leave your catch empty when you are absolutely sure the system is still in that defined state and nothing needs to be done. This happens when the calling code has a better idea of the system state than the code that threw the exception.

That's all there is to it. Calling an exception boneheaded is just being pointlessly rude. In some contexts you simply don't know what will happen to system state when you fail, so you throw and let the calling code figure it out. If the calling code can't figure it out then you crash and let the operator figure it out.

Now if you somehow know that failing won't put the system into a bad state then you don't need to throw in the first place. Throwing was never meant to be the only way to fail to do what you were asked to do. You can return -1, or the empty string, or null (ick), or a null object, or an empty collection, or a "maybe" monad, or print "Your princess is in another castle", or just do nothing, quietly. Yeah, I know. But sometimes that's ok. Why? Because at times failing to do what you're told is correct, expected, and not interesting.

candied_orange
  • 102,279
  • 24
  • 197
  • 315
  • 14
    It is also easier to fix the problem (bug) that caused the exception when the crash makes you aware of it. If you silently catch the exception and keep going, you can end up masking an issue that will become a bigger problem later. – 1201ProgramAlarm Jan 04 '20 at 20:20
  • 1
    What you said seems to make sense, but... Still, the result of following this advice in server code is that instead of returning HTTP 500 and continuing to server other requests we take the whole website down! AFAIK webapps typically opt for the first behavior, not the second? (Unless typically half of webapps' code is error recovery code?) – gaazkam Jan 04 '20 at 20:51
  • 3
    @gaazkam not true. When a website sends you a 500 it's doing so because it has recovered and part of that recovery is letting you know that whatever work you thought you were going to get done is gone. If it hasn't recovered to a safe stable state the last thing it should do is tell you about it. – candied_orange Jan 04 '20 at 22:20
  • @candied_orange So, typically, is half of server side code error recovery code? Do backend authors try to anticipate every possible bug they might introduce and write code to recover from every possible bug to avoid downtime? – gaazkam Jan 04 '20 at 22:59
  • @gaazkam When the service crashes, you get a nice log entry and (if your system is configured for it) the service can be automatically restarted, minimizing downtime and leaving you a clue to help track down the problem. – 1201ProgramAlarm Jan 04 '20 at 23:56
  • @gaazkam It largely depends on the kind of server. If the primary concern is correctness, like a bank website, it's much better for the service to crash than to continue on as though the unexpected error never happened. In your case, you'd have to decide how bad it is for your service to continue running with a faulty ban management system. From the tone of your question, I'm guessing you don't consider it the end of the world. That's fine, as long as you understand the risk you're taking. – wavemode Jan 05 '20 at 23:19
  • I hate code that _either_ works _or_ returns an error code (my main programming language is full of that). It forces my code to check for the error immediately, distracting from the purpose of the code. If I had my way, I'd wrap that potentially-error-returning call with a function that throws an exception. Then my code can simply assume that everything is fine as continue as normal. – CJ Dennis Jan 06 '20 at 07:19
  • 19
    It is not true that calling boneheaded exceptions is *pointlessly rude*. I called them boneheaded exceptions deliberately; the comical rudeness *was* the point, so that the term would be *memorable*. You might not have internalized what is boneheaded about a boneheaded exception. **A boneheaded exception is always avoidable, and almost always indicates that the programmer forgot to handle a valid value or code path**. That obviously some of these defects are subtle and difficult to see, and that's why they persist in the wild, is part of the humour. – Eric Lippert Jan 06 '20 at 15:31
  • letting a process crash would "take the whole website down" - this is why operational staff use supervisor processes that can take corrective actions such as restarting the service, this is why there are monitoring services that can be aware of a service outage and inform someone (ergo someone is on-call to fix it). If one crashed process takes a service completely offline, then there is no hope of achieving a reasonable uptime, I would implore the responsible team to consider using something more conventional to learn on like a LAMP stack if the builds are this fragile. – ThorSummoner Jan 07 '20 at 04:13
43

It doesn't really matter if it's a "boneheaded" exception (e.g. a Java unchecked exception) or not.

The only question to ask yourself is:

"Can the program sensibly continue?"

A "server" is a program that processes messages. Typically, each message is mostly independent from the others. That is: if there is an error processing on message, it makes sense to continue receiving and processing future messages.

This would apply to an HTTP server, a job server, etc.

Some errors may not be sensibly recoverable, like OutOfMemory. But if you can recover from an error (including possibly logging it some way), then you should.

Paul Draper
  • 5,972
  • 3
  • 22
  • 37
  • 3
    I would say "sensibly do something about it" instead of "sensibly continue" - because there are sensible actions like "log the error, then exit (and probably get restarted by service supervision)" can be called "recovery", it is hardly "continuing", in casual or practical terms, or technical terms within the scope of the process' lifespan. – mtraceur Jan 06 '20 at 22:21
  • Boneheaded exceptions might cast doubt on the sensibility of continuing. I mean, if someone's getting exceptions from their own code not behaving as intended, it'd seem potentially hazardous to just keep calling that same code anyway in scenarios in which that code could, for example, corrupt a database of user accounts. For example, in the question, they `try { RemoveBan(ban); } catch {}`, which presumably modifies a database with user accounts; if that throws due to a programming logic error, then what exactly is that code doing, and can it be trusted to keep modifying the database? – Nat Jan 06 '20 at 22:21
  • @Nat, if other operations are dependent on `RemoveBan()` then they should not continue. – Paul Draper Jan 07 '20 at 05:52
  • How do ou know the program can sensibly continue? If that's because you carefully evaluated that error source it no longer is a "boneheaded" exception. If it's something you did not plan, then your program is doing something you did not expect, and is now in an unexpected state. There is no telling what invariants no longer hold, what things may be in an inconsistent state and… how long it will corrupt data unnoticed if you leave it be instead of triggering a clean restart by service supervision. – spectras Jan 08 '20 at 00:03
  • @spectras, you know the program can continue because you know the designed scope of the code. When a process segfaults, your OS does not crash. The OS has no idea how critical that component was, but it knows that it's reasonable to isolate errors to that context. – Paul Draper Jan 24 '20 at 17:36
  • @PaulDraper in order not to crash on application failure, the OS relies on hardware-enforced segregation to prevent contamination by the unexpected state. Your code has no such thing available. – spectras Jan 24 '20 at 18:14
  • (in fact falling back on that hardware segregation guarantee to prevent contamination is the very reason crashing the application is a good idea - the OS will use its privileged access to hardware to destroy the contaminated context and create a fresh one in a well-known state — if your application could do that, then yes it would be a good idea to do it at application level. And symmetrically, on embedded systems with no such segregation, the typical response to a crash is a reboot, the idea is the same: restore the application to a known state). – spectras Jan 24 '20 at 18:28
  • @spectras, the OS relies on hardware-provided recovery mechanisms, unless of course, it's software emulated. Java code relies on Java virtual machine-provided mechanisms. The capabilities are different, but idea is the same: divide up software and scope your errors. – Paul Draper Jan 24 '20 at 19:18
  • As long as it can guarantee a known state. When it cannot, it crashes like any other well designed software. Even your os does, that's what panics and blue screens are. The flaw in your reasoning is you do not make a difference between layers. A crash on one layer only needs that layer (and all above) reset. This is why the only example you can quote are programs that live at layers boundaries: os, interpreters, VM. A crash in the layer they *manage* is not a crash in the layer they *live in*. – spectras Jan 26 '20 at 11:54
  • Now you can sure do that. Games usually do it, using an interpreted language within a vm to segregate user add-ons from game core. But that's a very different topic, and.... It does not involve catching any exceptions. – spectras Jan 26 '20 at 11:57
  • 1
    @spectras, the layering you speak of can be as simple as a stack layer; i.e. a function call. There's no more reason to suspect ArrayIndexOutOfBoundsException is corrupting the entire process anymore than a process crashing corrupts the entire OS. (It might have side effects that do that, but that's uncommon.) – Paul Draper Jan 01 '21 at 01:49
23

You are absolutely correct. An exception in server-side application code ("boneheaded" or not) should not crash a web server.

The confusion is because the articles are not clear about what it actually means to "catch" or "crash". If we followed advice never to catch "boneheaded exceptions", then a single application bug should bubble up and cause the whole operating system to crash. This is not what anybody wants!

Rather you should think of a system as multiple layers of isolation. An unexpected crash in one layer should not cause outer layers to crash. An operating system will isolate crashes in individual processes and will not let a crash in one process take down other processes or the whole OS. A web server will typically isolate crashes in single requests and not let them affect any other request.

An application may itself have multiple layers. E.g. an application supporting plug-ins might isolate crashes in individual plug-ins. If a plug-in crashes it is disabled, but the rest of the application continues.

But isolating crashes can only work if the units of isolation are independent. Web requests are independent from the perspective of the web server. A crash in one request will not cause any other process to fail or receive invalid input.

So the bottom line is that you shouldn't catch boneheaded exception in your own application code, you should let a higher level (the framework) catch them.

Catching a boneheaded exception in your own code does not really make any sense. A boneheaded exceptions means there is an unexpected bug in the section of code. In other words, the code does something wrong, but you don't know exactly what or why. At best nothing happens when you execute the code, at worst some state is corrupted which might cause other parts of the system to behave erroneously and corrupt more state.

So if the code detects that there is a bug in the RemoveBan method... why do you want to execute it again? I can't imagine any scenario where you would want to repeatedly execute some code which you know does something wrong.

Toby Speight
  • 550
  • 3
  • 14
JacquesB
  • 57,310
  • 21
  • 127
  • 176
  • 1
    "If we followed to never catch "boneheaded exceptions", then a single application bug should bubble up and cause the whole operating system to crash." Sorry, that's just plain wrong. An exception *cannot* bubble out of its process. To the kernel, a throw, stack-unwind, catch is just a bunch of machine code like the rest it loaded from the binary. And the process's internal state, that which can be inconsistent due to an error, is none of the kernel's concern. To the kernel it's just some memory pages that are allocated to the process. That separation is the main reason we use virtual memory. – cmaster - reinstate monica Jan 06 '20 at 20:36
  • 5
    @cmaster-reinstatemonica Obviosuly a .net exception can never escape the CLR. My point is that even if the web server was written in .net, you wouldn't let an exception in a request handler take down the whole server. You *do* want to catch all exceptions, even boneheaded ones. You just have to do it at the right level. – JacquesB Jan 07 '20 at 13:27
17

You learned an important thing: Whenever you read a rule on the internet that must absolutely be followed, you must start thinking about it and decide for yourself whether you should follow the rule in your specific case or not. And you should also think about whether the rule as you understand it is a good rule or not.

Here some rules that you can think about: A server should never crash unless absolutely necessary. A client on the other hand is absolutely allowed to crash if you make sure there is no substantial data loss, users will often not even take notice.

So for a client, crashing is not at all a bad response to a boneheaded exception. For a server, it is a much worse response. Still, you have to weigh up: You have an exception that is totally unexpected to you. You have no idea how to handle it properly (reporting an exception isn't handling it). How much damage could be done by continuing when you know something has gone badly wrong? So do you want to continue operations, with unknown and potentially unlimited damages, do you want to crash the server, causing some damage, or do you want to restart the server (not quite crashing, but restarting it as gentle as possible).

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • 1
    Note that even on a server, "you the generic developer" should not normally be catching these exceptions; your framework should do it for you. – chrylis -cautiouslyoptimistic- Jan 05 '20 at 04:41
  • 4
    I'm not sure that I agree about clients being allowed to crash and users not noticing. If I recall correctly, app stores record the number of crashes for an app and will greatly drop your appearance ranking if your app is crashing. For gaming, crashing is fairly unacceptable, even more-so for anything involving multiplayer/live content. On the other hand, a server will likely have a monitoring service that will at least reboot it after a crash, so things aren't as black and white as this makes it sound – Mars Jan 06 '20 at 01:48
  • 1
    @Mars a client **literally** crashing is a bad thing. Stopping, showing some monobutton error modal dialog and returning to the starting state (screen) after user input is not. – beppe9000 Jan 07 '20 at 19:46
  • @beppe9000 That solves the issue for mobile where crash = lower ranking, but for user experience that doesn't change much. It would also require you to catch boneheaded exceptions in the first place – Mars Jan 08 '20 at 00:12
  • I just realized, *"must absolutely be followed"* should be *"must 'absolutely' be followed".* The former implies that yes, it absolutely must be followed (in which case it doesn't matter what your specific case is!), the latter implies that it's a questionable claim – Mars Jan 08 '20 at 00:15
  • @Mars Yeah, it sucks for users anyways... – beppe9000 Jan 08 '20 at 01:12
10

I think that what you are failing to appreciate is that that the real-life consequences of errors can be much worse than simply having the server go down. Just for instance:

  • Erasing a database that is essential to the functioning of a company
  • Granting access to confidential information that shouldn't be granted
  • Approving a financial transaction that shouldn't be approved
  • Disapproving a financial transaction that should be approved
  • Destroying physical equipment under computer control
  • Killing a patient receiving radiation therapy under computer control
  • Causing flight control problems that result in the crash of two aircraft, killing 346 people

Ok, the list is a bit of a scarecrow. Ignoring exceptions is just one many possible errors that could have terrible consequences like these. But it is important to keep in mind that an exception being raised is telling you that one or more of your assumptions about the operation of the program is not valid, and you really need to think about the possible consequences of that.

Since you are writing a hobby game where none of these outcomes is remotely possible, you could just blow a raspberry and ignore this issue. However, if part of the reason you are writing this hobby project is to educate yourself, and to prepare yourself for writing software that other people will use, then you have to think carefully about this stuff, and practice writing code as if its correct functioning mattered. Yes, this is a lot more work.

No, you don't have to write handlers and recovery code for every conceivable exception. You can just deal with the ones that you think are important and know how to handle. But unless you are absolutely sure that an exception has no impact on the continuing correct functioning of the program, then it has to get passed on to the next higher level of control. That gives it the chance to recover or to abort the program and prevent incorrect functioning.

Charles E. Grant
  • 16,612
  • 1
  • 46
  • 73
  • Or in Windows, having a user click OK to illegally pass information that Microsoft promises they won’t use. – WGroleau Jan 05 '20 at 21:52
4

I am surprised. Especially for some (important!) use cases, like server-side code, I simply can't see why is catching such an exception suboptimal and why the application must be allowed to crash.

There's nothing wrong with a crash, in fact, it can be very helpful to have an application crash early. @PaulDrappers answer of 'Can the program sensibly continue?' is really the key way of looking at it.

For example, if my application has been incorrectly configured - then I want it crashing immediately on startup. By quickly and obviously failing, this allows developers/DevOps to quickly identify and rectify the problem. The alternative, returning 500 errors, is less obvious and if intermittent may even pass health checks until a customer complaints.

On the other hand, if I get a sporadic error on a single request, I will typically 'crash' that single request (e.g. return 500), but otherwise, let the application continue running. After all, a single DB timeout shouldn't typically crash a website.

Though this is dependant on the application - if it's processing a single task at a time in a fault-tolerant pipeline (e.g. some kind of job or message processor), letting the application die might be appropriate (e.g. an orchestrator might have a failure policy we want to follow).

Whatever the situation I will try to log every error regardless if it's handled or allowed to bubble up.

NPSF3000
  • 285
  • 1
  • 5
  • 1
    Touché. Also a fun fact: your server will occasionally crash anyway, in unplanned ways :) And everything else as well. – kubanczyk Jan 05 '20 at 16:55
3

Your example demonstrates exactly why you shouldn't catch 'all possible' exceptions.

If your GetExpiredBans call fails, your code simply carries on as if it had been sucessful. The unbanner server is up and running and looks good, but actually it's not working at all.

Now if you know that RemoveBan occasionally fails due to a network problem for example, then you could catch that specific exception and implement a retry, or skip that player and move to the next. It's an understood problem and you know the desired behaviour.

If your code throws an exception you werent expecting then it's best to stop and have a human look at it.

Now you bring up the case of a HTTP server returning 500s instead of crashing.

Here the server is running an external program, ie your webpage. It's not the webserver which has a bug, its someone elses code. You wouldnt expect DOS to crash if you ran a buggy program, returning an error code and moving on is the same as that page crashing.

Ewan
  • 70,664
  • 5
  • 76
  • 161
  • 3
    I'd venture that a failing minor functionality like autobanning users isn't a serious enough problem to crash the server, though it is serious enough to send an email alert (or possibly SMS or Twitter message) to a sysadmin to come take a look. – jwenting Jan 06 '20 at 10:13
  • @jwenting its your assumption about the consequences and reason of the _unknown_ exception which are the root of the problem. Adding a sms message which sends in a tight loop would only exacerbate the issue – Ewan Jan 06 '20 at 10:21
  • 1
    In this particular case, the *ideal* outcome for an unexpected error is that the server continues running, the users don't get unbanned, and the administrator/developer gets notified that something went wrong so they can hopefully fix it before the next auto-unban cycle. – user253751 Jan 06 '20 at 11:33
  • @user253751 sure, and the best way to achieve that is to crash the program and generate an alert through your monitoring solution. "server continues running" is not desirable if its just looping around hitting the same error each time – Ewan Jan 06 '20 at 11:45
  • @Ewan Why would that not be desirable? – user253751 Jan 06 '20 at 11:47
  • @user253751 processing power costs money. worst case your DB grinds to a halt processing millions of GetExpiredBan calls and crashes everything. Now you have a dozen services all presumably doing the same thing, erroring and retrying instantly. You logs fill up with errors for every service and you cant tell what the cause is – Ewan Jan 06 '20 at 11:51
  • if you have implemented @jwenting's sms solution your phone has probably crashed as well – Ewan Jan 06 '20 at 11:52
  • @Ewan Okay, so you need a more fancy "circuit breaker" which won't try to process expired bans for 6 hours after an error is encountered. That doesn't change the fact that it's better to ignore expired bans for a while than to crash the entire server and lose all your revenue. – user253751 Jan 06 '20 at 12:02
  • crashing the unbanner program, wont crash anything else. Leaving it running might. You already have a fancy circuit breaker in your monitoring system which can detect non running services – Ewan Jan 06 '20 at 12:25
  • also check some of the other answers for varying degrees of bad things that are worse than crashing – Ewan Jan 06 '20 at 12:26
3

Personally, I handle all exceptions.

And email them to myself.

It doesn't have to be email, you could just log them, but if you don't know what these errors are happening, how can you ever hope to correct them?

By all means, try to recover, or gloss over the error - to the user, but not to yourself.

For AJAX requests, this might mean sending a 500 response. For GUIs, the classic MicroSoft "something bad happened" (with its implied, and "we coders know what it is, but we are not going to tell you").

IMO, you need to inform your "user", whether human or software, but you also need to inform yourself, the development/maintenance team. These might be same action, or two different actions, but they both need to be informed.

3

There are actually two different types of exceptions that need to be handled differently.

First of all there are the "Non-bonehead" exceptions like "File not found". These exceptions are kind of expected, even if you check for the files existence beforehand it doesn't prove that it will be there when you go to use it.

These you typically catch as low/specific as possible, everyone understands this already because it's how everyone says to handle "Exceptions", so let's ignore it.

The other is unexpected exceptions. This covers both programming errors and some situational errors you aren't expecting to encounter (like the OS rips a disk out from under you). These are what you are calling "Bonehead" exceptions but they DO happen.

The important thing is that these are NOT ignored. If the only way to get a team to pay attention is to crash the app hard, then DO SO, but if you can get their attention another way, I recommend catching the most general "Exception" type just inside each primary thread loop and dealing with it in such a way as to get the teams attention, and then try to continue. You'd be surprised at how often this can keep your app running pretty much perfectly while you fix it. It's even allowed me to recover from an out of memory situation.

Not catching it is really not good compared to just handling it in a way that gets attention.

Also VERY important:

In Java (at least) by default when a thread throws an exception it is silently eaten, it doesn't crash your app or give you any indication that part of your program failed, but if that was a long-running thread, everything it powered is completely (invisibly) gone!

This can be fixed by installing a default exception handler, but be careful because without the handler or a try/catch, allowing such an exception to just silently kill a thread is the worst possible solution--it's possibly the most expensive thing you can do to another developer (or yourself!) I've spent weeks tracking down exceptions that were eaten by threads and empty catches!

Bill K
  • 2,699
  • 18
  • 18
2

There are two broad categories of exceptions: those that arise from the programmer's misuse of an API (e.g., invalid arguments) and those that arise from external causes (e.g., file not found). Exceptions from external causes should always be caught; exceptions from misuse of APIs should never be caught. This is because it's realistically achievable to avoid making these kinds of mistakes in the first place, and because these kinds of bugs usually manifest right away and should be identified and fixed, not handled in the code.

Distinguishing these categories of exceptions is much easier when you have language support (i.e., checked vs. unchecked exceptions) or when working with APIs that clearly distinguish them (i.e., different exception base classes.) Avoiding API misuse is also much easier in languages and IDEs that provide instant access to the complete documentation (e.g., Java/Javadoc in IntelliJ IDEA.) It's considerably more difficult in languages and IDEs that only provide instant access to a severely limited summary of the documentation (e.g., C#/Visual Studio.) In my experience, good IDE support for instant access to complete documentation vastly reduces the number of API misuse bugs that are ever introduced in the first place.

TL;DR:

  1. Constantly refer to the documentation to avoid misusing an API. Corollary: avoid using APIs that are inadequately documented.

  2. Do not catch exceptions arising from misuse of APIs. Corollary: avoid using APIs that do not clearly distinguish between misuse and external errors.

StackOverthrow
  • 646
  • 3
  • 13
  • 1
    Is out-of-memory external or misuse? That depends. Is malformed data external or misuse? Probably the former, but the latter can be true. – Deduplicator Jan 06 '20 at 19:43
  • @Deduplicator: Yup. In the real world, many programs need to deal with data from potentially unreliable or untrustworthy sources. Sometimes sanitizing data before processing makes sense, but in some cases validating data without processing it would require essentially as much work as processing the data while checking for validity throughout the process, and sanitizing valid data before processing would represent a wasted effort. – supercat Jan 06 '20 at 22:01
  • @Deduplicator The API designer should answer those questions and distinguish the exceptions accordingly. – StackOverthrow Jan 06 '20 at 23:08
  • @user560822 Unfortunately, the API-designer generally simply doesn't have the data to decide. Thus lower layers signaling errors up the chain instead of handling them. – Deduplicator Jan 06 '20 at 23:13
  • @Deduplicator Sure they do. To use your example, the API designer decides and documents who should perform validation. If they let lower layers perform the validation, which they normally would, then the API would throw an exception indicating an external error. If the exception that may be thrown by the lower layer is the wrong type to indicate this, then a good API would translate it to the correct type (catch, wrap, and rethrow.) – StackOverthrow Jan 06 '20 at 23:24
1

An exception you don’t know how to handle at a particular place should not be handled there. It should be passed up the call stack until it can be handled properly. If it is not expected anywhere, there is at least some place where an operation can be canceled without causing a “wrong answer” or a crash.

WGroleau
  • 121
  • 3
1

Exceptions (Boneheaded or not) are Exceptional

If something is exceptional, by definition you have not considered it, nor know how to handle it.

Think of it like this. The sales assistant on the checkout of a store knows how to sell stuff, perform exchanges, etc..

They do not know how to handle Reporter Interview. The correct approach is to throw an exception to their Manager. Their manager may know how to handle this, and if so will then catch the exception and deal with the Reporter Interview. Otherwise what should the manager do?

Pass the exception up to their Manager. Eventually (if not dealt with earlier) this will reach the CEO/Business owner/Directors and have to be dealt with there. Even they may not know. In which case the exception escapes, and the Reporter Interview is not dealt with.

The worst scenario to occur is for the Reporter Interview to be dealt with by one of those people who does not have a clue how to handle it. Imagine the Reporter reporting on an poorly worded comment from the Sales Assistant or the Local Manager? Massive damage to the business reputationally, and quite possibly financially.

Business Software, be it in an office, or an online game is essentially just automating the process that would have otherwise had to be done manually by an actual worker. Ergo you want it to behave like a well mannered employee. It should only deal with the situations that it has a good well-defined response/activity for. Everything else should be handed over to its managers (usually its support team) for clarification (new code) and/or action (manual tasks).

Kain0_0
  • 15,888
  • 16
  • 37
  • But we do know that if something goes wrong with removing expired bans, it would be better to let the game continue but leave those players banned, than to crash the game. We also know that if this is likely to corrupt the database, it's better to crash the game than to corrupt the database. – user253751 Jan 06 '20 at 11:34
  • hmmm... No. Knowing a desirable property is not the same as knowing the state and what can be done to progress. I know that it is desirable for my computer to do as i mean, that does not mean it won't do as I've written. It would be desirable to not kill the system just because a ban could not be lifted for some inane reason. But, exactly which inane reasons are those? OutOfMemory? ProgressiveLockFailure? DiskSpaceFull? DivideByZero? AssemblyNotFound? Any of those could prevent the ban from being lifted, exactly which of those leaves the system in a good to go state? Try a Task architecture. – Kain0_0 Jan 06 '20 at 13:42
  • All of the above. If any of those are problems for the actual game, then the problem with the ban lifting will be logged, and then the same problem will crash the actual game (and be logged). – user253751 Jan 06 '20 at 14:23
  • ... Either there is information in your scenario that you left out of your question such as logging your swallowed exceptions... or you have been exceptionally fortunate. None of those errors are guaranteed to be raised, consider AssemblyNotFound. A caching layer could preserve that knowledge and route around the issue. Presuming a ban removal on the list that fails before game code executes, that cache is primed. Later when you cleanup the expired ban list, and then at some point restart the server, all of the game servers start failing. Again which exception is boneheaded? – Kain0_0 Jan 06 '20 at 22:36
  • Why wouldn't you get an AssemblyNotFound exception *every* time you called into the missing assembly? – user253751 Jan 07 '20 at 11:09
  • As I pointed out a cache layer could avoid the lookup on the second access as it determines that the service is unavailable and offers an alternative. Before you ask, I have worked in truly insane code bases that have done exactly this. My point isn't that you have a caching layer, nor is it the specific kind of error, nor that your code base is insane. My point is that without investigating the errors actually raised, the code can very easily *not* present them to you later. In short do not presume reproducibility, and do not presume triviality without actually verifying that this is the case – Kain0_0 Jan 07 '20 at 22:26
  • So the first time you call into the NetflixSearch assembly you get an AssemblyNotFound exception, but the second time you get a youtube search instead? That's a bit crazy. – user253751 Jan 08 '20 at 10:37
  • The first time it performs a search to identify implementations of `X`. The failure was expected as the framework is performing a general search of several optional but well known assemblies. The findings are turned into factories which are cached. Subsequent lookups use the cached factories to produce the required implementations. The later code succeeds because the cache is primed. Occasionally the service would take several restarts to get back to running. Occasionally the service displayed retrograde behaviours. Niche and boneheaded, but would have been better for it to just die. – Kain0_0 Jan 08 '20 at 22:51
1

To add an extra bit of flavor:

While you should allow your program to fail, it's not necessary that you make it fail catastrophically.

If your language allows you create your own exceptions, a great approach is to do something like this (Python):

# Exception is the base "user" exception class
class MyAppException(Exception): pass
class MySpecificError(MyAppException): pass
class MyOtherError(MyAppException): pass

...

if __name__ == '__main__':
    try:
        do_it()  # or whatever starts your app
    except MyAppException as e:
        # This catches any exceptions that *you* raised, and
        # is entirely the programmers fault for not catching before here!
        print('Well, we forgot to catch *this* exception:', e)
    except Exception as e:
        # This catches anything exceptions that may or may or may not
        # be within the programmers control, but include things like
        # KeyErrors or IndexErrors
        print('This was unexpected:', e)
    except:
        # Maybe not this one - it catches SystemExit & others. But, you *could*.
        print('Well, this was very unexpected')
    finally:
        # Do some cleanup stuff - close sockets, or whatever, and then quit
        exit(1)  # or some other failure code - it's up to you.

You can also re-raise your exceptions in Python if you'd like to really crash crash.

Taking this approach, you know:

  1. If you missed any exceptions that the developers should have known about
  2. If you missed any exceptions that you probably should have known about, but just didn't think of.
  3. If there was just something else entirely that failed - solar flares or something that was really outside of your control.

But you don't bother recovering and trying to keep running - as everyone else said, if you can't recover then you really shouldn't. But you don't have to die ungracefully - it's OK to give a last gasp, "Well, that was unexpected!" before keeling over. And in my experience, that's even a good idea because then you can try to fix it or account for the problem. Or just decide that it was something that really wasn't recoverable and you're just going to alert on that or something.

Deduplicator
  • 8,591
  • 5
  • 31
  • 50
Wayne Werner
  • 2,340
  • 2
  • 23
  • 23
  • 3
    One of my favorite things about python is that it already does what I want by default. Uncaught exceptions write a backtrace to stderr, and exiting the process with return code 1. I only ever catching bear exception when I want to "backtrack and die loudly" with a special-purpose integration. For *nix systems, stderr and rc!=0 is integrated, crontab will mail stderr on failure, the system supervisor writes stderr to log files and logs failures, most (shell based) task runners understand this behavior well. – ThorSummoner Jan 07 '20 at 04:36
1

You have to consider the reason/purpose for handling an exception, such as:

  • To perform some sort of corrective or recovery action so that the application can meaningfully continue; this implies that the exception and its consequences are well understood
  • To perform some sort of cleanup so the application is restored to a consistent state; again, the consequences of the exception must be fully understood
  • To capture information about the exception for logging and later analysis
  • To fail gracefully, such as to issue some sort of user-friendly/user-meaningful message in an appropriate UI instead of whatever raw message or code is directly reported by the exception; in doing this, it's generally wise to capture the more technical details for analysis by the development team.

If there is no useful purpose served by your exception handler, then a handler shouldn't be there and the exception should be allowed to "bubble up".

"Bonehead" exceptions implies conditions which were not anticipated when the application was initially coded. Un-anticipated exceptions should have as much detail captured as feasible so that the cause can be determined and appropriate action taken. This may be done automatically by the programming framework; if so, then you may not want to explicitly handle them.

Zenilogix
  • 309
  • 1
  • 3
1

I know I am late to the party, but I think there is some nuance that is missing in the existing answers. To quote sage advice from any number of martial arts movies:

The best way to block a punch is to not be there

To bring it into software engineering, we can repurpose the quote this way:

The best way to handle exceptions is to never throw them in the first place

Even though it was never explicitly called out in the first link you provided, that was essentially what the writer was trying to steer the readers toward.

Reasons why exceptions are considered bad:

  • They are very heavy to recover from (stack information takes time to unwind)
  • They were never designed to be used for normal program flow
  • They were designed to allow you to do your best effort to recover
    • That recovery may only be to close open resources and release locks before shutting down

That said, exceptions should not be propagated to users, particularly in a web context:

  • It exposes the internal architecture of your application, making it vulnerable to future attacks
  • For the web, instead of more appropriate responses like a 400 Bad Request or 404 Not Found, you return a 500 undefined server error

So what do we do?

  • Minimize the use of exceptions whenever possible (i.e. a TryParse that returns boolean instead of Parse which throws an exception)
  • Be purposeful about our exceptions
  • Never propagate exceptions to the response, but prefer logging to debug with
  • Fix the bugs that cause exceptions to be thrown
  • Verify your environment in code during startup if you have external dependencies--it's better to fail on start up then figure out at runtime that the service or application couldn't have worked anyway.
Berin Loritsch
  • 45,784
  • 7
  • 87
  • 160
-3

This cannot be more straightforward.

There are only 3 types of code:

  1. Code that never throws an exception.

  2. Code that you know will generate exceptions. For example, connecting to an external RPC server, it is expected that sometimes it will timeout. You should add try...catch for this type of code.

  3. Code that you didn't expect, but does throw errors. Since you didn't expect it to throw errors, there's nowhere to add try...catch, because you cannot add an infinite number of try...catch, that's simply not possible. Once an error happens, you just debug and fix it, that's all.

How hard is that?

laike9m
  • 135
  • 7