31

In The Pragmatic Programmer, the authors write:

One of the benefits of detecting problems as soon as you can is that you can crash earlier, and crashing is often the best thing you can do. The alternative may be to continue, writing corrupted data to some vital database or commanding the washing machine into its twentieth consecutive spin cycle.

...when your code discovers that something that was supposed to be impossible just happened, your program is no longer viable. Anything it does from this point forward becomes suspect, so terminate it as soon as possible.

To what extent does this principle apply in the context of GUI applications? That is, is the best course of action when faced with an unanticipated exception or an assertion failure to terminate the GUI program (possibly with an appropriate error messages to the user). What are the trade offs involved in applying it or not applying it?

What about single-page javascript applications? For example, terminating the page (or perhaps prompting to refresh?) when an uncaught promise rejection is detected.

samfrances
  • 1,065
  • 1
  • 10
  • 15
  • 2
    Excuse me in advance for the question but. Are you asking what to do if the GUI fails or if any application with GUI fails? – Laiv Feb 16 '21 at 08:34
  • See also: https://softwareengineering.stackexchange.com/q/399424/208524 – jaskij Mar 02 '21 at 18:12

9 Answers9

47

Quoting the same passage from the book (emphasis mine):

One of the benefits of detecting problems as soon as you can is that you can crash earlier, and crashing is often the best thing you can do. The alternative may be to continue, writing corrupted data to some vital database or commanding the washing machine into its twentieth consecutive spin cycle.

...when your code discovers that something that was supposed to be impossible just happened, your program is no longer viable. Anything it does from this point forward becomes suspect, so terminate it as soon as possible.

When a programmer uses an assertion, they're saying "This should never happen." Normally, terminating the program under these conditions is an appropriate response, especially since the programmer's assertion has been violated for unknown reasons. This is as true of a program with a GUI as it is for a console program or service.

For normal exceptions, the question becomes the same as it's always been: can we meaningfully recover from this exception? That depends; did the exception occur during a write to a critical database, or did the user simply give us a file name that does not exist?

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
  • 5
    This. When you find yourself in a situation that you thought could never happen, then there is *no sensible thing* the program could do, because you don't know what's going on. The best thing to do is to do nothing, because that way, *at least* you can't do any damage. – Jörg W Mittag Feb 13 '21 at 17:00
  • 2
    I hope the nuclear power plant next door is not coded according to that pragmatic programmer’s principle and that it tries some recovery strategies before deciding to crash the reactor unsafely ... ;-) – Christophe Feb 13 '21 at 17:01
  • 2
    @Christophe: That's a *known,* not an unknown. Obviously, "crashes" need to be accomplished *safely,* if possible. But sometime programs are already dead, not just [mostly dead.](https://www.imdb.com/title/tt0093779/) – Robert Harvey Feb 13 '21 at 17:03
  • 27
    A good example of recovering is a web server: if there's an error while handling the request, we let that request crash and continue with the next one. We shouldn't crash the entire server. Similar when handling events in a GUI. The important part is figuring out the correct *error boundary* in the context of the problem domain. This tends to be simpler when there's a clear Controller (as in MVC) or use case layer (in the Clean Architecture terminology). – amon Feb 13 '21 at 17:34
  • 21
    @Christophe: I really, really hope the GUI parts of the management software in the nuclear power plant "next door" is coded with the crash-early principle in mind, and does not give its user a false impression of safety. – Doc Brown Feb 13 '21 at 19:50
  • 1
    @DocBrown I hope that it will not crash because of an unexpected divide by 0 but inform the operator that there is a serious problem, still letting he/she the possibility to decide whether or not to shutdown the graphit grids that will prevent the kernel meltdown ;-) – Christophe Feb 13 '21 at 20:02
  • 2
    @DocBrown Of course, I hope even more that quality insurance will have prevented this divide by zero in the first place :-) – Christophe Feb 13 '21 at 20:03
  • 6
    @Christophe: quality assurance will have a much higher chance to find such issues, if the program does sweep unexpected divide-by-zeros under the rug, maybe because of a programmer thinking "hey, it is for a power-plant, the program must stay alive under all circumstances". – Doc Brown Feb 13 '21 at 20:16
  • 9
    .. and after an unexpected divide-by-zero, I don't think it is a good idea to trust the running process any more, any information about the status of the system given to the operator then is quite unreliable. In a system like a nuclear power plant, there needs to be a redundant fail safe for such a situation, and the operator needs to get informed that it is appropriate action to make use of the fail safe. – Doc Brown Feb 13 '21 at 20:24
  • @DocBrown This is a very interesting ethical issue. I think I have read an article recently about fail safe system with two software written by different teams with different technologies, but where both ended with an inaccurate result in a very specific situation due an undue algorithmic assumption. This is unexpected for the arbitration system: should it then crash leading to a plane crash with 100% probability? or should it leave control to one of the system, which security is no longer guaranteed, but could still give a chance to land the plane in 50% of cases? – Christophe Feb 13 '21 at 21:01
  • 4
    @Christophe: sorry, typo above, I meant "quality assurance will have a much higher chance to find such issues, if the program does **not** sweep unexpected divide-by-zeros under the rug" – Doc Brown Feb 13 '21 at 22:15
  • 3
    @Christophe ideally the airplane-software designers have considered the possibility of a crashed process and implemented a mechanism that will handle that eventuality without crashing the plane. In my software (which controls nothing nearly so safety-critical as an aircraft!) I do this by executing the complex code in a child process, and the parent process stands by ready to launch a fresh/new child process if/when the running child process crashes or exits. Once that is implemented, allowing the child process to crash is less problematic. – Jeremy Friesner Feb 14 '21 at 03:31
  • 10
    @Christophe: in a "this should not ever happen" error situation, how can you trust your other sensors and information panels are working properly? Your decision based on faulty data to prevent a shutdown might instead guarantee its explosion. Not all errors are the same and different kinds have different mitigation strategies. – whatsisname Feb 14 '21 at 04:33
  • 31
    @Christophe In my non-power-plant-engineer opinion, the power plant processor *should* crash if something unexpected happens, and then an auxiliary system should order an emergency shutdown. And they should design and test the hell out of the first processor to make sure it doesn't crash. I.e. it shouldn't crash, not because it hides crashes when unexpected things happen, but because there are zero unexpected things to begin with. But if it does, it shouldn't explode. Multiple layers of safety. – user253751 Feb 14 '21 at 06:40
  • 2
    @whatsisname could you even blindly trust any sensor? Is an inconsistent sensor value that leads to an inconsistent state that leads to an “internal error” and an early crash not something that should be prevented by first checking for sensor input for possible inconsistency and anyway designing the system to appropriately react early to an unexpected situation (“please check sensor”) ? – Christophe Feb 14 '21 at 09:06
  • 8
    @Christophe Looks like you are assuming that crashed process = crashed plane/plant. We ofter identify with the code we write, and its crash is the end of world for our mental model. But this is the opposite of how a system as a whole should behave to stay alive. A failed process should be killed and restarted ASAP; if it cannot, or if a whole piece of hardware is suffering a RAM fault, it should shut down and let the backup system take over, not try to control the plane based on a data from a corrupted memory. – Victor Sergienko Feb 15 '21 at 02:38
  • 6
    I never coded for power plants but I did work on safety critical systems that had GUIs. The GUIs read the state of the safety systems and displayed it, the safety side never depended on the GUI and the GUI couldn't crash it. If there was a problem on the safety critical side (e.g., redundant input mismatch) the only response was to drop power to the hazard and stop everything. In what I worked on the whole "fail fast" philosophy was irrelevant, you didn't dictate what the "fail" path was, every N nanoseconds you enabled *success* iff safe conditions were met. It was a real-time system. – jrh Feb 15 '21 at 03:06
  • 2
    The whole system had to be designed from day 1 from electrical wiring to software to make a "fail-safe" setup; frankly if you are working on a system like that, the Pragmatic Programmer probably won't help you much. In fact only recently did people even trust software to do it, it was 100% hard wired with relays. If you're imagining a Java application running the safety routines I sure didn't see that, in fact it was assembly most of the time, you're not reading from files or waiting for web requests in this, I can tell you that. – jrh Feb 15 '21 at 03:12
  • 2
    The QA terminology is a bit ill-fitting, you don't have "QA" in safety rated stuff, it's not good enough. You certify it and have it inspected, it has to pass federal regulations (e.g., OSHA). IMO it's not enough to say "I will crash when there's a problem I haven't seen before", everyone has to work together to find mistakes in the level below them, and establish all the possible unknowns; hardware and software designed for safety critical systems try to reduce unknown unknowns by using proper equipment. An "oops I didn't think of that" can get somebody killed. – jrh Feb 15 '21 at 03:28
  • 4
    The GUI and the actual safety control system were two separate real time, high reliability systems; the GUI having a slower, more flexible response time, but it was important to indicate when the GUI lost contact with the safety system; as a fallback there were LED lights hooked directly into the system for the most important information. Both systems had serious limitations like no heap allocation, static type size, and a watchdog timer that would immediately fault the whole system if it took too long to respond, loops were pretty rare and risky. – jrh Feb 15 '21 at 04:01
  • @Christophe "still letting he/she the possibility to decide whether or not to shutdown the graphit grids" --> time to decide may take too long. Discussion about "nuclear" is interesting considering the [Chernobyl failure](https://en.wikipedia.org/wiki/Chernobyl_disaster) was during safety testing. – chux - Reinstate Monica Feb 15 '21 at 17:37
  • @chux-ReinstateMonica That was exactly a situation that wasn't tested. Miscommunication combined with a very unlikely scenario with a failure. Had the miscommunication not happened, the other parts would've been caught at least somewhat safely. Possibly with damage to the plant, not with damage to the rest of the continent like what happened now. Or, in other words, had the plant been fully automated and not reliant on pesky humans, it wouldn't have gone wrong. Naturally, that wasn't possible or preferred at the time. – Mast Feb 15 '21 at 21:30
  • 1
    I have experience in both software and hardware design of safe (PL e and SIL 3) machinery. When done right, you can absolutely make just about anything safe yet automated. Use quality sensors, use redundant sensors, redundant wiring, never assume data you can't read, always use a back-up sensor in case the normal operation failed (basically the hardware equivalent of "this position shouldn't have been reached", elevators are famous for limit switches like that), etcetera.. Even there, 'dead programs tell no lies' is very valid. – Mast Feb 15 '21 at 21:43
  • The moment you can't trust your program, you must fall-back or revert to a trustworthy position. You do this by making paths that aren't likely to be used, but might be used anyway and will be very beneficial should it happen. – Mast Feb 15 '21 at 21:44
  • @jrh I would hope that someone examined the situation ahead and figured out "what is the most reasonable thing to do, which will cause the least damage, if I can't trust my software". If my freezer with ten tons of frozen fish inside malfunctions, then cutting off the power might not be the best default action. – gnasher729 Feb 16 '21 at 08:29
  • 1
    @Christophe> I believe you confuse the technical issue (yes, a divide by zero is nothing that bad if the possibility was anticipated) and the real problem. The problem is the program is in a state that was never anticipated, which means an unknown set of assumptions it relies on are wrong right now - thus any action it takes from that point relies on wrong assumptions and will lead to unpredictable outcome. This is akin to trying to hold a math demonstration after having been proven the starting point was false. That cannot work. You have to throw it away and call plan B. – spectras Feb 16 '21 at 10:54
  • @gnasher729 Yup. My comments referred to machines that need to have a response to an immediate danger to life and limb, though. A freezer isn't going to crush somebody, and it might have a much simpler circuit that's just hooked up to a backup generator, I'm not sure it'd even need much software at all. – jrh Feb 16 '21 at 13:15
  • I don't make freezers, but a business-critical (not safety critical) freezer circuit could maybe try and mitigate the chance of a broken temp sensor by having two of them, but the question of "what to do if a sensor dies" is not so simple. The answer is probably "routine maintenance and inspections and having a person on site that can respond to problems". If the system just crashed in this condition (which IMO would be a rather lame response) it'd likely just show as a generic "timeout" error on somebody's screen. I'd rather the software keep running and communicate exactly what went wrong. – jrh Feb 16 '21 at 13:49
  • ... if for whatever reason you didn't plan for the circuit to go bad and didn't think of a valid response like this, you could say it's better for the system to crash, and maybe it even would (but I wouldn't take it for granted). However effective error handling in any case IMO should also ask "how should the layer above me respond to the problem". A generic "assertion failed" gives no information to a maintenance person. If you see a possible failure path, make a strategy for it, assumptions are a form of gambling. Every failure you can plan for and handle gracefully is precious, use it. – jrh Feb 16 '21 at 14:00
24

I think you are asking the wrong question. IMHO it is pretty obvious that the principle applies especially in the context of GUI (and other UI) applications, and one should better ask if it applies also in the context of non-UI applications.

Why? Simply because for a UI program, there is usually a user sitting in front of the system who can react accordingly. When a UI program detects something which indicates a bug in the running process, there is no justification to sweep this under the rug, or pretend nothing really bad happened. The process should terminate with a clearly visible error message, maybe together with some post-mortem information. This immediately gives the user who sees the message the possibility for taking a sensible course of action. Depending on the kind of application, this might be something like

  • restart the program again and try if the failure will occur again

  • try if the issue can be circumvented

  • check if some data was lost and initiate a recovery procedure

  • call the hotline, or send them an email with the error message and a description what has happened

  • use a different program (or maybe no software at all) as long as there is no bug fix available

or whatever makes most sense for the given system. Doing this as early as possible will not only prevent the program from actively cause any unintended damage, but also make it easier for the hotline or the maintenance dev to find the root cause of the issue.

For non-UI applications, the situation is a little bit more complicated. They will often require some automatic failure handling, because the systems are typically working unattended. To implement this in a reliable fashion, one usually makes use of multiple processes, where at least one process, which is not too complex and written in a robust manner, has the role of a monitor or controller, and others have the role of workers. The worker processes can still follow the "crash-early" principle. For the controller, a worker which crashes is an anticipated behaviour. So the controller can apply some heuristics how to react accordingly (often, it automates exactly the actions listed above, the ones a human would apply for a crashed GUI program). There exist other models than worker-controller for resolving this issue, but all of them utilize multiple processes.

Today even lots of complex GUI applications like web browsers, word processors or dozens of other programs are implemented that way: certain parts of them run in external processes. When such a process "crashes early", it is often not necessary to shutdown the whole program, just terminating the process, and giving the user some of the options I mentioned, directly within the GUI, can be enough.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • 3
    There are different classes of non-GUI applications - a daemon or service or other long-running process expected to handle multiple request shouldn't crash, but a Unix-style CLI utility might as well, it's not like it can do anything more and the next item in the pipeline will just restart the application anyway. GUIs are generally more stateful than the latter, and the question becomes not one of keeping the gears turning but one of preserving the user state - this can be done with worker processes, but also with eg. robust autosave. – Maciej Stachowski Feb 15 '21 at 10:29
  • @MaciejStachowski And even the daemon part is dubious. Unless it has some containment mechanisms, it is probably better to let it crash and have the system restart a fresh copy in a valid state. – spectras Feb 16 '21 at 10:59
  • Also, whenever possible (GUI or CMD), error messages and consequences should be presented in clear, simple language. "Do you want to release poison gas to relieve pressure?" is better than "Fleem connoiter valve?" – GalacticCowboy Feb 16 '21 at 15:42
11

In short

Yes, early crash is an advisable approach also for GUI, but at the same time, the robustness expectations being higher, one shall seek to minimize as much as psosible the risk of reaching an inconsistent state.

Some more arguments

Early crash in a GUI

A GUI has the goal to make user’s life easy and the system understandable. When crashing, a GUI is no longer graphical and the user is lost. It failed its prime purpose. Here, for example, is an inviting cafeteria screen at an airport:

BSOD at airport

Nevertheless, if the system detects that it can no longer guarantee reliable operations (e.g. unexpected inconsistency, resources exhausted, etc...), and it can do nothing to revert to a safe situation, the best thing is indeed to terminate in a way to limit damage as much as possible. It's not only an advice of Andy Hunt’s pragmatic programmer, it's also an ethical principle: AVOID HARM.

Is it the only way?

Before jumping prematurely to an easy early-crash solution, ask yourself critically how the detected inconsistency could be avoided in the first place, and if it is unavoidable how it can be recovered from. Here a more friendly recoverable action:

enter image description here

Instead of finding out that the pipe arm is not where it is supposed to be, it shows a useful error in a user friendly way. It does not crash. Depending on its design, it could retry, clean the current state and restart the arm control subsystem, or just give further instruction when the technician opens the device.

More robust systems

My point here is that the "crash early" is still valid, but it should not be considered in isolation, without at the same time considering making the system more robust.

How many so called "internal errors" are just bugs or consequences of poor practice: nowadays hackers still exploit buffer overflows, because it was forgotten to sanitize the input? Segmentation faults still happen because someone assumed memory allocation always works. Divide by zero still crashes because that defect sensor returned 0 Kelvins (that's -273°C: couldn't someone have checked that the parameter is in an acceptable range?).

Moreover, there are quite some errors that can be recovered: a function may raise an exception that is caught to limit damage; a module, a thread, a process, a subsystem may be killed/reinitialized/restarted. The system can even diagnose itself looking at performance stats to inform user preventively that the system is under unusually high load.

And there are systems that are not allowed to fail. In the 20th century, Margaret Hamilton saved the Apollo team by designing a priority-based scheduler that could cope with (unanticipated) capacity overload. If she had just written a crash early, some astronauts would not have made it back to Earth.

We are in the 21st century: every smartphone has 100 times the processing power of an Apollo space computer. A lot can be done to prevent a crash in the first instance.

Conclusion

So yes, consider the crash-early advice of the pragmatic programmer as valid. But please, consider in the same time at least some elementary defensive programming practices such as verifying parameters and verifying error status after a call. And maybe the higher robustness will require revising the architecture or some additional design, but your users will definitively value it.

Toby Speight
  • 550
  • 3
  • 14
Christophe
  • 74,672
  • 10
  • 115
  • 187
  • 8
    All true and good advice here, but nevertheless I think this answer gives a wrong impression of what crash-early is about. Crash-early is not primarily about exhausted resources or terminating a program when it detects wrong input data, it is first and foremost about situations where a program detects a situation which can only be explained by **an internal bug**. – Doc Brown Feb 14 '21 at 06:33
  • 2
    Thank you. My whole point here is to prevent jumping prematurely to a solution (early crash) without first rethinking the problem (causes of the abnormal situation) and trying to engineer a robust fault tolerant solution in the first place. Sorry if resource exhaustion was a bad example. Can you define “internal error” and propose a criteria that distinguishes them from other bugs? There are still people out there who claim that memory corruption is an internal bug when in reality it is due to a buffer overflow caused by missing sanity checks on the input (which is not so internal after all) – Christophe Feb 14 '21 at 09:16
  • Maybe I should have just written "bug". By the term "internal" I wanted to emphasize it is an indication for a bug in the running process, not an issue observed by a watchdog process. And when a process detects a corruption in its memory , this is definitely a justification for an "early crash" - it does not matter whether it is caused by a missing sanity check or a buffer overflow or if it has a different cause. – Doc Brown Feb 14 '21 at 16:16
  • @DocBrown Ok. I see. I think my arguments were weakened by my bad example in the beginning. FOr me it's not being for or against early crash. It's only that early crash shall not be considered in isolation. I've tried to clarify my point in an edit. Thank you for your critical and constructive exchange. – Christophe Feb 14 '21 at 22:58
  • @Christophe your answer makes your point much more clear than your comments on the other answer. Definitely a program should deal gracefully with all problems that can be anticipated. The early-crash advice is only about detecting violated invariants, it is not an error handling mechanism and should not be used as such. – spectras Feb 16 '21 at 11:11
  • @spectras thank you for your feedback. I finally feel understood by someone ;-) It appears that comments unfold sometimes in an unexpected way, and it is then difficult to clarify what was initially meant. – Christophe Feb 16 '21 at 12:55
7

It depends. In theory, everything is possible and terminating the whole process is the only safe course of action.

In practice, we (= my company) do the following: When an unexpected error occurs in the business logic layer or below, we

  • catch the exception in the UI layer,
  • rollback all transactions,
  • log the error,
  • show an error message to the user, and then
  • let the user continue to work in the UI layer.

It works for us, because (a) all important database operations are done in the business layer and (b) we use transactions to ensure atomicity of operations, so reverting back to the UI layer is reverting back to a known safe state.

Yes, there is the possibility that the UI might now show outdated data, but this is something that can always occur in multi-user systems (even without unexpected errors), so the program needs to be able to to deal with it anyway.

We've been doing that for decades and, so far, we have not had one single data corruption that could have been prevented by deliberately crashing the UI. For our software, this has turned out to be the right compromise between safety and user experience. Obviously, if your software is safely-critical (ours isn't) and human lives are at stake, your choice might be different.

Heinzi
  • 9,646
  • 3
  • 46
  • 59
  • Who is "we", and what is "our software"? I think how well this approach works depends heavily on the level of isolation between an UI layer and other parts of the system. – Doc Brown Feb 14 '21 at 18:53
  • @DocBrown: We = my company (thx, I clarified this), our software = mostly line-of-business applications (ERP stuff). I fully agree with your second sentence. – Heinzi Feb 14 '21 at 19:09
  • 3
    @DocBrown: I also think that this approach works better for multi-user database applications than for other types of software: In the former, you need to persist all "global state" in the database anyway (due to multi-user concurrency) rather than in "local objects". This has two consequences which are beneficial for this approach: The state is protected by transactions (which local object state usually isn't), and local business objects are necessarily short-lived (since they might be out-of-date any time) - which reduces the risk of unexpected errors corrupting local state. – Heinzi Feb 15 '21 at 08:03
2

The assert() primitive is a classic way to test for a condition and to throw an exception if "what can never happen ... just did." When designing software, I fill my code with assertions (and similar "suspicious defenses"), and upon deployment I leave them in. My software is always looking out for trouble, because "the software itself" is really the only party that is capable of realizing that something is wrong and calling attention to it.

Your application should have an "outermost-level" exception handler which will intercept "an exception of last resort," and you should carefully work out a meaningful class hierarchy for those exceptions – above and beyond whatever is built-in to the language. Be specific in trapping the exceptions that you expect to find in a particular section of the code, allowing unexpected exceptions to bubble-up to a higher level handler. Figure out a strategy for handling them all, even if one is "we now have no choice but to terminate the program."

The "exception" mechanism can also be used as a goto of sorts. For example, you might be deeply buried in code when you realize that the user made a mistake. You can throw an exception of an appropriate class, knowing that it will eventually be caught by the chosen handler. You can attach arguments to the exception object, and I suggest that one (dummy ...) argument should be one that allows you to uniquely identify the point in the code where it was thrown. This is actually a very clean way to handle these "exceptional" situations.

"If something goes wrong, a yellow baseball is going to come flying from somewhere toward the catcher's mitt. Therefore, if you don't see such a baseball, nothing has gone wrong (yet) because everybody's looking for trouble."

Mike Robinson
  • 1,765
  • 4
  • 10
  • On one project we had an `assert_or_else` macro. As in: (not actual syntax) `assert(thing != NULL) else {errno = E_WTF_THE_THING_IS_NULL; return -1;}` This was for an unattended system where crashes were very bad (not nuclear-power-plant levels of bad). The macro automatically logged if the assertion failed, obviously, and on development systems, it crashed. – user253751 Feb 14 '21 at 06:45
  • Yes, it is a good idea to test for problematic issues not ony in the debug version of a program, but also in the deployed version. Unfortunately, the classic C macro *assert* was not really designed for this purpose, it is made for debugging, nothing else. So I would recommend to take your advice to leave "assert" statements in a deployed version not too literally, one should use a more sophisticated mechanism which allows more differentiated error mechanism and ending a program in a more controlled fashion. In most modern programming languages, special exceptions can be utilized for this. – Doc Brown Feb 14 '21 at 07:25
  • on higher level languages you get that "dummy" argument for free on exceptions since they capture a stacktrace by default which points to the path used to end up at the exception – masterX244 Feb 15 '21 at 10:57
  • Active asserts in production code are at least a code smell though, and close to an anti-pattern. The problem with an assert is that it puts up an error message aimed at the coder, reporting debugging information. There is a general UI principle that no message should be reported to the user unless it gives the user information about what they should do. The user should not (I'd go as far as "must not") be told that fooBarFlag is false at line 1365, only whether they need to restart and whether their work got saved. Having these checks is good, but you need a different reporting mechanism. – Graham Feb 16 '21 at 10:04
2

As you're asking about what you perceive to be a difference between "GUI" applications and "non-GUI", which I will interpret as no UI at all since the passage quoted refers to washing machines, it seems to me there are two significant differences between them. Note that I'm using my own experience - I develop both embedded control and related GUI, plus some general scripts.

In a GUI application, you have the option to communicate. This is much harder in, say, a motor car's fuel injection controller. Given this, it makes a lot of sense to inform the user of what you think went wrong, even if that's just "the programmer asserted that X should be 0 and it's not", before giving up. If it's a desktop application with no external influence beyond corrupting files, you may choose then to offer an option to carry on regardless. However, I'd weigh that up with serious consideration first.

In an embedded controller, carrying on regardless is likely to have serious potential consequences. On the other hand, giving up completely may not be much better. A fuel injection controller that just stops completely when you're in the middle of overtaking on a single carriageway road with a bend coming up (yes, I've been there!) really isn't good. In this case disabling only part of the functionality or, in extremis, rebooting (quickly!) is a better option.

So as to whether the advice "also" applies to a GUI, I'd say in many ways it applies more to a GUI, but the distinction isn't the right one. The important distinction is whether the application is mission / safety critical and, if so, what the results of the FMEA are.

Rob Pearce
  • 29
  • 1
2

In production environment, when unexpected crash may not be preferred, it is often possible to abort a smaller unit of execution. For GUI, this would often be a single user action (menu item selected, keystroke, icon clicked or the like), logging the error somewhere and reverting the application to the state as was. The action will not work, and the bug will be reported but this is much less annoying than crashing the whole application.

For server application is often a single web request that can be aborted, but for security reasons you may also need to invalidate the session. Here also, if we shutdown the server, the administrator will likely just restart it anyway.

For serial document analysis, web crawling, it is usually a single document or picture. Here it is important to log what has not been processed, and the possibility to reprocess the failed entries at some time later.

A database transaction is a very useful feature in this context, as it allows to stay in consistent state after rollback. Worst designs I ever seen were making multiple changes in the database without transactions, so that a crash leaves it inconsistent.

Terminating such an execution unit is one of the proper ways of using exceptions. The unit is terminated by throwing the exception that is caught by the code responsible for the cleanup and starting another unit. Apart from logging, this code normally does not care about the precise reason, why did the previous unit failed. Hence huge diversity of exceptions is not needed. Google Go only has one exception, 'panic'.

h22
  • 905
  • 1
  • 5
  • 15
1

I think an informative decision on handling errors in UI comes from one of the most ubiquitous UI technologies - React. The following snippet comes from the documentation on Error Boundaries, which are components that can catch and handle errors from components nested below them in the hierarchy:

As of React 16, errors that were not caught by any error boundary will result in unmounting of the whole React component tree.

We debated this decision, but in our experience it is worse to leave corrupted UI in place than to completely remove it. For example, in a product like Messenger leaving the broken UI visible could lead to somebody sending a message to the wrong person. Similarly, it is worse for a payments app to display a wrong amount than to render nothing.

This doesn't inform much about error-handling in UIs in general, but does provide a good example for why it's a reasonable default to bubble up errors and fail completely when an error isn't explicitly handled.

mowwwalker
  • 1,117
  • 2
  • 12
  • 20
0

When you encounter a situation where you don’t know what to do, the only thing to do is throw up your hands and say “I don’t know what to do” and stop doing what you were trying to.

A GUI is composed of various sub systems, and all of those systems should do the same thing with a situation they don’t know how to handle — throw up their metaphorical hands and metaphorically say “I don’t know what to do”. Eventually that message will either become lost or will reach a part of the process that knows what to do.

It doesn’t matter if it’s a GUI application or function in a command line or an AWS function. Anything else is, as the quote implies, a lie. Told to do something, the system either reports back “did it” when in fact they didn’t or just never says anything again. Both are a lie.

Crash early, crash often.

jmoreno
  • 10,640
  • 1
  • 31
  • 48