What does the crash early concept mean?

Question

While I am reading The Pragmatic Programmer e2, I came across Tip 38: Crash Early. Basically, the author, at least to my understanding, advises to avoid catching exceptions and let the program crash. He goes on saying:

One of the benefits of detecting problems as soon as you can is that you can crash earlier, and crashing is often the bet thing you can do. The alternative may be to continue, writing corrupted data to some vital database or commanding the washing machine into its twentieth consecutive spin cycle.

Later he says:

In these environments, programs are designed to fail, but that failure is managed with supervisors. A supervisor is responsible for running code and knows what to do in case the code fails, which could include cleaning up after it, restarting it, and so on.

I am struggling to reflect that into real code. What could be the supervisor the author is referring to? In Java, I am used to use a lot of try/catch. Do I need to stop doing that? And replace that with what? Do I simply let the program restart every time there is an exception?

Here is the example the author used (Elixir):

try do
  add_score_to_board(score);
rescue
  InvalidScore
  Logger.error("Can't add invalid score. Exiting");
  raise
rescue
  BoardServerDown
  Logger.error("Can't add score: Board is down. Existing");
  raise
rescue
  StaleTransaction
  Logger.error("Can't add score: stale transaction. Existing");
  raise
end

This is how Pragmatic Programmers would write this:

add_score_to_board(score);

You understand that if you get an exception you did not expect and know how to handle, then _your world is broken_ and you have no idea what to do in order to proceed correctly. Doesn't "Pull the emergency brake" sound like the most reasonable thing to do? — Thorbjørn Ravn Andersen, Dec 11 '20 at 14:18
I was thinking of answering this but I've already said what I have to say [here](https://softwareengineering.stackexchange.com/a/372036/131624). — candied_orange, Dec 11 '20 at 18:10
@ThorbjørnRavnAndersen: Unfortunately, it's rare for functions to distinguish exceptions that indicate "The requested operation couldn't be completed, but all side effects from the attempt have been rolled back", from "The requested operation couldn't be completed, and some objects have been *expressly invalidated* and will cause exceptions to be thrown if code tries to use them" or "The requested operation couldn't be completed, and the effect on system state is unknown." Many exceptions of the first type should be handled identically *whether or not the particular cause of the failure... — supercat, Dec 11 '20 at 22:48
...had been anticipated. Likewise, in many cases, for exceptions of the second type. If all object which may have gotten invalidated by an exception are going to get discarded during stack unwinding, the consequences of such invalidation will vanish with them. If an expressly-invalidated object is essential to system functioning, the attempts to use it will trigger an exception and crash when they occur. As before, the particular cause of the exception will often not matter. What really matters is whether the system state will be consistent with caller expectations. — supercat, Dec 11 '20 at 22:52
A different phrasing for the concept that more people seem to intuitively get is: ‘Only catch exceptions you can recover from or expect to happen, and reraise those you can’t recover from.’. The idea is to not let unknown or unhandled states persist and to minimize the potential problems they can cause by bailing out as early as possible once you know you are in an unknown or unhandled state. — Austin Hemmelgarn, Dec 11 '20 at 23:56
@supercat The assumption was that you have never seen the error situation before. How do you know what category it is in then? — Thorbjørn Ravn Andersen, Dec 12 '20 at 03:27
@ThorbjørnRavnAndersen: With the exception hierarchies in frameworks, there probably isn't a good way. If exceptions were categorized as described, however, then a function which calls another function that throws an exception would know whether to let it leak as the same category, escalate it, or catch it. — supercat, Dec 12 '20 at 04:37
@supercat Yes and that is what you as a responsible and careful programmer have done with what you've seen so far. Now you have a situation you've never seen before, and therefore have not been able to categorize. Now what? — Thorbjørn Ravn Andersen, Dec 12 '20 at 10:57
@ThorbjørnRavnAndersen: If in response to an open-file request one invokes a function `LoadWoozleDocument(somePath);` and gets an exception that indicates that the document didn't load, but the attempt had no side effects, one should display a message which indicates that the document didn't load. If the failure was because of e.g. some unanticipated network-drive-related problem, the message may not be as useful as if one anticipated the failure, but the fundamental behavior should remain the same: abandon the attempted operation but otherwise allow the program to proceed as normal. — supercat, Dec 12 '20 at 17:10
@ThorbjørnRavnAndersen: Now suppose one has a function which is supposed to merge some data read from a stream into a `WoozleDocument`, and the presence of invalid data in the stream causes it to fail in such a way that leaves the `WoozleDocument` in question in a corrupted state, but doesn't affect anything else in the system. If that function was called from `LoadWoozleDocument`, and only corrupted the new document that was being created, it should again report that a document couldn't be loaded from the file; once the corrupt document is abandoned the system state would be normal again. — supercat, Dec 12 '20 at 17:17
@ThorbjørnRavnAndersen: Additional complications arise if one adds interfaces into the mix. If an implementation of `IEnumerable` is supposed to receive and parse data from a network stream, and a function receiving the `IEnumerable` is supposed to merge it into a collection, what should happen if the stream successfully returns data, but the data can't be parsed? What if the connection is reset? The method that creates the object that implements `IEnumerable` may anticipate such failures, but a method that's trying to build the collection can't. It may, however, need to... — supercat, Dec 12 '20 at 17:45
...distinguish cases where the collection was left unmodified, those where the collection had some items added but is otherwise valid, and those where the collection has become corrupt as a result of an exception occurring between two operations on the collection which needed to both occur if either one did. Even if the caller would have anticipated the exception thrown by `IEnumerable.MoveNext` that would say nothing about whether the caller was prepared to deal with a partially-modified or corrupted collection. — supercat, Dec 12 '20 at 17:48
There is also the related concept known as [Crash Only Software](https://en.wikipedia.org/wiki/Crash-only_software) — Reginald Blue, Dec 12 '20 at 19:25
Check your spelling. There's a mjaor difference between exiting and existing. — Mast, Dec 12 '20 at 20:12
Some languages like erlang have supervisors built in, some installations use specific restarting supervisors (supervisord, postmaster, runit, ...) for normal operating system deployments, both windows services and linux/unix init or systemd have limited support for restarting. — eckes, Dec 12 '20 at 23:35
@ThorbjørnRavnAndersen Read it. It has good examples. Not reading the explanation of why you are wrong doesn't make you any less wrong. — user253751, Dec 13 '20 at 16:01
Related: https://ericlippert.com/2008/09/10/vexing-exceptions/ — Andrew Savinykh, Dec 14 '20 at 08:51
@ThorbjørnRavnAndersen: Suppose an object method is supposed to merge data from an abstract stream into the object's content. Should the method designer try to predict all the ways in which the attempt to read from the stream might fail? Should the method try to distinguish for the caller scenarios where the attempt had no effect from those where some data was imported but the object is valid, or those where the object's state should be regarded as corrupted? How can the latter be accomplished without Pokemon exception handling? — supercat, Dec 14 '20 at 16:33
@supercat so therefore the mechanism "crashes" when a failure is encountered instead of trying to hobble along, which is what the question is about. The pokemon catching is at top-level so the app doesn't exit completely. A mechanism prepared for failure would also ensure that existing data is not corrupted. — Thorbjørn Ravn Andersen, Dec 14 '20 at 16:52
@ThorbjørnRavnAndersen: Yes, but if e.g. that method was called because the user selected `File Open...`, the surrounding program should typically continue operation without having opened the file, *whether or not its author had predicted the particular way in which the operation had failed*. — supercat, Dec 14 '20 at 18:24
@supercat Yes, _but the operation crashed instead of trying to continue!_ The programmers had then decided that at _this particular spot_ can we recover gracefully from operations crashing. — Thorbjørn Ravn Andersen, Dec 14 '20 at 18:45
@ThorbjørnRavnAndersen: I would not use the term "crash" to describe operations that fail in a fashion that allows recovery without having to kill the process or other execution context in which they occur. Further, my point was that in many cases a programmer can know how to handle a wide range of things that could go wrong without having had to anticipate them *individually*. — supercat, Dec 14 '20 at 18:51
@supercat Sure you wouldn't. So the problem is the interpretation of "crash". Can we agree that the operation stopped execution and reported that back to upper management instead of trying to handle the unknown situation as well as possible and return the result? — Thorbjørn Ravn Andersen, Dec 14 '20 at 19:07
@ThorbjørnRavnAndersen: At the inner level, an exception should be propagated. The behavior of the outer function that run in response to "File Open...", however, should generally be "open the document if possible, or present a message indicating that it couldn't be opened", with the act of showing the message being regarded as an acceptable and complete way of handling most failures from unanticipated causes. What frameworks generally lack, though, is a means of identifying few situations that wouldn't be resolved by abandoning the attempt to open the file and showing a message. — supercat, Dec 14 '20 at 20:11

score 84 · Answer 1 · edited Dec 13 '20 at 13:27

84

Basically, the author, [...] advises to avoid catching exceptions and let the program crash

No, that is a misunderstanding.

The recommendation is to let a program terminate its execution ASAP when there is an indication that it cannot safely continue (the term "crash" can also be replaced by "end gracefully", if one prefers this). The important word here is not "crash", but "early" - as soon as such an indication becomes aware in a certain part of the code, the program should not "hope" that later executed parts in the code might still work, but simply end execution, ideally with a full error report. And a common way of ending execution is using a specific exception for this, transport the information where the problem occurred to the outermost scope, where the program should be terminated.

Moreover, the recommendation is not against catching exceptions in general. The recommendation is against the abuse of catching unexpected exceptions to prevent the end of a program. Continuing a program though it is unclear whether this is safe or not can mask severe errors, makes it hard to find the root cause of a problem and has the risk of causing more damage than when the program suddenly stops.

Your example shows how to catch some severe exceptions, for logging. But it does not just continue the execution, it rethrows those exceptions, which will probably end the program. That is exactly in line with the "crash early" idea.

And to your question

What could be the supervisor the author is referring to?

Such a supervisor is either a person, which will deal with the failure of a program, or another program running in a separate process, which monitors the activity of other, more complex programs, and can take appropriate actions when one of them "fails".

What this is precisely depends heavily on the kind of program, and the potential costs of a failure. Imagine the failure scenarios for

a desktop application with some GUI for managing address data in a database
a malware scanner on your PC
the software which makes the regular backups for the Stack Exchange sites
software which does automatic high speed stock trading
software which runs your favorite search engine or social network
the software in your newest smart TV or your smartphone
controller software for an insulin pump
controller software for steering of an airplane
monitoring software for a nuclear power plant

I think you can imagine by yourself for which of these examples a human supervisor is enough, or where an "automatic" supervisor is required to keep the system stable even when one of its components fail.

edited Dec 13 '20 at 13:27

Peter Mortensen

1,050
2
12
14

answered Dec 11 '20 at 12:31

Doc Brown

199,015
33
367
565

29

Indeed. The idea is that *as soon as* you encounter a situation where the program cannot safely continue *you give up*, and do *not* try to "fix" it. Because by trying to fix it, you a) make things more complex, b) potentially make things worse, and most importantly c) you *move away from the point of failure*, so when the system ends up failing anyway, you are so far removed from the original source of the error that you can no longer figure out what caused it. – Jörg W Mittag Dec 11 '20 at 12:46
26

This is actually analogous to project management: if you encounter a problem in your project 6 months before the deadline, the correct way to handle this is to escalate the problem *now* to your manager, instead of going off on your own trying to fix it by yourself and realizing two weeks before the deadline that your fixes are not working. – Jörg W Mittag Dec 11 '20 at 12:50
8

Aka [pokemon exception handling](https://wiki.c2.com/?PokemonExceptionHandling). – Jared Smith Dec 11 '20 at 20:18
2

In Java, an uncaught exception will only terminate the thread. It will only terminate the program, if it's the only (non-daemon) thread. So this idea only works for the simplest of programs. – Paŭlo Ebermann Dec 12 '20 at 00:21
1

@PaŭloEbermann: then one has to pick a different mechanics, obviously. I edited the specific example out, for not confusing anyone. – Doc Brown Dec 12 '20 at 00:32
4

@PaŭloEbermann my impression is when Erlang people talk about programs or processes that should be allowed to die, those can very well be tiny little pieces that are part of an actual user program. Typically there is a hierarchy of processes and if one process "crashes" a process higher in the hierarchy (with therefore more general oversight) is responsible to get it going again (e.g. restart ). That seems to mislead many in thinking the whole (user) program is supposed to crash. Imho they mean only the sub-process that is running into a problem. So somewhat similar to exit a Java thread. – Frank Hopkins Dec 12 '20 at 04:45
4

Perhaps the divide can be bridged if we reduce "crash" to "stop what you're doing and get back to a known-stable state." Depending on the situation, that could mean crashing a sub-process, aborting a task, reverting an operation, terminating a process, even stopping a distributed process running on a thousand machines, all depending on the context and the error. Figuring out what level of known-stable state to get back to is part of the challenge: a web browser shouldn't crash the whole user program if it encounters a network error, while termination might be the right option for a CLI app. – Zach Lipton Dec 12 '20 at 06:47
3

@ZachLipton That's another benefit of programs that avoid shared mutable state. The smaller the "state box", the easier it is to recover from unexpected errors - you can just "restart the box". If you have shared mutable state all over your application, you really can't safely restart any part of the application - you need to restart the whole thing. It's not surprising web browsers these days tend to have a separate process for each tab (if the context allows it) - it's still the easiest way to do this kind of boxing. – Luaan Dec 12 '20 at 09:14
When a murder happens you don’t touch anything until the cops show up. You don’t move the body. You don’t clean up the blood. – candied_orange Dec 12 '20 at 14:43
For example, if you have multithreaded Java program where one thread hits an exception that it cannot properly handle, it must re-throw to cause the whole thread to die. And if children threads dying randomly is not the *intended behavior* of the parent process, it must throw, too, because dying thread is logically unhandled exception. And "Crash Early" means that you do not do any cleanup or teardown routines when you stop a process, simply kill it. This forces you to make the startup process robust/stable instead of trying to improve the cleanup routines on crashing process. – Mikko Rantalainen Dec 13 '20 at 09:36
@FrankHopkins it's also my experience that Erlang people vastly overestimate the ability of their programs to handle failure in small components. Failing big, rather than assuming you can handle an unexplained component failure is still a good strategy. I really wish Riak would actually crash more often, rather than limp on, insisting "'tis but a flesh wound". – James_pic Dec 14 '20 at 12:21
@PaŭloEbermann you can easily set an uncaught exception handler for a thread group that will terminate the whole JVM if you wish. – Holger Dec 14 '20 at 12:56
Many exceptions are actually expected (e.g. network failure) and need to be handled correctly. However, many exceptions just cannot be handled because you don't know what else could be wrong. For example, a UI resource file could not be found. The computer has run out of available memory. It's impossible to recover from such errors but many programmers would still attempt to handle them. To give a specific example, testing that the return value of `malloc` is not `NULL` in C is completely pointless. If the value is `NULL`, something really bad happened and you just cannot recover. – Sulthan Dec 14 '20 at 15:56
@Sulthan: Some implementations guarantee that if `malloc()` returns null, nothing bad will happen if a program is prepared to handle that possibility. IMHO, the Standard should have included an allocation function which will fail gracefully if possible, and one which may force abnormal program termination but would never return anything other than a valid pointer. – supercat Dec 14 '20 at 18:54
@supercat Yes, C is a bit complicated in that regard because if you don't handle these states, you can end up with corrupted memory and a vulnerability. – Sulthan Dec 14 '20 at 19:16
@Sulthan: What's irksome is that programmers have to explicitly check whether `malloc()` returned null, whether or not there would be any possibility of recovery, and implementations have no way of knowing whether using their own out-of-memory handling logic might be better than returning null (e.g. an implementation might invite the OS to suspend execution of a program unless or until more memory becomes available, but allow for the possibility of resuming execution if that happens, but that couldn't happen if a program called `exit()` because `malloc()` returned null.) – supercat Dec 14 '20 at 20:16

score 17 · Answer 2 · edited Dec 20 '20 at 08:44

The important part here is the kind of error you encountered. There are errors that are expected, and where you know what to do with them. Typical examples are network errors, e.g. in your web application you need to display an error if the server doesn't respond, and probably give the user a button to retry. You don't want to crash everything for this kind of error that you can cleanly handle.

Another type of error are those that simply make the current job impossible. For example if you need to read 100 different files for a specific job, if any of them fails you don't need to continue, it is impossible to complete the job. So you don't need a try/catch around every file access, you can let the whole thing either succeed completely, or let it fail on any error.

The most important error, and the one this statement is really about is an unexpected error that has put your application into an unknown state. Let's assume we're in an application with multiple threads and shared memory. We have a try/catch around the whole program in each thread that catches anything. Is it safe to just restart the thread if any kind of arbitrary exception is thrown?

The answer is no, because of the shared state. The error could have done anything to the shared memory, and put it into a corrupt state. What you need to do is to get the program into a defined, known good state again. In most programming languages this means crashing the entire program and restarting it. You can't recover from having your application in an unknown state. Any of your assumption might be broken, there might simply be garbage data in some of your state.

So of course you should handle exceptions if you understand the error and know how to recover, and if it makes sense to handle the error at that particular point and not at a higher level. What you should not do is try to handle errors that you don't understand, and where you can't guarantee that your application is still in a valid state.

What is special about Erlang/Elixir is that you don't need to crash the entire application. The Erlang VM allows you to have easily hundreds of thousands of processes, and each process is completely isolated, there is no shared, mutable memory there. So in many cases you don't need to catch any exceptions at all, you just let the process crash. This can't affect anything outside that process. And Erlang/Elixir has Supervisors that manage these processes, and you can define restart policies there. So in most cases the process that failed would be simply restarted automatically from a known good state.

Note the requirement for Erlang-like behavior: no shared mutable memory. In practice, only functional languages can properly support this with multithreading. — Mikko Rantalainen, Dec 13 '20 at 09:40
Personally I think the distinction can be drawn what you can test. If you can test the exception (by introducing the failure, aka pull the network plug) then you can also write a handler. — lalala, Dec 13 '20 at 18:49

lennon310 · Answer 3 · 2020-12-12T14:45:13.070

11

What could be the supervisor the author is referring to?

In the context of the book, the author is referring to the supervisor in Erlang. It handles restart logic for crashing processes, and handles exit messages from their dying processes. The supervisor can then decide what action to take to bring the system back to a stable state. We are allowed to define restart policies on the process there.

Because the supervisors in Erlang manages the processes, we can just let the process crash without affecting anything outside the crashed process, instead of catching the exceptions (and try to address/fix it).

In Java, I am used to use a lot of try/catch, do I need to stop doing that?

We should avoid abusing try/catch the unexpected exceptions, because it could be unclear if it's safe the program continues. If the program fails later, it may be very difficult to track the root cause.

Taking Java as example, exceptions inheriting from RuntimeException will produce crashes in runtime. For example, try to avoid try/catch but just let the code crash on NullPointerException.

In your code example, the exception is caught, logged, and then rethrown. It is similar in Java where a checked exception can be caught and re-thrown without losing the Stacktrace info (enforced by compiler), for example

try 
{
  //
} 
catch (final SQLException e) 
{
  // logging the error if necessary
  throw new RuntimeException(e);
}

edited Dec 12 '20 at 14:45

answered Dec 11 '20 at 12:59

lennon310

3,132
6
16
33

1

It's hardly an Erlang-only idea. I've written a similar setup in C: small, simple, reliable monitor program keeping an eye on (and distributing tasks to) a collection of complex, unreliable worker programs. – Mark Dec 12 '20 at 01:47
5

I was only referring to the book context, it is using Erlang as the example – lennon310 Dec 12 '20 at 02:55
In Java, isn't a threadpool basically very similar. A thread runs into an exception and you let that execution die and abort, but all the others keep running - and depending on threadpool management a new one might be spawned. The benefit of erlang processes is that they are super cheap to handle compare to Java threads and the supervisor tree structure makes it likely easier to define different recovery behaviour along the tree hierarchy. – Frank Hopkins Dec 12 '20 at 04:50
@FrankHopkins, ...though Java threads can be modifying shared state, so one that goes sufficiently far off the rails can break your program as a whole. This is where functional languages make reasoning about side effects easier -- any state changes need to be explicit, so you don't have hidden places where one part of the program can change how another part behaves. – Charles Duffy Dec 14 '20 at 17:48
@FrankHopkins It's not really related to thread pools. The thread pool will typically abort the *task* the thread is running and continue with the next task... but the only reason it does that is because it has no other sensible way to handle it (ti can't propagate to the caller). – user253751 Dec 14 '20 at 18:19

Zach Lipton · Answer 4 · 2020-12-13T06:35:31.320

How about a physical analogy? Your boss instructs you to organize and file some boxes of paperwork and tells you do exactly as you're told and not to bother her until the job is done for any reason. During the process of filing, you:

Discover the office hallway is blocked off for construction work. You stand there starving and dehydrated for a week until the work is complete and the hallway is reopened.
Discover the file room door is locked, so you pile all the files up in front of the door.
Realize the index that tells you where each file goes is missing, so you make up your own new filing scheme entirely different than the exiting one.
Realize the labels on the files are in a writing system you don't understand, so you guess wildly at how to alphabetize them.
Notice one of the boxes contains an active bomb, but you know you're not supposed to disturb your boss, so you file the bomb and don't tell anyone.
Notice the office is now exploded and on fire, and keep delivering files into the flames until the fire department drags you out of the building.

When you meet your boss outside, you let her know you finished the filing job and there were just a few problems you noticed along the way. That's what happens when you don't crash early: at every point in the process, the environment was unsuitable for you to do the work, but you kept on going in the hope it would work out instead of stopping immediately.

So what does that mean for programming? If there's a problem (usually delivered to you in the form of an exception or a failed assertion check), you need to immediately assess whether it's something you can deal with. Unless you have a clear plan to recover from the problem, you should never just keep going on blindly in the hope it's all going to be fine somehow.

There are a lot of judgement calls here that will depend on your application. If you're processing all the files in a directory and one turns out to be corrupt, there's no hard rule about the right thing to do. For some applications, it will make the most sense to roll everything back and leave things as they were. For others, it would make more sense to skip that file and process the rest. Or it might be best to pause and alert a human and give them a choice of what to do, or allow such configuration before the task starts. You'll have to decide what makes the most sense given the context of how your application used and the ways in which it could cause problems if something goes wrong. This requires even more careful analysis when the software is serving a critical purpose: your judgement about how to handle missing sensor data will likely be different if you're designing a floor cleaning robot (where it may be more important to stop the robot immediately before it causes damage) vs flight control software (where you've put considerable design into redundancy, gradual degradation, and failure modes).

score 5 · Answer 5 · answered Dec 11 '20 at 14:27

Exceptions are meant to communicate to your caller that you couldn't fulfill your job. [That's the most-ignored fact about exceptions.]

Fail Early

That's good advice. As soon as you find out that you can't complete successfully, it's best to immediately inform your caller about that fact (after cleaning up any inconsistent state that you'd otherwise left behind, it that applies to your application).

Continuing in your program is typically useless, can even be dangerous because of missing or wrong data.

So, e.g. when opening a file, don't immediately catch the exception, log it and continue. The following code will try to read from that file and of course fail as well.

Generally, you write program statements because your logic needs them. So, if one of your steps fails, the whole method won't give the desired results. So, let exceptions that you receive simply bubble up the stack, and actively throw appropriate exceptions whenever you detect failure conditions.

Avoid Catching

Although a good general guideline, "Avoid Catching" is over-simplified.

Better: think three times if you really want to catch exceptions here in this place. I've seen lots and lots of code cluttered with try/catch constructs that are unnecessary and most of the time even quality traps or plain programming mistakes.

Catch exceptions only in places where you can successfully continue, even after some of your program so far has failed. That translates to the question: Do I have a fallback or recovery strategy available that can turn the failure I just experienced into a success? Maybe by a retry/reconnect or by having an alternative algorithm or whatever.

In catching exceptions, you have to be honest to yourself:

You know that something in your current code block failed.
You also know how the failing code labeled the type of failure (by means of creating its exception object), or how some layer in-between re-labeled the original failure reason (by wrapping the original exception object). [You see, this isn't the most reliable source of information.]
Take into account that a given exception type might come from any enclosed piece of code, from any level deep down the stack. So don't assume you know what happened from just looking at the exception object.
Having only these informations, can you turn your current method into success?

An honest answer to this reasoning will be "No" in most cases. And then don't catch the exception.

Valid "Yes" situations are e.g. having a retry/reconnect strategy at hand, or an alternative algorithm, or just reasoning about optional code, something like a cleanup that's nice to have, but not necessary for making your current method succeed.

You should finally catch exceptions at some top-level (user-interface action level, service API top layer, etc.). There:

log the error,
tell the user or your client that their request failed,
and wait for the next request, that probably (hopefully?) won't run into the same problem.

Supervisor

What the author calls a supervisor translates to a well-designed catch block in more traditional languages: a place where you know how to deal with a failure in such a way that you can meaningfully continue.

I was reading through the answers, and was planning on making these exact points, but not only did you beat me to it, you did it better than I would have. Well done. +1 — jmoreno, Dec 13 '20 at 00:22
It's important that the supervisor process does not share writable memory space with the processes it's trying to supervise. Otherwise, a buffer overflow or any other memory corruption can corrupt the supervisor process, too. In practice, you either need to use functional language (where things like variables and memory pointers do not exist at all, e.g. Erlang) or put the supervisor in different process (not thread) and trust the operating system to guard the supervisor memory contents. — Mikko Rantalainen, Dec 13 '20 at 09:43

score 0 · Answer 6 · answered Dec 11 '20 at 12:49

0

In principle, your code should handle unusual situations, but it shouldn't handle programming errors (and not expecting that an unusual situation might happen is a programming error).

If there are programming errors, then your code should crash, then a developer figures out why it crashed, and fixes it. If there are unusual situations, your code should handle them if it is possible and safe.

answered Dec 11 '20 at 12:49

gnasher729

42,090
4
59
119

1

Can you provide an example of a programming error? – Peter Mortensen Dec 13 '20 at 11:58
@PeterMortensen one example could be a piece of code which unsafely accesses an array with an out of bounds index – james Dec 15 '20 at 18:16

score 0 · Answer 7 · edited Dec 11 '20 at 19:33

What could be the supervisor the author is referring to?

It depends on the type of program. In the very common case of a webserver, the "supervisor" would be a dispatcher thread that hands off individual requests to worker threads to process. If there is an exception while a request is processed, it's perfectly acceptable (and in fact common) practice to just let that exception bubble up to some high level exception handler, which logs the exception, rolls back any DB transactions and sends a HTTP 500 error response to the client. The worker thread can then start processing the next request. For a standalone GUI program, you can imagine something similar in the event handler that responds to user input, although here there is a danger of leaving the GUI in a corrupted state.

In Java, I am used to use a lot of try/catch, do I need to stop doing that?

It depends. In many cases, specific exceptions indicate known error conditions that your program can and should handle. For example, trying to parse a date a user entered, but they entered the 30th of February. Keep doing that.

What you should not do is catch(Exception e) and then just continue as if nothing happened.

Java is unique in that it has checked exceptions that it forces you to catch (or pollute your method signature with). This is widely considered a failed language design experiment. A pretty common way to deal with checked exceptions you can't usefully handle is to wrap them in a RuntimeException, i.e. do something like catch(IOException e) { throw new RuntimeException("Error processing " + filename, e); } - note that you can still do something very useful by adding information (in this case about the file that was being accessed) that will help in debugging when the exception is logged.

and replace that with what? Do I simply let the program restarts every time there is exception?

Depends on the program. If you can find a central, high-level place where it makes sense to catch exceptions because you can log them and have the program in a consistent state as if whatever action ultimately caused the exception was never started, do that.

If there is no such place, yes, let the program crash. For command line utilities, that is often the correct choice.

score 0 · Answer 8 · answered Dec 11 '20 at 13:22

Further to @DocBrown's answer, it's also worth throwing errors/exceptions in early guard clauses. The example's structure already facilitates this, but people often write code that checks things too late. This can lead to long, WET, highly nested code, more execution than necessary, complex boolean logic, more bugs, and a result that's hard to see at a glance is bug-free. These concerns apply even if it can't fall over. But if it can, and you use early guard clauses, you don't just avoid the above problems; you can also assume certain things succeed in most of the code.

score 0 · Answer 9 · answered Dec 12 '20 at 19:21

What could be the supervisor the author is referring to?

This could be many different things, depending on the platform and environment in which the program is running.

One possibility is daemontools. This is a Unix-based program which takes charge of starting and stopping programs under its control, including restarting them if they stop unexpectedly. I've successfully used it in production environments.

(I think it's largely been superseded by systemd, which does some of the same things.)

Note that it can only do this if the entire program shuts down — if one thread crashes but others continue, it has no way to tell.

(And, as mentioned in other answers, that would leave your program in an inconsistent state. For example, consider a simple program with one thread reading from a data feed, and another writing it to a DB. If the reader thread crashed, the writer might happily continue indefinitely, with no data to write. However, if the writer thread crashed, the reader could only continue until the memory was full of data waiting to be written — which might take a long time. Either way, it could miss many minutes or hours of data from the feed. Whereas if the whole program shut down, it would miss data for only the few seconds it took to notice the problem, shut down, and be restarted by the supervisor.)

So if your program uses multiple threads (directly or not), it's not safe just to let an exception go uncaught; for safety, you may need to set up a global uncaught-exception handler which does an explicit shutdown of the entire program. (Such a handler needs to be very carefully written, in case it throws an exception… Out-of-memory conditions are especially tricky to handle reliably.)

Things that are really tricky to handle: out of memory because if you try to run any generic code it may try to allocate memory; out of storage space because if you try to log anything, it will fail too if it tries to log to permanent storage. If you run out of all storage, you cannot even log the fact that you did run out of all storage. — Mikko Rantalainen, Dec 13 '20 at 09:46
@MikkoRantalainen Yes, I was writing from bitter experience :-) After many different failures, I ended up _trying_ to log the error in detail, but catching all possible errors and exceptions from that; if that failed, trying to log a simple fixed string; and even if that failed shutting down no matter what. — gidds, Dec 13 '20 at 10:28
I think it meant something different. I edited my answer to mention it. See at the bottom. — FluidCode, Dec 14 '20 at 19:00

FluidCode · Answer 10 · 2020-12-14T18:58:47.330

In Java almost two decades ago the concept was called "fail fast" not "crash early". The main issue is that back then most of the times you had a server program handling multiple requests and it could not stop for a problem raised processing one single request, soon people found out that when a program writes just some log lines when an Exception is thrown most of the times it ends up being submerged by a huge amount of trivial reporting and troubleshooting turns into a painful digging into piles of log files. The most pragmatic approach was to send error reports to external applications and, at the same time, use Assertions which are Exceptions disabled by default unless the process is started with the flag Enable Assertions. Purpose of the assertions was to stop quickly processing in test environments letting people to spot immediately possible issues, hence the term fail fast as opposed to "resilient behaviour".

Unfortunately later on Junit reused the keyword assert and created a lot of confusion about it, but this is another issue.

With the development of enterprise Java and server programs spawning multiple instances to process client requests the need to keep working when an exception is raised was lessened and people began telling that if don't know what to do in a catch block you better not wrap the code with a try/catch that would just hide the exception to the caller, but this does not mean that you should not use try/catch at all, exceptions should always be handled as best as possible, you can ignore them only if you are sure that in one way or the other they will be managed at higher levels in the call stack.

Lately the idea of spawning a child for each request has been expanded by the advocates of reactive programming with the goal of obtaining fail fast and resilient behaviour at the same time. Provided you have a framework able to supervise, monitor, handle automatically failed requests. Hence the idea of the supervisor taken from the Actor Model. Or, better say, that the supervisor is a role of the Actor Model as it has been outlined in functional programming, see for an example the Akka framework or the Actor model in Scala

What does the crash early concept mean?

10 Answers10

Fail Early

Avoid Catching

Supervisor