When do I stop being paranoid about my code failing?

Question

I'm currently designing a system that, no matter how hard I try to break, slow network, failures, random server deaths, it can recover and it can re-build again. Each action it does is a fragment and it can pick up from where it left off. Each fragment is signed with a "done or not" and, again, it can resume work no matter what happens.

Do I stop just because of development time? I've covered all user-cases. I just want my framework to be robust, but how do I know where to draw the line and is someone that doesn't draw the line impractical because he's leaving issues on the table?

I've covered all my core and lots of exotic cases where it can fail. Anything besides now is in the area of "good to have".

When do I stop?

My time is rather unlimited, and can make sure things work out nicely, but I'm at cross-roads where if I do more checks, it becomes difficult to read through even for me.

If you have enumerated your vulnerabilities, and have tests that fairly represent those vulnerabilities, and your tests pass, you are done. There is nothing more we can tell you. — BobDalgleish, Nov 13 '19 at 22:14
"avoid asking subjective questions where … your question is just a rant in disguise: “______ sucks, am I right?”" ([help/dont-ask]) — gnat, Nov 13 '19 at 22:18
What's your measure of success? Keeping in mind that it is impossible to guarantee 100% uptime and impossible to guarantee that bugs will never exist, noone would ever promise such a thing to a user anyway. It would be more typical to have a Service Level Agreement (SLA) to work against (e.g. uptime percentage, time to respond to live incidents, etc.), then to continuously collect stats to show whether your system works within the parameters agreed in your SLA. If you can collect stats which demonstrate that your system is working as per that agreement, then your job is done. — Ben Cottrell, Nov 13 '19 at 22:54
@BenCottrell Thank you. That is very, very useful in forming my thinking about how to better architect. I guess I should try to objectively measure what's an acceptable failure rate and test against that assmption. — Daniel Smith, Nov 13 '19 at 23:00
@TulainsCórdova Unit testing to the literal type check of everything + interacting in every single possible way that me & my partners can think of - Q&A sort of We have a suite of servers and hosting plans where we just push our updates, all loaded fully with plugins and other things: slow connection, clogged up requests, etc. — Daniel Smith, Nov 14 '19 at 00:36
I have code that's been running fine for decades in shops I no longer work for but I still check up on it. I expect when I'm dead I'll still be checking up on it from the afterlife. — candied_orange, Nov 28 '19 at 04:11

score 2 · Answer 1 · answered Nov 14 '19 at 10:02

While your time now might be unlimited, in normal projects it's not.

Your time costs someone money, which means that any error handling you write needs to cost less than the money you would lose if it would actually occur. That means you need to make at least informal (e.g. in your head) risk and cost-benefit evaluations to check if the time spent on catching a problem is actually worth the effort. This ends up being mostly a matter of experience with the systems involved.

As an example: If you spend two days protecting your order processing code against a network outage, your hoster hasn't had an outage in the last three years and the effect of the outage would be that customers couldn't reach your website to place orders anyways, it's probably not worth putting in that much effort.

Another example: If you spend two weeks to make sure your stock broker system resumes after a server death and the result of not resuming would be thousands of uncommitted stock transactions being lost then yes, that would definitely be worth the effort.

If you aren't being paid and your time essentially has no (monetary) value then you can spend whatever time and effort you want on securing it, because dividing benefit by cost will in that case always result in infinity.

If the code gets too complex due to additional error handling, it's a matter for refactoring, which would usually be covered by the cost-benefit calculation as additional cost. In your case, you can spend as much time on refactoring it until you can understand the code even with all the additional error handling.

In summary: If your time is free and unlimited, only you can decide whether adding more error handling is necessary or worth it. If it's neither free nor unlimited, you catch all errors that are cheaper to catch than to happen.

score 1 · Answer 2 · answered Nov 14 '19 at 06:41

1

if I do more checks, it becomes difficult to read through even for me

This is the worrying part. As you spend more time crafting your code it should become easier to read.

Considering the sheer unlimited time at your disposal, some of it may be well spent on eradicating many of those if statements of which I suspect there are a lot in your code. Recognize logic and apply some structure rather than repeat logic and add conditions. If you post a typical piece of code I think we will be able to give you some pointers.

answered Nov 14 '19 at 06:41

Martin Maat

18,218
3
30
57

I agree with the first statement and I've actually went back to audit a lot of my codebase to see exactly if I can RELY on what's coming in order not to do a check. I've cnealed a lot of my code that way. My problem is that I'm writting PHP and I don't have the tools you guys have. – Daniel Smith Nov 14 '19 at 17:36

score 1 · Answer 3 · answered Nov 14 '19 at 10:50

1

A lot of it depends on what kind of an application you are building and what SLAs you intent to provide.

No system has been build to handle all the scenarios perfectly so that the developer can rest. But you have to stop somewhere, and as long as your system does what you intend it to do on most of the occasions (which is SLA), and are able to recover if the system behaves (the time for which is an SLA again), you should be fine.

answered Nov 14 '19 at 10:50

skott

489
2
7

SLA is a very blunt measure and usually doesn't cover anything other than the live system. You want to be identifying problems way before the code gets to this point. – Robbie Dee Nov 14 '19 at 11:46
But Daniel has mentioned in the question - "I've covered all my core and lots of exotic cases where it can fail. Anything besides now is in the area of "good to have"". Beyond this, the point where one stops further development or optimisations is when your system meets the set expectations right? Or else there is always scope for that one little improvement which can be done. – skott Nov 14 '19 at 12:08
1

As I outlined in my answer, test cases are but one small part of the approach. – Robbie Dee Nov 14 '19 at 13:40

Robbie Dee · Answer 4 · 2019-11-14T08:53:15.520

This seems to boil down to how one builds confidence with their code. Well, lots of ways:

Rubber duck it

Talk through the design with someone else and see if they can spot any problems. As the old saying goes, if you can't explain it simply, there may be some complexity there that needs teasing out.

Code review

Someone else may have a view on how the code could be improved or spot some problems you hadn't seen.

Lots of cheap tests

Unit tests are cheap tests - the more of these you have the better. Yes, coverage tests are good but make sure the behaviour is adequately tested.

Run it

Design and unit testing is all very well, but does it run as intended when put together?

Destruction testing

Try and break the thing - throw invalid data/conditions at it - how does it stand up?

Push it out thru the layers

It always works on the developer machine. Try a deployment to test, UAT and beyond. You'll encounter different issues at every hurdle.

Let someone else take it for a spin

What is obvious for you might not be for someone else.

Code fatigue

If you look at the same code time and time again, you stop seeing the obvious. Take a break and walk away. Do something different and come back and look at it with fresh eyes. You can help yourself by keeping your methods and modules small.

score 0 · Answer 5 · answered Nov 14 '19 at 15:57

There is no such thing as conclusively knowing you've tested everything you can. Test everything that makes sense. That definitely includes the sofware specs, and developers can add further tests that they deem necessary.

Going slightly off the silly deep end, since there are only finite possibilities in any finite codebase, you could theoretically test literally everything, but this would go WELL beyond any reasonable amount of test coverage.
No employer is going to pay the wage that accompanies the amount of time you need to test literally every permutation of the application's state. Test what is reasonable to test.

Don't succumb to thinking that you will get it right the first time. That applies to both your code and your tests.

It can happen that your tests forget to cover a certain scenario. Sometimes, a failure state is so contrived that no one could reasonably see it coming (until it presented itself) and therefore no one pre-emptively wrote a test to catch it.

In such a case, you deal with the issue when it presents itself, wonder why the tests didn't catch this, realize that there is no test to catch this, and then write a test to catch this.

When do I stop being paranoid about my code failing?

5 Answers5