Good code style to introduce data checks everywhere?

Question

I have a project that is sufficiently large in size that I can't keep every aspect in my head any more. I'm dealing with a number of classes and functions in it, and I'm passing data around.

With time I noticed that I kept getting errors, because I forgot what precise form the data has to have when I pass it to different functions (e.g., one function accepts and outputs an array of strings, another function, that I wrote much later, accepts strings that are kept in a dictionary etc., so I have to transform the strings I'm working with from having them in an array to having them in a dictionary).

To avoid always having to figure out what broke where, I started treating each function and class as being an "isolated entity" in the sense that it cannot rely on outside code giving it the correct input and has to perform input checks itself (or, in some cases, recast the data, if the data is given in the wrong form).

This has greatly reduced the time I spend making sure that the data that I pass around "fits" into every function, because the classes and functions themselves now warn me when some input is bad (and sometimes even correct that) and I don't have to go with a debugger through the whole code anymore to figure out where something went haywire.

One the other hand this has also increased the overall code.
My question is, if this code style appropriate for solving this problem?
Of course, the best solution would be to completely refactor the project and make sure the data has a uniform structure for all functions - but since this project is growing constantly, I would end up spending more and worrying about clean code than actually adding new stuff.

(FYI: I'm still a beginner, so please excuse if this question was naive; my project is in Python.)

Possible duplicate of [How defensive should we be?](https://softwareengineering.stackexchange.com/questions/84857/how-defensive-should-we-be) — gnat, Apr 20 '18 at 08:04
@gnat It is similar, but instead of answering my question, there it provides advice ("be as defensive as you can be") for *that specific instance* that OP mentioned, which is different from my more general query. — user7088941, Apr 20 '18 at 08:15
Check out https://stackoverflow.com/questions/1273211/disable-assertions-in-python — Mikhail, Apr 20 '18 at 08:17
"but since this project is growing constantly, I would end up spending more and worrying about clean code than actually adding new stuff" - this sounds much like you need to start worrying about clean code. Otherwise you will find your productivity slowing and slowing as each new bit of functionality is harder and harder to add because of existing code. Not all refactoring needs to be "complete", if adding something new is hard because of the existing code it touches, refactor just that touching code and make a note of what you'd like to revisit later — matt freake, Apr 20 '18 at 08:37
This is a problem people often face using weakly typed languages. If you do not want or can switch to a more strictly typed language, the answer is simply *"yes, this code style is appropriate for solving this problem"*. Next question? — Doc Brown, Apr 20 '18 at 08:39
In a strictly typed language, with proper data types defined, the compiler would have done this for you. — S.D., Apr 20 '18 at 13:55
Check out design by contract, or contract based design, as well. It may fit your style. It is always best to find errors as soon as possible. — Frank Hileman, Apr 24 '18 at 00:02

JacquesB · Answer 1 · 2018-04-20T12:21:49.393

9

OK, the actual problem is described in a comment below this answer:

For example, in function 1, the expected input is an array of strings, where the first string denotes the title of something and the second a bibliographical reference. In function 2 the expected input is still an array of strings, but now the roles of the strings are inversed

The problem here is the use of a list of strings where the order signifies semantics. This is a really error prone approach. Instead you should create a custom class with two fields named title and bibliographical_reference. That way you are not going to mix them up, and you avoid this problem in the future. Of course this requires some refactoring if you already use lists of strings in a lot of places, but believe me, it will be cheaper in the long run.

The common approach in dynamically types languages is "duck typing", which mean you dont really care about the "type" of the object passed, you only care if it supports the methods you call on it. In your case, you will simply read the field called bibliographical_reference when you need it. If this field does not exist on the object passed, you will get an error, and this indicates the wrong type is passed to the function. This is as good a type check as any.

edited Apr 20 '18 at 12:21

answered Apr 20 '18 at 09:05

JacquesB

57,310
21
127
176

Sometimes th problem is even more subtle: I'm passing the correct type, but the "internal structure" of my input messes the function up: For example, in function 1, the expected input is an array of strings, where the first string denotes the title of something and the second a bibliographical reference. In function 2 the expected input is still an array of strings, but now the roles of the strings are inversed: The first string should be the bibliographical reference and the second one should be the bibliographical reference. I guess for this checks are appropriate? – user7088941 Apr 20 '18 at 10:37
I would really avoid constant refactoring. My codebase changes so much that I would always have to rewrite all the input handling for all the classes and functions, because if my data format has to change for some reason slightly, it seems easier to try to see which functions complain about them and fix that, that doing a whole redesign. – user7088941 Apr 20 '18 at 11:22
1

@user7088941: The problem you describe could be easily solved by having a class with two fields: "title" and "bibliographical_reference". You will not mix that up. Relying on the order in a list of strings seem very error prone. Maybe this is the underlying problem? – JacquesB Apr 20 '18 at 12:02
@JacquesB It is true that not using classes, in the you mentioned, caused some errors. But the advantage was that working with lists kept things "organic" (or, at least, I don't know enough OOP to keep meing "organic" using classes). By "organic" I mean that when outside circumstances forced me to change slightly the internal structure of the list (like, in some instances, changing the order - and some other things as well) I could simply change the lists, and [...] – user7088941 Apr 20 '18 at 13:02
[...] implement in a few functions some code that ensures robustness, so that those functions can figure out how to deal with that changed lists. If I would have used classes it seemed to me that for each change I would have needed to rewrite the class (so that it's not nonsensical to pass to the "title" field bibliographical data) and also the functions - which seemed more time consuming. Or maybe I was wrong? Do you know which OO paradigms would have enabled me to efficiently to such a rewriting? – user7088941 Apr 20 '18 at 13:05
@user7088941: I'm honestly not following...why would you want to pass biographical data to the title field? This is exactly the kind of error you want to avoid. OO means you can change and evolve a class *without* having to change all functions which use the object. If you pass a list of strings, and you change the order or semantics of the items, you will have to update all functions which use the list. It is far more work and far more error prone. – JacquesB Apr 20 '18 at 13:17
3

This is the answer. Python is an object-oriented language, not a list-of-dictionaries-of-strings-to-integers-oriented (or whatever) language. So, use objects. Objects are responsible for managing their own state and enforcing their own invariants, other objects *cannot* corrupt them, ever (if correctly designed). If unstructured or semi-structured data enters your system from the outside, you validate and parse *once* at the system boundary and convert to rich objects as soon as possible. – Jörg W Mittag Apr 20 '18 at 15:05
3

"I would really avoid constant refactoring" - this mental block is your problem. Good code only arises from refactoring. Lots of refactoring. Supported by unit tests. Especially when components need to be extended or evolved. – Doc Brown Apr 20 '18 at 15:52
2

I got it now. +1 for all the nice insights and comments. And thanks to all for their incredibly helpful comments! (While I was using some classes/objects I interspersed them with the mentioned lists, which, as I see now, wasn't a good idea. the question remained *how* to best to implement this, where I used the concrete suggestions from JETMs answer, which really made a radical difference in terms of speed of achieving a bug-free state.) – user7088941 Apr 21 '18 at 07:47

score 5 · Accepted Answer · answered Apr 20 '18 at 11:59

A better solution is to take more advantage of Python language features and tooling.

For example, in function 1, the expected input is an array of strings, where the first string denotes the title of something and the second a bibliographical reference. In function 2 the expected input is still an array of strings, but now the roles of the strings are inversed.

This problem is mitigated with a namedtuple. It's lightweight and gives easy semantic meaning to the members of your array.

To grab the benefit of some automatic type checking without switching languages, you can take advantage of type hinting. A good IDE can use this to let you know when you do something dumb.

You also seem worried about functions going stale when requirements change. This can be caught by automated testing.

While I am not saying that manually checking is never appropriate, a better use of available language features can help you solve this problem in a more maintainable way.

+1 for pointing me to `namedtuple` and all of the other nice things. I did not not about `namedtuple` - and while I knew about automated testing, I never really used it much and did not realize how much it would help me in this case. All of these really seem to be as good as a static analysis. (Automated testing might even be better, as I can catch all the subtle things that would not be caught in a static analysis!) If you know any other, please let me know. I'm going to keep the question open a while longer, but if no other answers come, I'll accept yours. — user7088941, Apr 20 '18 at 13:10

score 3 · Answer 3 · answered Apr 20 '18 at 12:42

First of all, what you are experiencing right now is code smell - try to remember what lead to you becoming conscious of the smell and try to hone your "mental" nose, as the sooner you notice a code smell the sooner - and easier - you are able to fix the underlying issue.

To avoid always having to figure out what broke where, I started treating each function and class as being an "isolated entity" in the sense that it cannot rely on outside code giving it the correct input and has to perform input checks itself.

Defensive programming - as this technique is called - is a valid and often used tool. However, as with all things it is important to use the right amount, too little checks and you will not catch issues, too many and your code will be over-bloated.

(or, in some cases, recast the data, if the data is given in the wrong form).

That might be a less-good idea. If you notice a part of your program is calling a function with incorrectly formatted data, FIX THAT PART, not change the called function to be able to digest bad data anyway.

This has greatly reduced the time I spend making sure that the data that I pass around "fits" into every function, because the classes and functions themselves now warn me when some input is bad (and sometimes even correct that) and I don't have to go with a debugger through the whole code anymore to figure out where something went haywire.

Improving the quality & maintainability of your code is a time saver in the long run (in that sense I must again warn against the self-correction functionality you built into some of your functions - they might be an insidious source for bugs. Just because your program does not crash & burn does not mean it works right...)

To finally answer your question: Yes, defensive programming (i.e. verifying the validity of the provided parameters) is - in a healthy degree - a good strategy. That said, as you said yourself, your code is inconsitent, and I'd strongly reccommend that you spend some time to refactor the parts that smell - you said you do not want to worry about clean code all the time, spending more time on "cleaning" than on new features... If you don't keep your code clean you might spend twice the time "save" from not keeping a clean code on squashing bugs AND will have a hard time implementing new features - technical debt can crush you.

score 1 · Answer 4 · answered Apr 20 '18 at 08:44

1

That's okay. I used to code in FoxPro, where I had a TRY..CATCH blocks almost in every big function. Now, I code in JavaScript/LiveScript and rarely check parameters in "internal" or "private" functions.

"How much to check" depends on the chosen project/language more than it depends on your code skill.

answered Apr 20 '18 at 08:44

Michael Quad

15
4

1

I guess, it was TRY...CATCH...IGNORE. You did about the opposite of what the OP is asking for. IMHO their point is avoiding inconsistency while yours was ensuring that the program doesn't blow up when hitting one. – maaartinus Apr 20 '18 at 10:35
1

@maaartinus that is correct. Programming languages typically gives us constructs that are simple to use to prevent the application of blowing up - but the constructs programming languages give us to prevent inconsistencies seem to be much harder to use: to my knowledge, constantly refactor everything and use classes that best containerize information flow in your application. This is precisely what I'm asking about - is there an easier way to fix this. – user7088941 Apr 20 '18 at 10:44
@user7088941 That's why I avoid weakly typed languages. Python is just fantastic, but for anything bigger, I can't keep track of what I did elsewhere. Therefore, I prefer Java, which is rather verbose (not so much with Lombok and Java 8 features), has strict typing and tools for static analysis. I'd suggest you to try some strictly types language as I don't know how to solve it otherwise. – maaartinus Apr 20 '18 at 11:28
It's not about strict/loose typed parameter. It's about knowing that parameter is correct. Even if you use (integer 4 byte) you may need to check if it is in some range 0..10 for example. If you know that parameter is always 0..10 you dont need to check it. FoxPro doesn't have associative arrays for example, it's very hard to operate with its variables, their scope and so on.. that's why you have to check check check.. – Michael Quad Apr 20 '18 at 11:42
@MichaelQuad that's exactly what I mean. Do you know if there exist OO design principles that are designed to do "check check check" as painlessly as possible? – user7088941 Apr 20 '18 at 13:07
I don't know if it's OO related either. For example, in FoxPro you don't have to worry if you compare 0.1 + 0.2 == 0.3, In JavaScript, you should. You can try to reduce/simplify object APIs, because, checks are obligatory only at the API (public methods) and may be avoided in private methods. You can try to reduce the number of objects interacting at the same level.. – Michael Quad Apr 20 '18 at 14:25
1

@user7088941 It's not OO, but there's the "fail fast" rule. Every non-private method has to check its arguments and throw when anything is wrong. No try-catch, no attempt to fix it, just blow it sky-high. Sure, at some higher level, the exception get logged and handled. As your tests find most problems beforehand and no problems get hidden, the code converges to a bug-free solution much faster than when being error-tolerant. – maaartinus Apr 20 '18 at 14:38

Good code style to introduce data checks everywhere?

4 Answers4