53

Lets say you are coding a function that takes input from an external API MyAPI.

That external API MyAPI has a contract that states it will return a string or a number.

Is it recommended to guard against things like null, undefined, boolean, etc. even though it's not part of the API of MyAPI? In particular, since you have no control over that API you cannot make the guarantee through something like static type analysis so it's better to be safe than sorry?

I'm thinking in relation to the Robustness Principle.

jpmc26
  • 5,389
  • 4
  • 25
  • 37
Adam Thompson
  • 1,243
  • 1
  • 9
  • 14
  • 16
    What are the impacts of not handling those unexpected values if they are returned? Can you live with these impacts? Is it worth the complexity to handle those unexpected values to prevent having to deal with the impacts? – Vincent Savard Jan 14 '19 at 18:13
  • @VincentSavard I know I won't get an absolute answer without this data, but I'm looking for an _in general_ answer. – Adam Thompson Jan 14 '19 at 18:52
  • 55
    If you're expecting them, then by definition they're not unexpected. – Mason Wheeler Jan 14 '19 at 19:33
  • 3
    Possible duplicate of [Differences between Design by Contract and Defensive Programming](https://softwareengineering.stackexchange.com/questions/125399/differences-between-design-by-contract-and-defensive-programming) – gnat Jan 14 '19 at 19:56
  • There is no general answer. Ideally you make the software fail and log an error. But if you cannot fix the error, that doesn't do much good for the user. – Frank Hileman Jan 14 '19 at 20:00
  • 1
    You also don't expect the external API to ever return a status code in the 400/500 range, but you still have code to handle that, right? Developers are people and people sometimes merge breaking API changes. I suspect the answer writes itself at this point. – Ellesedil Jan 14 '19 at 20:49
  • 28
    Remember the API isn't obligated to only give you valid JSON back (I'm assuming this is JSON). You could also get a reply like ` 504 Gateway TimeoutThe server was unable to process your request. Make sure you have typed the address correctly. If the problem persists, please try again later.` – user253751 Jan 14 '19 at 22:18
  • 5
    What does "external API" mean? Is it still under your Control? – Deduplicator Jan 14 '19 at 22:22
  • 5
    Absolutely. expect nothing, expect anything. – S.. Jan 15 '19 at 12:14
  • 1
    I'd say that's not (quite) the right question.  *Expect* dodgy results?  Probably not.  *Handle* dodgy results?  Very probably.  *Consider* the possibility of dodgy results, and what sort of handling or mitigation is appropriate?  Absolutely! – gidds Jan 15 '19 at 16:25
  • Regardless of how you decide to handle it, enter what you were expecting and what you got instead in a log file somewhere, and mark it as an internal error. This separates out the capture/details of the condition from how it's eventually addressed. – user117529 Jan 15 '19 at 23:24
  • @MasonWheeler Happy? I rather liked the paradox inherent in the original wording, but alas, it's gone to satisfy you. – jpmc26 Jan 16 '19 at 03:49
  • 12
    _"A good programmer is someone who looks both ways before crossing a one-way street."_ – jeroen_de_schutter Jan 16 '19 at 09:17

9 Answers9

104

You should never trust the inputs to your software, regardless of source. Not only validating the types is important, but also ranges of input and the business logic as well. Per a comment, this is well described by OWASP

Failing to do so will at best leave you with garbage data that you have to later clean up, but at worst you'll leave an opportunity for malicious exploits if that upstream service gets compromised in some fashion (q.v. the Target hack). The range of problems in between includes getting your application in an unrecoverable state.


From the comments I can see that perhaps my answer could use a bit of expansion.

By "never trust the inputs", I simply mean that you can't assume that you'll always receive valid and trustworthy information from upstream or downstream systems, and therefore you should always sanitize that input to the best of your ability, or reject it.

One argument surfaced in the comments I'll address by way of example. While yes, you have to trust your OS to some degree, it's not unreasonable to, for example, reject the results of a random number generator if you ask it for a number between 1 and 10 and it responds with "bob".

Similarly, in the case of the OP, you should definitely ensure your application is only accepting valid input from the upstream service. What you do when it's not OK is up to you, and depends a great deal on the actual business function that you're trying to accomplish, but minimally you'd log it for later debugging and otherwise ensure that your application doesn't go into an unrecoverable or insecure state.

While you can never know every possible input someone/something might give you, you certainly can limit what's allowable based on the business requirements and do some form of input whitelisting based on that.

Paul
  • 3,277
  • 1
  • 17
  • 16
  • 21
    What is q.v. stand for ? – JonH Jan 14 '19 at 20:54
  • 15
    @JonH basically "see also"... the Target hack is an example that he is referencing https://en.oxforddictionaries.com/definition/q.v. – andrewtweber Jan 14 '19 at 21:43
  • 8
    This answer is as it stands just doesn't make sense. It's infeasible to anticipate each and every way a third-party library might misbehave. If a library function's documentation explicitly assures that the result will always have some properties, then you should be able to rely on it that the designers ensured this property will actually hold. It's _their_ responsibility to have a test suite that checks this kind of thing, and submit a bug fix in case a situation is encountered where it doesn't. _You_ checking these properties in your own code is violating the DRY principle. – leftaroundabout Jan 14 '19 at 23:58
  • 1
    ...That's not to say there aren't in practice often reasons to mistrust certain third-party functions, but _distrusting everything_ doesn't get you anywhere. Following that logic, _you must write everything yourself including the operating system_. I think we can agree that this would not give a better end product... — What I'd argue is the best approach to this kind of things is to choose tooling that gets as close as possible for library providers to _ensure_ contracts: strong, static type systems, automated property-based unit testing etc.. – leftaroundabout Jan 15 '19 at 00:01
  • 24
    @leftaroundabout no, but you should be able to predict all valid things your application can accept and reject the rest. – Paul Jan 15 '19 at 02:15
  • 1
    I'd argue for a more nuanced approach - if a library/API is _known_ to be flaky and buggy, add some extra security in its wrapper (you _are_ using wrappers, right?) If it's trustworthy, you can make your wrappers a lot thinner. – Sebastian Lenartowicz Jan 15 '19 at 07:53
  • 10
    @leftaroundabout It's not about distrusting everything, it's about distrusting external untrusted sources. This is all about threat modelling. If you haven't done that your software isn't secure (how can it be, if you never even thought about against what kind of actors and threats you want to secure your application?). For a run of the mill business software it's a reasonable default to assume that callers could be malicious, while it's rarely sensible to assume your OS is a threat. – Voo Jan 15 '19 at 09:07
  • 2
    Good answer but not complete without [a reference to the OWASP cheat sheet](https://www.owasp.org/index.php/Input_Validation_Cheat_Sheet#Implementing_input_validation) *Data from all potentially untrusted sources should be subject to input validation, including not only Internet-facing web clients but also backend feeds over extranets, from suppliers, partners, vendors or regulators*. Inputs should be validated as narrowly as possible so as to present the smallest possible attack surface. – John Wu Jan 15 '19 at 09:09
  • 1
    You should add that testing such input validation is referred to as "fuzzing". – MooseBoys Jan 15 '19 at 10:01
  • 1
    @Voo yeah, but this answer phrased it as “every source is always untrusted”, which is utterly impractical because it would also entail that, say, the OS' random generator is untrusted. – leftaroundabout Jan 15 '19 at 11:09
  • @left That's not how I read the answer, but that depends to a large part on your mental threat model. But yes I agree that a simplistic "distrust everything" is not helpful. – Voo Jan 15 '19 at 12:02
  • 1
    "What you do when it's not OK is up to you" - There's a [rule of thumb](https://en.wikipedia.org/wiki/Robustness_principle) for this. Basically it's "be permissive in what you accept, and strict in what you send". Thus sanitizing would generally be preferred over outright rejection in most cases, where possible. – aroth Jan 16 '19 at 23:53
  • 2
    @Paul You called this out in the comments, but I think it's worth including in your answer- It's a LOT easier and more sensible to figure out what your application CAN accept, and fail generally on things it can't, than to individually dispatch on all possible varieties of failure. I think that advice is pretty universal, and helps put the rest of your stance in perspective. – Iron Gremlin Jan 17 '19 at 02:02
  • *"...it's not unreasonable to, for example, reject the results of a random number generator if you ask it for a number between 1 and 10 and it responds with "bob"."* I don't understand this, it seems ridiculous. An RNG would never return such a value. That would be beyond exceptional, it would be extremely broken. In languages with type-safety, that would throw an exception at whatever point it was being parsed to a number type. I will never check the type returned by an RNG. – Dom Jan 17 '19 at 09:27
  • @Dom I agree that it's something you'd get for free if you're in a type safe language, but not all are. Also, I was intentionally being a little ridiculous to make the point with a bit of levity. I agree that if your RNG is returning strings you have a bigger problem on your hands. But to be clear, a type safe language is in fact doing what I'm describing by rejecting that unparseable value via exception so that you don't have to. – Paul Jan 17 '19 at 13:27
  • @IronGremlin I had that as the final sentence in my first edit, but I just separated it into its own paragraph to make it stand out more. Thanks for helping me improve the answer! – Paul Jan 17 '19 at 21:34
34

Yes, of course. But what makes you think the answer could be different?

You surely don't want to let your program behave in some unpredictable manner in case the API does not return what the contract says, don't you? So at least you have to deal with such a behaviour somehow. A minimal form of error handling is always worth the (very minimal!) effort, and there is absolutely no excuse for not implementing something like this.

However, how much effort you should invest to deal with such a case is heavily case dependent and can only be answered in context of your system. Often, a short log entry and letting the application end gracefully can be enough. Sometimes, you will be better off to implement some detailed exception handling, dealing with different forms of "wrong" return values, and maybe have to implement some fallback strategy.

But it makes a hell of a difference if you are writing just some inhouse spreadsheet formatting application, to be used by less than 10 people and where the financial impact of an application crash is quite low, or if you are creating a new autonomous car driving system, where an application crash may cost lives.

So there is no shortcut against reflecting about what you are doing, using your common sense is always mandatory.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • What to do is another decision. You may have a fail over solution. Anything asynchronous could be retried before creating an exception log (or dead letter). An active alert to the vendor or provider may be an option if the issue persists. – mckenzm Jan 15 '19 at 02:20
  • @mckenzm: the fact the OP asks a question where the literal answer can obviously be only "yes" is IMHO a sign they may not just be interested in a literal answer. It looks they are asking *"is it necessary to guard against different forms of unexpected values from an API and deal with them differently"*? – Doc Brown Jan 15 '19 at 06:39
  • 1
    hmm, the crap/carp/die approach. Is it our fault for passing bad (but legal) requests? is the response possible, but not usable for us in particular? or is the response corrupt? Different scenarios, Now it does sound like homework. – mckenzm Jan 15 '19 at 21:56
26

The Robustness Principle--specifically, the "be liberal in what you accept" half of it--is a very bad idea in software. It was originally developed in the context of hardware, where physical constraints make engineering tolerances very important, but in software, when someone sends you malformed or otherwise improper input, you have two choices. You can either reject it, (preferably with an explanation as to what went wrong,) or you can try to figure out what it was supposed to mean.

EDIT: Turns out I was mistaken in the above statement. The Robustness Principle doesn't come from the world of hardware, but from Internet architecture, specifically RFC 1958. It states:

3.9 Be strict when sending and tolerant when receiving. Implementations must follow specifications precisely when sending to the network, and tolerate faulty input from the network. When in doubt, discard faulty input silently, without returning an error message unless this is required by the specification.

This is, plainly speaking, simply wrong from start to finish. It is difficult to conceive of a more wrongheaded notion of error handling than "discard faulty input silently without returning an error message," for the reasons given in this post.

See also the IETF paper The Harmful Consequences of the Robustness Principle for further elaboration on this point.

Never, never, never choose that second option unless you have resources equivalent to Google's Search team to throw at your project, because that's what it takes to come up with a computer program that does anything close to a decent job at that particular problem domain. (And even then, Google's suggestions feel like they're coming straight out of left field about half the time.) If you try to do so, what you'll end up with is a massive headache where your program will frequently try to interpret bad input as X, when what the sender really meant was Y.

This is bad for two reasons. The obvious one is because then you have bad data in your system. The less obvious one is that in many cases, neither you nor the sender will realize that anything went wrong until much later down the road when something blows up in your face, and then suddenly you have a big, expensive mess to fix and no idea what went wrong because the noticeable effect is so far removed from the root cause.

This is why the Fail Fast principle exists; save everyone involved the headache by applying it to your APIs.

Mason Wheeler
  • 82,151
  • 24
  • 234
  • 309
  • 7
    While I agree with the principle of what you're saying, I think you're mistaken WRT the intent of the Robustness Principle. I've never seen it intended to mean, "accept bad data", only, "don't be excessively fiddly about good data". For example, if the input is a CSV file, the Robustness Principle wouldn't be a valid argument for trying to parse out dates in an unexpected format, but would support an argument that inferring colum order from a header row would be a good idea. – Morgen Jan 14 '19 at 21:22
  • 9
    @Morgen: The robustness principle was used to suggest that browsers should accept rather sloppy HTML, and led to deployed web sites being much sloppier than they would have been if browsers had demanded proper HTML. A big part of the problem there, though, was the use of a common format for human-generated and machine-generated content, as opposed to the use of separate human-editable and machine-parsable formats along with utilities to convert between them. – supercat Jan 14 '19 at 21:33
  • 10
    @supercat: nevertheless - or just hence - HTML and the WWW was extremely successful ;-) – Doc Brown Jan 14 '19 at 21:43
  • 12
    @DocBrown: A lot of really horrible things have become standards simply because they were the first approach that happened to be available when someone with a lot of clout needed to adopt something that met certain minimal criteria, and by the time they gained traction it was too late to select something better. – supercat Jan 14 '19 at 22:05
  • 5
    @supercat Exactly. JavaScript immediately comes to mind, for example... – Mason Wheeler Jan 14 '19 at 22:11
  • @supercat My bad, I'd forgotten about that one. – Morgen Jan 15 '19 at 00:30
  • 1
    @MasonWheeler Do you have a citation for `It was originally developed in the context of hardware`? – studog Jan 15 '19 at 19:53
  • @studog Oops! I had heard that the idea came from the world of telephone hardware, but when I looked it up just now, it turns out that I was mistaken. It actually comes from [RFC 1958](https://tools.ietf.org/html/rfc1958), laying out general principles for Internet architecture, which is disappointing to see, especially considering the idea that immediately follows the Robustness Principle: "When in doubt, discard faulty input silently, without returning an error message unless this is required by the specification." It's difficult to imagine a more wrongheaded notion of error handling! – Mason Wheeler Jan 15 '19 at 20:24
  • 2
    @supercat What you're describing sounds a lot like "Worse Is Better". I won't rehash the debate on that concept, beyond drawing attention to the fact that there is a debate. – James_pic Jan 16 '19 at 12:24
  • 1
    @MasonWheeler: When using a shared communications medium, it's often impossible to identify the originator of a message with certainty. Further, bandwidth which is used for delivering error indications will count against bandwidth that could be used for carrying useful data. Silent discarding of erroneous packets may make some kinds of troubleshooting more difficult, but doing otherwise can create different problems. – supercat Jan 16 '19 at 15:50
  • @supercat When using a socket connection, you don't need to "identify" the originator of the message; you just send an error back over the socket connection they're establishing with you, to whatever computer happens to be on the other end. – Mason Wheeler Jan 16 '19 at 16:06
  • 1
    @MasonWheeler: Once two-way connectivity is established, that is true. The "ignore silently" principle largely applies at lower-levels of communication before that happens. It also serves, in weaker form, to prevent meltdown where a party receives a malformed message which causes it to send back a malformed error reply, which in turn causes the originator to send back its own malformed response, etc. The latter problems may be mitigated by capping limiting the rate at which error messages will be sent, and ignoring any errors which happen once the rate limit is reached. – supercat Jan 16 '19 at 16:40
14

In general, code should be constructed to uphold the at least the following constraints whenever practical:

  1. When given correct input, produce correct output.

  2. When given valid input (that may or may not be correct), produce valid output (likewise).

  3. When given invalid input, process it without any side-effects beyond those caused by normal input or those which are defined as signalling an error.

In many situations, programs will essentially pass through various chunks of data without particularly caring about whether they are valid. If such chunks happen to contain invalid data, the program's output would likely contain invalid data as a consequence. Unless a program is specifically designed to validate all data, and guarantee that it will not produce invalid output even when given invalid input, programs that process its output should allow for the possibility of invalid data within it.

While validating data early on is often desirable, it's not always particularly practical. Among other things, if the validity of one chunk of data depends upon the contents of other chunks, and if the majority of of the data fed into some sequence of steps will get filtered out along the way, limiting validation to data which makes it through all stages may yield much better performance than trying to validate everything.

Further, even if a program is only expected to be given pre-validated data, it's often good to have it uphold the above constraints anyway whenever practical. Repeating full validation at every processing step would often be a major performance drain, but the limited amount of validation needed to uphold the above constraints may be much cheaper.

supercat
  • 8,335
  • 22
  • 28
  • Then it all comes down to deciding whether the result of an API call is an "input". – mastov Jan 16 '19 at 17:43
  • 1
    @mastov: The answers to many questions will depend upon how one defines "inputs" and "observable behaviors"/"outputs". If a program's purpose is to process numbers stored in a file, its input could be defined as the sequence of numbers (in which case things that aren't numbers aren't possible inputs), or as a file (in which case anything that could appear in a file would be a possible input). – supercat Jan 16 '19 at 18:01
3

Let's compare the two scenarios and try to come to a conclusion.

Scenario 1 Our application assumes the external API will behave as per the agreement.

Scenario 2 Our application assumes the external API can misbehave, hence add precautions.

In general, there is a chance for any API or software to violate the agreements; may be due to a bug or unexpected conditions. Even an API might be having issues in the internal systems resulting in unexpected results.

If our program is written assuming the external API will adhere to the agreements and avoid adding any precautions; who will be the party facing the issues? It will be us, the ones who has written integration code.

For example, the null values that you have picked. Say, as per the API agreement the response should have not-null values; but if it is suddenly violated our program will result in NPEs.

So, I believe it will be better to make sure your application has some additional code to address unexpected scenarios.

lkamal
  • 199
  • 3
1

You should always validate incoming data -- user-entered or otherwise -- so you should have a process in place to handle when the data retrieved from this external API is invalid.

Generally speaking, any seam where extra-orgranizational systems meet should require authentication, authorization (if not defined simply by authentication), and validation.

StarTrekRedneck
  • 188
  • 1
  • 4
1

In general, yes, you must always guard against flawed inputs, but depending on the kind of API, "guard" means different things.

For an external API to a server, you do not want to accidentally create a command that crashes or compromises the state of the server, so you must guard against that.

For an API like e.g. a container class (list, vector, etc), throwing exceptions is a perfectly fine outcome, compromising the state of the class instance may be acceptable to some extent (e.g. a sorted container provided with a faulty comparison operator will not be sorted), even crashing the application may be acceptable, but compromising the state of the application - e.g. writing to random memory locations unrelated to the class instance - is most likely not.

Peter
  • 3,718
  • 1
  • 12
  • 20
0

To give a slightly differing opinion: I think it can be acceptable to just work with the data you are given, even if it violates it's contract. This depends on the usage: It's something that MUST be a string for you, or is it something you are just displaying / does not use etc. In the latter case, simply accept it. I have an API which just needs 1% of the data delivered by another api. I could not care less what kind of data are in the 99%, so I will never check it.

There has to be balance between "having errors because I do not check my inputs enough" and "I reject valid data because I am too strict".

Christian Sauer
  • 1,269
  • 1
  • 9
  • 16
  • 2
    "I have an API which just needs 1% of the data delivered by another api." This then opens up the question *why* your API expects a 100 times more data than it actually needs. If you need to store opaque data to pass on, you don't really have to be specific as to what it is and don't have to declare it in any specific format, in which case the caller wouldn't be violating your contract. – Voo Jan 15 '19 at 19:21
  • 1
    @Voo - My suspicion is that they are calling some external API (like "get weather details for city X") and then cherry-picking the data they need ("current temperature") and ignoring the rest of the returned data ("rainfall", "wind", "forecast temperature", "wind chill", etc...) – Stobor Jan 16 '19 at 02:24
  • @ChristianSauer - I think you are not that far from what the wider consensus is - the 1% of the data that you use makes sense to check, but the 99% which you don't does not necessarily need to be checked. You only need to check the things which could trip your code up. – Stobor Jan 16 '19 at 02:25
0

My take on this is to always, always check each and every input to my system. That means every parameter returned from an API should be checked, even if my program does not use it. I tend to as well check every parameter I send to an API for correctness. There are only two exceptions to this rule, see below.

The reason for testing is that if for some reason the API / input is incorrect my program cannot rely on anything. Maybe my program was linked to an old version of the API that does something different from what I believe? Maybe my program stumbled on a bug in the external program that has never before happened. Or even worse, happens all the time but no one cares! Maybe the external program is beeing fooled by a hacker to return stuff that can hurt my program or the system?

The two exceptions to testing everything in my world are:

  1. Performance after careful measurement of performance:

    • never optimize before you have measured. Testing all input / returned data most often takes a very small time compared to the actual call so removing it often saves little or nothing. I would still keep the error detection code, but comment it out, perhaps by a macro or simply commenting it away.
  2. When you have no clue what to do with an error

    • there are times, not often, when your design simply does not allow handling of the kind of error you would find. Maybe what you ought to do is log an error, but there is no error logging in the system. It is almost always possible to find some way to "remember" the error allowing at least you as a developer to later check for it. Error counters is one good thing to have in a system, even if you elect to not have logging.

Exactly how carefully to check inputs / return values is an important question. As example, if the API is said to return a string, I would check that:

  • the data type actully is a string

  • and that length is between min and max values. Always check strings for max size that my program can expect to handle (returning too large strings is a classical security problem in networked systems).

  • Some strings should be checked for "illegal" characters or content when that is relevant. If your program might send the string to say a database later, it is a good idea to be check for database attacks (search for SQL injection). These tests are best done at the borders of my system, where I can pinpoint where the attack came from and I can fail early. Doing a full SQL injection test might be difficult when strings are later combined, so that test should be done before calling the database, but if you can find some problems early it can be useful.

The reason for testing parameters I send to the API is to be sure that I get a correct result back. Again, doing these tests before calling an API might seem unnecessary but it takes very little performance and may catch errors in my program. Hence the tests are most valuable when developing a system (but nowadays every system seems to be in continous development). Depending on the parameters the tests can be more or less thorough but I tend to find that you can often set allowable min and max values on most parameters that my program could create. Perhaps a string should always have at least 2 characters and be a maximum of 2000 characters long? The min and maximum should be inside what the API allows as I know that my program will never use the full range of some parameters.

ghellquist
  • 177
  • 4