6

I often hear people say they "sanitize input," which would mean make it clean. I understand this to mean "clean of potentially damaging contents," where the function that does the sanitizing would do something like character escaping.

But then I hear things like "sane input," which to me, means the input isn't a string where a double was expected, or "January Third of Nineteen Ninety Five" where "1995/01/03" would have been correct. This represents a matter of formatting.

Then we have "sanity functions," which handle user input to make it usable by the backend of the software. Can this refer to both types of input validation? Does it only deal with the formatting (like "sane input"), or with cleanliness of the input (like "sanitized input")? Are they two different classes of operations, or does sanity in this case just refer to both? I always thought it referred to sanitizing it (if that actually means something different than making it sane) since I thought "sanity" was a root for "sanitize." But I just looked it up and can't find any definition of "sanity" that has anything to do with cleanliness or sanitization.

Is there idioms for each of these operations that I don't know about, or is it always just "sanity functions" which do both of these things? Would it be confusing to see "sanity" and "safety" functions?

Carson Myers
  • 2,480
  • 3
  • 24
  • 25
  • Santizing input just comes from the same way you would santize you're hands to get rid of germs – billy.bob Jul 12 '11 at 09:56
  • @billy.bob so then why is "sanity" used? Is it just a misuse of the word, or do "sanity" and "sanitization" refer to different sets of algorithms and concepts in the same domain? – Carson Myers Jul 12 '11 at 09:59
  • 1
    Using "sanity" implies that the inputs make sense in the real world, where as "sanitization" (as in "sanitary" - clean) is about making sure your inputs don't cause problems for your system (e.g. SQL injection, XSS). It is just usually the case that, by checking to see if the input is "sane", you are also checking to see whether it's "sanitary", because usually unclean input doesn't make any sense. – Aether Jul 12 '11 at 10:08
  • @Aether I see, so one automatically includes the other – Carson Myers Jul 12 '11 at 11:20

3 Answers3

6

Sane input means input that is acceptable for further processing. It doesn't have to be dangerous - just wrong. Say,

  • fractional amount of items that are sold only in whole units.
  • A person's name containing newline characters.
  • A PO Box address for paid-upon-receiving parcel.
  • A value that is against official regulations.
  • A textual description where only number is accepted.
  • Invalid date, like 31st February.
  • A value out of reasonable bonds, say, birthdate two centuries ago.
  • Email address without the @ character. And so on.
  • First name with trailing spaces at the end.

Sometimes sanitizing means only fitting into desired standard, meaning change date ordering and separators, so that 1/1/2011 turns into 2011-01-01, or stripping whitespaces at the ends, or capitalizing the country code etc. Sometimes it's limiting it to sane values, you are entitled to 100% refund, not 18000%. Sometimes it's discarding gibberish or useless data, say, nonexistent zip code will render the whole address invalid, and wrong number of digits in account number will make money transfer impossible.

Sanitizing against SQL injection, or other attacks is only a margin of the operation.

Edit: Yes, pretty much both are a subset of the same problem - making data fit for further processing.

If your database is dropped because someone wrote '; DROP DATABASE;-- as their username it's the same set of problems as when someone wrote 0 as a quotient and your backend blew up on division by zero, or as when someone stole money from someone else's account by entering negative value in amount field of a bank transfer, or as your parcel was returned to sender because you accepted phone number in place of ZIP code.

SF.
  • 5,078
  • 2
  • 24
  • 36
  • So you're saying that sanitization is a subset of making data sane, rather than them both meaning the same thing, or being the other way around? I know I'm nitpicking, but this is something that robs me of concentration, haha. – Carson Myers Jul 12 '11 at 07:27
  • 2
    Just to add some terms, you are distinguishing between the *syntactic* and *semantic* value of the input. – edA-qa mort-ora-y Jul 12 '11 at 07:57
  • @edA-qa yes, you're right, not sure why I didn't use those terms in my question. -- actually, I don't think it's so succinct, I think both have a syntactic component – Carson Myers Jul 12 '11 at 08:07
  • Semantic quality of data is indeterminate (and for practical purposes, bad) if the data is not readable due to being syntactically invalid. Data to be sane must be semantically valid, but syntactic validity is a prerequisite for that. – SF. Dec 23 '19 at 17:46
  • BTW transforming input syntax to a unified form alone, without much worry for semantic validity or being within safe limits (say, turning the date 00-00-0000 into a timestamp corresponding to 30 November 1b.c., as some library functions tend to) is called normalization. – SF. Dec 23 '19 at 17:53
2

Something everyone seems to overlook: "to sanitize" means "to make sanitary", i.e. to clean up, not "to make sane".

Thus, "sanitizing input" means cleaning up input by normalizing it or removing bad or unnecessary parts, but with the basic asssumption that the input is generally sane but possibly flawed in some aspects. This most often applies to input provided by users.

"Sanity checking", on the other hand, means identifying and rejecting input that is fundamentally broken. This is generally used for data provided by other systems or automatically, which is expected to be flawless. Sanity checking is basically a form of defensive programming (fail-fast).

Michael Borgwardt
  • 51,037
  • 13
  • 124
  • 176
  • 1
    Right, and I often hear or see "sanity" used to refer to both of these concepts, and always just assumed it was a happy coincidence that "sane" and "sanitized" are very similar. But I can find no definition of "sanity" that refers to sanitization. I am curious whether it was mistaken of me to always lump input sanitization with sanity checking, or if it was idiomatic to separate the two. – Carson Myers Jul 12 '11 at 09:26
  • _Sanitary_ and _sanity_ both come from the Latin word _sanus_, meaning "healthy or sane." It might be useful to distinguish input checks at the level of problems with a single field (negative day of month; email address w/no @) vs. real world sensibility (31st of Feb; no such post code in that country) but I haven't seen developers make that distinction. The _clean up_ vs. _filter out_ distinction is common. – Jerry101 Nov 11 '17 at 19:46
1

There's a whole chain of things which can be considered input sanitation:

  • Client-side error avoidance means constructing the user interface so that certain human errors (as different from malicious use) are impossible. This includes setting input field lengths, using date pickers, selection lists which update to contain only valid selections, and discarding any non-numeric characters typed into a number field. Any logic here should not be duplicated in client-side validation.
  • Client-side validation checks the input for user errors which can pass error avoidance. Obvious examples include trailing whitespace (you don't know until submission time that the user isn't planning to add another word) or two dots in a number. This should not check for things which are sign of bots (such as strings longer than the maximum input field length), because the test results will simply be ignored.
  • Server-side input filtering using a blacklist or whitelist is probably the most common definition of sanitation: Remove data from input that you don't think the system is capable of handling properly, and return an error if the result is deemed not usable.
  • Server-side input sanity check could be necessary if the user interface is somehow unable to verify some parts of the input. For example, a calculation or third-party communication might take too long to be done interactively by the client-side error avoidance code.
  • Server-side escaping is the good twin sister of filtering. By escaping once and only once on input and output you ensure that your entire stack is able to handle any input thrown at it.
  • Database restrictions are the final checkpoint, and there should be plenty of validation there (foreign keys, sane column lengths and data types, triggers if necessary) to ensure that a data insert is atomic, and that the result of a successful commit is usable. Any errors caught at this level should be a sure indication of willful attempts at sabotage.

Sanity can thus refer to formatting (date, int), validity (password, IBAN), computability (regular expression, mathematical formula), and just plain neatness (trailing spaces).

l0b0
  • 11,014
  • 2
  • 43
  • 47
  • I see what you're saying, but shouldn't client side input logic _always_ be duplicated server side? It's always possible that the input will be coming from a browser with JavaScript disabled, or from a source that isn't a browser at all. Also, I was asking mainly about the semantics of the words sanity, sane and sanitization, and whether there were idioms surrounding those semantics or if it was all just grouped under the word "sanity." This sort of addressed that, but I feel like this answer missed the point a little. – Carson Myers Jul 12 '11 at 09:08
  • The server should only duplicate those parts which cannot be captured by the client. I agree that this should include anything that is validated by JavaScript only, but regardless of accessibility concerns most web sites are terrible this way. The answer was meant as a summary of what is normally grouped under "sanity, sane and sanitization." – l0b0 Jul 12 '11 at 09:37
  • Sorry to keep bothering with this, but what form of validation would only require client-side validation? Do you mean things like checkboxes being boolean? Or something else? And I understand, but I was wondering specifically the differences between those terms, if they existed. I mentioned that you sort of answered the question, in a way that said "they're pretty much the same" without explicitly saying so. FWIW I didn't downvote you. – Carson Myers Jul 12 '11 at 09:57
  • "The server should only duplicate those parts which cannot be captured by the client" should have been explained more in detail. First, the client should be able to capture all normal (i.e., non-hacker) errors and report them nicely to the *user*. After submitting to the server, any validation errors should be reported to the *developer* and considered a client-side validation *bug* or a *hacking attempt*. On the client side, this should just result in a generic error message and a request to try again. – l0b0 Jul 12 '11 at 11:28
  • okay -- but if these error happen because JavaScript is turned off (which may or may not be worth ignoring, depending on the form's purpose), wouldn't it be better to just correct it on the server side also, rather than bothering the user _and_ the developer with avoidable error messages/logs? – Carson Myers Jul 12 '11 at 11:36
  • I would never, ever *correct* user provided data on the server side. That way lies madness, and a huge chunk of complicated code for corner cases. If the form should work without JavaScript then the JS-only *validation* should be repeated on the server (using the same JS code as the one sent to the client!) to be able to provide more useful user feedback. – l0b0 Jul 12 '11 at 12:14
  • I think I see where you're coming from. I was referring to simple things, like making sure phone numbers and times are stored in a certain format, or making sure numerical values can be comprehended by the application (stripping spaces, and depending on the locale, commas). It's just always bugged me when I get errors like "PLEASE ENTER YOUR PHONE NUMBER THIS WAY!" when that can be easily and reliably fixed. – Carson Myers Jul 12 '11 at 12:19