60

The way I see it, SQL injection attacks can be prevented by:

  1. Carefully screening, filtering, encoding input (before insertion into SQL)
  2. Using prepared statements / parameterized queries

I suppose that there are pros and cons for each, but why did #2 take off and become considered to be more or less the de facto way to prevent injection attacks? Is it just safer and less prone to error or were there other factors?

As I understand, if #1 is used properly and all caveats are taken care of, it can be just as effective as #2.

Sanitizing, Filtering, and Encoding

There was some confusion on my part between what sanitizing, filtering, and encoding meant. I'll say that for my purposes, all of the above can be considered for option 1. In this case I understand that sanitizing and filtering have the potential to modify or discard input data, while encoding preserves data as-is, but encodes it properly to avoid injection attacks. I believe that escaping data can be considered as a way of encoding it.

Parameterized Queries vs Encoding Library

There are answers where concepts of parameterized queries and encoding libraries that are treated interchangeably. Correct me if I'm wrong, but I am under impression that they are different.

My understanding is that encoding libraries, no matter how good they are always have the potential to modify SQL "Program", because they are making changes to the SQL itself, before it is sent off to the RDBMS.

Parameterized queries on the other hand, send the SQL program to the RDBMS, which then optimizes the query, defines the query execution plan, selects indexes that are to be used, etc., and then plug in the data, as the last step inside the RDBMS itself.

Encoding Library

  data -> (encoding library)
                  |
                  v
SQL -> (SQL + encoded data) -> RDBMS (execution plan defined) -> execute statement

Parameterized Query

                                               data
                                                 |
                                                 v
SQL -> RDBMS (query execution plan defined) -> data -> execute statement

Historal Significance

Some answers mention that historically, parameterized queries (PQ) were created for performance reasons, and before injection attacks that targeted encoding issues became popular. At some point it became apparent that PQ were also pretty effective against injection attacks. To keep with the spirit of my question, why did PQ remain the method of choice and why did it flourish above most other methods when it comes to preventing SQL injection attacks?

Dennis
  • 8,157
  • 5
  • 36
  • 68
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/45375/discussion-on-question-by-dennis-why-did-sql-injection-prevention-mechanism-evol). – maple_shaft Sep 14 '16 at 16:13
  • 23
    Prepared statements are **not** a result of evolution from SQL injection attacks. They were there from the beginning. Your question is based on a false premiss. – user207421 Sep 14 '16 at 22:14
  • 4
    If you think you are smarter than the bad guys then go for #1 – paparazzo Sep 15 '16 at 04:51
  • 1
    "why did PQ remain the method of choice" Because it's the easiest and most robust. Plus the aforementioned performance advantages to PQ's. There's really not a downside. – Paul Draper Sep 17 '16 at 10:12
  • For the same reason we gave our cars round wheels, rather than softer suspension. – Lightness Races in Orbit Sep 18 '16 at 00:20
  • 1
    Because it's the correct solution to the problem of how to do queries, *even if it weren't for the issue of SQL injection in a security context*. Forms that require escaping and use in-band data with commands are **always a design bug** because they're error-prone, counter-intuitive, and break badly when used wrong. See also: shell scripting. – R.. GitHub STOP HELPING ICE Sep 18 '16 at 01:42

16 Answers16

148

The problem is that #1 requires you effectively parse and interpret the entirety of the SQL variant you're working against so you know if it is doing something it shouldn't. And keep that code up to date as you update your database. Everywhere you accept input for your queries. And not screw it up.

So yes, that sort of thing would stop SQL injection attacks, but it is absurdly more costly to implement.

Telastyn
  • 108,850
  • 29
  • 239
  • 365
  • 2
    Thanks ... can you give an example of why such code would have to keep track of the SQL variant? So far I can only imagine I'd have to check for things like "is this really a string? -> Escape all the quotes using proper encoding, quote the entire thing and send it". "Is this really a number? -> send it in". I suppose if the way numbers are accepted in SQL variant itself changes, i.e. `1e4` may be accepted as number in one variant but not in another. Is that the kind of change you are talking about? – Dennis Sep 12 '16 at 14:26
  • 61
    @dennis - Well, what is a quote in your SQL variant? "? '? ”? U+2018‎? \u2018? Are there tricks to separate expressions? Can your subqueries do updates? There are _many_ things to consider. – Telastyn Sep 12 '16 at 14:45
  • 7
    @Dennis every DB engine has its own way of doing things such as escaping characters in strings. That is a lot of holes to plug, especially if an application needs to work with multiple DB engines or be compatible with future versions of the same engine that might change some minor query syntax that could be exploitable. –  Sep 12 '16 at 23:34
  • 2
    @Dennis also consider the different collations available, I believe SQL Server (??) had an issue where if a specific collation was used, it could be used for injection as a character specific to that set wasn't checked for by a certain sanitiser ( I believe there was another apostrophe). Unless you want to spend your days checking your sanitiser still works with every possible collation, parameterisation is easier and water tight. – Neil P Sep 13 '16 at 08:55
  • 12
    Another benefict of prepared statements is the performance gain you get when you have to re-run the same query, with different values. Also, prepared statements can know if a value is trully meant as `null`, a string or a number and act accordingly. This is very good for security. And even if you run the query once, the DB engine will already have it optimized for you. Better yet if it is cached! – Ismael Miguel Sep 13 '16 at 10:43
  • 9
    @Dennis Mr. Henry Null will thank you for doing this the right way. – Mathieu Guindon Sep 13 '16 at 19:43
  • 1
    Who is Mr. Henry Null? – Dennis Sep 13 '16 at 20:37
  • 14
    @Dennis the first name is irrelevant. The problem is with the last name. See [Stack Overflow](http://stackoverflow.com/q/4456438/1188513), [Programmers.SE](http://programmers.stackexchange.com/q/313819/68834), [Fox Sports](http://imgur.com/RW1F8b4), [Wired](https://www.wired.com/2015/11/null/), [BBC](http://www.bbc.com/future/story/20160325-the-names-that-break-computer-systems), and whatever else you can turn up in a quick Google search ;-) – Mathieu Guindon Sep 14 '16 at 01:59
  • 2
    It's also worth pointing out that #2 essentially takes care of #1 for rdbms that pass text at the network level but will likely be better coded and tested than some homebrew #1 code. – James Snell Sep 15 '16 at 16:39
  • Beware, mortal - the gods punish hubris, and enjoy a good laugh... – Bob Jarvis - Слава Україні Sep 16 '16 at 11:45
79

Because option 1 is not a solution. Screening and filtering means rejecting or removing invalid input. But any input might be valid. For example apostrophe is a valid character in the name "O'Malley". It just have to be encoded correctly before being used in SQL, which is what prepared statements does.


After you added the note, it seems you are basically asking why use a standard library function rather than writing your own functionally similar code from scratch? You should always prefer standard library solutions to writing your own code. It is less work and more maintainable. This is the case for any functionality, but especially for something which is security sensitive it makes absolutely no sense to reinvent the wheel on your own.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
  • 2
    That's it (and that was the missing part in two other answers, so +1). Given how the question is formulated, it's not about *sanitizing* user input, but, and I quote the question: “filtering input (before insertion)”. If the question is now about sanitizing the input, then why would you do it yourself instead of letting the library do it (while, also, losing the opportunity to have cached execution plans, by the way)? – Arseni Mourzenko Sep 12 '16 at 16:17
  • 1
    I feel like I am missing a deeper meaning with this answer. I could say that `O'Malley` may be encoded like `'O\'Malley'`, or `"O'Malley"`, or even `O'Malley`, and things will be fine. But it cannot be encoded like `'O'Malley'` (Because this last encoding will consider the value to be only `'O'` and the rest will either result in an error or an injection attack). I may be missing nuances between *sanitizing* and *filtering* and *encoding* though. – Dennis Sep 12 '16 at 20:29
  • I understand that arbitrary input is (usually) meant to be "as-is", not to be altered, and thus it should not be modified, but only properly encoded as to not break the SQL program (& not cause an injection attack). I get that. But ... why won't option 1 be a solution in this case, if I encode it properly myself? (ignoring the difficulties & caveats & catching-up with 3rd party software that it may present here, as mentioned in other answers). I guess ... what's the difference between me encoding it and 3rd party library or software or RDBMS doing the encoding? – Dennis Sep 12 '16 at 20:37
  • 8
    @Dennis: Sanitizing or filtering means *removing* information. Encoding means transforming the representation of data *without* losing information. – JacquesB Sep 12 '16 at 20:39
  • 9
    @Dennis: filtering means either accepting or rejecting user input. For instance, “Jeff” would be *filtered* as input of “User's age” field, because the value is obviously invalid. If, instead of filtering input, you start transforming it by, for instance, replacing the single quote character, then you are doing *exactly* the same thing as the database libraries where they use parametrized queries; in this case, your question is simply “Why would I use something which exists and was written by experts in the field, when I can reinvent the wheel in every project?” – Arseni Mourzenko Sep 12 '16 at 20:40
  • ah, I see. In case of my question I meant more encoding and not filtering. Filtering/sanitizing can be used for something else (I believe more about ensuring validity of information rather than preventing injection attack). I can see that using filtering to *modify* data can be a way to prevent injection attack and yes it won't be an overall solution for precisely the reasons you've stated. I say makes sense & great answer. Thanks! – Dennis Sep 12 '16 at 20:49
  • 3
    @Dennis: `O\'Malley` is using the slash to _escape_ the quote for proper insertion (at least in some databases). In MS SQL or Access it can be escaped with an additional quote `O''Malley`. Not very portable if you have to do it yourself. – AbraCadaver Sep 12 '16 at 21:42
  • @AbraCadaver Right, but if (!) the SQL interface you use provides something like `mysql_escape()` (don't remember the exact spelling), you can do something like `"INSERT INTO xy (f1, f2) VALUES ('" + mysql_escape(value1) + "', '" + mysql_escape(value2) + "')"` and it is safe. But all other arguments, such as caching etc., remain effective. – glglgl Sep 13 '16 at 13:49
  • @Jacques, Re: "using a library function": I thought that parameterizing queries offers somehow a different technique altogether than "an encoding library function". Are you saying they are effectively identical in their functionality? I've heard agruments for paremeterizing queries state that the technique separates the `program` (the SQL statement) from `data` (the parameters). So I was under impression that there was a fundamental difference between the `library` and `parametrized` approaches and that they are not the same. – Dennis Sep 13 '16 at 14:10
  • 1
    I don't know the exact details of how RDBMS work when they receive SQL, but I understood parameterized queries as RDBMS first accepts `the program` -- the SQL query itself without parameters, then does whatever it needs to do to optimize the query, pick and choose proper indexes to use, define execution plan, etc, etc, *and only then* plug in the data parameters as the very last step. So even if given data is rogue, that data will not be able to affect the SQL `program` at this step. – Dennis Sep 13 '16 at 14:16
  • 1
    When an `encoding library` on the other hand precedes RDBMS transformations and the library encodes the data into the SQL before an execution plan is done, thereby making it potentially vulnerable to injection. Because no matter how good the encoding library is, it has the *potential to modify the SQL Program* because it is making modifications to the the SQL itself before it is being sent off to RDBMS. – Dennis Sep 13 '16 at 14:18
  • 5
    I can't tell you how many times my name has been outright rejected by a system. Sometimes I've even seen errors caused by SQL injection just from using my name. Heck, I was once asked to change my username because I was actually breaking something on the backend. – Alexander O'Mara Sep 14 '16 at 17:06
  • 2
    @AlexanderO'Mara: Well, at least your name isn't something like `Alexander O';Delete from Users;--` ;-) – dan04 Sep 14 '16 at 22:22
  • @Dennis: I am not really sure what your actual question is now? – JacquesB Sep 15 '16 at 18:13
  • And you should be able to store little Bobby tables; as a name in your database. – gnasher729 Apr 14 '23 at 17:44
60

If you're trying to do string processing, then you're not really generating an SQL query. You're generating a string that can produce an SQL query. There's a level of indirection that opens up a lot of room for errors and bugs. It's somewhat surprising really, given that in most contexts we're happy to interact with something programmatically. For instance, if we have some list structure and want to add an item, we usually don't do:

List<Integer> list = /* a list of 1, 2, 3 */
String strList = list.toString();   /* to get "[1, 2, 3]" */
strList = /* manipulate strList to become "[1, 2, 5, 3]" */
list = parseList(strList);

If someone suggests doing that, you'd rightly respond that it's rather ridiculous, and that one should just do:

List<Integer> list = /* ... */;
list.add(5, position=2);

That interacts with the data structure at its conceptual level. It doesn't introduce any dependency on how that structure might be printed or parsed. Those are completely orthogonal decisions.

Your first approach is like the first sample (only a little bit worse): you're assuming that can programatically construct the string that will be correctly parsed as the query that you want. That depends on the parser, and a whole bunch of string processing logic.

The second approach of using prepared queries is much more like the second sample. When you use a prepared query, you essentially parsing a pseudo-query that's legal but has some placeholders in it, and then using an API to correctly substitute some values in there. You no longer involve the parsing process, and you don't have to worry about any string processing.

In general, it's much easier, and much less error prone, to interact with things at their conceptual level. A query isn't a string, a query is what you get when you parse a string, or construct one programatically (or whatever other method lets you create one).

There's a good analogy here between C-style macros that do simple text replacement and Lisp-style macros that do arbitrary code generation. With C-style macros, you can replace text in the source code, and that means that you have the ability to introduce syntactic errors or misleading behavior. With Lisp macros, you're generating code in the form that the compiler processes it (that is, you're returning the actual data structures that the compiler processes, not text that the reader has to process before the compiler can get to it). With a Lisp macro, you can't generate something that would be a parse error, though. E.g., you can't generate (let ((a b) a.

Even with Lisp macros, you can still generate bad code though, because you don't necessarily get to be aware of the structure that's supposed to be there. E.g., in Lisp, (let ((a b)) a) means "establish a new lexical binding of the variable a to the value of variable b, and then return the value of a", and (let (a b) a) means "establish new lexical bindings of the variables a and b and initialize them both to nil, and then return the value of a." Those are both syntactically correct, but they mean different things. To avoid this issue, you could use more semantically-aware functions and do something like:

Variable a = new Variable("a");
Variable b = new Variable("b");
Let let = new Let();
let.getBindings().add(new LetBinding(a,b));
let.setBody(a);
return let;

With something like that, it's impossible to return something that's syntactically invalid, and it's much harder to return something that's accidentally not what you wanted.

Joshua Taylor
  • 1,610
  • 11
  • 11
  • Good explanation! – Mike Partridge Sep 13 '16 at 14:22
  • 2
    You lost me at "good analogy" but I upvoted based on the preceding explanation. :) – Wildcard Sep 14 '16 at 02:30
  • 1
    Excellent example! - And you could add: Depending on the datatype it is sometimes not even possible or feasible to create a parsable string. - What if one of my parameters is a **free-text field** containing a story draft (~10.000 characters) ? or what if one parameter is a **JPG-Image** ? - The only way then is a parametrized query – Falco Sep 14 '16 at 14:05
  • Actually no - it is a pretty bad description on why prepared statements evolved as a defense to sql injection. Particularly given the code example is in java, which was not around when parameterized queries where developped likely in the timeframe where C/C++ where considered state of the art. SQL databases started to be used in the early years of the 1970-1980 timeframe. WAY before higher level languages where popular. Heck, I would say many of them came to make working with databases easier (PowerBuilder anyone?) – TomTom Sep 15 '16 at 07:09
  • @TomTom actually, I agree with most of your content. I've only implicitly touched on the security aspect here. On SO, I answer lots of SPARQL (the RDF query language, with some similarities to SQL) questions and lots of people run into issues because they concatenate strings rather than using parameterized queries. Even without injection attacks, parameterized queries help avoid bugs/crashes, and bugs/crashes can be security issues, too, even if they're not injection attacks. So I'd say less and more: parameterized queries are good, even if SQL injection wasn't an issue, and they're good... – Joshua Taylor Sep 15 '16 at 10:51
  • @tomtom for preventing more than just SQL injection. So I agree with (and upvoted) your answer: parameterized queries didn't evolve because of SQL injection, though awareness of SQL injection might have helped increase their adoption. But as for language issues, I want writing Java, per se, just outlining some ideas in pseudo code that's kind of similar; the language doesn't matter. – Joshua Taylor Sep 15 '16 at 10:55
  • @TomTom I feel you misinterpreted this answer: it explains _why_ prepared statements are are more popular defense, not _how_ or _when_ they evolved. And the writer does a good job of illustrating this using programming languages that are common today, probably so that it's understandable to more people currently. It's simply good authorship. – Ricardo van den Broek Sep 15 '16 at 16:47
  • @RicardovandenBroek Yes, that is the answer. But that does not make it an answer that answers the question as written in the title - ergo it is not appropriate. The question is about the evolution, explicitly. – TomTom Sep 15 '16 at 17:20
  • @TomTom OP explained his question: ''Why did #2 take off and become considered to be more or less the de facto way to prevent injection attacks? Is it just safer and less prone to error or were there other factors?" This answer addresses that, instead of dwelling on the possibility that the OP had a misunderstanding of the history of and between these methods. Even though the latter is still interesting and hence your answer too is appreciated. – Ricardo van den Broek Sep 15 '16 at 18:04
  • Except it did not. It was best practice WAY BEFORE sql injections. – TomTom Sep 15 '16 at 19:36
21

It helps that option #2 is generally considered a best practice because the database can cache the unparameterized version of the query. Parameterized queries predates the issue of SQL injection by several years (I believe), it just so happens that you can kill two birds with one stone.

JasonB
  • 339
  • 1
  • 3
  • 10
    SQL injection has been an issue since SQL was first invented. It didn't become an issue later. – Servy Sep 12 '16 at 17:31
  • 10
    @Servy Theoretically yes. Practically it only became a real issue when our input mechanisms went online, presenting a massive attack surface for anyone to hammer. – Jan Doggen Sep 12 '16 at 19:42
  • 8
    [Little Bobby Tables](http://xkcd.com/327/) would disagree that you need either the internet nor a large user base to leverage SQL injection. And of course networks *pre-date* SQL, so it's not like you would need to wait for networks once SQL came out. Yes, security vulnerabilities are *less* vulnerable when your application has a small user base, but they're still security vulnerabilities, and people *do* exploit them when the database itself has valuable data (and many *very* early database had very valuable data, as only people with valuable databases could afford the tech).. – Servy Sep 12 '16 at 19:52
  • 5
    @Servy to my knowledge, dynamic SQL was a relatively late feature; initial use of SQL was mostly precompiled/preprocessed with parameters for values (both in and out), so parameters in queries might predate SQL injection in software (maybe not in ad-hoc/CLI queries). – Mark Rotteveel Sep 12 '16 at 20:16
  • 2
    According to Wikipedia "The first public discussions of SQL injection started appearing around 1998" though I don't know if it was being used before then and not written about. – Martin Smith Sep 13 '16 at 05:05
  • 6
    They might predate *awareness of* SQL injection. – user253751 Sep 13 '16 at 05:11
  • 1
    Anyway the point is that not only is a safer to use parametrized queries but also it has advantages in performance and memory footprint – Borjab Sep 14 '16 at 15:16
  • 1
    Enjoy filling up your DB's prepared statement cache with `SELECT ... FROM ... WHERE somecolumnname IN (?, ?)`, `SELECT ... FROM ... WHERE somecolumnname IN (?, ?, ?)`, `SELECT ... FROM ... WHERE somecolumnname IN (?, ?, ?, ?)`, `SELECT ... FROM ... WHERE somecolumnname IN (?, ?, ?, ?, ?)`, etc. depending on the argument count. – Damian Yerrick Sep 14 '16 at 15:44
  • "also it has advantages in performance and memory footprint" - No. Not in 2016. But yes more than 25 years ago. As someone looked it up, for example, query plan caching was added by Microsoft to SQL Server in 1998 (v7). Which means earlier SQL Server AND Sybase Products had performance gains when not using dynamic SQL. – TomTom Sep 14 '16 at 17:43
20

Simply said: They did not. Your statement:

Why did SQL Injection prevention mechanism evolve into the direction of using Parameterized Queries?

is fundamentally flawed. Parameterized queries have existed way longer than SQL Injection is at least widely known. They were generally developped as a way to avoid string concentation in the usual "form for search" functionality LOB (Line of Business) applications have. Many - MANY - years later, someone found a security issue with said string manipulation.

I remember doing SQL 25 years ago (when the internet was NOT widely used - it was just starting) and I remember doing SQL vs. IBM DB5 IIRC version 5 - and that had parameterized queries already.

TomTom
  • 545
  • 3
  • 8
  • thanks. Why was there a need to avoid string concatenation? It seems to me like that would be a useful feature. Did someone have an issue with it? – Dennis Sep 14 '16 at 13:50
  • 3
    Two actually. First, it is not always totally trivial - why deal with memory allocation etc. when it is not needed. But second, in ancient times performance caching sql database side was not exactly that great - compilation SQL was expensive. As side effect of using one sql prepared statements (which is where parameters come from), exeuction plans could be reused. SQL Server introduced auto parameterization (to reuse query plans even without parameters - they are deducted and implied) I think either 2000 or 2007 - somewhere in between, IIRC. – TomTom Sep 14 '16 at 13:54
  • 2
    Having parameterized queries does not eliminate the ability to do string concatenation. You can do string concatenation to generate a parameterized query. Just because a feature is useful doesn't mean it's always a good choice for a given problem. – JimmyJames Sep 14 '16 at 14:13
  • Yes, but as I said - by the time they were invented, dynamic SQL came with a quite decent performance hit ;) Even today people tell you that dynamic SQL query plans in sql server are not reused (which is wrong since - hm - as I said some point between 2000 and 2007 - so QUITE long). At that old time you really wanted PREPARED statements if you run sql multiple times ;) – TomTom Sep 14 '16 at 14:37
  • Plan caching for dynamic SQL was in fact added to SQL Server 7.0, in *1998* - http://sqlmag.com/database-performance-tuning/inside-sql-server-sql-server-70-plan-caching – Mike Dimmick Sep 14 '16 at 15:35
  • Ah, off by 2 years. Amazing - I was actually part of the technical beta for a project I was in... yeah, it was added back then. Before that - you better used prepared statements for performance ;) Actually stored procedures. So, there are factual reasons - databases are complex and are around a LONG time. – TomTom Sep 14 '16 at 16:27
  • You might safely concatenate a query from a bunch of constant strings though, at configuration time. It doesn't mean you're concatenating arbitrary user input. – Useless Sep 16 '16 at 15:16
  • @SantiBailors To add to what Useless wrote, I agree that if you are concatenating SQL with user input, you are still subject to SQLI but that's not the only reason you might concatenate Strings together to generate parameterized statements is if you have variable numbers of parameters. Templates work well for this kind of thing like what MyBatis/IBatis provides. – JimmyJames Sep 16 '16 at 19:23
13

In addition of all the other good answers:

The reason why #2 is better is because it separates your data from you code. In the #1 your data is part of your code and that's where all the bad things come from. With #1 you get your query and need to perform additional steps to make sure your query understands your data as data whereas in #2 you get your code and it's code and your data is data.

Pieter B
  • 12,867
  • 1
  • 40
  • 65
  • 3
    Separating code and data also means that your defences against hostile code injection are written and tested by the database vendor. Therefore if something passed as a parameter along with a harmless query ends up trashing or subverting your database, the database company's reputation is on the line, and your org might even sue them and win. It also means that if that code contains an exploitable bug, the odds are pretty good that it is someone else's site where all heck breaks loose, rather than yours. (Just don't ignore the security bugfixes!) – nigel222 Sep 15 '16 at 16:49
11

Parameterized queries, apart from providing SQL injection defence, often have an additional benefit of being compiled only once, then executed multiple times with different parameters.

From the SQL database point of view select * from employees where last_name = 'Smith' and select * from employees where last_name = 'Fisher' are distinctly different and therefore require separate parsing, compilation, and optimization. They will also occupy separate slots in the memory area dedicated to storing compiled statements. In a heavily loaded system with a large number of similar queries that have different parameters computation and memory overhead can be substantial.

Subsequently, using parameterized queries often provides major performance advantages.

mustaccio
  • 227
  • 1
  • 4
  • I think that's the theory (based on used prepared statements for parameterised queries). In practice, I doubt this is really that often the case, as most implementations will just prepare-bind-execute in one call, so use a different prepared statement for each parameterised query unless you take explicit steps to actually prepare statements (and a library-level `prepare` is often quite different from an actual SQL-level `prepare`). – jcaron Sep 14 '16 at 06:23
  • The following queries are also different to the SQL parser: `SELECT * FROM employees WHERE last_name IN (?, ?)` and `SELECT * FROM employees WHERE last_name IN (?, ?, ?, ?, ?, ?)`. – Damian Yerrick Sep 14 '16 at 15:45
  • Yes, they have. WHich is why MS did add query plan caching back in 1998 to SQL Server 7. As in: Your information is a generation old. – TomTom Sep 14 '16 at 17:44
  • 1
    @TomTom -- query plan caching is not the same as auto-parameterization, at which you appear to be hinting. As in, read before you post. – mustaccio Sep 14 '16 at 18:04
  • @mustaccio Actually at least MS did introduce both at the same time. – TomTom Sep 14 '16 at 18:19
5

Wait but why?

Option 1 means you have to write sanitizing routines for ever type of input whereas option 2 is less error-prone and less code for you to write/test/maintain.

Almost certainly "taking care of all caveats" can be more complex that you think it is, and your language (for example Java PreparedStatement) has more under the hood than you think.

Prepared statements or parametrized queries are pre-compiled in the database server so, when parameters are set, no SQL concatenation is done because the query is no longer a SQL string. An aditional advantage is that the RDBMS caches the query and subsequent calls are considered to be the same SQL even when parameter values vary, whereas with concatenated SQL every time the query is run with different values the query is different and the RDBMS has to parse it, create the execution plan again, etc.

Tulains Córdova
  • 39,201
  • 12
  • 97
  • 154
  • 1
    JDBC do not sanitize anithing. Protocol have specific part for parameter and DB simply do not interpret that parameters.It is why you can set table name from parameter. – talex Sep 13 '16 at 10:35
  • @talex It must at least scape single quotes. – Tulains Córdova Sep 13 '16 at 10:38
  • 1
    Why? if parameter is not parsed or interpreted there is no reason to escape something. – talex Sep 13 '16 at 10:41
  • @talex To prevent SQL injection maybe? – Tulains Córdova Sep 13 '16 at 10:56
  • We still talking about PreparedStatement? You can look into JDBC code and notice that no escaping are done there. – talex Sep 13 '16 at 11:09
  • 11
    I think you have the wrong image of how a parameterized query works. It is not just a case of the parameters being substituted in later, *they are never substituted in*. A DBMS turns any query into a "plan", a set of steps it's going to execute to get your result; in a parameterized query, that plan is like a function: it has a number of variables that need to be supplied when it is executed. By the time the variables are supplied, the SQL string has been completely forgotten, and the plan is just executed with the provided values. – IMSoP Sep 13 '16 at 12:18
  • 2
    @IMSoP That was a misconception of mine. Although I think it's a common one as you can see in the two most voted answers to this question in SO http://stackoverflow.com/questions/3271249/difference-between-statement-and-preparedstatement . I read about it and you are right. I edited the answer. – Tulains Córdova Sep 13 '16 at 13:59
  • @TulainsCórdova It is indeed a common misunderstanding. See also [my explanation for why you can't use a table or column name as a bound parameter](http://stackoverflow.com/a/15990488/157957). – IMSoP Sep 13 '16 at 15:03
  • @IMSoP If the number of variables is itself variable, such as in the right side of operator `IN`, how does the DBMS convert that to a "plan"? – Damian Yerrick Sep 14 '16 at 15:47
  • @DamianYerrick At least in JDBC the list inside IN cannot be specified by a parameter. What you usually do is to dynamically create a list of parameters "IN (?,?,?,?,?)" . – Tulains Córdova Sep 14 '16 at 15:50
  • @DamianYerrick As Tulains says, in most cases, it can't, and this is one of the limitations of parameterized queries (along with cases where you genuinely need to vary the plan dynamically, e.g. sort order). However, some DBMSes do have array types, and could at least in theory construct a plan with an unbound array parameter. If they did, that would allow you to do `WHERE some_column = ANY( ? )`, which is equivalent to `WHERE some_column IN ( ... )`. It would rely on the driver being able to convert your programming language's array or list type into the correct representation, though. – IMSoP Sep 14 '16 at 16:22
  • "Prepared statements or parametrized queries are pre-compiled in the database server so" - as are dynamic SQL since a long time. It is not like manufacturers are idiots. MS added query plan caching with auto parameterization - back in 1998. – TomTom Sep 14 '16 at 17:45
  • 3
    @TomTom That's great for *performance*, but it does nothing for *security*. By the time a compromised piece of dynamic SQL is compiled and cached, *the program has already been altered*. Creating a plan from non-dynamic parameterised SQL and then passing data elements is still fundamentally different from a DBMS abstracting the similarity between two queries presented to it as complete SQL strings. – IMSoP Sep 16 '16 at 09:47
1

Let's imagine what an ideal "sanitize, filter and encode" approach would look like.

Sanitizing and filtering might make sense in the context of a particular application, but ultimately they both boil down to saying "you can't put this data in the database". For your application, that might be a good idea, but it's not something you can recommend as a general solution, since there will be applications that need to be able to store arbitrary characters in the database.

So that leaves encoding. You could start by having a function that encodes strings by adding escape characters, so that you can substitute them in yourself. Since different databases need different characters escaping (in some databases, both \' and '' are valid escape sequences for ', but not in others), this function needs to be provided by the database vendor.

But not all variables are strings. Sometimes you need to substitute in an integer, or a date. These are represented differently to strings, so you need different encoding methods (again, these would need to be specific to the database vendor), and you need to substitute them into the query in different ways.

So maybe things would be easier if the database handled substitution for you too - it already knows what types the query expects, and how to encode data safely, and how to substitute them into your query safely, so you don't need to worry about it in your code.

At this point, we've just reinvented parameterised queries.

And once queries are parameterised, it opens up new opportunities, such as performance optimizations, and simplified monitoring.

Encoding is hard to do right, and encoding-done-right is indistinguishable from parameterisation.

If you really like string interpolation as a way of building queries, there are a couple of languages (Scala and ES2015 come to mind) that have pluggable string interpolation, so there are libraries that let you write parameterised queries that look like string interpolation, but are safe from SQL injection - so in ES2015 syntax:

import {sql} from 'cool-sql-library'

let result = sql`select *
    from users
    where user_id = ${user_id}
      and password_hash = ${password_hash}`.execute()

console.log(result)
James_pic
  • 274
  • 3
  • 5
  • 1
    "Encoding is hard to do right" - hahaha. It is not. A day or two, it is all documented. I wrote an encoder many years ago for an ORM (because sql server has a limit on parameters and thus it is problematic to insert 5000-10000 rows in one statement (back 15 years ago). I do not remember that being a large problem. – TomTom Sep 14 '16 at 17:47
  • 1
    Perhaps SQL Server is sufficiently regular that it's a non-problem, but I've encountered problems in other DBs - corner cases with mismatched character encodings, obscure config options, locale specific date and number issues. All solvable, but needing at least a cursory understanding on the DB's quirks (I'm looking at you, MySQL and Oracle). – James_pic Sep 15 '16 at 08:53
  • 3
    @TomTom Encoding is actually very hard to get right once you factor in time. What do you do when your DB vendor decides to create a new comment style in the next release or when a bareword becomes a new keyword in an upgrade? You could theoretically get encoding actually right for one release of your RDBMS and be wrong on the next revision. Don't even get started on what happens when you switch vendors to one that has [conditional comments using nonstandard syntax](http://dev.mysql.com/doc/refman/5.7/en/comments.html) – Eric Sep 17 '16 at 06:07
  • @Eric, that's frankly horrifying. (I use Postgres; if it has any such bizarre warts I've yet to encounter them.) – Wildcard Nov 24 '16 at 11:00
1

Offering an alternative perspective on this one - I see SQL injection (and other injection attacks) as being a serialisation problem... you're taking a complex structure (in this case, a database query), serialising it to string, and sending it across the wire to the database which has to deserialise it into a usable form.

That last part is the key. The database doesn't just magically run an SQL string directly... it has to parse the SQL string into some kind of syntax tree, turning embedded values into tokens.

So in that model, parameterised queries make perfect sense... the caller already has those values separated out, and the database needs them separated out, so you might as well keep them in the form that's easiest for both sides, instead of worrying about how to safely encode them in such a way that the parser can safely decode them.

Simon Geard
  • 189
  • 3
0

In option 1, you are working with an input set of size=infinity that you are trying to map to a very large output size. In option 2, you have bounded your input to whatever you choose. In other words:

  1. Carefully screening and filtering [infinity] for [all safe SQL queries]
  2. Using [preconsidered scenarios limited to your scope]

According to other answers, there also appears to be some performance benefits from limiting your scope away from infinity and towards something manageable.

0

One useful mental model of SQL (especially modern dialects) is that each SQL statement or query is a program. In a native binary executable program, the most dangerous kinds of security vulnerabilities are overflows where an attacker can overwrite or modify the program code with different instructions.

A SQL injection vulnerability is isomorphic to a buffer overflow in a language like C. History has shown that buffer overflows are extremely difficult to prevent -- even extremely critical code subject to open review has often contained such vulnerabilities.

One important aspect of the modern approach to solving overflow vulnerabilities is the use of hardware and OS mechanisms to mark particular parts of memory as non-executable, and to mark other parts of memory as read-only. (See the Wikipedia article on Executable space protection, for example.) That way, even if an attacker could modify data, the attacker cannot cause their injected data to be treated as code.

So if a SQL injection vulnerability is equivalent to a buffer overflow, then what's the SQL equivalent to an NX bit, or to read-only memory pages? The answer is: prepared statements, which include parameterized queries plus similar mechanisms for non-query requests. The prepared statement is compiled with certain parts marked read-only, so an attacker cannot change those parts of the program, and other parts marked as non-executable data (the parameters of the prepared statement), which the attacker could inject data into but which will never be treated as program code, thus eliminating most of the potential for abuse.

Certainly, sanitizing user input is good, but to really be secure you need to be paranoid (or, equivalently, to think like an attacker). A control surface outside of the program text is the way to do that, and prepared statements provide that control surface for SQL. So it should come as no surprise that prepared statements, and thus parameterized queries, are the approach that the vast majority of security professionals recommend.

Daniel Pryden
  • 3,268
  • 1
  • 21
  • 21
  • This is all nice and dandy, but it does not address the question as per title at all. – TomTom Sep 14 '16 at 17:46
  • 1
    @TomTom: What do you mean? The question is exactly about why parameterized queries are the preferred mechanism for preventing SQL injection; my answer explains why parameterized queries are more secure and robust than sanitizing user input. – Daniel Pryden Sep 14 '16 at 18:16
  • I am sorry, but MY question reads "Why did SQL Injection prevention mechanism evolve into the direction of using Parameterized Queries?". They did not. It is not about the now, it is about the history. – TomTom Sep 14 '16 at 18:19
0

I alredy write about this here: https://stackoverflow.com/questions/6786034/can-parameterized-statement-stop-all-sql-injection/33033576#33033576

But, just to keep it simple:

The way parameterized queries work, is that the sqlQuery is sent as a query, and the database knows exactly what this query will do, and only then will it insert the username and passwords merely as values. This means they cannot effect the query, because the database already knows what the query will do. So in this case it would look for a username of "Nobody OR 1=1'--" and a blank password, which should come up false.

This isn't a complete solution though, and input validation will still need to be done, since this won't effect other problems, such as XSS attacks, as you could still put javascript into the database. Then if this is read out onto a page, it would display it as normal javascript, depending on any output validation. So really the best thing to do is still use input validation, but using parameterized queries or stored procedures to stop any SQL attacks

Josip Ivic
  • 1,617
  • 5
  • 28
  • 46
0

I've never used SQL. But obviously you hear about what problems people have, and SQL developers had problems with this "SQL injection" thing. For a long time I couldn't figure it out. And then I realized that people where creating SQL statements, real textual SQL source statements, by concatenating strings, of which some where entered by a user. And my first thought on that realisation was shock. Total shock. I thought: How can anyone be so ridiculously stupid and create statements in any programming language like that? To a C, or C++, or Java, or Swift developer, this is utter madness.

That said, it is not very difficult to write a C function that takes a C string as its argument, and produces a different string that looks exactly like a string literal in C source code that represents the same string. For example, that function would translate abc to "abc", and "abc" to "\"abc\"" and "\"abc\"" to "\"\\"abc\\"\"". (Well, if this looks wrong to you, that's html. It was right when I typed it in, but not when it gets displayed) And once that C function is written, it isn't difficult at all to generate C source code where the text from an input field supplied by the user is turned into a C string literal. That isn't hard to make safe. Why SQL developers wouldn't use that approach as a way to avoid SQL injections is beyond me.

"Sanitizing" is a totally flawed approach. The fatal flaw is that it makes certain user inputs illegal. You end up with a database where a generic text field cannot contain text like ; Drop Table or whatever you would use in an SQL injection to cause damage. I find that quite unacceptable. If a database stores text, it should be able to store any text. And the practical flaw is that sanitizer can't seem to get it right :-(

Of course, parameterized queries are what any programmer using a compiled language would be expecting. It makes life so much easier: You have some string input, and you never even bother to translate it into an SQL string, but just pass it as a parameter, with no chance of any characters in that string causing any damage.

So from the point of a developer using compiled languages, sanitizing is something that would never occur to me. The need for sanitizing is insane. Parameterised queries are the obvious solution to the problem.

(I found Josip's answer interesting. He basically says that with parameterised queries you can stop any attack against SQL, but then you can have text in your database that is used to create a JavaScript injection :-( Well, we have the same problem again, and I don't know if Javascript has a solution to that.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
0

When I first heard of SQL injections I had the impression that something magically clever had to go on. Then I found it was the absolute opposite: textual substitution in a query is just utter stupidity on an incredible scale.

If parameterised queries didn’t exist, there would be one possible solution: Escaping. As an example, I can write a C function that takes any char* containing a C string as it’s argument and outputs a C string containing a C-language string literal that is compiled to this C string. For example a char* containing "abc" including the apostrophes would return ""abc"". You must do something like that to produce JSON. And the same could be done for SQL (hoping that SQL doesn’t make it too hard).

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • 1
    Indeed... SQL injection (and command injection, etc) are all essentially the same thing as XSS... variants on the theme of serializing a complex structure into a byte stream and reading it back again. If you construct that byte stream using the right tools, it all works... but if you take shortcuts, things can go very very wrong. Escaping _can_ work, and is sometimes the only option, but you need to handle the encoding *exactly* the same as what the parser expects. – Simon Geard Apr 16 '23 at 21:05
-2

The main problem is that hackers found ways to surround the sanitation while the parametrized queries was an existing procedure that worked perfectly with the extra benefits of performance and memory.

Some people simplify the problem as "it is just the single quote and double quote" but hackers found smart ways to avoid detection like using different encodings or making use of database funtions.

Anyway, you only needed to forget one single string to create a catastrophic data breach. Hackers where able to automatize scripts to download the complete database with a series or queries. If the software is well known like an open source suite or a famous business suite you could simply attact the users and passwords table.

On the other hand just using concatenated queries was just a matter of learning to use and getting used to it.

Borjab
  • 1,339
  • 7
  • 16