19

http://highscalability.com/blog/2010/11/9/facebook-uses-non-stored-procedures-to-update-social-graphs.html talks about how Facebook moved logic out of the database into the application in order to improve ease of development (scaling better across employees instead of across databases).

Are there any studies/articles/benchmarks that evaluate the business and technical tradeoffs of running as much of the logic in the database versus in the application?

Specifically, I'm wondering what scales better (both in terms of the database performance and ease of development as the business grows):

Scenario 1 (Logic in database)

  • Running as much of the logic on the database end.
  • Using database-specific SQL extensions/features.
  • Using stored procedures.
  • Few queries over short transactions, low network overhead.

Scenario 2 (Logic in application)

  • Running as much of the logic on the application end.
  • Limiting oneself to features which are common across all major databases.
  • Not using stored procedures.
  • Many queries over long transactions, high network overhead.

What kind of people-scaling overhead can one expect in Scenario 1?
What kind of database-scaling overhead can one expect in Scenario 2?

gnat
  • 21,442
  • 29
  • 112
  • 288
Gili
  • 477
  • 3
  • 13
  • @HLGEM, I flagged this question for moving to programmers. Once that's done, I'll cross-post it to http://dba.stackexchange.com/ and see what they say :) – Gili Sep 30 '14 at 20:27
  • 2
    @Gili and it'll just get closed because cross posting is not appreciated. – jwenting Oct 01 '14 at 10:20
  • @jwenting, fortunately someone else already did it: http://dba.stackexchange.com/q/2450/4719 so I won't be doing the same. – Gili Oct 01 '14 at 14:15
  • @Gili: [I've posted your interesting question also to reddit](http://redd.it/2i0cvy). You might get some more traction over there... – Lukas Eder Oct 01 '14 at 21:52
  • @FlorianMargaine, I saw that question but the answers were very subjective and did little to back up their claims. I am attempting to conduct a more balanced discussion this time around. Also, there is an added focus on people-scaling this time around that did not exist in the original question. – Gili Oct 02 '14 at 18:03

4 Answers4

17

Answering my own question (aggregating everything I've read so far):

Summary: Contrary to popular belief, you shouldn't use stored procedures for performance or security reasons. You should use them to support multiple applications, to enforce data integrity and to implement cross-cutting concerns that cannot be easily implemented in the application layer.

Feel free to push all remaining logic into the application layer.

Gili
  • 477
  • 3
  • 13
  • 1
    Stored procs can be as subject to SQl injection as prepared statements, that is not the only security need that can be handled by stored procs. When you use prepared statements, you have to grant access at the table/view level. When you use stored procs the access can be at the stored proc level only and thus no one can do anything that is not in the proc. If you combine that with not using dynamic SQl, then it is more secure. In a financial system for instance it would be a necessary internal control to not allow anyone except the admins to do anything to the database not defined by a proc. – HLGEM Oct 01 '14 at 13:23
  • @HLGEM, your point is already covered by the Security section, bullet #2. According to the link at the end of that bullet, you should be able to express almost the same thing using views. (Granted, you can't protect from buggy code setting columns to the wrong value but the same could be true for stored procedures) – Gili Oct 01 '14 at 14:15
  • No you absolutely cannot do the same thing with views. Yes you can limit what specific field can be updated through views, but with stored procedures you can specifically limit the particular actions that can be taken by a user in a much more detailed way. YOu can make them have to do multiple tasks for instance in a particular order on different tables which you can't do by managing securty through views. In a financial system, if you do not use stored procs your system is wide open for fraud. – HLGEM Oct 01 '14 at 14:33
  • 2
    @HLGEM, you are right that asserting permissions over stored procedures instead of table columns gives you better control of what values get written into columns, and in what order; but for all intensive purposes the difference is meaningless. An attacker who controls your webapp layer has as much ability to commit fraud as an attacker who controls your database layer. Instead of directly adding/removing funds from a particular account he can simply transfer everyone else's money into his account. The ship is sunk either way. – Gili Oct 01 '14 at 15:17
  • That's a great answer to your own question. It really boils down to these various pros and cons of each approach leading to the fact that there is not a clear and obvious best way to design applications with respect to where logic should reside. – Lukas Eder Oct 01 '14 at 17:43
  • 1
    Instead of putting logic into a database you can create (web) services which get called by the applications. If you database contains lots of code that could should also be in source control. For such maintainability reasons I don't like stored procedures at all. – beginner_ Oct 02 '14 at 10:29
  • "If the database needs to support multiple applications," => then you implement an application providing the functional API over the database, and this application will have sole access to the database :) – Matthieu M. Oct 02 '14 at 17:20
  • 1
    @MatthieuM. Keep in mind that the applications on top are written in different programming languages and platforms. You could expose a REST API, to be sure, but writing and using one is usually more work than direct database access. That's not to say you shouldn't do it, but it's not a clear winner over stored procedures. – Gili Oct 02 '14 at 17:45
  • 1
    @beginner_, database schema and stored procedures are easily version-controlled so that's not a valid reason. See http://flywaydb.org/ for example (and there are many more). As I explained to MattheiuM above, a REST API is great but it incurs a lot of extra work. – Gili Oct 02 '14 at 17:49
  • 1
    @MatthieuM. and you may have requirements that don't allow that. And you may not even own or be allowed to change all the applications. You may need to use 3rd party applications which don't allow that. – HLGEM Oct 02 '14 at 19:28
  • @HLGEM: If you are not allowed to change some applications, then the discussion about best practices/recommendations is moot. *Obviously* such discussion is only worth if you can actually take the decision in the end (or present arguments to someone who can take it). – Matthieu M. Oct 03 '14 at 06:19
  • @Gili: Well, there is indeed a development cost; whether it is superior or not depending on how much it costs to spin up a new service in your architecture (where I work, it is so simple that there is actually less expensive than delivering a new library). However, you then benefit from all the advantages you listed... and perhaps one more. In performance, you have missed the fact that databases today are "central", and taking workload off a central point toward distributed applicative servers allow scaling more easily (to be balanced with lugging MBs over the network). – Matthieu M. Oct 03 '14 at 06:26
  • @MatthieuM., Maybe I'm wrong, but this sounds like a strawman argument. While I agree that moving computationally expensive operations out of the database will improve scalability, in my experience stored procedures are I/O bound, not computationally-bound. In such a case, moving them out of the database will not improve scalability. – Gili Oct 03 '14 at 07:15
  • @Gili: actually, it's real-life anecdote, not a strawman argument; note however that I did not specify CPU-bound, high memory consumption can also be an issue (when multiplied by the number of clients). – Matthieu M. Oct 03 '14 at 07:30
  • @MatthieuM. Apologies. So what do you propose moving from the database to the application, and how will it reduce memory usage? (I ask because I can't see how moving the queries themselves would help, so you must be moving something else) – Gili Oct 03 '14 at 16:16
  • @Gili: actually moving the queries help in that when a procedure/query is running it is consuming memory that can be released once it completes. For example, imagine having to do action X (taking some time) on N elements: the cursor/query generating the N elements can be freed once those N elements are turned over to another process, and said process may act to process X on one item at a time. With a stored procedure (or just a single big query) instead the cursor may occupy a lot of memory for the whole time. – Matthieu M. Oct 03 '14 at 16:38
  • @MatthieuM. http://dba.stackexchange.com/a/78179/4719 explains that breaking up one query into multiple queries results in a performance loss of 10x to 100x (see the comments). There is also nothing preventing databases from releasing memory as they process elements because their execution plan knows they'll never need to navigate backwards. Honestly, I'd be surprised if they aren't already doing this. If you believe this really works, please back up your claims with concrete numbers (run benchmarks). – Gili Oct 03 '14 at 16:55
  • @Gili: It certainly made things slower though I do not think it was that much (remember the "lugging MBs on the network" comment); and I don't have numbers any longer nor am I inclined to investigate and re-create the situation (honestly, I have better things to do). It's a compromise, things got slower but we reduced the footprint. – Matthieu M. Oct 03 '14 at 17:19
6

Scalability isn't the only issue in making this choice (and I do know of successful databases with trillions of records so databases are more scalable than you think). Nor is it probably the most important.

First you need to look at the meaning of the data. Something like Facebook is tied intrinsically to its application and thus putting the logic there isn't as risky as a business application which has to get its data from imports, database jobs, user data entry from several different applications including some that perhaps the business has no control over. So the risk to data integrity is the first and most important thing to consider when making this choice.

Also how the data will be used and in what environment is critical. How is the information audited? Are there regulatory requirements? How is the reporting done? Do I need to be able to reuse the logic in a different reporting system as well as in the user application? If so the logic needs to rest outside of the application. Do I need to do exports of the data and how is that affected by the logic being in the application. Does that mean the people writing the reporting and export code will not have the ability to see what the logic is because they don't have access to application code? That can be a major problem.

Another consideration is the scalability which includes the performance. How much scalability do you need? Very few things need the scalability of a Facebook. How much performance do you need? Designing for scalability when you will never need it leads to less than optimum results. Will the methods you use to put logic in the application have a negative impact on database performance (many ORMs for instance write terrible database code).

Then there is the argument about less time for development which is ridiculous. If you know what you are doing, putting the logic in the database does not take more time than putting it in the application. It is just that most application developers aren't SQL specialists. However, is saving the devs time to get better at SQL really a plus? No it is not because that choice almost always comes at a cost of performance on the database and at the cost of data integrity.

What I am trying to say is there is no one size fits all. There are some applications where putting the logic in the application makes sense and some where it does not, but thinking there are only one or two critical factors to consider when making the choice is generally a mistake.

HLGEM
  • 28,709
  • 4
  • 67
  • 116
  • 2
    Very good points about ensuring data integrity if multiple applications access the same database, and implementing cross-cutting concerns like auditing and reporting between the application and database. Regarding the ease of development, I think the main point is that stored procedures are harder to read, write, debug and test than modern programming languages/ecosystems. That last point about debugging/testing is a big one. And yes, granted, most programmers are simply too lazy to learn SQL (the rise of ORMs is proof of this) and everyone suffers as a result. – Gili Sep 30 '14 at 20:37
2

Well, there's the practical answer, and then there's this one.

The question is titled "Where should the application logic run?" But if you think about it, this conflates two things that ought to be separable:

  • How should the application logic be coded?
  • Where should the application logic be run?

One of the big, not generally recognized weaknesses of present-day database technology is that making a choice for either of these questions usually forces your hand on the other one:

  • If you choose to run application logic on the database, you're forced to either use the database's procedural language (eeek, PL/SQL), or really cumbersome support for third-party language stored procedures.
  • If you choose to code application logic in a good programming language, then you lose the data integrity advantages unless you duplicate effort and code some of the same logic at the database as well.

But ideally, I should be able to code all my application logic once, in a powerful programming language of my choice, and whether the logic runs on the application or the database should be a late-bound decision—one that I choose when I deploy the application, or even as late as runtime.

In fact, ideally I should be able to run the same logic simultaneously in the browser, the application server and the database. Think of validation rules in a forms-driven web application. Ideally you want the validation logic to run at all three tiers:

  1. The browser should run the validation logic, because it can provide very low latency feedback to the user when they're filling the form wrong. Also, by preventing the user from submitting invalid forms, it reduces load on the lower tiers.
  2. The application server should run the validation logic, because it cannot trust the browser to do so—a hacker can bypass the browser's logic. Also, by catching invalid forms before they're sent to the database, it reduces the load on that critical shared resource.
  3. The database should run the validation logic, because it's responsible for enforcing data integrity for multiple users of the data.

So really you'd want to code that logic once, and have tools that apply it in all three places. But precious little technology exists to achieve this.

sacundim
  • 4,748
  • 1
  • 19
  • 16
  • That is a **very** good point. I feel this pain on a daily basis when implementing REST APIs. As you said, there is a good reason for implementing the same validation logic at different levels and there doesn't seem to be a way to avoid the duplication. (That said, I encourage you to move this to a new question because the answer doesn't actually address this question) – Gili Oct 02 '14 at 17:56
1

I tend to recommend / prefer:

  • Logging

    • App level for application logic ("delete user", "change password", "add customer").
    • For compliance / audit, apply directly to database (since this will catch any other indirect means of access, such as DBAs)
  • Security / Permissions

    • Security should be implemented within the application.
    • Within the DBRMS, security can be implemented a dozen different ways (SQL Admins, often Local Admins are SQL Admins (not that they should be), DB owner, DB roles, object rights)... this can easily lead to confusion ("who has rights to X?", "how does Jane Doe have permission to Y?")... avoid wherever possible. And, rights directly to the DB can be dangerous, since there are so many tools to access databases (Excel, Access, etc)... too easy for users to hit the DB directly, and do things that bypass the app's process.
    • Let's face it, most apps DON'T use integrated authentication (NTLM / Kerberos). WebApps require Kerberos, which is annoying and confusing to many, plus it means that permissions within the DB can cause problems to the app (if for example, the app doesn't know how to handle when a specific call fails due to the user not being in the correct group)... plus, there are too many ways to apply permissions directly to SQL, to enforce good consistency... PLUS, apps aren't going to scan the DB for such misconfigurations ("gee, user can run things because they're SA, when in fact they SHOULD be a member of this DB role").
    • (MSSQL) Application roles are rarely used, but SHOULD BE (looking at the devs here). Great way to enforce app permissions within SQL.
  • Queries

    • simple queries (SELECT *, SELECT by ID) can be handled by ORMs writing dynamic SQL
    • Complex queries tend to be written better by hand, so write a sproc and map within the ORM
  • INS/UPD/DEL

    • no specific preference.
    • on one hand, passing an app object to a sproc makes the INS/UPD transaction across tables easier
    • on the other hand, scaling out may mean that some tables are moved to other servers, so the transactions aren't contained to a single server
    • also, transaction locking requires tuning to ensure that locks won't become performance impediments. The alternative "NOSQL" / "eventual consistency" model is high performance, but involves more code, and more sysadmin effort (since there are usually more systems involved to track and propogate the data between the databases / cache servers).
  • SDL

    • should only run during deployment... presumably an installation / upgrade process runs the SDL code
  • how does this answer the question asked, "What kind of people-scaling overhead can one expect in Scenario 1? What kind of database-scaling overhead can one expect in Scenario 2?" See [answer] – gnat Oct 02 '14 at 14:44
  • I agree that the Hybrid approach may be the best one. – HLGEM Oct 02 '14 at 19:29