161

Recently I read a lot about noSQL DBMSs. I understand CAP theorem, ACID rules, BASE rules and the basic theory. But didn't find any resources on why is noSQL scalable more easily than RDBMS (e.g. in case of a system that requires lots of DB servers)?

I guess that keeping constraints and foreign keys cost resources and when a DBMS is distributed, it is a lot more complicated. But I expect there's a lot more than this.

Can someone please explain how noSQL/SQL affects scalability?

gnat
  • 21,442
  • 29
  • 112
  • 288
ducin
  • 1,729
  • 3
  • 11
  • 7
  • 8
    "I guess that keeping constraints and foreign keys cost resources and when a DBMS is distributed, it is a lot more complicated. But I expect there's a lot more than this." - Actually, that's it. More accurately, that's the one common characteristic that makes most NoSQL solutions more scalable than their SQL cousins (for certain data models). But NoSQL is an extremely vague term, different families of NoSQL databases have different characteristics that make them more scalable. – yannis Apr 08 '13 at 21:49
  • 9
    Of course SQL databases scale perfectly well into trillions of records, they just need some expertise to design and set them up that application developers don't have. And generally a fairly expensive set of hardware and licenses. – HLGEM Apr 08 '13 at 22:19
  • possible duplicate of [When would someone use MongoDB (or similar) over traditional RDMS?](http://programmers.stackexchange.com/questions/54373/when-would-someone-use-mongodb-or-similar-over-traditional-rdms) and of [Are NoSQL databases going to take the place of relational databases? Is SQL going away?](http://programmers.stackexchange.com/questions/5354/are-nosql-databases-going-to-take-the-place-of-relational-databases-is-sql-goin) – gnat Apr 09 '13 at 06:21
  • 8
    In my opinion this question is not a duplicate of either of those. The mongodb question is (besides a bad title making it seem more specific) asking something else which is in fact more general. Voted to reopen. – Joeri Sebrechts Apr 15 '13 at 14:31

4 Answers4

251

It's not about NoSQL vs SQL, it's about BASE vs ACID.

Scalable has to be broken down into its constituents:

  • Read scaling = handle higher volumes of read operations
  • Write scaling = handle higher volumes of write operations

ACID-compliant databases (like traditional RDBMS's) can scale reads. They are not inherently less efficient than NoSQL databases because the (possible) performance bottlenecks are introduced by things NoSQL (sometimes) lacks (like joins and where restrictions) which you can opt not to use. Clustered SQL RDBMS's can scale reads by introducing additional nodes in the cluster. There are constraints to how far read operations can be scaled, but these are imposed by the difficulty of scaling up writes as you introduce more nodes into the cluster.

Write scaling is where things get hairy. There are various constraints imposed by the ACID principle which you do not see in eventually-consistent (BASE) architectures:

  • Atomicity means that transactions must complete or fail as a whole, so a lot of bookkeeping must be done behind the scenes to guarantee this.
  • Consistency constraints mean that all nodes in the cluster must be identical. If you write to one node, this write must be copied to all other nodes before returning a response to the client. This makes a traditional RDBMS cluster hard to scale.
  • Durability constraints mean that in order to never lose a write you must ensure that before a response is returned to the client, the write has been flushed to disk.

To scale up write operations or the number of nodes in a cluster beyond a certain point you have to be able to relax some of the ACID requirements:

  • Dropping Atomicity lets you shorten the duration for which tables (sets of data) are locked. Example: MongoDB, CouchDB.
  • Dropping Consistency lets you scale up writes across cluster nodes. Examples: riak, cassandra.
  • Dropping Durability lets you respond to write commands without flushing to disk. Examples: memcache, redis.

NoSQL databases typically follow the BASE model instead of the ACID model. They give up the A, C and/or D requirements, and in return they improve scalability. Some, like Cassandra, let you opt into ACID's guarantees when you need them. However, not all NoSQL databases are more scalable all the time.

The SQL API lacks a mechanism to describe queries where ACID's requirements are relaxed. This is why the BASE databases are all NoSQL.

Personal note: one final point I'd like to make is that most cases where NoSQL is currently being used to improve performance, a solution would be possible on a proper RDBMS by using a correctly normalized schema with proper indexes. As proven by this very site (powered by MS SQL Server) RDBMS's can scale to high workloads, if you use them appropriately. People who don't understand how to optimize RDBMS's should stay away from NoSQL, because they don't understand what risks they are taking with their data.

Update (2019-09-17):

The landscape of databases has evolved since posting this answer. While there is still the dichotomy between the RDBMS ACID world and the NoSQL BASE world, the line has become fuzzier. The NoSQL databases have been adding features from the RDBMS world like SQL API's and transaction support. There are now even databases which promise SQL, ACID and write scaling, like Google Cloud Spanner, YugabyteDB or CockroachDB. Typically the devil is in the details, but for most purposes these are "ACID enough". For a deeper dive into database technology and how it has evolved you can take a look at this slide deck (the slide notes have the accompanying explanation).

Joeri Sebrechts
  • 12,922
  • 3
  • 29
  • 39
  • While I agree that *some* NoSQL stores replace ACID with BASE, that still isn't a common feature for *all* stores that fall under the NoSQL "category", which is an ill-defined in the first place. After a while, the interpretation of the term switched from "No SQL" to "Not Only SQL", but as many such databases still do JOINs or have started implementing SQLesque dialects, Mark Madsen has re-coined the term to mean something else in his [history of databases in no-tation](http://www.nosql-vs-sql.com/#no-tation): *"No, SQL"* ;-) – Lukas Eder Dec 15 '13 at 08:41
  • 3
    To avoid joins, we will have de-normalized data in NoSQL leading to repetition and more storage. But then same can be achieved in RDBMS if we are OK with de-normalization. So "Joins" or "no Joins" is depending on the DBA, and not on database type. Correct ? – Kaushik Lele May 31 '15 at 07:12
  • mysql doesn't count? are you kidding? tell that to youtube, wikipedia and about half of top 100 website of the world – dynamic Jul 02 '15 at 09:43
  • 2
    @dynamic Those sites either use heavy caching, or they shard. Those designs place the complexity of scaling the data outside of the db. You might as well use nosql in such a case, because that's exactly the trade-off nosql makes. – Joeri Sebrechts Jul 02 '15 at 18:35
  • 1
    "SQL API lacks a mechanism to describe queries where ACID's requirements are relaxed". Technically true, but SQL server has taken a timid step in that direction. SQL 2014 introduces Delayed Durability, relaxing the D in ACID, in exchange for reducing write log pressure. – EBarr Jul 27 '15 at 00:56
  • 3
    This should be the accepted answer imo. It's very clear with examples but manages to remain concise. – Olshansky Jun 03 '17 at 23:36
  • Thx for this overview! Do I understand correctly any scalable solution (which implies distributed) *must* (mathematically) give up C or A from CAP already; and in addition, for performance reasons, it also *chooses* to give up anywhere from 0 to 3 letters from ACID (anything except I, since that one isn't costly I assume). (And of course the C in CAP and ACID are different.) – max Jun 23 '17 at 23:09
  • The way CAP works out: anything that is distributed (P) must respond to network partitions (part of the cluster cannot reach the other) by either becoming unavailable (no A), or by becoming inconsistent (no C). You can draw it out on paper: just draw two connected nodes, then break the connection, and reason about what happens with a row update on one side followed by a read of that same row on the other side. Btw, I is costly even without the network, so if you dig deep enough you'll find almost every database does not perfectly isolate transactions in some cases. – Joeri Sebrechts Jun 26 '17 at 07:19
  • I'm wondering why this was not the accepted answer! It clearly answers the question in the first 4 lines. – Kasper Feb 02 '20 at 14:28
  • It seems that you think consistency in ACID as consistency in CAP(roughly). However, consistency in ACID doesn't mean "Consistency constraints mean that all nodes in the cluster must be identical". What you are talking about is more or less similar to C in CAP(roughly) – guo Mar 28 '20 at 03:20
  • @guo feel free to propose an edit – Joeri Sebrechts Mar 30 '20 at 06:36
  • This post and the deck hands down the best post on why noSQL! Thank you! – Manish Jain Oct 16 '20 at 23:28
  • Some like Fauna (multi-region distributed & scalable) which some would call "NoSQL" were built to be strongly consistent.The line has faded completely if you ask me: https://fauna.com/blog/a-comparison-of-scalable-database-isolation-levels. Disclaimer: I work for them as a developer advocate, didn't intend to 'market' here, it did however pain me to see "ACID enough" there. If you realize that some of those actually provide stronger consistency by default and a stronger max consistency level than traditional databases (https://fauna.com/blog/introduction-to-transaction-isolation-levels) – Brecht De Rooms Feb 01 '21 at 17:08
  • 1
    @BrechtDeRooms well, again, the devil is in the details. I said the holy grail was "SQL, ACID and write scaling" and fauna does not seem to offer SQL. Also, I've just read through [Aphyr's report](https://jepsen.io/analyses/faunadb-2.5.4) on Fauna, and while it is an impressive system, the strict serializability aspect does seem to come with a bunch of asterisks. You're definitely right that the traditional RDBMS's often are a lot less consistent than people think they are. – Joeri Sebrechts Feb 02 '21 at 11:12
  • Very good points thanks. There is actually a hackathon this week to generate FQL from SQL. Maybe one day it becomes the holy grail :) – Brecht De Rooms Feb 06 '21 at 18:01
  • @JoeriSebrechts it is possible that any node can provide stale value in worst case while doing horizontal scaling ? if yes then does SQL provides consistency ? – Rishabh Agrawal May 29 '21 at 19:17
  • @RishabhAgrawal Yes, if you're reading from a different node than the original write went to, stale values are possible. If you mean "do SQL RDMBS's provide consistency across servers?", then typically if what they offer is master/slave replication then they don't provide consistency. Clustered RDMBS's that at least try to do this do exist, but the consistency story they bring is tricky. See for example [Aphyr's analysis of MariaDB Galera Cluster](https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster). – Joeri Sebrechts Jun 03 '21 at 15:40
  • I believe consistency in ACID is not similar to the consistency you have mentioned. In ACID, Consistency refers to maintaining data integrity constraints. – Freeze Francis Aug 28 '22 at 16:30
  • @FReezeFRancis You are right, this is an oversimplification. Consistency in distributed operations generally means that the system's observers see a consistent history of operations (e.g. a value that was observed as written can be read), which in a traditional RDBMS generally means syncing writes as they happen. Jepsen has the most accurate explanation of consistency in distributed systems: http://jepsen.io/consistency – Joeri Sebrechts Sep 05 '22 at 11:01
  • @JoeriSebrechts very good points, can you please share if there's any video recording of the talk? Tried searching on meetup but couldn't find one. – Prashant Shubham Oct 21 '22 at 14:33
  • @PrashantShubham sorry, no recording – Joeri Sebrechts Oct 23 '22 at 11:31
119

noSQL databases give up a massive amount of functionality that a SQL database gives you by it's very nature.

Things like automatic enforcement of referential integrity, transactions, etc. These are all things that are very handy to have for some problems, and which require some interesting techniques to scale outside of a single server (think about what happens if you need to lock two tables for an atomic transaction, and they are on different servers!).

noSQL databases don't have all that. If you need that stuff, you need to do it yourself, but if you DON'T need it (and there are a lot of applications that don't), then boy howdy are you in luck. The DB doesn't have to do all of these complex operations and locking across much of the dataset, so it's really easy to partition the thing across many servers/disks/whatever and have it work really fast.

Michael Kohne
  • 10,038
  • 1
  • 36
  • 45
  • 7
    Didn't know it was that simple – Honinbo Shusaku Apr 06 '17 at 16:29
  • 12
    this accepted answer is totally failing to mention the NoSQL sharding capability which is missing from SQL. Sharding is what makes NoSQL horizontally scalable. – hyankov Nov 11 '17 at 12:10
  • 11
    @HristoYankov And it works because the NoSQL system doesn't do all the things that don't play nicely with sharding. – user253751 Jan 17 '19 at 22:33
  • 2
    @HristoYankov: SQL database can be horizontal sharded, and not all NoSQL databases can be horizontally sharded easily. Sharding isn't really the reason why you want to use NoSQL. – Lie Ryan Sep 17 '19 at 08:30
  • 3
    @HristoYankov The accepted answer goes one level deeper than your note of "totally failing to mention the NoSQL sharding capability which is missing from SQL". The accepted answer, rightfully, talks about WHY horizontal sharding is more difficult with SQL databases. In fact, I spent a good 20 min searching for the answer to this and pretty much everyone just rolls out the "ohh NoSQL shards better", without mentioning any reason whatsoever. Totally useless response. The accepted responses here answers the question perfectly - albeit very briefly. Would be nice to have more reasons listed also. – Phoeniyx Sep 26 '19 at 20:28
  • 1
    This answer was great at the time it was answered but is probably outdated in the sense that these interesting techniques do exist nowadays. NoSQL abandons typically one of relations/transactions/schema/integrity in some degree. However, there are nowadays scalable databases ("NoSQL" & NewSQL) that provide transactions/relations and some actually provide strong transactions than conventional databases (https://fauna.com/blog/a-comparison-of-scalable-database-isolation-levels, Fauna/Spanner). We should just forget the term NoSQL, it was never well-defined has even less meaning today. – Brecht De Rooms Feb 01 '21 at 17:06
8

It's true that NoSQL databases (MongoDB, Redis, Riak, Memcached, etc.) don't maintain foreign key constraints, and atomic operations must be more explicitly specified. It's also true that SQL databases (SQL Server, Oracle, PostgreSQL, etc.) can be scaled to handle very large performance requirements by seasoned DBAs.

NoSQL databases allow seasoned programmers, who are well aware of race-conditions and atomic operations, to forego a large amount of processing only required in a small percentage of today's web application code. NoSQL databases certainly have atomic operations and most all transactional requirements present in SQL databases can also be obtained NoSQL databases. The difference is the level of abstraction. NoSQL databases remove the higher levels of abstraction and hand that capability to the application programmer, thereby resulting is faster code overall with the increased probability of data corruption by unseasoned programmers.

As a result we are much more likely to see NoSQL databases being used more and more heavily in the web application space, where development time and performance are very important. Financial and corporate software is likely to retain it's SQL heritage because hardware performance is relatively cheap, they have seasoned DBAs on-hand, and the increased risk caused by unseasoned programmers is not palatable.

  • 2
    I'm not sure I agree with the part about atomic transactions, in the ACID sense (although it's difficult to comment on "NoSQL", since it's up for debate what exactly we mean). Most of the performance gains in "typical" NoSQL DBs are achieved through loosening of consistency guarantees (see: [eventual consistency](http://en.wikipedia.org/wiki/Eventual_consistency), ACID vs. BASE). If eventual consistency is good enough for an application (and it often is), then this allows for much more efficient horizontal scaling. – Daniel B Apr 09 '13 at 06:19
4

From IBM developerWorks: Supply cloud-level data scalability with NoSQL databases

Scalability is the system that should be able to support very large databases with very high request rates at very low latency.

NoSQL systems have a number of design features in common:

  • The ability to horizontally scale out throughput over many servers.
  • A simple call level interface or protocol (in contrast to a SQL binding).
  • Support for weaker consistency models than the ACID transactions in most traditional RDBMS.
  • Efficient use of distributed indexes and RAM for data storage.
  • The ability to dynamically define new attributes or data schema.

Why relational databases may not be optimal for Scaling

In general, relational database management systems have been considered as a "one-size-fits-all solution for data persistence and retrieval" for decades. They have matured after extensive research and development efforts and very successfully created a large market and solutions in different business domains.

The ever-increasing need for scalability and new application requirements have created new challenges for traditional RDBMS, including some dissatisfaction with this one-size-fits-all approach in some web-scale applications. The answer to this has been a new generation of low-cost, high-performance database software designed to challenge dominance of relational database management systems. A big reason for the NoSQL movement is that different implementations of web, enterprise, and cloud computing applications have different requirements of their databases — not every application requires rigid data consistency, for example.

Another example: For high-volume websites like eBay, Amazon, Twitter, or Facebook, scalability and high availability are essential requirements that cannot be compromised. For these applications, even the slightest outage can have significant financial consequences and impacts customer trust.

Over on DBA.SE: What does horizontal scaling mean?

Horizontal Scaling is essentially building out instead of up. You don't go and buy a bigger beefier server and move all of your load onto it, instead you buy 1+ additional servers and distribute your load across them.

Horizontal scaling is used when you have the ability to run multiple instances on servers simultaneously. Typically it is much harder to go from 1 server to 2 servers then it is to go from 2 to 5, 10, 50, etc.

Once you've addressed the issues of running parallel instances, you can take great advantage of environments like Amazon EC2, Rackspace's Cloud Service, GoGrid, etc as you can bring instances up and down based on demand, reducing the need to pay for server power you aren't using just to cover those peak loads.

Relational Databases are one of the more difficult items to run full read/write in parallel.

Md Mahbubur Rahman
  • 4,747
  • 5
  • 32
  • 38