Achieving scalability and ACID with a RDBMS to NoSQL streaming solution

Question

My understanding is that the main feature Cassandra has to offer is linear performance at any scale; meaning that if I know 1 C* node can handle 500 queries or commands per second from my app, then I also can rest assured that 100 C* nodes added to the same cluster will be able to handle 500 * 100 = 50K queries or commands per second.

My understanding of the main tradeoff between RDBMS and NoSQL is that NoSQL systems tend to favor scalability but that requires them (mechanically, due to the implementation) to need to relax their transactability. Hence NoSQL systems like C* typically scale extremely well but can't offer the classic ACID transactions that RDBMS systems like MySQL can.

My understanding is that since scalability and transactability are mutually exclusive of one another, that there aren't any magical NoSQL databases out there that offer C*-like scalability (again: linear performance at any scale) and that offer Java clients with JTA implemented (commit + rollback capabilities) in them.

These are my assumptions heading into this question: if I'm wrong or misguided about any of them, please begin by correcting me!

Assuming I'm more or less correct in all those assumptions, then what does one do when you actually do need both scalability and ACID transactions? My only idea here would be to implement the following:

Configure the app to write to an RDBMS (like MySQL) using a JDBC/JTA-compliant driver (e.g. using transactions)
Somehow, configure the RDBMS to replicate (either in real-time or with very low latency) to a highly-scalable DB (like Cassandra). This might be a configuration option that the RDBMS itself offers, or much more likely, will be code you need to write yourself to continuously ETL out of one system and into another. The idea here is that the app would then read from the NoSQL table, and still have very performant reads against a massive amount of data.
Somehow configure the RDBMS tables with TTLs so that the tables don't grow extremely large and start requiring sharding and other tricks that can slow transactions down. Again, this might be a configuration option that the RDBMS itself offers, or is likely code you need to write yourself.

Are there any better known solutions here? Any pitfalls/caveats to this approach?

Do you write to the DB with no regards for past transactions? And how would the RDBMS enforce something like a foreign key if the DB doesn't have the info? — Nelson, Jul 07 '18 at 03:29
Thanks @Nelson (+1) this is a good point to conscious of when configuring the TTLs. In my particular case the data being written has temporal relevancy but only with a scrolling windw of a few minutes. So I'd get into trouble if I set the TTLs > a few minutes but I should be fine beyond that. But thank you for pointing that out! — hotmeatballsoup, Jul 07 '18 at 14:47

score 1 · Accepted Answer · answered Jul 09 '18 at 07:53

Your assumptions are not so much wrong as incomplete. The trade-off between scalability (tolerate network partitions or P in CAP), and transactability (consistency or C in CAP) does not pose itself at the level of a system but at the level of an individual operation or transaction. That is to say, a transaction can be consistent or it can leverage horizontal scale, but not both. Databases are however free to provide both mechanisms. For example, cassandra supports lightweight transactions and majority reads/writes which gives you mechanisms to ensure consistency.

The picture becomes even more complicated because these are technical properties, but they don't behave exactly like you would intuitively expect. For example, google's cloud spanner delivers scalability and strong consistency from the perspective of the database user, even though internally it is eventually consistent. Sharding is a way to cheat, with the system as a whole leveraging horizontal scale but all of the data within one shard being strongly consistent.

Now, to your question of how to achieve scalability and consistency, some strategies:

The problem really only poses itself in cases where you must read before write. You can try isolating those parts that fall into that category and use a different mechanism or database to serve those. Applying sharding can help to further subdivide the set of consistent data to fit within one server.
You can take advantage of the fact that probably not all reads must have a consistent view, and use caches or non-ACID queries (e.g. cassandra read consistency ONE).
In cases where you want to replace a complex graph of data by a new complex graph of data which must be entirely consistent, you can use a versioning strategy where you write the new version of the graph using non-atomic writes and then switch to that version with an atomic write.

Achieving scalability and ACID with a RDBMS to NoSQL streaming solution

1 Answers1