5

I am new to NoSQL and have always heard of "key-value" DBs like Cassandra or Riak and expected their tables to literally have two columns: a key-col and a value-col. I just started working at a new shop where the senior engineers chose Cassandra as the main datastore and setup their own Cassandra cluster themselves. Based on the design and use of the tables, however, I'm beginning to suspect that we're not using Cassandra the "right way"; that is, the way its intended to be used (as a KV DB).

The tables are all totally relational, just like a MySQL or Postgres DB, with non-key UUID columns in some tables acting as the primary keys for records in other tables. And the tables are wide, sometimes having 10+ columns and multiple columns making up the table key.

However in Cassandra you can't do joins so the software is jumping through all of these ridiculous hurdles to treat the non-relational KV datastore as if it was relational, where the code queries the data for one table, and fetches a UUID ("key") for the records it wants to JOIN to in another table, then it queries that second table for records with a matching UUID/key. To me this just seems crazy. I think I get the buzzphrase "Code is king", but it just seems bonkers and wildly inefficient to be hitting the database multiple times and stitching the data together in the app layer...I feel like this means our data is actually relational and that we should be using some RDBMS.

So all this to ask: Is this truly the way Cassandra was intended to be used, or should the tables truly be 2-column KV pairs, or is the proper use of Cassandra something else entirely?

smeeb
  • 4,820
  • 10
  • 30
  • 49
  • 4
    You work for a shop that's going to write a relational database from scratch. Run. Run away. Run fast. Run far. – Blrfl Feb 15 '17 at 11:22
  • 3
    You should ask them why they chose Cassandra over a rdbms. The answer may be illuminating – Paul Feb 15 '17 at 11:34
  • 2
    I did ask and they literally regurgitated the "Code is king" mantra :-/ . Please keep in mind though this question is *more* about helping me understand the *proper* way to use Cassandra, rather than to focus on the senior engineer's shortcomings...I'm still curious about how Cassandra is intended to be used!!! – smeeb Feb 15 '17 at 11:40
  • 1
    I guess you are asking how to use SQL on a NoSQL database? – johnny Feb 15 '17 at 15:26
  • 1
    Nope @johnny I'm looking for (a) clarification that the way our shop is using Cassandra is *not* the way to use it, and then tangentially (b) interested in what the proper use for it is. – smeeb Feb 15 '17 at 15:29
  • Remember most data modeling books and just about any training or education will be based of relational models -- you need to somehow forget a lot of that to successfully model data on other platforms. – Wyatt Barnett Feb 15 '17 at 19:56
  • 2
    You are questioning the nosql religion with rationality. You are treading on dangerous turf. i approve, but you should probably run while you can. – Jim Garrison Feb 16 '17 at 17:02

1 Answers1

9

If you use Cassandra and need to emulate a lot of JOIN's, then you are doing it wrong. The reason for such a data model is usually that you are following the habit you learned from relational databases and normalize your data as much as possible. This leads to data getting spread out over many tables. This is a bad idea in Cassandra, because you have no efficient way to JOIN these tables back together.

The first thing one must learn when switching from relational databases to Cassandra is that you must not be afraid of data duplication. Redundancies in Cassandra data models are normal and often recommended. Let's take a simple invoice database, for example.

An invoice in a relational database would have one entry in a table for the invoice head and several entries in another table for the invoice positions. In Cassandra you would drop the head-table and instead duplicate all that information from the head in all rows of the position-table.

In a relational database, your invoice-table would likely have a column "productId" which is the primary key of another table "Products". In Cassandra that Products table would likely still exist, but you would replicate all the fields from it which are relevant for invoices in the invoice position table.

The goal of all this is that you can show an invoice to the user with all the relevant information by only making a single query on the invoice positions table.

The result of this is that your tables in Cassandra tend to be far wider than in relational databases. A lot of small tables usually means that you need to query a lot of different tables to get the information you need. Yes, Cassandra is called a key-value database, but that doesn't mean that there is a size restriction for either keys or values or that values must not be more than one field. The values can be as large as you want.

For further reading, I recommend the article Basic Rules of Cassandra Data Modeling by DataStax.

Philipp
  • 23,166
  • 6
  • 61
  • 67
  • Thanks @Philipp (+1) - but in your invoice example, what's the *actual benefit* to storing the (duplicated) data in Cassandra? Thanks again! – smeeb Feb 15 '17 at 14:30
  • 1
    @smeeb The benefit for Cassandra is that you don't need to do a three-table JOIN over heads, positions and products and can get all the data you want in a single query. When you design Cassandra data models, you should look at what queries you expect and then design your data model in a way that each query can be satisfied with querying one table, even if that means that you have the same information in multiple tables. – Philipp Feb 15 '17 at 14:39
  • 1
    @smeeb You might also consider the corollary of "code is king": **"database model is peasant"**. If the king wants something, the peasants are supposed to bend over backwards and fulfill that wish even at the expense of their own well-being. If the data model needs to include redundancies to fulfill a certain request with one query, so be it. – Philipp Feb 15 '17 at 14:50
  • Thanks again @Philipp (+1 again), so are you saying that the single table queries (in Cassandra) are faster than JOINS that you would typically have to do in an RDBMS? If so, is *that* really the primary motivation for Cassandra? That if you need super quick queries that would normally be bogged down by slow JOINs, you can get them in Cassandra, **but at the price of data duplication?** Am I understanding this right?!? Thanks again! – smeeb Feb 15 '17 at 14:52
  • @smeeb The advantages and disadvantages of Cassandra over relational databases are a far wider topic than just this one aspect. This is an issue which is far too complex to answer with a comment. – Philipp Feb 15 '17 at 14:54
  • Sure, understood @Philipp, but can you confirm that my assertion above is at least correct or not? Also, any "TL;DR" articles/summaries you can recommend that would cover Cassandra's benefits/strongpoints? Thanks again! – smeeb Feb 15 '17 at 15:06