8

I have some experience with document-based stores (MongoDB and CouchDB) and I am interested in exploring wide-column databases.

Based on my initial exploration I can grasp a basic understanding of how wide column stores are different, but I do not really understand in which type of operations they are a better fit than an indexed document store.

My initial impression is that column stores are better if the column combinations for the queries are highly dynamic (no indexed view really required) and/or if there is a high rate of writing (that triggers map-reduce indexes in a document store).

Performance-wise, it seems that column stores might be better if I have documents with many properties but not all of them are needed. Document stores seem to promote that the whole document will be retrieved, but not sure how much impact this really has. Maybe the document needs to have many filtered columns to make a difference.

Also, I got the impression that column stores "might" be more performant for multi-tenant systems which shared databases where one of the columns holds the tenant id and maybe another one the roles.

And I am getting the feeling that wide column stores are very good for the queries done by data analysis applications, where there is a large set of collected data for each entry, only a few fields must be extracted and the combination of columns is totally random.

My Question: What types of queries are better handled in wide-column stores as opposed to document stores?

SystematicFrank
  • 897
  • 8
  • 18
  • recommended reading: **[Gorilla vs. Shark](http://blog.stackoverflow.com/2011/08/gorilla-vs-shark/)** -- _"if you... don’t want your question to get instantly closed... — try to keep Gorilla vs. Shark in mind."_ – gnat Sep 04 '14 at 15:38
  • 5
    I thought this question was more like: "under which conditions do sharks have an advantage fighting a gorilla" to which the answer is "under water". That page makes me think that people is reading my question like "columns or documents! which one is cooler?" unfortunately I do not know column stores well enough to see the obvious answer that others can immediately point out. – SystematicFrank Sep 05 '14 at 05:02
  • The simplest data model is the fastest at the simplest of queries. Use the following infographic for reference: [![Data Models in NoSQL and NewSQL Databases Comparison](https://i.stack.imgur.com/rHv9J.png)](https://i.stack.imgur.com/rHv9J.png) **References** * [PWC Accelerator: Technology Forecast](https://www.pwcaccelerator.com/pwcsaccelerator/media-technology-forecast.html) – Paul Sweatte Aug 18 '18 at 15:40
  • Here's a [couple](http://info.jethro.io/blog/database-architectures-business-intelligence-comparison) of [case studies](https://semanticommunity.info/Other/Big_Data_at_NIST) – Paul Sweatte Nov 15 '18 at 13:57

2 Answers2

1

Column stores are great for when you have many fields, but rarely need to read all of them at once.

Because the data is stored per column, the CPU only needs to fetch the data that it needs. Also, columns can be aligned contiguously on disk, and CPUs are better at reading and filtering aligned contiguous data. Duplicate data is easier to compress in a column store as well, because each column will store a single data type, which can be compressed more readily when kept together. Columnar data can be more expensive to write to, so it's great for read-often, write-rarely data like analytics snapshots.

I recommend reading the Dremel research paper from Google to understand the benefits they found from using a columnar format.

If you want to get a better feel for it, you can try out BigQuery on some public datasets. That should help you get a feel for which kinds of queries run faster or more efficiently on a distributed columnar datastore.

Document datastores are better for data without an obvious schema, or data that will usually be fetched together. They also tend to be more general-purpose. For many web developers, a document datastore is the go-to datastore for user data, because they tend to perform well for most small applications, and can be optimized/scaled further by adding indexes.

-2

I can't answer this question for you, and no one else can either, because "Gorilla vs Shark" as noted in comments above. But I will help anyway.

You have omitted an important preceding question:

What are the characteristics of the data set I am querying?

That is just as important, if not more so, than the specific queries you want to run. Some useful questions to ask about your data are:

  • How much data do I have? Does it fit into memory? On one server? On a cluster?
  • How does my data change? Does it get mass updates on a predictable frequency? Unpredictable frequency? Does it get streamed updates? Trickled transaction updates? No updates at all?
  • What is the structure of the entities in my data? Are there any "one-to-many" relationships? Is it all tabular? Is it mostly tabular?
  • What is the sparseness of my data? Is it reasonably complete? Is it mostly empty?

If you are considering this in the abstract and don't have any specific data set in mind, then there is no reasonable answer to your question.

And even with a specific, well-defined set of data, and answers to all these questions, you still might not know without doing a bakeoff of particular implementations.

Tajh Taylor
  • 102
  • 3
  • asking about the data is the wrong question! the real thing you need to ask about is the characteristics of the queries – user253751 Nov 17 '22 at 17:35