Near real-time analytics from Cassandra with frequent updates

Question

We have an activity metrics page where users can select a date period and see other user's aggregated activity (by action) and optionally filter everything by 4 or 5 fields. Actions happen sequentially, but one of the fields is "Tags" and the user may change old action's tags anytime. The data is in a Cassandra 3.7 with the partition key being company_id, action_year, action_week. For each week we have about 70k actions (there are 20 columns with long or int data for each action, each action with the partition key plus action_timestamp and action_key as row key).

PRIMARY KEY ((company_id, action_year, action_week), action_date, action_key) ) WITH CLUSTERING ORDER BY (action_date ASC, action_key ASC)

With a first version we are querying the full actions for a period and doing all the aggregations and filtering in memory. When the user selects a couple of weeks, the whole request takes like 10 or 15 secs. And we are expecting to scale to thousands of users requesting these analytics that should work as near real time analytics.

We thought of moving the filtering to C* using "allow filtering", but the WHERE clause seems very limited. And we are also worried about the frequent updates for the labels.

What other options do we have? We thought of Druid but maybe it's too much for what we need. Spark maybe? Are we not using C* right and we may need to cache full weeks elsewhere?

score 2 · Answer 1 · answered Oct 27 '16 at 21:59

First of all, I highly recommend you take this data modeling course from Datastax. They claim it takes 12 hours, but depending on your prior experience it may take less. It's well worth your time if you're struggling with issues like this.

Second, ALLOW FILTERING is almost always a bad idea. It's useful in a limited set of use cases, but magically making already slow queries faster isn't one of them.

When I've taught data modeling in Cassandra, the issue programmers struggle with most is that Cassandra trades off storage space for speed. If you want decent speed, you're going to require custom tables for each query, and you're going to end up with several duplicates of each record. This is especially hard to grasp for programmers used to relational databases, where you try to avoid duplication as much as possible.

That means adding a tag to a record should usually end up creating a brand new partition with duplicate data in a separate table with a primary key something like:

PRIMARY KEY ((company_id, action_year, action_week, tag), action_date, action_key)
) WITH CLUSTERING ORDER BY (action_date ASC, action_key ASC)

If you have other common columns you filter by, like perhaps action_key, those columns should be in the partition key in a near-duplicate table. You're trading off extra writes for efficient reads. Same for aggregates. Do it at write time if possible. If you can't avoid filtering in memory, do everything possible to make your partitions as close to your end result as possible first.

Best of luck. Switching your data modeling mindset to Cassandra's can be mind-bending at first, but it's worth the effort.

Thanks, the problem is that I have multiple tags per item, so also per action_key. I understand that the key is data duplication, what I don't see is a good solution for handling multiple tags and be able to filter by any of them. Maybe Cassandra is the wrong tool for that. — Federico Pugnali, Oct 31 '16 at 14:16

Near real-time analytics from Cassandra with frequent updates

1 Answers1