How to perform summary analytics on production databases?

Question

I want to use production data to generate "user activity" metrics.

My thought is to set up a cron job that regularly reads production databases to summarize user activity and then stores the results in a separate metrics database. This metrics database can then be used to generate various dashboards.

However, directly reading production databases to generate metrics seems abusive -- hence my question.

What is best practice for this situation?

score 5 · Accepted Answer · answered Feb 21 '16 at 04:22

Why is reading the data "abusive"?

Presumably, the business has some level of need for operational and analytic reporting. How is that accomplished today? In smaller organizations, that generally involves running queries against the production database(s) that the OLTP applications are manipulating. At some point, the load that reporting puts on the system grows and there is a desire to separate the higher priority OLTP load from the lower priority reporting load. At that point, there are various architectural options such as maintaining a replica of the production database that is a few seconds or a few minutes behind the OLTP system and moving the reporting there or building a data warehouse and moving the reporting there. Building a data warehouse generally involves, among other things, building and maintaining aggregate tables of some sort.

Of course, that generally means that you need additional servers, additional database licenses, additional processes to monitor, and generally involves adding complexity to the production environment. Architecting and building out either the replica database approach or the data warehouse approach generally involves a fair amount of effort and thought-- different databases provide a wealth of different tools, there are ETL suites to manage building the data warehouse, you start to worry about things like data lineage, etc.

I have seen both of these options used. A data warehouse is a lot more work, but performs much better in the long run. A replica is easy to set up, but depending on the query and take structure, could be intensive. But even on production, a database server is meant to be queried. Try it, and if you have performance issues, reevaluate. — mgw854, Feb 21 '16 at 04:59

score 1 · Answer 2 · answered Feb 21 '16 at 12:06

If your primary database is relational one, a RDBMS, like for example MySQL, you should store the data you want to generate statistics from in something like Elastic Search. With this approach your primary db is your persistent storage, while you copy over relevant data to your search index in ES. You could then do all kinds of fun things with your data!

It is very usual to have multiple databases for the same datasets. They are all there to achieve a specific task and for advanced metrics a search engine is a perfect fit.

How to perform summary analytics on production databases?

2 Answers2