What are the some common practices for storing and accessing data for analytics?

Question

If you need to access application information from a database for analytics, is it better to copy the information first and analyse the copy or is it okay to analyse the data retrieved fresh from the application database?

Does anyone know if posts can be marked as "community" posts here or is that just on SO? — irl_irl, Mar 12 '17 at 21:17
CW functionality has been greatly scaled back from the old days. I think you have an option when creating the post to start with, otherwise, you would need to flag it for a moderator to change it. — , Mar 14 '17 at 19:01

Slawek · Answer 1 · 2017-03-14T07:06:34.210

The problem with nosql is that it's like going back in time to the age of tape drives. From my experience - it completely sucks for almost everything which is small to medium sized, especially statistics where you have very large number of rather small rows, you need to access many of them and REAL DECIMALS, not CRAPPY FLOATS are sometime nice to have. I was recently thinking where the hype is comming from... i think it's from multi-million corporations which could afford thousands of servers running nosql because this way is much cheaper and safer to scale large size apps than hiring large engineerig teams to design and maintain SQL cluster.

And it's really suprising that their devs who need to manually sanitize every piece of data are not ending up in mental asylums in large number ;)

I was doing some tests of some best available nosql solutions like 2 weeks ago, when putting 1 day worth of stats (about 2-3 million row) - what was taking less than 100 ms in mysql, unoptimized - it took over 10 seconds in nosql with some optimizations on same hardware. Not to mention that standard aggregating functions were failing because there was "too many rows to agregate" and i had to write map-reduce jobs. Which is really pathetic. Some simple, aggregated/sorted query which took 20 seconds to write in SQL will take days to do in nosql "database" and still suck.

I think if you don't have 5-10 servers and you're not extremely bored looking for something stupid to do with your life - then forget it. JSON is crap for storing large amount of data. The time needed to parse this bloated shit and send it around will kill you. There are some legitimate cases for a tape drive, and for nosql... but it's short signed to use a tape drive just because "it has more space".

If you worried about additional statistics - store most of the data in normal columns, and use JSON columns to store everyting exotic. At least users won't wait 10 minutes to get a report on commonly used metrics, and for less used stuff you can put a warning it'll take eg. 20 minutes to generate.

For using application DB for doing long, analitical queries... you can design it "properly" using some MVCC based engine wich is very hard to do and then you could make your queries alongside application queries without locking data, or you can use standard way of "moving data around" called replication in master-slave mode, do analitical queries on replica, and save yourself couple months of life plus days of application's downtime when it'll turn out that MVCC isn't always as great as advertised :)

paparazzo · Answer 2 · 2017-03-14T14:24:06.620

This is pretty broad

First you need to design a database for day to day features.

Likely you will want some real time analytics for the user that you most likely just use existing database design.

For historical data you could have dedicated table where a column is the feature name so you can perform analytics on features. And / or look for correlations across features. That can be an external NoSql like database but I would not call it unstructured data as need enough structure to run the analytics.

What are the some common practices for storing and accessing data for analytics?

2 Answers2