I'm looking at building out the architecture for the following, and wanted to see what others think about it.
Assume that the system is running some non-trivial algorithm (so it's not simply a sum of something etc.) on the data collected on each user. Some users will have 10 rows of data, some will have tens of thousands. The data will be user geo positions over time. There would be upwards of 10-100M users, and the data on many users is coming in every day, potentially every minute for some.
At periodic intervals (1/5/15 minutes, basically as soon as possible), I'd want to run that non-trivial algorithm on each users data, which would spit out couple of numbers that would then be reported out.
One way to model that would be to store in a NoSQL db and process each users data on an Akka cluster. Any recommendation for the DB?
The user data here is basically an append log where once added, data won't change - but it keeps growing all the time, and some users have disproportionately more data than others. In order to process the data per user, all of it needs to be loaded into memory somewhere, so the best possible scenario is where all data is in memory and re-processed at one minute interval - the downside being that I would need terabytes of RAM to do that and if the in-memory servers go down, all of the data would need to re-loaded and that would take a while.