Architecture for Real Time data processing

Question

I'm looking at building out the architecture for the following, and wanted to see what others think about it.

Assume that the system is running some non-trivial algorithm (so it's not simply a sum of something etc.) on the data collected on each user. Some users will have 10 rows of data, some will have tens of thousands. The data will be user geo positions over time. There would be upwards of 10-100M users, and the data on many users is coming in every day, potentially every minute for some.

At periodic intervals (1/5/15 minutes, basically as soon as possible), I'd want to run that non-trivial algorithm on each users data, which would spit out couple of numbers that would then be reported out.

One way to model that would be to store in a NoSQL db and process each users data on an Akka cluster. Any recommendation for the DB?

The user data here is basically an append log where once added, data won't change - but it keeps growing all the time, and some users have disproportionately more data than others. In order to process the data per user, all of it needs to be loaded into memory somewhere, so the best possible scenario is where all data is in memory and re-processed at one minute interval - the downside being that I would need terabytes of RAM to do that and if the in-memory servers go down, all of the data would need to re-loaded and that would take a while.

The algorithm works on specific user or on all users at once? To me, it sounds like you should worry much more about load-balancing running of the algorithm instead of storage of the data. — Euphoric, Aug 29 '16 at 19:49
It works on a specific user only, hence I was looking at Akka to parallelize all of that processing. — kozyr, Aug 29 '16 at 23:57
Does the algorithm run on incoming (streaming) data only, or does it run over historical data as well? — tofro, Oct 29 '16 at 13:54

score 1 · Answer 1 · answered Aug 30 '16 at 09:12

It's just a (distributed) job you can do asynchronous. When new data for the user comes in add it to the task queue.

Detect previous jobs and prevent duplicated work

When the previous job of that user is still there then remove the old one and place a new one. Or add the new data to the existing one. Depending on how you want to deal with this.

Scale

Then you can scale by workers who process the data and do the calculations. Here you can be a bit smart: Try to optimize the moment you calculate the data for a user with the moment they want to see it.

Optimize historic results

Best would be if you could store intermediate results so you don't have to process all data of a user again and again. Depending on your algorithm that may be the best optimization because over the months/years those tasks become bigger and bigger as the users gets more data.

Because the workers are constantly busy (can be scaled automatically) and constantly do the same job you can optimize that very hard. Also it reduces the amount of peaks in your workload which reduces capacity costs.

Platform of choice

Which specific database / platform is best is not answerable. That depends strongly on the real data and, amount of reads vs writes and other factors. I suspect the balance will be to have a lot of data at rest, so just stored. And then when a user becomes active and starts delivering data you wake him up, get his data ready and the processes get to work.

Because you expect a new requests quite soon you can just keep that in memory if you want so you can proceed when the next one comes in. Testing will tell whether that is actually needed. Loading a few geo points for a user would not be the hardest job in your system and keeping them in distributed memory for a minute might actually be more expensive.

Do you have any experience with the "Detect previous jobs and prevent duplicated work" part? I would say that is the most critical part and would be most prone to race conditions and stale data. — Euphoric, Aug 30 '16 at 09:15
Basically what you could do is: Insert the new data in a queue with date.now() + 1.5 min. Only allow workers to get the items from this first queu when that 1.5 min is gone. When the same users adds a new data point update/replace the job with a new one. Trick is that you should let some requests proceed because otherwise an active user is never going to get his update. But for start I would just take the simple way and forget that part. Adding complexity like that is very doable when your system is organized well from the beginning. — Luc Franken, Aug 30 '16 at 09:36

Architecture for Real Time data processing

1 Answers1