How to store large amounts of _structured_ data?

Question

The application will continuously (approximately every second) collect the location of users and store them.

However, there is too much data. There will be 60 × 60 × 24 = 86,400 records per user, daily. Even with 1000 users, this means 86,400,000 records daily.

And it is not only 86,400,000 records daily. Because these records will be processed and the processed versions of them will be stored as well. So, multiply that number with approximately 2.

How I plan to use the data

Essentially, I plan to make coarser grained versions of location data for easier consumption. That is:

Sort the received data w.r.t timestamps.
Iteating on this list in order, determine if the location has changed significantly (by checking out how much the latitude and longitude changed)
Represent the non significant location changes as a single entry in the output (hence, output is a coarser grained version of the location data).
Iterate this process on the output, by requiring an even larger latitude and longitude change for a significant change. Hence, the output to be produced from the previous output will be even more coarse grained.
Iterate the whole process as much as needed.
Aggregate a range of resolutions and send them to users. Also, store all resolutions of the data for later consumption.

What should I use to store this data? Should I use a relational database or a NoSQL solution? What other things should I consider when designing this application?

2000 records per second like that probably won't trouble an up-to-date SQL engine. A simple capacity test would be to get a console program writing some random to files that get bulk loaded. — Caleth, Jan 03 '17 at 10:21
@Caleth But is it scalable? What about when the user base grows 100 times? — Utku, Jan 03 '17 at 10:39
Measure what your hardware can currently handle. The bottleneck is likely to be either CPU "processing" the values, or raw disk speed. What do you intend to *do* with all this data? That should shape what tech you choose for storage — Caleth, Jan 03 '17 at 11:00
Caleth is absolutely right. Millions of records dont faze a modern database system. NoSQL stores are very good at *writing* huge amounts of data very fast, but ultimately you want to do something that involves *reading* things again. How much reading you will need often determines what kind of store you should use. — Kilian Foth, Jan 03 '17 at 11:18
To give a good answer, we need to know how you plan to *use* this data. A database might be a good choice if you want ad-hoc queries, while a file-based solution would probably be better for whole-dataset analysis. Voting to close. — kdgregory, Jan 03 '17 at 13:05
@kdgregory I have provided more details on how I plan to use the data. — Utku, Jan 03 '17 at 14:58
This kind of data is easily losslessly compressed. Also, if you stream the data properly, sorting on timestamp can be trivial. — Erik Eidt, Jan 03 '17 at 15:47
Given your updated requirements, I don't think that you'll benefit from a database (although a distributed database such as Cassandra _might_ provide a simpler/more performant query as opposed to something like Hadoop). — kdgregory, Jan 03 '17 at 16:35
What is the point of storing the full data at all? Why don't you do the first processing step "on the fly" and store only the "lower resolution" data? And if you think you really need to store the whole high-resolution data, why don't you just use a logfile for each user, or rotating logfiles per day or per hour? Do you really need the full resolution in a database? What is your use case for the high resolution data besides the processing you described? — Doc Brown, Jan 03 '17 at 17:14
@DocBrown Yes, I can store only the lower resolution data as well. Storing the higher resolution data might be to understand the reliability of the data. That is, a device might be sending location every second for an hour, and it might be sending location once in every 10 seconds on the next hour. Storing the full resolution data might help to tell the difference. — Utku, Jan 03 '17 at 17:56
@Utku: well, my point is: when you store the full resolution data simply in one file per user and a fixed time interval, you have a much lower risk of encountering scaling problems, and reduce the need for developing a complex "high speed database solution". — Doc Brown, Jan 03 '17 at 18:25
Are there ay fundamental / rockstar books on the process behind storing / handling large data sets? — zero_cool, Dec 31 '18 at 16:58

score 9 · Accepted Answer · answered Jan 03 '17 at 12:41

Some alternatives for storing this data:

Message queue (possibly distributed), like Apache Kafka

This will be optimized for writing and reading a stream of data. It is ideal for collecting data streams in an easy to process format, but it cannot typically be queried except by reading out the stream in its entirety. So, this would be either for archival purposes, or an intermediate step on the way to a processing layer.

Relational database(s)

You can just write it to the database, and when the volume exceeds the capacity of the DB to handle, you can shard the database (= have multiple subsets of the data sit on different database servers). Benefit: you can use a relational DB and don't have to learn anything new. Downside: all code dealing with the DB must be aware on which shard which piece of data lives, aggregated queries must be done in application software.

Distributed NoSQL database, like Cassandra.

You write your data to a distributed NoSQL database, and it will automatically shard the data for you. Cassandra allows you to do queries across the cluster, requiring less application code to get back at the data. Benefit: more naturally suited for large amounts of data, downside: will require specific expertise and deep understanding of the mechanics of how these systems work to achieve good performance and make the data queryable according to your needs. NoSQL is not a magic performance fix, it is a set of trade-offs which must be understood to be navigated.

Hadoop / file

The data is appended to files which are distributed automatically across servers by the Hadoop platform, processed on those servers using tools like M/R or Apache Spark, and finally queried (as file) using a Hadoop SQL engine like Hive or Impala.

Which to choose?

The trade-offs between these alternatives are complex, and they very much depend on both your write and your read patterns, so the only person who can decide on these trade-offs is you. If you lack the time to build up a deep understanding of these alternatives, then just use a relational DB and figure out a sharding solution as you go along. In all likelihood, YAGNI.

I have provided more details about how I plan to use the data. Would you like to add anything given this information? — Utku, Jan 03 '17 at 14:55
Still not quite clear to me what you mean by "resolution". Do you want to aggregate onto geographical level (city, state, ...) or onto some coordinate system like a geohash? Or are you interested in the amount of delta because you want to build notifications based on movement thresholds? In short: what is this all for? — Joeri Sebrechts, Jan 03 '17 at 15:36
It is for tracking users. Users track each other, and I graph where the users whom they track have been in the last 5 hours on the devices. Essentially, the finer grained, the better. However, mobile devices have a limited amount of memory, hence you cannot send the data without reducing its resolution. That is, let's say user A is tracking user B, C and D. If I simply forward whatever location data I receive from B, C and D to A without doing any processing on the server side, user A's device's memory will fill very quickly. Hence, I need to do some processing. — Utku, Jan 03 '17 at 16:45
If I were to build what you're describing, I'd construct it as a series of kafka logs connected via spark streaming, where the positions are integrated across windows in the spark stream, and the final output kafka log is provided as pull and push web api's to the clients. However ... that's a lot of very particular technology, and depending on your background and available time those choices may be wrong for you. — Joeri Sebrechts, Jan 04 '17 at 07:57
Thanks. I will keep that in mind but following the YAGNI principle, I am planning to use a relational database for now. When the need arises, I will switch to something that better suits the application. Please feel free to edit any information into your answer, if you like. — Utku, Jan 04 '17 at 08:06

score 7 · Answer 2 · answered Jan 03 '17 at 14:21

Look into your requirements a little deeper. There is a way to create the illusion of tracking position every second.

If you have an app that knows your current GPS location and writes it to a database, why would you keep writing the location if it doesn't change? Even if you require the data, if the user has been asleep for 7 hours, you can programmatically fill-in the missing time slots with a duplicate location to do your calculations or mapping or whatever else you need to do.

If you do track the location every second, do you have to store these data forever? You can archive the records to another database to prevent the current table from getting too large. Or you could even just keep the records where there is a position change. This is common in data warehouses.

score 2 · Answer 3 · answered Jan 03 '17 at 14:27

Your data is a set of time series. You have given sets of numbers (two per user) that evolve with time. Typically, you are NOT looking for any kind of relational storage, but rather a RRD storage. These storage heavily focuses on reducing I/O work of numerous small writes by buffering it.

Relational storage is an heresy for this volume of time series. However, be warned that the development of RRD is quite not as well supported in terms of programmable exploitations than the SQL is. You are probably looking at serious integration work, but it's hardly avoidable given your requirements.

How to store large amounts of _structured_ data?

How I plan to use the data

3 Answers3