Scanning a billion rows in an ultra-fast database

Question

Background

A local database contains nearly 1.3 billion unique rows. Each row is indirectly associated with a specific latitude and longitude (location). Each row has a date stamp.

Use Case

The problem is as follows:

The user sets a starting/ending date, and a range of values (e.g., 100 to 105).
The system gathers all the rows that match the given date, grouped by location.
The system performs determines the locations that, during those dates, have a statistical likelihood of falling into the given range of values.
The system displays all matching locations to the user.

This is a problem of speed and scale.

Question

What is the least expensive solution architecture you can imagine that would allow such a system to retrieve results for users in under five seconds?

Current System

The environment is currently:

PostgreSQL 8.4 (upgrade is possible; switching databases is not an option)
R and PL/R
XFS
WD VelociRaptor
8 GB RAM (Corsair G.Skill; 1.3 GHz)
Quad core GenuineIntel 7 (2.8 GHz)
Ubuntu 10.10

Hardware upgrades are acceptable.

Update - Database Structure

The billions of rows are in a table resembling:

id | taken | location_id | category | value1 | value2 | value3

id - Primary key
taken - Date assigned to the row
location_id - Reference to the latitude/longitude
category - A description of the data
value1 .. 3 - The other values the user can query

The taken column is typically consecutive dates per location_id, sometimes each location has data from 1800 to 2010 (about 77,000 dates, many of them duplicated as each location has data in the same date range).

There are seven categories and the tables are already split by category (using child tables). Each category contains ~190 million rows. In the near future, the number of rows per category will exceed a billion.

There are approximately 20,000 locations and 70,000 cities. The locations are correlated to city by latitude and longitude. Assigning each location to a particular city means finding the city's boundaries, which is not a trivial task.

Ideas

Some ideas I have include:

Find a cloud service to host the database.
Create an SSD raid stripe (great video).
Create a table that amalgamates all the locations by city (pre-calculation).

Thank you!

"switching databases is not an option" well that pretty much eliminates most solutions. good luck! — Steven A. Lowe, May 25 '11 at 02:47
It's hard to say without more information about what exactly you're doing with those records. Also, are you looking for 5 seconds worst case (which probably means every record examined and zero locations match)? — Guy Sirton, May 25 '11 at 03:07
@Steven: The code relies heavily on R for statistical analysis. Switching databases would necessitate a new R integration. The code also uses JasperReports, which calls PostgreSQL stored procedures. — Dave Jarvis, May 25 '11 at 05:46
@Guy: I have updated the question with details about the database structure. Five seconds is a worst case. — Dave Jarvis, May 25 '11 at 05:49
What does "have a statistical likelihood of falling into the given range of values." mean? — , May 25 '11 at 06:03
@Dave: How much time does the current system take? Is the current system using [PostGIS](http://en.wikipedia.org/wiki/PostGIS)? Is `location_id` a `geography` or `geometry`, or refers to a second table? Is the `location_id` column indexed? — rwong, May 25 '11 at 06:19
Totally Agree with Steven here. Otherwise i would have suggesteeed Hadoop — Chani, May 25 '11 at 06:51
@rwong: I haven't benchmarked yet. A similar system (currently running on the given hardware and using similar statistics calculations) using 43 million rows (with appropriate indexes) runs in under 10 seconds. The difference is that the user can supply a city, which cuts the amount of data from 20,000 locations to fewer than 100. — Dave Jarvis, May 25 '11 at 07:04
@Thorbjørn: The user picks a range of values from 100 to 105 for Jan 1st to Jan 6th. The system (1) averages the values at each location for every city during the specified dates; (2) finds all locations with an average generalized additive model value that falls between the given range of values; (3) displays to the user what cities match the given criteria. — Dave Jarvis, May 25 '11 at 07:21
"each" and "every" implies a lot of work needing to be done. Can ANY of this be calculated in advance. — , May 25 '11 at 08:06
As Andersen has pointed out, pre-calculating would seriously speed up things. How much of this is dynamic data vs static data? — Darknight, May 25 '11 at 08:33
@Thorbjørn & @Darknight - In the ideas section I list pre-calculating, which would reduce the data to one value per city per day (per category). The calculation could recur annually, or even monthly, I suppose. This was my plan if there were no other possibilities (the calculations will probably take weeks). — Dave Jarvis, May 25 '11 at 09:05
@Dave, plenty of possibilities, but the question is what is relevant to you. Have you investigated where the current bottlenecks are yet? — , May 25 '11 at 09:11
@Thorbjørn: Not yet; the new data is currently importing into the existing system. The existing system, however, has been optimized. (One initial problem was that the physical model on disk and the logical model orders were not in sync, which prevented the indexes from being used. Clustering fixed that, if I recall.) The old system does not hunt for cities, which is the key requirement of the new system. — Dave Jarvis, May 25 '11 at 09:19
@Dave, make an initial prototype so at least you can get an INKLING of an idea of what you are dealing wit. — , May 25 '11 at 10:50
maybe helpful: [CLUSTERing on geometry indices](http://postgis.refractions.net/documentation/manual-1.3/ch05.html#id2574260), [explain analyze](http://wiki.postgresql.org/wiki/Introduction_to_VACUUM,_ANALYZE,_EXPLAIN,_and_COUNT) — rwong, May 25 '11 at 14:27
@Dave: if your project is educational/research and you're looking for cloud service, consider [Open Cirrus](https://opencirrus.org/content/participating-open-cirrus). — rwong, May 25 '11 at 14:56
@Dave: in your description, "The user picks a range of values from 100 to 105 for *Jan 1st to Jan 6th*", do you mean Jan 1st to Jan 6th of a *single year*, or for every year (i.e. seasonal)? — rwong, May 25 '11 at 16:11
@rwong: A single year. Thanks for the note about clustering---I have used it in the past; as for explains, http://explain.depesz.com/ is a useful site. — Dave Jarvis, May 25 '11 at 22:39
Maybe some variation of a [Bloom filter](http://en.wikipedia.org/wiki/Bloom_filter)? — Homde, May 25 '11 at 08:25

score 12 · Accepted Answer · answered May 25 '11 at 06:07

12

The most important thing is to be absolutely certain where the bottleneck is now for a given number of representative requests as you cannot switch databases.

If you do full table scans, you need appropriate indexes.

If you wait on I/O you need more memory for caching (Jeff Atwood recently mentioned that 24 Gb systems were reachable on desktop systems).

If you wait on CPU you need to see if your calculations can be optimized.

This requires a pointy DBA-hat and a Operating System-hat, but is worth it to ensure you are barking up the right tree.

answered May 25 '11 at 06:07

How ever you slice and dice it - even if each row takes only 100 bytes, 1.3Billion rows = 121 GB. With all your indexes etc., I am sure this will be much more. On a single box, you are gonna be slow unless you have some serious hardware around SSD + Tonnes of ram. Cheaper way is to scale across boxes. – Subu Sankara Subramanian May 25 '11 at 12:26
4

@Subu, you want to go distributed? Now you have two problems... – May 25 '11 at 12:28
Heh - that I agree with:) But its cheaper! – Subu Sankara Subramanian May 25 '11 at 13:42
@Thorbjørn: Thank you for your time and all your help. I think I will reduce the data set to 25 million rows per category then apply indexes on the date. That should reduce the scan to ~70000 rows (per day, with a limit of two weeks for the range), which should be fairly snappy. – Dave Jarvis May 25 '11 at 22:45
@Dave, you still need to know where your bottlenecks are. Learn it while you don't _have_ to. – May 26 '11 at 05:40
@Thorbjørn: Just adding the smaller data set will take several years, not weeks, without optimization. – Dave Jarvis May 29 '11 at 05:31
I dó not understand why this is so? – May 29 '11 at 07:54
@Thorbjørn: It was because the locations were not indexed (location & dates were indexed, but not locations alone). This reduces the time from several years to 28 days to create the smaller data set. – Dave Jarvis May 29 '11 at 23:19
@Dave, perhaps I would understand it better if you showed some code? – May 31 '11 at 01:36
@Thorbjørn: Subset creation code: http://pastebin.com/U0HS8xDN Disk space is an issue, so I cannot try to speed it up using `WITH`. After the inserts are finished, I should be able to query the data in various ways, and have it be fast. – Dave Jarvis May 31 '11 at 02:19

score 4 · Answer 2 · answered May 25 '11 at 02:51

4

How about partitioning the table into multiple pieces located on different hosts based on date stamp? This is horizontally scalable, and as long as you have enough number of boxes, you could write a small aggregation engine on top of these setups.

If you see that the date stamp is changing too much, then you could partition based on the locations - again horizontally scalable. (Hopefully they don't add many more of latitudes/longitudes!)

answered May 25 '11 at 02:51

Subu Sankara Subramanian

2,438
15
23

Thank you for the ideas. There are potentially 77,066 dates, and new dates will be added going forward. I have a single machine. There are 20,000 locations, yet splitting by location wouldn't help because the data to analyze spans all locations. – Dave Jarvis May 25 '11 at 05:57
and how is using cloud different from th above solution ? – Chani May 25 '11 at 06:50
This is what I thought of as well. Some kind of horizontal partition so that the search can happen in parallel across all the partitions. – May 25 '11 at 07:45
Splitting on the day would probably be the most helpful, resulting in 2562 separate tables (366 days x 7 categories). – Dave Jarvis May 26 '11 at 03:07

Guy Sirton · Answer 3 · 2011-05-25T06:51:14.593

Worst case scenario is date range covers all the dates in your database.

You're looking to read 1.3 billion records and do some sort of analysis on each record vs. the entered values, on one physical machine, in less than 5 seconds. The outcome can be all locations or none - you know nothing in advance.

Given these parameters I would say likely impossible.

Just look at your hard drive: the Max Sustained rate is less than 150MB/s. Reading 1.3 billion records will take more than 5 seconds. CPU-wise you're not going to be able to do any sort of statistical analysis on 1.3 billion records in 5 seconds.

Your only hope (tm :-) ) is finding some sort of lookup function based on the values entered by the user that will narrow the search down (by a few orders of magnitude). You can calculate this lookup function offline. Without knowing more about the exact match criteria I don't think anyone can tell you how to do that but an example would be to partition the range of values into some discrete interval and create a lookup that gives you all the records in that interval. As long as the interval is small enough you can do real work in it, e.g. pruning away entries that don't match the user entered value. Basically trading space for time.

It may be possible to hold onto all the records (or at least the important part) in memory. Probably not in 8GB. This will at least eliminate the disk I/O portion though even the memory bandwidth may be insufficient to scan through everything in 5 seconds. At any rate, this is another technique for speeding up these sorts of applications (combine with my previous suggestion).

You mention using a cloud service. Yes if you pay for enough CPU and IO muscle and partition your database across many servers you can brute force/divide and conquer it.

Thank you for the answer. Hardware upgrades are a consideration, as per the ideas I listed. A sub-$750 USD solution would be ideal. — Dave Jarvis, May 25 '11 at 07:09

score 2 · Answer 4 · answered May 25 '11 at 06:50

2

I second rwong's comment to the question: PostgreSQL offers appropriate indexes types and tools (GIST indexes, GIN indexes, Postgis, Geometrical types) in such a way that geodata and datetime-related data should be searchable along those criterias without much issues.

If your queries on these criterias take seconds, it probably means no such indexes are being used. Can you confirm that you've investigated these as appropriate?

answered May 25 '11 at 06:50

Denis de Bernardy

3,913
21
18

Thank you. The seven child tables are clustered on the location, date, and category using a btree. I researched GIN indexes last year and they did not (or would not) help, as I recall. – Dave Jarvis May 25 '11 at 07:15
2

Indexing location based on B-Tree isn't the slightest bit useful considering the type of searches you're looking into. You need an inverted index that works with the needed operators, which in the case of Postgis usually means GIST. You might want to highlight a few of the slow queries... – Denis de Bernardy May 25 '11 at 07:40

score 1 · Answer 5 · answered May 25 '11 at 08:33

Given you use PostgreSQL and latitude/longitude data, you should definitely use PostGIS as well, that way you can add a GiST spatial index to your database to help speed things up.

I have a such a table (with 350k rows) with a configuration much smaller than yours (2 core and barely 2Gb RAM) yet searches take less than one second.

score 0 · Answer 6 · answered May 25 '11 at 04:37

0

Maybe you could break a relational model like Essbase did with their OLAP architecture: Essbase Wikipedia

What I mean is create one table per city, thus ending up with 1000+ tables. Not one table like you suggested, but many. Index each table by date and location. Many tables, many indexes --> faster.

answered May 25 '11 at 04:37

mihaela

101
2

Thanks for the note. There are over 70,000 cities, and many different latitude/longitude values fall within a specific city area. – Dave Jarvis May 25 '11 at 05:24
@Dave: can you build a voronoi diagram for cities and classify lat/lon values into tessellations? (i.e. if it sounds haphazard, let it be.) Then, during lookup, you'll search for all cities whose tessellation touches the query's lat/lon ranges. If voronoi tessellation is too slow, square boxes (e.g. 5 deg lat x 5 deg lon) might be worth trying. – rwong May 25 '11 at 19:57

score 0 · Answer 7 · answered May 25 '11 at 16:11

As far as your idea of finding a cloud service to host the database, have you come across SimpleGeo yet? They just cut the ribbon on a Storage service that's apparently "specifically tuned to store and query location data really, really fast" - although the cost to store and query against more than billion rows might make this approach unfeasible.

score -2 · Answer 8 · answered May 25 '11 at 08:13

you are expecting a bicycle to run on the highway. currently you are looking for a solution to tackle this problem only, you are not foreseeing the problem what if u have 2 billion records? scalability must be addressed. answer is simple use object databases. e.g Intersystems cache

and believe you me I'm not from intersystems ;-)